Threads and task scheduling

Here’s a spoiler : the next post in the “process abstraction” series will talk at least about the kernel entity that manages threads, and the interface that is used to handle them. But to write this, I first have to define the thread abstraction of this OS, and the associated mechanisms of task scheduling. Which is the goal of this article.

Basic purpose of a thread manager

The role of treads has already been introduced in the first “process abstraction” article. They are isolated execution units, which each have a private CPU state and stack and can run conccurently on a set of CPUs. When they do so, they must share CPU time fairly, which is typically achieved using an algorithm called round-robin scheduling. Each CPU rapidly switches from one thread to the other, giving each thread a time window of a certain length to run in. In the end, the net result is that each thread gets a tunable percentage of CPU time.

Attentive readers will have noticed that if we define the CPU time percentage that each thread gets, we have not fully defined the length of the time window which they are allowed to run in, but only the relative size of time windows with respect to each other. On a personal computer, absolute window sizes should be chosen small enough that the illusion of simultaneous execution is preserved from the point of view of the user. But infinitely small windows are not good either, because switching between threads has a performance cost. So there is a trade-off to consider here. But let’s leave it for implementation time.

From the definitions above, a thread manager should be able to perform the following tasks :

  • Keep track of the intrinsic properties of each process with respect to threading.
  • Create and destroy internal thread structures. These structures are at least associated to one private region in the parent process’ address space which is used as a stack during execution, and two kernel-side management structures : one to save the CPU state of the thread while it’s not running, and one to store its scheduling properties.
  • Perform the switch between two threads on a given CPU, which involves saving the CPU state of the previously running thread to the dedicated storage space and loading the CPU state of the thread that is going to run next.
  • Periodically switch threads, in a fashion which gives each thread a fair share of CPU time (a functionality called task scheduling)

Task prioritization on personal computers

Introduction

I have been mentioning the concept of “fair share of CPU time” for some time now, but it is clear that this concept is ill-defined at this point. What I want to express is that it is not always good to give each running thread an equal share of CPU time. Or even each process an equal share of CPU time, and each thread an equal share of process execution time. For optimal performance, it can be relevant to prioritize some tasks above others. And in the specific context of a personal computer, I can think of three reasons to do so :

  • To express user or developer opinion concerning the relative importance of different pieces of code
  • To make sure that background tasks, which the user is not directly interacting with and has only remote perception of, will not effect too much the performance of foreground tasks, which the user is actively waiting to complete
  • To attest that some tasks will always be completed within a certain amount of time, regardless of the amount of concurrently running threads

Let’s now define each of these forms of prioritization in more details.

Relative process importance, or priority

This form of prioritization is the easiest to deal with, and coincidentally also the most frequently implemented. It can be easily and elegantly dealt with using the aforementioned round robin algorithm. Let T be an elementary time slice (say, 1ms), also called “quantum”, and A and B two threads which share a single CPU. If we want to make B “twice as important” as A, we will alternatively make A run for one time slice and B run for two time slices. The net result will be that A and B seemingly run simultaneously, but that B gets twice as much CPU time as A on the average, and thus seemingly runs twice as fast.


Now, remember that we live in a world of isolated processes, which may spawn as many threads as they want. With this in mind, it becomes obvious that if we want processes to remain isolated and unable to “steal” CPU time from their neighbors, we want to consider time slices at the scale of processes, and then have threads’ private time slices sliced within the time slices of their parent process, in a nested form of round robin scheduling.

Preventing background from hurting foreground too much

This form of prioritization is extremely important for user comfort, as everyone who has ever used an OS that feels sluggish under load can attest, but it is also quite tricky to implement, because there is a need to separate foreground tasks and background tasks. One criteria which could be used to this end is direct communication between foreground software and the user using a communication channel that is known of the operating system (currently displayed GUI, audio stream emission, keyboard focus…), but this information is not known at the kernel level, and must be provided by higher-level subsystems of the OS (GUI toolkit, audio stack…).

At this point, it is worth pointing out that information can also flow the other way between the kernel’s CPU scheduler and higher-level subsystems. Each time there is a shared resource, which must be used quasi-simultaneously by all threads but which only one thread may use at a time (as can be the case for disk drives, network connections, GPUs…), a form of task scheduling is needed. To be performed efficiently this task scheduling requires information about each thread’s priority from the kernel scheduler. Conversely, it has been shown above that the kernel needs high-level information to adjust task priorities in an efficient way. The conclusion is that just like there is an “insulator API” which allows user-mode processes to do their part of the system-wide process abstraction, there needs to be a “scheduling API” which allows user-mode processes to read and write processes’ and threads’ scheduling properties in a controlled way.

Once the distinction between foreground and background tasks has been made, making sure that background tasks do not become overly possessive of CPU time is fairly straightforward : we could define which percentage of the CPU time background processes are allowed to use, then use some form of the round robin algorithm at a third level of nesting to segregate foreground and background tasks.

Deadlines and real time scheduling

The third form of prioritization which I have mentioned is to make sure that some tasks are performed in a certain amount of time, irrespective of the current CPU time sharing landscape. Here, the keywords if you want to study research papers are “real-time computing”, “real-time constraints”, and “real-time operating systems”. Now, there are operating systems which are built from the ground up to solve this kind of problems, used in areas such as mission-critical systems, audio processing, or telecommunications, and I do not pretend to match or beat dozens of years of research in these RTOSs with my humble hobby OS. But I think that the personal computing world does have its share of real-time constraints, which should be taken into account in order to reach optimal performance. Here are a few examples :

  • On the average, dispatching CPU interrupts should be done faster than they arrive, otherwise PIC buffers will be filled up and incoming interrupts will be lost. If we have, as an example, a hardware clock which fires interrupts at a frequency 1 kHz, these interrupts must be dispatched to interrupt handlers in less than 1 ms, otherwise the hardware will start to drop interrupts.
  • More generally, when an input stream of data is filling a hardware buffer in a circular fashion, incoming data must be processed faster than it arrives, otherwise information will be lost. As an example, if we are recording audio using a sound card, and the sound card has a buffer with room for 64 ms worth of of audio data that is filled with a new 8 ms blocks every 8 ms, each block must be processed in less than 8 ms, or else audio data will be lost, resulting in ugly artifacts in the recorded audio.
  • Conversely, when software has to provide a stream of output data, outgoing data must be provided faster than it leaves the hardware buffer, otherwise a buffer underflow event which can have dramatic consequences will happen. If we consider again the example of audio I/O, and have a sound card with a playback buffer containing 64ms worth of audio data that is read by hardware in a circular fashion by units of 8 ms blocks, each block must be generated and provided to hardware in less than 8 ms to avoid ugly playback artifacts.
  • When information is to be transmitted to the user, the total latency between the instant where an information is generated and the instant where effective transmission takes place (on-screen GUI is updated, audio notification playback begins) should be as small as possible. Ideally, it should be smaller than the smallest latency that a human being may perceive, which is around 10 ms. When this is not technically feasible, a “latency goal” which software must stay under should still be defined. The larger the gap between the displayed representation of an object and the actual object inside of software, the worse the consequence of lag-induced user action can be.
  • For feedback, this criteria becomes even more stringent. Between the time where a user sends input to the computer (pressing a key, moving a mouse, tapping a touchscreen…) and the time where this input has perceptible consequences (on-screen GUI is updated, sound is played), the whole latency should be smaller than the 10 ms human perceivable latency treshold, or than a slightly larger “latency goal” when not feasible.

Noticed a pattern there ? All the numerical values of temporal deadlines provided here, which are to the best of my knowledge realistic, are small enough to border human perception limits, a few tens of milliseconds at most. I cannot think of any active processing job on personal computers which needs to stay below a temporal deadline that is larger than that. Timers which must fire after a precise amount of milliseconds has elapsed ? Sure. But on desktops, laptops, and other tablets, real-time constraints on processing work only seem to exist on very small time scales. Could this property possibly be used to simplify the whole real-time scheduling problem ?

Let’s review what we have said so far. There are some tasks which must complete before a temporal deadline is reached. If they do not complete within this time frame, horrible things will happen. We are not just talking about software that runs annoyingly slower, as happens when background tasks steal too much CPU time from foreground ones, we are talking about stuff that breaks or becomes extremely painful to use. Breakage is an infinitely worse outcome than running slow, so real-time tasks should have a priority that is infinitely higher than that of normal tasks to avoid this outcome. In round robin terms, “infinitely higher priority” means that real-time software gets approximately 100% of CPU time. So the simplest possible implementation of real-time tasks which we can imagine is that all normal tasks are stopped while a real-time task is running. Now, the question is, could this “simplest possible implementation” work ?

Well, let’s consider a simple use case where a single RT task is popping up in the middle of nowhere : a keyboard interrupt. The (real-time) kernel interrupt manager fetches the interrupt and dispatches it to the keyboard driver by spawning a (real-time) thread in it. That thread is then converted to a generic input event, which is sent to an input event manager by sending an RPC call that spawns again a real-time thread. And so on. Now, let’s assume that for some reason, one of the system components in this chain fails and falls into an infinite loop. Then every task of the system is put to a grinding halt.

Yes, but we have this deadline thing which we have not used yet. Let us now assume that real-time threads and their RPC children can only keep their real time status until the deadline is reached, and then become regular foreground threads. After all, if our deadline is busted, it really does not matter by how much. Failure is failure. Keeping the use case of hung RT tasks in mind, a “watchdog” notification could also be sent to the process which was running the real-time thread, so that it can halt, clean up and destroy it if it makes sense to do so.

So for a single RT task in the middle of nowhere, we are happy. Now, let’s go one step further in complexity, and consider an RT task that is being periodically spawned : the previously mentioned clock interrupt. Each time that clock interrupt is fired, the kernel interrupt manager fetches that interrupt and dispatches it to the appropriate clock driver, which then does its own thing in an RT thread. Let us denote T1 the elapsed time between two clock interrupts and T2 the temporal deadline for clock interrupt processing.

Here, three cases are to be considered :

  • T1 < T2 : The real-time deadline is inappropriate. If, per chance, the previous clock interrupt handler was done with work, the kernel can start a new interrupt handler. But if not, the only sensible thing which the kernel can do is to buffer clock interrupts handler tasks in the clock driver queue and drop further interrupts once the queue is full. Bad interrupt handling performance will be exhibited. This is probably an implementation error in the driver, and should be reported in debugging information.
  • T1 > T2 : The real-time deadline is appropriate. Each clock interrupt is fully processed before the next one comes, and regular process get some time T1-T2 to run. This is a normal driver behavior.
  • T1 = T2 : If the clock driver exactly needs T2 time to process clock interrupts, then the clock interrupt handler ends up running permanently, and regular software never runs. So the real-time deadline is, again, inappropriate.

Of course, T1 = T2 is an unrealistic limit, but if T1 is inferior to T2 or pretty close, we get a behavior that’s equivalent in practice, where normal software only gets a ridiculously small amount of CPU time and the clock driver ends up eating everything.

In general, having a whole group of running tasks stop to perceptibly execute for an extended period of time (also called “CPU time starvation” in the slang of CS) is not a bright idea. Software will be perceived as hung, which in most users’ perception is a form of failure as bad as audio glitches or failed optical disk burning. So just like we want to keep a few percents of CPU time dedicated to background tasks when foreground tasks are running, we also want to keep a few percents of CPU time dedicated to non-RT tasks when RT tasks are running. Similar problems call for similar solutions : we want a form of round robin to ensure that if RT tasks run for too long, the CPU is regularly given back to non-RT tasks for a while. However, in order to avoid missing RT deadlines, we want to make sure that switching to normal tasks is only a “last resort solution” that happens after a relatively long period of time has elapsed in RT mode, so this form of round robin scheduling will typically have a much larger elementary time quantum than other round robin schedulers on the system.

Now that we have treated the case of one single source of RT threads, it is also possible to consider the case where several RT threads may be spawned throughout the system in an unrelated fashion. In this case, it may happen that multiple RT threads are running at the same time. In that case, making them fairly share CPU time using a scheduling algorithm like prioritized round robin is pretty much the only option, unless we want to prioritize some RT tasks infinitely above others (a situation which I will not treat in this article). However, such CPU time sharing introduces irregularity in RT thread execution times, which may in turn make them miss their deadlines. As such, RT threads must remain in small numbers to work properly. A typical way to achieve this result would be to make spawning RT threads a highly privileged action, that would be pretty much the landmark of critical system services.

Unified task prioritization model

Alright. So far, we have introduced a thread scheduler which supports three different forms of prioritization, and explained why each of these is useful and necessary for a modern desktop or mobile operating system to use system resources efficiently and guarantee best performance in tasks which need it most. Still, having to consider three different forms of prioritization is complex, and that complexity goes into kernel code. In software, complexity means bugs and exploits, which in a non-sandboxed codebase like an OS kernel are extremely bad. As such, we would really like to unify all that was previously introduced under the umbrella of a unified abstraction, that would be simpler to implement and potentially more open to future kernel evolution.

Which is why I will now introduce the concept of priority layers.

In this model, there exists a range of so-called “priority layers”, which follow a hierarchy where the “top” layer symbolizes highest-priority tasks which run first and the “bottom” layer symbolizes lowest-priority tasks. Each layer is associated with an integer index, which represents its position in the layer hierarchy and shall never be used directly by applications, and each thread on the system belongs to one of the priority layers. The previously introduced “RT”, “Foreground”, and “Background” varieties of running tasks are each associated to a priority layer. Under them, at the bottom, there is an extra special priority layer, layer 0, where the OS puts threads which are halted (stopped by their parent process) or blocked (stopped by the OS) and shall never run.

All threads belong to one of the priority layers. Each process is also associated with a priority layer, which in this case both specifies the layer where its threads are spawned by default and the highest layer where it may spawn threads without OS help. On a running system, background software belong to the background layer, foreground software belongs to the foreground layer, and only a handful of system processes, typically the kernel and a few drivers which deal with interrupt sources, are given the very special permission to spawn threads in the RT layer.

The general rule of priority layer scheduling is that tasks from the top priority layer run first, followed by tasks from the next priority layer, and so on until layer 0 (halted and blocked tasks) is reached, at which point there is no work to do anymore and the CPU is put in a power-saving state. CPU starvation prevention is provided through a set of two parameters, which are a layer’s “minimal resource giveaway” (in %) and “watchdog time” (in ms). When both of these parameters are set and several active priority layers are occupied, the scheduler regularly monitors how CPU time is shared between priority layers on the scale of the “watchdog time”, and automatically switches to lower occupied priority layers for a period of time when they stop getting their “minimal resource giveway” share of CPU time through self-organization mechanisms.

To implement RT threads in the formalism of priority layers, we now introduce the concept of “lifetimes” and “decay target”. Each thread may have a lifetime (in ms) in each priority layer, after which it “decays” to a lower priority level which is called a “decay target”. As an example, for RT threads as they have been presented above, the lifetime in RT layer is the deadline of the RT threads, and the decay target in the RT layer is the foreground layer. It can also be envisioned to create RT threads whose decay target is layer 0, ie which are stopped by the OS if they run beyond their deadline. And to create threads that decay from a high layer to a low layer in a cascading fashion, as an example to implement “soft” real-time deadlines where slightly missing an RT deadline is less critical than missing it by a significant amount of time.

In all cases, for the debugging purposes mentioned in the RT part of this article, it is important to allow kernel code to notify user-mode code when a thread “decays”, but allowing this is mostly an implementation concern. To prevent abuse, a layer may also have a “maximal lifetime”, which threads are not allowed to outlast, and a “highest decay target”, above which threads are not allowed to decay.

The last special case which we want to manage is the OS-enforced transition of processes from the foreground to the background. Or, in the most general case, from a priority layer to another. As a first approach, we may imagine a simple model where as a process transitions from a priority layer to another, all threads which have him as their scheduling parent follow. This simple model works very well as long as threads are spawned within the process’ priority class : if a foreground process spawns a foreground thread, then goes to background, the foreground thread becomes a background thread. If that background process goes back to foreground at a latter time, the thread also goes back to foreground. So far, so good.

But now, consider threads which are spawned within a non-default priority class, like when a foreground process spawns a background thread. How do we manage those ? Well, the best option in my opinion is to give threads a “preferred layer” property, which specifies which priority layer threads have been created in or have decayed in and wish they would stay in. Then as long as the father process remains in a priority layer above the thread’s preferred layer, nothing happens. If the father process moves to an inferior layer, the thread follows its father in the reduced priority layer, but goes back to its preferred layer as soon as is made possible by a new transition.

Now, what happens of higher-priority threads, such as RT threads that are spawned by the kernel (or another privileged entity) within processes ? Well, the way RPC works is that these threads are, as far as the scheduler is concerned, considered as belonging to the privileged process which spawned them. They only happen to be jailed using the process abstraction of their current host. So altering the scheduling properties of their host won’t effect them, and there is nothing to do here.

To conclude, while I believe that this layered scheduling abstraction is easier to understand, more future-proof, and less code-hungry than the three forms of scheduling mentioned above separately, it remains worryingly complex. The kernel code that is associated to it will have to be very rigorously tested for bugs that may turn into vulnerabilities, and there is quite a bit of fine-tuning to be done before a good equilibrium between priority layers is reached. But one thing which I don’t think one will have to worry about in the end is the performance impact of this technology. Today, we are dealing with computers which can compute millions of exact decimals of Pi per second and render smooth realistic real-time fluid simulations, so reactivity and perceived performance of a mere desktop GUI really shouldn’t be a problem anymore. If it still is, it is because of an incredibly inefficient use of hardware resources, and if we can increase resource usage efficiency by a significant amount, be it at the unexpectedly crazy cost of a few percents of CPU power, the net benefits should still be huge.

In fact, it is not even necessarily true that this scheduler will use more CPU resources. On OSs which are based on pure prioritized round robin, kernel developers are forced to use relatively small time slices to achieve some level of reactivity. But on an OS where reactivity would be enforced using RT tasks and foreground prioritization, round robin quanta of non-RT tasks are not the main warrant of reactivity anymore, and they could possibly go from the ~1 ms of current desktop schedulers to something much longer like 10 ms, or even to 100 ms for background tasks where there is no need to give a human-perceptible illusion of simultaneous execution. This would mean a scheduler that runs less often, less task switches, less cache flushes and moving HDD heads, and thus potentially a scheduler that uses less hardware resources overall.

Conclusions : Thread manager, the full picture

So, what have I been saying so far ?

There is a need for an abstraction that represents independent execution units which can run on any available CPU in a prioritized fashion, the thread. Threads have two separate relationships with processes. There is the process which they currently reside in, that determines their process isolation properties. And there is the process which started them, which determines their execution priority. When a process A spawns a thread in a process B through RPC, the resulting thread is isolated within the boundaries of B, but prioritized as a thread of A. If that thread spawns in turn a thread in a process C through RPC, the resulting thread is isolated within the boundaries of C, but prioritized as a thread of A. This distinction ensures that it is tasks, and not artificial process boundaries, that are prioritized.

Prioritization of threads is a critical concern on a desktop or mobile operating system. If tasks are not prioritized properly, as is sadly the case on many current OSs, a device which feels butter smooth at rest will become slow and sluggish under some amount of background load. The current approach, which is to brute-force this problem away and higher the threshold background load by requesting the use of more and more powerful hardware, puts a growing amount of unnecessary hurdles on desktop developers (parallel computing, then NUMA, then distributed computing…) and won’t scale indefinitely. Mobile operating systems, which can’t afford high-performance CPUs at a decent price due to hardware constraints, already picture best what will happen if nothing is done : decently performing computers will be a luxury of the wealthy, and everyone else will get unusable crap. Other families of personal computers also exhibit this market segmentation phenomena to a weaker extent.

To avoid this tragic fate, we need to prioritize tasks better than we currently do. To this end, I propose a three-level categorization of tasks : real-time (RT) tasks, which must complete in a specific amount of time or will fail to work as expected, foreground tasks, which the user is directly interacting with, and background tasks, which the user is not strongly interacting with. In first approximation, RT tasks all run before foreground tasks are allowed to run, and foreground tasks all run before background tasks are allowed to run. In practice, however, each family of tasks is asked to leave a small share of system resources to lower families, so that no task is every fully deprived of system resources on a loaded system.

To future-proof this prioritization scheme and streamline its implementation, I propose the priority layer abstraction, which is suitable for all the use cases presented above and more in the future. There is a stack of so-called “priority layers” in the system, where the lowest priority layer represents fully halted tasks (either on the behalf of their father process or the OS) and layers above represent the aforementioned task categories in an order of growing priority. Threads may “decay” from a high priority layer to a lower layer after some time has elapsed, as happens for RT threads which do not need RT priority anymore if their deadline is busted, and whole processes (and their associated threads) may be moved along priority layers in a fully reversible fashion, as is needed when software moves back and forth between the foreground and the background.

As a word of conclusion, all that has been presented here is not CPU specific. There are several hardware and software resources on a running system, managed at a higher level than the kernel, which could be shared between multiple running tasks and require task prioritization. Examples include disk drives and GPUs. Conversely, the kernel needs information from higher-level OS components to make some scheduling decisions, such as moving processes from the background to the foreground as they start to interact with the user in a significant way and vice versa. So in the final thread manager design, there needs to be an interface for scheduling information to flow in a controlled way between the kernel and higher-level system services which also deal with the problematic of scheduling and task prioritization.

And this is where I thank you for reading this very long article, ask for comments, and recommend you to stay tuned for the final thread manager interface design, probably coming in a week or two, once it’s done.

One thought on “Threads and task scheduling

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s