It’s not the first time which I’ve said this, and it won’t be the last: on a modern operating system, prioritizing code with respect to which process it runs in is largely useless and should be replaced with more advanced scheduling mechanisms which can actually grasp the context of tasks. In this post, after a quick reminder of why this is the case, I will discuss ways to achieve task-based scheduling, along with various edge cases which have to be handled properly for such scheduling to work well.
When process priorities do not work
Let us picture ourselves someone doing live sound engineering work, consisting of taking audio streams from a multichannel input, applying various effects to them, mixing them together in a five-channel output, and sending the result to large speakers through appropriate output. Personal computers have fast enough to do this kind of work for a very long time, provided that the effects being used are not too fancy and that appropriate audio interfaces are available.
However, in practice, people who do this kind of thing will tend to be paranoid about running any kind of background application when such an audio pipeline is running, or use vastly overpowered hardware for this use case. The reason is simple: sometimes, when pushing a machine relatively close to its limits, all it takes is for an e-mail to be received or some kind of background processing to be carried out, before the audio output will start to stutter and exhibit other latency-bound glitches. Experience shows that in this case, adjusting the priority of the audio software to ensure that other software does not still computer resources away from it will either be useless or even worsen the actual system performance.
Pay attention when you are using your computer, and you will notice other examples of silly performance glitches like this, where programs start slowing down as soon as some kind of mild background processing is going on. And the million dollar question is, why are the task prioritization mechanisms of personal computer operating systems so inefficient, to the point of recent mobile operating systems either giving up on user-accessible multitasking altogether or violently castrating it to the point of rendering it nearly useless? It is not for a lack of scheduling mecanisms, for sure, as any modern operating system is capable of ensuring that a given process gets exactly N% of CPU time, in timeslices of miliseconds that are respected up to a nanosecond granularity, along with providing additional “real-time” mechanisms for tasks which really need to run quickly. The key to understanding this problem resides, in my opinion, in asking oneself what is being prioritized and why.
The need for a task abstraction
In the example above, if we increase the priority of the audio processing process, we do so at the expense of every other process running on the system. This includes, in particular, the various system processes which will be invoked by said audio software in order to read and write data from disk, send data to and from audio interfaces, and perhaps also do more advanced multimedia processing tasks depending on the design of the OS being implemented. In legacy OSs, this was less of an issue, because system tasks used to preempt everything else and run fully until completion, at the risk of freezing the whole system in the event of a bug. However, with modern operating systems treating system tasks more and more like any other task for the sake of increased reliability, this system task priority reduction problem becomes more and more of an issue. Basically, this is why increasing the priority of a user process to very high levels will often dramatically reduce the whole system performance, ironically including the application itself.
It gets worse, though. Suppose now that a background task is started in this prioritized scheme. Since it has a process priority similar to that of system tasks, but significantly lesser than that of the foreground process, it will actually eat into the CPU time of these system tasks more than in that of the foreground task, which is actually an intended behaviour, but can quickly starve these vital tasks from CPU time altogether. Moreover, the background task can itself give work to do to system processes, and that work will be handled on an equal footing with that of other tasks running inside of that system processes, including those which were started by the foreground task.
In essence, giving scheduling priorities to processes is a fundamentally broken way to control the priority of tasks, which only affects the work that is directly performed by applications and not the work which they delegate to system APIs. As a conclusion, it is also worth noting that well-designed applications, which take much care to make use of system abstractions instead of reimplementing everything in their own way, will effectively end up being more penalized for this choice than applications which feature everything and the kitchen sink within the boundaries on their code, with respect to this prioritization scheme.
To avoid this, the only option is to have the OS scheduler actually be able to prioritize not only process code in and of itself, but also the code which is started by a given process inside of other processes, and specifically for the purpose of doing the task at hand. For the purpose of this discussion, I will call this whole set of running code a task in the remainder of this article.
Implementing tasks into a scheduler
So, we want the system scheduler to be aware of when a client process gives work to do to a server process, and to know which thread of the server process is in charge of processing that work, so that it may run that thead at client priority. In the meantime, we would like not to break process isolation by letting the server schedule threads at whichever priority it wants, we also need it to keep the ability to delegate client work to other processes with same execution priority or to do multithreaded processing, and finally we need to make the associated technical solution reasonably independent of the IPC primitives being used and the way the server is being implemented.
Achieving all of these goals is no trivial task. In fact, at least one of them is probably unrealistic : without deep knowledge of the underlying server implementation, a scheduler cannot guarantee that privileged client time won’t be used by a server process to do other things. All it can do is to set some restrictions on the way that privileged CPU time will be used. One of such restrictions would be to ensure that once a task has been performed, no server is able to run threads with the priority associated to that task. This only requires the system scheduler to be notified of when a task is started and ended, which can be done by the client and silently automated in the specific case where RPC calls are used for IPC. For those asynchronous tasks which never return or are not RPC-based, the server process should probably be trusted to notify the kernel of work completion instead.
How could the client “pass around” its scheduling settings to other processes in practice? In my view, an attractive option is to have the system scheduler issue, on the request of a process, revocable “task tokens” that can be passed around to other processes. These tokens would represent the client’s permission to let code associated to its task be performed at its scheduling priority. A server process would then be able to use such a token to have any number of threads scheduled with client priority, as long as the token stays valid. When the client notifies the scheduler that a task has been performed, the scheduler could then ensure that no thread remain running with client priority using the associated task token.
Starting from this simple concept, let’s now discuss how a task-based scheduler could handle the number of edge cases that will ineluctably arise…
Dealing with task-related issues
Mitigating priority abuse
A problem with task scheduling which has been outlined above is that server processes which have been handed a scheduling token could potentially never return to the client and instead use that token for some nefarious purposes involving running tasks with inappropriately high CPU priorities. Above, I have merely handwaved the issue by stating that the server process should probably be trusted anyway, but I reckon that the most paranoid of us would not be satisfied with such an ostrich solution.
For purposes where it is critical that a server can never acquire client priority permanently, start a lot of threads using it, or pass it around to other processes, I see two possible mitigation strategies. One would be to set a reasonable limit on the amount of threads that can simultaneously use a given scheduling token. The other would be to limit the duration for which a scheduling token can be used, and have it timeout automatically as if the task had been completed after that time. In both cases, threads which cannot use the token could fall back to the “normal” server process thread priority, though additional thoughts must be turned towards how the server process should be informed of the situation.
Sometimes, client processes do stupid things. Like trying to break process isolation and dying in the process, dereferencing null pointers, or giving other processes some computation and never reading the results. All of these scenarios lead to a situation where a client process dies while other server process around are still using its task tokens. What should happen then?
I see two possible strategies regarding this outcome. One is to revocate all task tokens associated to a client process at the point where it dies. It has simplicity in its favor, as there is no need for the system scheduler to keep records about dead processes, but it can also let some server threads run at an unfairly low priority if the client actually intended server threads to complete some work for it without caring about the results. The other option is to keep task tokens alive until tasks are completed. It is friendlier towards asynchronous processing, but adds complexity to the scheduler and can potentially lead to memory leaks. For this reason, I would spontaneously prefer the first option.
Deferring task definition
Not all tasks are started by a process. When a hardware event occurs, additional processing is often needed before work related to that event can be assigned to a process. At which priority should that processing be performed?
My opinion is that in doubt, most hardware events should be handled with very high priority until their actual priority is determined. After all, the hardware drivers which handle these events are highly trusted by other processes anyway, so there is little harm in giving them high privileges. For these purposes, I would probably use real time tasks together with a reasonable (driver-specific) deadline that makes the threads fall back to background priority when it expires. This would ensure that hung driver task do not completely stall the system.