Skip to content

Clarifications on processes, pop-up threads, and services

May 21, 2011

In the recently published version of my interrupt handling model, it has been noticed that a section, which was written a bit too late in the evening, is still lacking in details and taking several logical shortcuts bordering misinformation : the one about pop-up threads. So considering how important this part is, I thought it was important to go in more details about why I think that pop-up thread is a good solution to this OS’ problems, and how I plan to use them.

Flashback : Why this project ?

Since its inception, the world of computers has mostly been dominated by monolithic operating systems. When security models appeared on these operating systems, they were conceived in days where it was common for thousands of users to share the CPU time of one huge server, and essentially aimed at protecting the server from attacks from its users, more than at protecting the users themselves. During these days, kernels had only few features, drivers were not many and were directly provided from the manufacturer of the hardware or the operating system, and malware, to put it simply, did not have a significant presence, as it lacked an essential ingredients of our days to prosper : a cheap and omnipresent worldwide computer network.

Times have changed. Although client-server systems still exist in academic and corporate settings, the vast majority of computers sold and used nowadays aim at being owned and used by a single user, or sometimes a family, for most or their life. These “personal computers”, to paraphrase IBM’s trademark, are connected to thousands of different and incompatible peripherals each requiring a different driver that is often painfully duck-taped together for release time by the manufacturer, and most of all to a gigantic, beautiful, and sadly malware-ridden network, the Internet, on which people share not also knowledge but also lots of totally untrusted but incredibly cool-looking software. And to make it even more complicated, the person in charge of administrating this hell is not anymore someone who has dedicated a significant part of his life to computer like you and me, but rather an average individual with no more computer knowledge than it takes to move a mouse around, use Word, and copy files to and from an USB pen drive.

In this context, providing a computing service that’s still fast, reliable, usable, and trustworthy for its sole user brings a new meaning to the “challenge” word, and different market actors have different ways of dealing with it. Some actors try to subtly bring the user back on the passive side of a client-server model administrated by a trusted and friendly sysadmin without him noticing, with products like Apple’s App Stores and Google’s ChromeOS. Other actors do their best to create a transition path from legacy operating systems to an OS model that’s better suited for this new ecosystem, starting from the lowest levels, with strategies like the modularization of the NT kernel or the implementation of some capability-based security in Android. Myself, I’m just a hobbyist that wants to take a clean break from the past and create from the ground up a proof-of-concept OS truly suited to this purpose, showing that it is indeed possible, with a reasonable amount of effort, to create a true personal computer OS having all the desirable qualities mentioned above. An OS that, without putting users under the sysadmin dictatorship once again or forcing them to read thousands of pages of technical documentation, manages to be snappy and powerful, crashes rarely and with little impact, is fun to use, and give malware trying to steal personal data and turn the user’s computer into a part of a botnet a serious run for its money without asking a lot of security knowledges from the user.

Benefits of a multiprocess design

Now let’s come back to what happens here and now. I’m building the first hardware abstraction layers of this OS, which in a traditional OS would be grouped together in a monolithic kernel. It is a known fact that hardware drivers are hard to code, which makes them significantly more likely to be buggy and/or exploitable than “normal” application code (3x as much, if I were to believe Tanenbaum’s statistics). Yet in classical monolithic kernel designs, those drivers are packed together in the same address space, with full access to the system’s capabilities, and are in direct contact with the core functionality of the OS (memory allocation, process and thread management…). As many computer scientists have pointed out in the past, this is a disaster waiting to happen, so I do my best to put instead drivers in isolated processes with the minimal set of security permissions they need to work, which is known as a microkernel design. I also heavily consider using such a sandboxed isolated process structure in higher layers of the OS to further improve security, reduce the impact of software component failures, and help developers finding out where these occur.

If we want to separate OS components in isolated process this way, Inter-Process Communication (IPC) becomes a thing of vital importance. I plan to implement various well-known IPC mechanisms (pipes, signals, remote procedure calls, shared memory…) in order not to needlessly disturb developer’s habits in this regard. But which one should I mainly use ?

From libraries to remote calls

Fundamentally, all those little processes are here to provide services. A keyboard driver, as an example, is in charge of handling keyboard interrupts, communicating with the keyboard controller, interpreting its weird language with the help of verses from the Necronomicon, and extracting from it clean and understandable standard input events that it sends to higher-level input management layers. It’s always about contacting another process and having it perform actions. Which is, when one thinks of it, very similar to the way libraries work. And what is the natural way of interacting with a library ? Through functions (and object methods, for those who make a distinction).

So what we’d like to do is to call a function from another process, and the common name for this is a remote procedure call. Traditional function calls work this way : the caller pushes function parameters on the stack, transfers control to the called function, and in the end the function optionally returns some result in a CPU register or on the stack. In the following, we’re going to review the characteristics to be changed before we get a suitable remote call model.

From local calls to remote calls

Before a remote call can be made…

…the process that is to be called has to broadcast entry points, that is, pointers to functions within it that can be called by other processes. This avoids an obvious design mistakes where malicious RPC code can call arbitrary functions in a process while it’s not supposed to. To avoid tons of classical stack- and buffer-based attacks, and more generally ease communication between processes, it will also be asked that the service provider broadcasts a full prototype of the function to be called, more specifically including the nature of each parameter (pointer or data) and its size. In the future, development tools may be tweaked to generate such information automatically and reduce the tediousness of this step.

The function parameters

A key benefit of using separate processes is that they can’t peek into each other’s address space. So even if we could theoretically have the stack of the calling thread shared with the called process, that wouldn’t be such a good idea, because the called process could read there some information that he shouldn’t have access to, voluntarily or not. As such, two options remain : either using the stack of one of the existing threads of the called process which will be in charge of doing the call (asynchronous RPC), or creating a new thread within the called process for each remote call (threaded RPC).

We’ll discuss the benefits and drawbacks of each approach later. Just one thing worth noting for now is that if developers use pointers in RPC parameters without first making sure that the pointed object is shared in the called process’ address space and that they use a pointer to the shared version, the call will fail. One way to solve this problem is to introduce a special “pointer” object for shared memory and have the compiler/kernel/RPC library spit errors and warnings if “normal” pointers (32- or 64-bit integers, depending on the platform) are put in remote calls instead. Conversely, care must be taken that pointer parameters cannot be used by the calling process to remotely modify “private” parts of the called process’ address space : it should be checked that pointers target exclusively areas of the called process’ address space which are shared with the caller before going further with processing of the call.

Control transfer

During a usual procedure call, the calling code is halted and control is transferred called to the called code, then when the called code ends the calling code’s execution is resumed. This kind of blocking call can totally be implemented for RPCs, using pipes or with a bit of help from the kernel, and totally should since it most closely mirrors the way “normal” calls work and as such will be very familiar to developers. However, for functions which do not return results (procedures), the very nature of RPC also provides us with the option to introduce a bit more flexibility in there, in the form of “non blocking calls”, in which execution of calling code continues after the RPC request has been sent. This permits multiple things, among which implementing timeouts in case RPC requests haven’t been processed in a certain amount of time, and is typically a feature worth having in an RPC mechanism.

Returning results

Results can be returned either with a bit of help from the kernel or using IPC and “stub” code within the calling process that’s in charge of putting register values in registers and stack values on the stack.

Threaded vs asynchronous

As said before, there’s a choice to make between having a single thread in the called process take care of all RPCs (asynchronous operation) or spawning one thread within it per procedure call (threaded operation). If we consider the way normal procedure calls work, threaded operation is the most natural way to do RPCs, because when developers make a remote call, they expect it to begin being serviced right after the “CALL” instruction, not after an unpredictable amount of time due to some other remote calls by some other unknown processes going in the way. If the remote calls are CPU-bound and last a sufficiently long time (ex : multimedia data processing), threaded operation also provides better performance, due to the added parallelism it provides. There are some considerations to keep in mind about threading, though :

  • The threaded model may sound more failure-proof than the asynchronous model at first, because if a thread gets stuck in a while(1) loop or something similarly silly, other threads just continue to run. But this advantage can be mitigated through the use of timeouts (task is killed, process/hardware state is restored, and an error code is returned to the caller if the call is not over after at total running time of xxx), which even the threaded model benefits from, but which only works if tasks take a fairly predictable time to complete.
  • Threads do not benefit performance as soon as there are more running threads than there are CPU cores, so once this point is reached, it is best to a switch to a “semi-asynchronous” model where new threads are set to an inactive state and wait until running threads terminate their execution or block for I/O before being scheduled to run in their place. In case of a thread blocking for I/O, once that I/O is completed, the old thread should be set to run again and the new one put back to sleep in order to prevent an accumulation of half-completed threads in RAM.
  • Although threads will always keep some CPU time cost due to their creation and annihilation, it is possible to reduce the RAM usage of a thread which has never run to a negligible amount by not allocating stacks right away and waiting until threads run before doing so.
  • Most service requests should be independent from each other, like library calls (have you ever seen any serious modern library use lots of global variables ?) and this is what the threaded model excels at. However, there can be some situations where work must be done in a sequential fashion, without even a way to run a significant part of it in parallel before writing the results sequentially. In that case, one is better off using an asynchronous model, since the threaded model becomes equivalent to an asynchronous model, with synchronization primitives as an extra debugging hurdle and performance hit.
  • Threading has a slight overhead that only pays off in the longer run. If a small task (that takes a performance hit from threading) is to be run very frequently, the performance gain of asynchronous behavior should become noticeable.

Implementation considerations

I usually try not to mix design and implementation discussions, but here I think that there are some particularly important implementation considerations to think about.

First, should a system call for RPCs exist in the kernel or should RPC be done by the calling and called process using a library and a pipe-based communication protocol ? There are good arguments on both sides.

  • RPC is easier to implement in the kernel because, well, the kernel knows about every process.
  • It is also really a vital mechanism on a microkernel OS, like all other forms of IPC, so if there’s a security flaw in the way we do it, we really want to have this flaw patched as soon as possible for all software, which cannot be done if it’s library-based and silly people get their software statically linked to that library instead of using dynamic linking.
  • Furthermore, at some point, we’ll probably want to introduce security token checks (like “you can only run this remote procedure if you have the appropriate security token”). Do we want anything but the kernel to do this check and be responsible for (part of) the OS’ security ?
  • Doing everything in user mode means that we’ll need less syscalls, since once a pipe between two processes has been created, the communication on that pipe can mostly be done in user mode. On the other hand, user mode code won’t be able to do immediate control transfer from one process to another instead of waiting for the scheduler to do the switch, which would have helped reactivity. Furthermore, this syscall advantage does not exist in a threaded model, since a syscall is necessary to create threads anyway (if Tanenbaum’s book has taught me something about threads, it’s that user-mode ones are not worth the effort).
  • Putting RPC out of the kernel means that we have one less potentially buggy and/or insecure code in kernel mode to worry about.

Another thing that should be considered is the way the asynchronous model will be managed. To reduce the complexity of the implementation, it’s possible to have it emulated on top of the threaded model, using a mechanism similar to the one above, as I mentioned in the previous article :

  • If no thread is running, create a thread and have it run right away.
  • Subsequently generated threads will be put in an inactive state until the current thread is done with its job. If the sole running thread is blocked for I/O, nothing will replace it.

But it could also be done by having a thread “waiting for RPC”. That thread will simply use blocking calls to either check a pipe for RPC tasks to handle (for user-mode implementations) or notify the kernel that it is ready to process an RPC request. In the latter case, the kernel keeps a specialized asynchronous task buffer at hand for tasks that have received clearance for being processed by the called process. Each time the thread notifies the kernel that it is ready to process a request, the kernel will have that thread perform the next call in queue by putting the parameters on its stack and making the thread jump at the right place.

Once all performance optimizations are applied, it is expected that the emulated method will be about as fast as the “pure” asynchronous method, except in situations where a very small remote call is done very frequently.

From → OS development

9 Comments
  1. Brendan permalink

    The problem with your “threaded operation” is that it quickly becomes a nightmare for developers to handle re-entrancy. Any use of any type of lock leads to potential deadlock, livelock, priority inversion, etc; and because you’re creating threads all over the place there’s no way to control it.

    For a simple example, process A acquires a lock, then does an RPC call to process B. During handling the RPC call, process B does a second RPC call back to process A. While handling the second RPC call, process A tries to acquire the lock again, but the lock is already acquired, and now you’ve got deadlock. The author/designer of process A had no way to know that process B would need to do the second RPC call and therefore couldn’t have avoided the problem; and the author/designer of process B had no way to know that process A would use locks like that and therefore couldn’t have avoided the problem.

    If one simple example can lead to an unavoidable deadlock, then imagine how much fun it’ll be with many processes.

  2. Interesting problem indeed. Can you provide a less abstract example of situation where this could happen ? Until now, I’ve never thought of a situation where RPC back and forth between A and B was involved, it was always more like usual stacked procedure calls…

  3. Brendan permalink

    For an example, consider a file system that supports notifications. Process A tells the file system “send me an RPC call whenever directory “foo” is modified”). Then process A acquires a lock, and sends an RPC to the file system to modify something in “directory foo”. The file system sends back a notification saying that “something changed in directory foo”, and process A tries to re-acquire the lock while handling the notification, causing deadlock.

    That is a very simple example (and probably easily avoided because the author/designer of process A can see the problem in their own code). It gets complex when multiple processes are involved – process A acquires lock and then sends RPC to process B which sends RPC to process C which modifies something in directory “foo” causing the notification back to process A (and the deadlock).

    If the system looks like a tree of processes; then a process can send RPC to its parent (or its parent’s parent or…) without any problem; but if a process sends RPC to one of its children (or its children’s children or …) you risk problems – basically anything that looks like a callback. In this case, you can avoid the problem by saying “never use RPC to talk to children while locks are held”.

    If the system doesn’t look like a tree of processes (e.g. arbitrary processes sending RPC to arbitrary processes) then there’s no easy way to tell where the problems will be. In this case, you can fix the problem by saying “never use RPC to talk to anyone while locks are held”.

    Of course these “fixes” just push the problem elsewhere – it makes it harder for processes to avoid race conditions, etc. because they can’t rely on locks so much.

    A similar problem exists for synchronous systems where the sender/caller waits until the receiving thread is blocked (e.g. waiting to receive); except it’s different/worse. Here the problem is a thread trying to send RPC to something that is waiting for a reply. E.g. thread A sends to thread B and waits for reply, thread B tries to send to thread A and can’t. The normal fix in this case is to force software to look like a tree of tasks where you’re simply not allowed to send RPC to a child (ouch – no callbacks).

    For asynchronous, all the problems disappear (anyone can send anything to anyone else at any time without caring about any deadlocks, etc). Here the problem is that it doesn’t look like a normal function call, and you end up writing software differently (e.g. everything done with “for(;;) { getMessage(); handleMessage(); }” loops and state machines).

  4. For an example, consider a file system that supports notifications. Process A tells the file system “send me an RPC call whenever directory “foo” is modified”). Then process A acquires a lock, and sends an RPC to the file system to modify something in “directory foo”. The file system sends back a notification saying that “something changed in directory foo”, and process A tries to re-acquire the lock while handling the notification, causing deadlock.

    If I understand correctly…

    -Thread T1 of A asks B to notify him when foo changes.
    -T1 acquires lock.
    -T1 RPCs B to write something in foo.
    -B sends notification to A and starts writing in foo
    -Let’s assume that blocking RPC is used for simplicity first (I handle the nonblocking case later) : T1 is blocked until B is done.
    -Threaded case : Notification causes a new thread T2 to be spawned in A. T2 tries to acquire the lock and gets blocked.
    -Async case : Notification processing is scheduled to be done later in A’s event queue.
    -Later, B is done with writing in foo and unblocks T1.
    -T1 releases lock, and dies or goes doing something else.
    -Threaded case : T2 is unblocked, can process the notification.
    -Async case : Notification is processed once T1 is done.

    In this case, threaded falls back nicely to async, I think. I don’t see the deadlock.

    That is a very simple example (and probably easily avoided because the author/designer of process A can see the problem in their own code). It gets complex when multiple processes are involved – process A acquires lock and then sends RPC to process B which sends RPC to process C which modifies something in directory “foo” causing the notification back to process A (and the deadlock).

    -T1 asks for notifications.
    -T1 acquires lock.
    -T1 RPCs process B.
    -Process B RPCs process C.
    -Process C modifies something in folder foo.
    -New thread T2 is spawned in A, blocks for lock.
    -Modification of folder foo is done, threads in process C and B complete their work.
    -T1 wakes up, releases the lock, goes back to living its life.
    -T2 starts handling foo modification notification.

    Smooth fallback to async again ?

  5. Here’s the nonblocking RPC case…

    Threaded model + nonblocking RPC :
    -T1 asks for notifications.
    -T1 acquires lock.
    -T1 RPCs process B, sets up a callback to release the lock when work of B is done, and goes doing something else.
    -Process B sends notification back, starts working.
    -T2 is spawned, blocks for lock.
    -When process B is done, it sends callback to process A.
    -T3 is spawned to handle callback, releases lock and dies or does something else.
    -T2 can proceed.

    Async model + nonblocking RPC :
    -T1 asks for notifications.
    -T1 acquires lock.
    -T1 RPCs process B, sets up a callback to release the lock when work of B is done, and goes doing something else.
    -Process B sends notification back, starts working.
    -T2 is queued for processing notification once T1 is done.
    -When process B is done, it sends callback to process A.
    -Callback is added to A’s event queue.
    -At some point before or later, T1 is done.
    -Process A switches to foo notification handling, tries to acquire the lock that should be released by next item in event queue, and blocks : A is deadlocked.

    Actually, the threaded approach sounds more robust here.

  6. Brendan permalink

    -Thread T1 of A asks B to notify him when foo changes.
    -T1 acquires lock.
    -T1 RPCs B to write something in foo, creating T2
    -T2 modifies foo
    -T2 checks the list of notifiers
    -T2 RPCs A to tell A that foo changed, creating T3
    -T3 tries to acquire lock and fails because the lock is already acquired
    -T1 can’t free lock until T2 completes
    -T2 can’t complete until T3 completes
    -T3 can’t complete until T1 frees lock
    = deadlock

    If either of those RPCs are non-blocking, then “for asynchronous all the problems disappear”…

    You mostly end up with a system where programmers have to be very careful about how and when they use (blocking) RPCs. A good programmer will just use non-blocking RPC instead so they don’t need to care about the problems. A bad programmer will get all confused and write single-threaded code that still manages to break everything. :-)

  7. I see the problem in my comments above. I’ve implicitly assumed that notifications were nonblocking, because I couldn’t see a single benefit in not doing so. After all, it’s called a notification for a reason : you push it on another process because you’ve been told to do so, but what it does with it afterwards is none of your business so there’s not point blocking until the other process has processed it.

    Anyway, I still don’t see how the async approach differs from the threaded approach there though. The threaded approach too has no problems getting through as soon as people start using nonblocking calls (T2 may complete its job after spawning T3, without caring about its fate, see example above), and with blocking calls I don’t think async would perform any better. It’d be like…

    -Thread T1 of A asks B to notify him when foo changes.
    -T1 acquires lock.
    -T1 RPCs B to write something in foo, creating T2
    -T2 modifies foo
    -T2 checks the list of notifiers
    -T2 RPCs A to tell A that foo changed
    -T2 blocks until foo notification processing is done
    -T1 can’t free lock until T2 completes
    -T2 can’t complete until foo notification has been processed
    -foo won’t be processed until T1 is done.
    Deadlock there, too

    Maybe the lesson to be taken from that example is that I should ditch blocking calls altogether and tell people “Blocking was a remain of the synchronous days of mono-core CPUs, callbacks are the way it’s done now, get used to it” ?

    You mostly end up with a system where programmers have to be very careful about how and when they use (blocking) RPCs. A good programmer will just use non-blocking RPC instead so they don’t need to care about the problems. A bad programmer will get all confused and write single-threaded code that still manages to break everything. :-)

    Sometimes, blocking code is just more intuitive, but I wish we could get rid of that. Some IDEs like Delphi (RIP) manage to provide good metaphors of event-driven programming, maybe it’s the development tools and tutorials that must be changed before more people start giving callbacks a try…

Trackbacks & Pingbacks

  1. RPC-based system API model : more details « The OS-periment
  2. Happy birthday, OS|periment ! « The OS|periment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.