This document is an updated version of “The OS-periment’s service model : RC2” that takes into account the feedback submitted by Alfman on said document’s comments.
Highlights of this new release, visible in the form of red text, include…
- Naming : nonblocking RPC it is
- Caching and batching in the stub library code
- Serialization issues, and why it seems they can’t be fully avoided if C compatibility is to be maintained
- More details on shared memory, automated sharing
- Details on connections and dynamic setup of remote calls
- Management of server death
Following the microkernel tradition and the philosophy behind early UNIX systems, this OS is to be structured as a set of relatively independent modules, separated through process boundaries, which provide services to each other in a controlled way. The question which I aim to answer here is, which way ? How would processes mainly communicate with each other in order to provide these services ? Read more to find out.
I. Principle and use cases
If we take two processes, a client and a server, the action “server providing a service to client” is done through having the client ask the server to do work for him. The server then does the work, and then notifies the client when it’s done, if it’s interested in that information, optionally providing results that way. The programming abstraction commonly used for such purposes on library-based OSs is a function call : client code gives work to do to the server (the library), the server does the work then notifies the client when it’s done, optionally providing results (through the library function’s return mechanism). So the most naive implementation of a “service” would be an IPC primitive which emulate a function call, having the client “call” a function of the server. This IPC primitive is called Remote Procedure Call, or RPC.
A problem with RPC is that it is not as universal of an IPC primitive as one might wish. Consider, as an example, a notification mechanism. When a server sends a notification to a client, it certainly doesn’t care about what the client will be doing with it. So what the server would wish to do is to just send the notification, and then go doing something else. But if notifications were implemented using RPC, the server would have to stay blocked while the client processes the notification, until the client’s code returns. There would be, of course, ways to avoid this, like having the server create an extra thread to send the notification, but all in all one must admit that RPC simply doesn’t fit this use case well. The reason : it requires the server to care about when the client has completed its task, even when it doesn’t want or need to, as it blocks it while the call is processed. There are several other issues with blocking calls : they aren’t suitable for handling unexpected events, they cause deadlocks, getting around them leads process to create several threads only to have them block shortly thereafter, in a waste of system resources… All in all, the RPC model could be quite good, but blocking is definitely not a part of it which we want around.
So let’s remove it.
Our reworked IPC primitive would be something that works like a function call on the client’s side and runs a function of the server with the parameters given by the client, like in RPC. But this time, immediately after work to do has been given to the server, the client’s call returns. If the server wants to provide results, it does it by calling in turn a function of the client, whose parameters are standard : a callback.
I call this IPC primitive “non-blocking RPC”, and will refer to it as “remote call” regularly for clarity purposes.
Oh, and also, using RPC obviously does not prevent drivers from having standard interfaces. They’ll just have standard RPC-based interfaces.
II. How should it work in practice ?
II.1) Service broadcasting
II.1.a) Server side
Client code should obviously not be able to call all the functions of the server code, otherwise our IPC mechanism totally breaks process isolation and that’s not good at all. Therefore, as part of its initialization, the server should specify to the kernel which of its functions may be called by client processes, along with functions pointers so that the kernel may actually run them when they are called.
To allow for function overloading, the server should not only broadcast function names, but also a full function prototype. To allow for interoperability between clients and servers written in different programming languages, function prototypes should be written in one specific programming language, no matter which programming language the server is written in, so that the kernel may easily parse them. Translation to and from other languages’ function prototypes can be done by library code as needed. We adopt C/C++’s conventions for this task, as this is what most of the operating system will be written in and there’s no reason not to make our life simpler :)
As the programming language used to code the server may not support function overloading itself, it should be possible for a server to broadcast a function of its code under a different name (i.e. the prototype broadcasting process should not be fully automated by development tools)
To allow for breaking API changes that do not alter a function’s parameters structure (as an example, if a structure’s definition changes), a versioning system should be introduced. The server process may or may not broadcast an integer version number along with each function prototype, if he does not the function is assumed to be at version 0.
To allow for non-breaking API changes that extend a function’s feature set by appending extra parameters to it, the remote call subsystem should accept default parameter values. Conversely, when a callback is sent by the newer server to the older client, the new parameters of that callback should be silently ignored by the kernel.
Up to this point, the method through which the server sends its function prototypes to the IPC subsystem is unspecified. This is done using a system call and hard-code remote call definitions in the server’s code.
Internally, the kernel keeps track of remote calls by PID, though client code will obviously have to find out about the server’s PID the first time it makes a remote call.
It must be possible to set up remote calls between the client and the server at any time during their operation. Typical use case of this is a notification feature : somewhere in the middle of their running time, clients should be able to tell the server “hey, I want to receive notification x through remote call y”.
II.1.b) Client side
To make different versions of the server and the client work well together using the features described above, it is necessary that client sends, too, as part of its initialization, the names, prototypes, versions, and default parameter values of the functions which is has been designed to work with.
Client developers wouldn’t have to care about that in practice, however, as good server developers will develop, either by themselves or using automated development tools, “stub” library calls which hide the internals of the remote call mechanism to them and make it as simple as calling a library function with the proper parameters. Only thing which the client’s developer will have to work on is callbacks, and again libraries can be used to make it as simple as writing the callback function itself and providing a function pointer to it as one of the library functions’ parameters.
An extra benefit of using “stub” library code is that it can reduce the cost of RPC in some situations by caching client calls on the client side and sending them in batches, reducing the number of system calls required. This only requires the kernel to support receiving and processing several remote calls at once, which would be rather easy to implement.
Constantly sending server PIDs and prototypes to the kernel each time it makes a call is an unnecessary performance hit for the client. It also means that if the server dies, and comes back under a different PID, the client will stop working properly, unless it constantly polls for the server’s PID. For all these reasons, a connection system is being used : on the first time a client makes a given remote call, the client checks compatibility, sets up management structures for future calls, and returns a unique integer identifier designating this structure. From that point, client code, including library one, will only use this integer to designate which remote call they want to make. This way, the kernel may internally and silently manage server death by adjusting these management structures, without having the client care about it.
II.2) Making the remote call
Okay, so the server and clients have sent their respective prototypes to the kernel, which has itself done its compatibility checks and confirmed that everything is okay. Now, how does the client call one of the server’s functions ?
“Normal” function calls are done by pushing the function’s parameters on the stack, in an order and a fashion that’s specified by the programming language and compiler’s calling convention, then jumping at the location of the function’s code. Obviously, this cannot be done here, because if client code can call server code seamlessly we have a major process isolation malfunction. Sharing the client’s stack with the server cannot be considered either for the same reason. So we have to somehow transfer the function parameters from the client’s stack (where they are when the stub library code is called) to a stack within the server’s address space, and we want that transfer to be as fast as possible because this remote call mechanism is to be used even in the lowest layer of the OS.
As mentioned earlier, most of the OS (at least its lowest layers) is going to be written in C/C++. If the client and the server are both written in C/C++, and use the same calling conventions, all we have to do is to copy the stack region associated with the function parameters somewhere on a stack within the server’s address space, add the new default parameters if the client was built for an older version of the server’s function, and then we can make the call on the server side while on the client side we remove the parameters from the stack and return. That is as fast as one can possibly get, and only requires the stub library code on the client side to specify the size of the function parameters on the client stack.
If communication between clients and servers written using two different programming languages is involved, it is possible to use translation layers at the library level. In fact, this is how languages which do not belong to the C family do system calls on most C-based modern OSs, and it works just fine.
Care must be taken about pointers, though, as client-side pointers won’t be valid on the server side. Either binaries are loaded at a fixed location of their process’ address space, and all kinds of pointers are fundamentally incompatible between processes, or binaries share a common address space and are relocatable using program counter-relative data pointers, in which case the changing location of code between client and server processes will still be a problem. Run-time symbol table parsing akin to dynamic linking could address these problems, but only at the cost of a fantastic complexity that does not have its place in low-level system layers. And besides, pointers must still be checked for validity : remote calls may only depend on data that is shared between the client and the server, not on arbitrary server data which the client should not have access to.
To address these issues, an automated sharing mechanism is suggested. When a pointer is used as a remote call parameter, the kernel automatically detects and shares the memory which it points to with the server and alters the passed pointer so that it points to the server version, only asking from the client to specifically allocate shared data in a shareable way. When the server is done, its shared copy is freed automatically. See appendix A for more details about the sharing mechanism.
This does not solve the problem of pointers which are hidden within data structures or shared memory blocks, though. Considering the very high extra complexity required to implement this while keeping C/C++ compatibility, we leave it up to good server programming practices to handle automatic serialization of all pointers included within shared memory blocks (e.g. linked lists). If using pointers in data structures which are included within function parameters cannot be avoided, programming tool-level support for these can be introduced, automatically broadcasting the prototypes of all data structures used as function parameters to the kernel, including all kinds of padding between structure members which can be introduced by the compiler. The kernel would then only have to interpret the header automatically generated by the programming tools. But it sounds like an unnecessary high hurdle to me, at the moment.
II.2.b) Running the code
At this point, server-side code is ready to be run. Two modes of server operation are supported. One of them is the asynchronous mode : if the server is idle, then a thread is spawned within it to handle the task. While a task is being processed, further remote calls are put in a queue, either as inactive threads or on a mere heap of function parameters stack. When the processing is done, next task in queue is being processed, and so on, one task at a time.
Another mode of server operation is the threaded mode. The difference with the asynchronous mode is that each time a new remote call occurs, a new thread is spawned within the server process in an active state, until the number of running threads within the server process equates the number of CPU cores, at which point remote call processing falls back to an asynchronous behaviour.
A cache of “dead” threads may be used to greatly speed up thread creation by removing the need to allocate memory each time a new thread is created.
Threaded operating mode is recommended for independent CPU-bound tasks, as it makes optimal use of the computer’s multiple CPU cores. On the other hand, for IO-bound tasks which do not benefit from the extra parallelism, and in situations where running tasks in parallel would require so much synchronization that they would run slower than in a sequential fashion, aynchronous operation should be preferred.
When the server is done with its work, it may have to send a notification and results to the client using a “callback” remote call. It is interesting to investigate what happens then. At first look, it is exactly similar to the kind of remote calls used by the client, but in practice there is an important difference : in that case, the kernel must deal with the compatibility of newer server callbacks with older client implementations. So it must remove function parameters from a stack. Unlike adding parameters, the kernel may only do this for the server if the length of the new parameters is known of it in advance. This can be done in two ways. Either the server specifies the length of each function parameter each time it calls a callback function, adding overhead to the interprocess communication protocol, or variable-length parameters are forbidden in remote calls.
Considering how much of a bad idea variable-length parameters generally are (pushing a whole array on the stack is a memcpy() hell and a stack overflow waiting to happen), I can’t help but lean towards the second solution, especially considering that adding variable-length parameters later is possible if there’s demand for it. Considering that this is not a distributed operating system, shared memory buffers and pointers to the server’s address space sound like a very valid alternative to variable-length function parameters. But this point is open for discussion.
III. Possible extensions to this model
All this leads to a beautifully simple and fast remote call model, but sometimes extra features are needed. This is why the remote call infrastructure should offer room for extensions that alter the behaviour of the calls. Here are examples of what such extensions could be.
Nothing can cause more painful problems than having a server stuck in a while(1) loop equivalent, especially if that server uses an asynchronous operating mode. However, in some situations, it is able to prevent this issues by having the server automatically abort its current task if it is not completed after a certain amount of time, send an error to the caller process, and revert its internal state to what it used to be before the killed thread started to run.
This extension can improve the reliability of the operating system significantly, but it should be optional for two reasons :
- It adds a lot of overhead to client and server’s development, which may just be too much for non-critical services. The server must support reverting its internal state to an earlier version, which implies regularly saving it (causing performance hits) and can become quite complicated, especially when a threaded operating mode is being used. On its side, the client must support random killing of its operations on the server.
- It is hard to predict how much time a wide range of operations will take, as such data is hardware-dependent. Timeouts which work well on the developer’s PC may be too small on a user’s one. Also, as soon as the remote call operates on a variable amount of data, predicting in advance how low the remote call should last becomes even more tricky.
III.2) Cooldown time
Sometimes, a notification may be fired very frequently, and it may not actually matter whether the notification calls are indeed all processed. As an example, imagine a program where each time a data structure is modified, a notification that says nothing about which changes have occurred is fired to the UI layer to ensure that said UI is updated. Updating an UI more than, say, 30 times per second, may simply be a waste of CPU power.
For those situations, it is advantageous to give the notification remote call a “cooldown time”. If two calls are made with a time interval between them smaller than that time, the second call is scheduled to only run once the cooldown time has elapsed, and all further notification calls are dropped.
III.3) Security checks
At the core of this OS’ philosophy stands the statement that system modules shouldn’t be allowed to do much more than they need in order to complete their task. This sandboxing-centric philosophy also applies to services : why should processes be able to access system services which are not useful for their tasks. As remote calls are the way system services are requested on this OS, a remote call sounds like the perfect place to check if a process has the right to access a given service, by having a look at its security tokens, or whatever else security mechanism I will implement.
This extension could work by having the kernel itself examine the process’ security permissions, or it could work by having a remote call sent to the server when the client asks to run a specific function for the first time, and having the server itself check the client’s security permissions. The latter is potentially worse for privacy and may increase the harm which a compromised server can do in a way which I can’t think of right now, but the former implies putting complex security checks in the kernel while all complexity which can be taken out of the kernel should be taken out of the kernel.
Appendix A : Shared memory
A.1) Fundamentals of paging-based memory sharing
To handle memory sharing between two processes located in two different address space, a paging trick is being used. New pages are allocated to the process with which the memory is shared, and are set up to point to the shared page’s physical location. This means that perfect sharing, with no information inadvertently leaking between both processes, require that the shared content is alone in the memory pages where it is located. This itself goes against basic memory allocator space optimizations, so it should not be the default. A special memory allocator must hence be introduced when memory is to be shared.
A.2) The OS-periment’s memory sharing model
- Memory that is allocated using a special allocator may be shared between processes.
- When sharing memory, a range of pages pointing to the shared object are created in the distant process’ address space, and a pointer to the shared version of the object is automatically generated and returned to the kernel, which may then transmit it to the most relevant process depending on the situation.
- When a process frees some shared memory, the kernel can’t actually fully free it, because it would mean that the freeing process has the power to crash other processes using the shared chunk instantly. So what it does is to solely destroy the part of the process’ address space which points to the shared object, while keeping other copies of the object alive. The physical object is only destroyed when the last process holding it liberates it.
- But what happens when an object is shared multiple time with the destination process, as an example when it is used in multiple simultaneous remote calls ? In that case, reference counting is used so that the object is only freed from the destination process’ address space when it has been freed the same number of time as it has been shared.