Big summer update 1 – RPC testing results

Like last year, I’ve had quite a bit of spare time at hand  for coding purposes, and I haven’t had Internet access available during that time. So here’s a big update on what I’ve been doing. In the first part of this update, let’s talk about the elephant in the room : I’m done with checking the viability of my RPC model. The tests, which follow the plan defined earlier, have been performed on three emulators : Bochs, QEMU, and VirtualBox, from the slowest to the fastest. All three ran on my fellow laptop, sporting a Core i5 M430 and 4GB of RAM, and mostly running on Fedora Linux 14 64-bit for OSdeving purposes. Here are the results :

Server startup performance tests

This step is pretty easy to optimize : since the server broadcasts its whole interface at once, pooled memory allocation can be used, which makes performance fly tremendously high. As a result, an optimized version of the code takes about half of a second to execute in Bochs, and executes instantaneously under QEMU and VirtualBox. Conclusion : on real hardware, there’s no need to worry about server startup performance.

Client startup performance tests

As predicted earlier, there’s no way this test could be successful the way it was defined earlier. As a performance compromise, I had to give up on the requirement that all 5000 call descriptors had to be outdated, as it resulted in a major performance hit (gigantic heap pollution, several minutes for test completion under VirtualBox). This could be optimized to some extent, sure, but is it worth it ? Outdated call descriptors appear as a compatibility workaround when clients do not keep up to speed with servers, which shouldn’t be the case for the vital system services that start on boot, assuming that users keep their OS installation in a coherent distribution instead of updating some things without updating others.

When using up to date descriptors, the test still takes a long time to complete under Bochs (20s), but not in QEMU (2s) nor VirtualBox (instantaneous). It should be noted that Bochs’ performance becomes weirdly slow on heavy memory manipulation tasks, to a point where it is not representative of real hardware, even slow, anymore. As an example, Bochs takes half a minute to map 4GB of RAM in x86’s multilevel page tables, despite real hardware being able to complete this task instantaneously even if it’s really old. Therefore, I’ll take the option of saying that the bad performance in Bochs is just a case of its emulation being stupidly slow, and will consider that this stuff is still ready to run on most real hardware.

Threaded RPC performance tests

Without much optimization, the code was too fast for the test to be of interest with 1 000 calls, so I’ve taken the freedom to increase this to 10 000 calls. This took 4s to complete in Bochs, less than 1s in QEMU, and was instantaneous in VirtualBox. What this means is that even a relatively slow computer (as illustrated by Bochs, which fully emulates x86 hardware on the inside) would be able to execute thousands of RPC calls per second. I’d say that’s good enough for a start :)

For future performance improvements, two optimizations may be envisioned. First, stop allocating and freeing stacks all the time, when possible, by keeping a pool of dead threads around instead. This is very likely to make its way into the final codebase, but quantifying the performance improvement brought by this is difficult. Another optimization would be to avoid going through an intermediary cdecl stack, as the current code does, and fully embrace the AMD64 ABI instead. But a the designers of said ABI chose to maximize performance at the expense of cleanness, that would be difficult and make the code a mess, so I don’t want to do it unless it’s absolutely necessary.

Asynchronous RPC performance tests

Not having to allocate and free stacks give some nice performance boost to the code, though less than I would have expected. If we look at the time it takes to emulate 100 000 RPC calls, Bochs needs 40s for threaded code vs 34s for async code. QEMU needs 6s for threaded and 5s for async. Virtualbox continues to execute code near-instantaneously, keeping me amazed by the power of a modern CPU.

Conclusion

When writing these tests, the two last ones especially, I’ve been wondering about the possibility to create a code generator for RPC calls (in an IDE, as an example you right click the definition of a server-side prototype and can easily turn it into a remote call descriptor, and client-side stuff that you can easily copy-paste inside of a library). So far, I see nothing going in the way, so that’s probably the way things will be in the final development tools. If I get up to that point obviously.

Second thing is, as I believe that the performance of this RPC interprocess communication primitive is very satisfying, implementation is soon to begin.

Advertisements

21 thoughts on “Big summer update 1 – RPC testing results

  1. Brendan August 9, 2011 / 8:15 am

    Conclusion: I’ve got no idea what any of these tests are actually doing (or how many CPUs are being used), so the results are inconclusive.

    As a very rough guide (based on past experience), Bochs is around 100 times slower than the host machine and Qemu is about 10 times slower.

    For the async test, if it takes Qemu 5 seconds to do 100 000 RPC calls, then it’d take the host machine about 500 ms to do those 100 000 RPC calls. That works out to about 5 us per RPC call on real hardware (the host, a Core i5 M430).

    A Core i5 M430 is a 2.26 Ghz part, so 1 us is 2260 cycles and 5 us is 11300 cycles. A software task switch (including switching virtual address spaces) is typically around 250 cycles, various TLB misses (caused by the switching virtual address spaces) and CPL=0 CPL=3 switching probably adds another 200 cycles. The extra overhead of the async messaging (alone) should only be about 200 cycles on top of that (depending on message size and a bunch of other unmentioned things); so expected results would be around 650 cycles (and nowhere near 11300 cycles).

    Based on all of that guesswork, your code is an order of magnitude (around 15 times) slower than it should be (assuming “single core ping-pong”).

    Of course if the tests are using both cores of the Core i5 M430, or all four logical CPUs, then there should be no need for any task switching for async (e.g. sender and receiver running on different CPUs), and your code is probably several orders of magnitude slower than it should be.

  2. Hadrien August 9, 2011 / 9:31 am

    Well, as for what the tests are doing, this post was indeed missing a link to the original post that explained that. Added one, and here’s an extra one just for you :)

    As for those of your questions which this page does not answer, here is some extra stuff which you might like to know…

    I do not have multicore support yet, so the tests are done on a single core.

    I do not have fully implemented user space processes yet either, so for context switches and syscalls I’ve taken a pessimistic model :

    • For a context switch, I flush the TLB and move 130 longs around (account for the register copying and other extra overhead that has to take place, in real life I guess that register-to-memory copying is much faster than this memory-to-memory copying).
    • I model a syscall as two context switches (one to come in, one to go out) and two mutex grab/release cycles (kernel stuff frequently requires some amount of mutual exclusion somewhere).

    I’ve looked at the AMD64 ABI and ran away screaming in horror (no way I can make simple, viability testing code support that), tried forcing _cdecl on gcc and realized that it wouldn’t let me do that, and given up, so the current code does one unnecessary parameter stack copy.

    Also, here, as mentioned on the document linked above, I’ve experimented with a relatively slow (compared to what you would expect), but potentially powerful feature : allow buffers to be transferred between client and server through shared memory. Without that (ie assuming that all parameters are of a non-pointer kind), you get a significant increase of performance : in QEMU, for 10^6 calls (I have a margin of error of ~1s when doing measurements in QEMU), it’s now around 32s for threaded, 28s for async.

    So according to your calculation…

    28s in QEMU means 2.8s on real hardware, and as it’s for 10^6 calls it means 2.8µs per call.
    Total overhead : 6328 cycles. The improvement is about twofold.

    Now, I’d like to have a closer look at the assumption that QEMU is about 10x slower than real hardware, so I’ve tried 10^7 cycles in VirtualBox (which is slower than real hardware, although not much for this kind of number-crunching work). If the assumption holds, VirtualBox should take about the same time as QEMU.

    Results are 19s for threaded, 18s for async.

    So something is strange here. If we assume that Virtualbox is as fast as real hardware (which it probably isn’t), then you need to apply a factor of 19/32 = 0.6 to the results above.

    Which gives us an overhead of 3797 cycles.

    That’s 6x slower than your expectation, which is still bad but not quite as bad as a factor of 14. And I’m still assuming that VBox is as fast as real hardware, using a pessimistic context switch model, and avoiding the horrors of the AMD64 ABI through an extra copy of the parameters (10 64-bit integers).

    Then, let us examine another side of the situation. As I said myself…

    “We’ll also assume that those clients all have a slightly outdated interface definition, and that they only know about the first 9 parameters of each server-side function call.”

    This implies the existence and manipulation of a “compatibility stack frame” that’s appended at the end of every client-side stack. I guess you didn’t assume that when doing your calculation. So let’s remove it from the current code.

    Results are 17s for threaded, 16s for async. Now we’re talking about 1.6µs per call, so the overhead would be 3616 cycles. We are now 5.6x slower than your expectations. At this point, I give up : all the rest is either my fault, or a problem with the things mentioned above (assuming VBox is as fast as real hardware, slow virtual context switches, extra stack copying).

  3. Brendan August 10, 2011 / 9:29 am

    Why do servers “broadcast 30 entries to the kernel” (surely that’d be a syscall, with nothing broadcast?).

    Why is RPC using function pointers? You’d want to give functions numerical “function IDs” and do a “switch(functionID) {}” thing in the server. That way the clients can just use a header file containing the function IDs for the server (e.g. an “enum” of function IDs) and the kernel won’t need to know/care what functions are supported (or what addresses they’re at). All the overhead of validating pointers, etc is gone (and replaced by the much smaller overhead of the “switch(functionID) {}” thing in the server, which would hopefully be optimised into a fast lookup table in most cases).

    When the server starts, it tells the kernel “I implement the service called SOME_NAME”. When a client starts it asks the kernel “Who implements the service SOME_NAME?”. Server startup time should be almost zero (the cost of one syscall to register the name).

    Function parameters should never be pointers (to data). It makes no sense for RPC (which is mostly intended for executing functions on remote computers, where the idea of shared memory is silly and there’s no guarantees that the computers aren’t completely different – different register sizes, different endian-ness, etc). Lets say you have a function that takes an integer and a pointer to an ASCII string. The caller builds a “data packet” containing the length of the packet, the function ID, the integer, then the offset of the string within the packet, then the string data itself (where everything is converted to some standardised representation). The kernel delivers the packet to the receiver/server (using nothing more than the “length” field and not needing to care about how many parameters, etc). Then the server does the “switch(functionID) {}” thing and decodes the packet (for e.g. the “offset of the string as a 4-byte little endian integer” might be converted into a “64-bit big endian pointer to the string”).

    Of course to simplify things for clients, typically a server would provide a header file and library for clients to use, so clients can just call a “foo( myInt, char *myString)” function in the library and this stub function would build the packet and send it to the server (and wait for a reply packet containing returned values and decode it, if necessary).

    Also note that for “async”, if the data is copied into the packet then it’s “safe”. Otherwise (e.g. if you try to use pointers to shared memory buffers) you’d have to do something to make sure that the data isn’t trashed by the client after the RPC call is made but before the server receives it.

    The “Behavioral model in asynchronous operation” is: caller sends 1000 “RPC packets” (where the kernel puts them onto the receiver’s queue, which is in kernel space to avoid any need for any context switches); then (if and only if the server can’t run on another CPU in parallel with the client) there’s one context switch (from client to server) and the server handles all 1000 of the “RPC packets” it received.

    An OS typically marks kernel pages as “global” so that they aren’t flushed when you switch virtual address spaces. When you flush the TLB (to simulate a task switch), are kernel’s pages also flushed?

    A syscall is not equivalent to 2 context switches. Most 64-bit OSs use the SYSCALL and SYSEXIT instructions, which are probably around 20 cycles. On top of that (unless you’re using assembly) there’s some extra overhead setting up a stack frame to suit the calling convention a compiler expects, so maybe about 30 cycles. If you’ve got the overhead of actual IPC code then you don’t need to add any extra overhead (the “2 mutex grab/release cycles”) to simulate the syscall’s overhead.

    I mostly used guesswork to estimate the cost of one RPC call from the information provided, because I can’t use RDTSC on a real computer to measure it accurately. Why are you using similar guesswork (given that you can use RDTSC on a real computer)? ;-)

  4. Hadrien August 10, 2011 / 10:18 am

    Why do servers “broadcast 30 entries to the kernel” (surely that’d be a syscall, with nothing broadcast?).

    The server sends descriptors of the remote calls it provides to the kernel. This way, when a client asks for a connection to call x of the server, the kernel can quickly check if that call actually exists, and automagically set up compatibility stuff if the client’s descriptor is older than the server’s in a way that can be fixed (parameters are either added or removed at the end of the parameter list, new parameters have default values).

    Why is RPC using function pointers? You’d want to give functions numerical “function IDs” and do a “switch(functionID) {}” thing in the server. That way the clients can just use a header file containing the function IDs for the server (e.g. an “enum” of function IDs) and the kernel won’t need to know/care what functions are supported (or what addresses they’re at). All the overhead of validating pointers, etc is gone (and replaced by the much smaller overhead of the “switch(functionID) {}” thing in the server, which would hopefully be optimised into a fast lookup table in most cases).

    This overhead is not much of a problem, because it occurs infrequently (when client sets up a connection with the server). On the other side of the coin, the advantage of using full-blown server descriptors over simple numerical IDs is that it brings extra flexibility as far as compatibility is concerned. One can easily add features to an existing remote call, that are controlled through extra parameters with a default value that’s consistent with the old behavior. This way, the hurdle of compatibility is minimal, and old servers don’t have to either maintain gigantic-sized switches/lookup tables for all the past versions of their function calls or break compatibility with old programs. You don’t get Win32-style “CreateWindowExtendedBlackholeCaterpillar” prototypes, you rather extend the existing CreateWindow function so that it meets the new needs.

    Granted, breaking changes will still be required from time to time, but the need for them may be greatly minimized through good design.

    Another good side of making the kernel aware of function characteristics through function descriptors is that it allows for shared pointer parameters. It seems that we do not to agree as to whether this is a good idea, however. More on that once this silly Windows box is done forcefully rebooting.

  5. Hadrien August 10, 2011 / 11:09 am

    Function parameters should never be pointers (to data). It makes no sense for RPC (which is mostly intended for executing functions on remote computers, where the idea of shared memory is silly and there’s no guarantees that the computers aren’t completely different – different register sizes, different endian-ness, etc).

    Indeed, it makes no sense for networked RPC. However, although you’re right that this design was proposed in the context of remote computers, I plan to mostly use it in a different context, as a local IPC primitive which processes use to provide services to each other. Networked RPC operation is certainly possible, but with it comes all the nastiness of everything that occurs on a network (no mutexes and shared memory, heterogeneous hardware, propagation delays, packet loss, possibility of server hardware failure while the client is still alive), that I don’t want to inflict on people for something which is not a favored use case of my OS. If I was aiming at a distributed operating system or some other variety of network-centric computing platform, I’d consider this criticism a bit further, but that’s not what I’m looking for.

    Lets say you have a function that takes an integer and a pointer to an ASCII string. The caller builds a “data packet” containing the length of the packet, the function ID, the integer, then the offset of the string within the packet, then the string data itself (where everything is converted to some standardised representation). The kernel delivers the packet to the receiver/server (using nothing more than the “length” field and not needing to care about how many parameters, etc). Then the server does the “switch(functionID) {}” thing and decodes the packet (for e.g. the “offset of the string as a 4-byte little endian integer” might be converted into a “64-bit big endian pointer to the string”).

    If the chunk of data which you want the server to play with is large, let’s say 1GB RAW image as an example, you end up copying boatloads of memory around and eating up twice the RAM that should be necessary. Another area of potential criticism when playing with large data sets is the “standardized representation” : though I can’t think of a good example right now, converting stuff to and from this representation could be an expensive operation, that is unnecessary if both the client and the server speak the same dialect.

    Finally, the third gripe which I have with this approach is that it brings variable-sized IPC packets on the table. I don’t like them, just like I don’t like scanf’s variable-sized output, because I see them as an area of potential issues. If packets are treated as atomic objects (server does not hear about them until the client has fully sent them), a malicious client may stall the server packet queue (and potentially overflow something, if there are coding mistakes around) by sending absurdly large packets in them. If packets are not treated as atomic objects, then you get the usual attack vector of web servers : sending a packet extremely slowly (“paying with pennies”), opening lots of connections without sending a full packet until the server connection limit is reached… Sure, there are workarounds around that, but they often resort to the use of arbitrary numbers (e.g. per-client connection limit) and strike me as something which one would not want in his codebase if he can avoid it.

  6. Hadrien August 10, 2011 / 11:35 am

    Of course to simplify things for clients, typically a server would provide a header file and library for clients to use, so clients can just call a “foo( myInt, char *myString)” function in the library and this stub function would build the packet and send it to the server (and wait for a reply packet containing returned values and decode it, if necessary).

    I totally agree that client-side library stubs may be used to put syntactic sugar around the actual gory IPC mechanisms.

    Also note that for “async”, if the data is copied into the packet then it’s “safe”. Otherwise (e.g. if you try to use pointers to shared memory buffers) you’d have to do something to make sure that the data isn’t trashed by the client after the RPC call is made but before the server receives it.

    Why only after the RPC call is made and before it is received ? The client could as well send trashed data in the buffer to begin with, and the kernel still won’t check that, it’s the server’s problem to take the relevant precautions when dealing with untrusted data.

    Another problem which is the issue of what happens when the client modifies the data that the server is currently dealing with (after the RPC call has been received and while it is being processed). Now, that’s a classic multithreaded programming problem, to which I have no clear answer, except to tell server developers to be careful about what they use shared data for so that impromptu modification by the client cannot cause a server crash. I also have eliminated the only crash factor I could think of that servers can do nothing about : the client freeing the shared buffer which the server is currently using. In my memory allocation and sharing model, all processes which share a memory chunk have a symmetric role, and a memory buffer is not actually freed from memory until all owners have given up on it.

    To go further, I could have made it so that the client can’t touch the shared buffer that the server is dealing with, by forcefully removing the client’s access to it, but I’m not sure that would be a bright idea. There could be some cases, which I cannot think of at the moment, where the client and server play a symmetric role and actually want to cooperatively work on a chunk of memory. Better stick with the less limiting abstraction for the basic mechanism, and let client and server voluntarily impose more restrictive protocols upon themselves when they want to, in my opinion. This can be a mistake though. I don’t know at the moment, I don’t have enough perspective right now. But there certainly will be time to enforce a new “buffer transfer” mechanism at the kernel level later, if it turns out there’s a need for it.

  7. Hadrien August 10, 2011 / 11:45 am

    An OS typically marks kernel pages as “global” so that they aren’t flushed when you switch virtual address spaces. When you flush the TLB (to simulate a task switch), are kernel’s pages also flushed?

    Probably not, I had forgotten about that and my TLB flush is triggered simply by reloading CR3. Is there a way to force a full TLB flush, including global pages ?

    A syscall is not equivalent to 2 context switches. Most 64-bit OSs use the SYSCALL and SYSEXIT instructions, which are probably around 20 cycles. On top of that (unless you’re using assembly) there’s some extra overhead setting up a stack frame to suit the calling convention a compiler expects, so maybe about 30 cycles. If you’ve got the overhead of actual IPC code then you don’t need to add any extra overhead (the “2 mutex grab/release cycles”) to simulate the syscall’s overhead.

    I mostly used guesswork to estimate the cost of one RPC call from the information provided, because I can’t use RDTSC on a real computer to measure it accurately. Why are you using similar guesswork (given that you can use RDTSC on a real computer)? ;-)

    I didn’t know that those instructions were so powerful that they avoided context switching overhead altogether, they decidedly sound really nice and I should totally have a look at them once I have user space and syscalls :)

    The thing is, at the moment, I don’t, and as such I don’t have actual IPC code at hand for benchmarking purposes either. With this relatively crude benchmark, I wanted to investigate the viability of what I currently consider as my main future IPC primitive, using a sort of rough code draft, before I start to implement user space stuff that will make heavy use of it.

  8. Brendan August 11, 2011 / 8:36 am

    So, rather than the overhead of checking what the message is in the server, you’ve shifted this overhead into the kernel, and added the extra overhead and complexity involved in telling the kernel what is allowed (and how to deal with the compatibility stuff). To me that sounds like you’ve optimized the rarely used paths at the expense of the “happy path”. I’d also be worried about how the server describes what it accepts to the kernel – is it too restrictive, or too complex, or a mixture of both?

    Note: I’m officially ignoring your “flexibility” and “compatibility” arguments – using function pointers is no more flexible and no easier for compatibility than using “function IDs”; the only difference is whether the burden of flexibility/compatibility is pushed onto the kernel or not.

    One thing you’re overlooking is that what a server accepts isn’t static, but changes. For example, the server might only except “setup()” and “get_version()” initially; but then after a client calls “setup()” it might also except “foo()” and “change_state()”; and then depending on the current state it might (or might not) accept “first()” or “second()” or “third()”. Then there’s parameters themselves – if “change_state()” accepts one 8-bit integer as it’s only parameter, then which values are accepted and when? In the same way, one “server” may provide multiple entirely different interfaces – for example, it might have one set of RPC calls that anonymous clients can use, and an entirely different set of RPC calls that only some special clients are allowed to use. The end result of all this is that kernel has to validate function parameters and decide whether or not the call might be accepted; and then the server validates function parameters a second time and decides if the call actually it accepted.

    The next thing you’re overlooking is that communication is often 2-way, and is sometimes n-way. For example, do clients have to tell the kernel what they accept, so the kernel can figure out if the client might accept a call from a server?

    Let’s have a semi-realistic example. Consider a “pipelined” compiler (or assembler or something – doesn’t matter what it is really and any sort of “pipelined” workload will do). The first stage communicates with some sort of file system server to read files and handles “include” directives, and sends “lines of raw source code (converted to UTF32)” to the next stage. The second stage does preprocessing – it accepts “lines of raw source code” and sends “lines of preprocessed source code” to the next stage (while keeping track of macros and expanding them, handling conditional code, etc). The third stage does parsing and converts source code into tokens. The fourth stage converts to some sort of intermediate language. The fifth, sixth and seventh stages do different optimizations (and may be enabled or disabled). The eighth stage converts to assembly. The ninth (optional) stage does some final optimizations and the tenth stage communicates with some sort of file system server to write the final output file. Each stage only communicates with the next stage in the pipeline. However, on top of all that there’s a “main” piece which communicates with all stages, and handles things like sending errors to STDERR and synchronizing the pipeline (e.g. if the main piece detects that a stage crashed, the main piece can tell all remaining stages to abort).

    When the compiler is started, the main piece checks the computer it’s running on and decides which stages should be run where. If there’s 2 NUMA domains with 4 CPUs per NUMA domain, it might decide that the first 6 stages run as one process with 7 threads (including the main piece) in the first NUMA domain, and the remaining 4 stages run as a second process with 4 threads in the second NUMA domain. Of course to make it simple (and to maximize the chance of keeping all CPUs busy without the need to use any locks/mutexes, etc), each stage uses asynchronous IPC to communicate with other stages (regardless of whether or not the other stages are running as threads in the same process or threads in a different process).

    Now, who is a server and who is a client, and which RPC calls are accepted by what?

  9. Hadrien August 11, 2011 / 9:43 am

    So, rather than the overhead of checking what the message is in the server, you’ve shifted this overhead into the kernel

    Actually, I think that the connection setup overhead should belong to the client, just like in modern postal systems the burden of writing an address and paying for a stamp belongs to the sender. But the client itself can’t be trusted to send correct RPC packets, so these must be checked by a trusted third party, the kernel. Now you may say “okay, but the kernel shouldn’t be wasting its CPU time on analyzing client calls either”. That is right, and that’s why the kernel threads which do this job should be scheduled as client threads, in my opinion. Also, care must be taken that the connection setup process does not significantly disturb other kernel activities (i.e. everything happens as if the client was doing the connection setup process itself). But given that these conditions are met, I’m not sure what the problem is.

    and added the extra overhead and complexity involved in telling the kernel what is allowed (and how to deal with the compatibility stuff). To me that sounds like you’ve optimized the rarely used paths at the expense of the “happy path”.

    Telling that is only about moving a few bytes of memory around, and I’m not sure I see your point about complexity : how would complexity change if that stuff is put inside of servers instead of inside of the kernel ? Complexity is just moved around there, I don’t think that it can be reduced at feature parity.

    Now, if you want to express that there should be as little code in the kernel as possible, I see your point. But I think that IPC, in the sense of setting up controlled communication channels between two isolated processes, is intrinsically kernel stuff. You can put a bit of logic in the client or the server, but a significant part of the work, the one which actually does something dangerous, will always be in the kernel anyway. By artificially splitting communication code in two halves, one only increases the number of syscalls involved.

    Also, there’s something which I wouldn’t let a server do : shared memory through pointer function parameters. The server doesn’t trust the client, but the client shouldn’t trust the server either, so the server shouldn’t have the right to choose which parts of the client memory it has access to.

    I’d also be worried about how the server describes what it accepts to the kernel – is it too restrictive, or too complex, or a mixture of both?

    I’m worried about that to, but again I can’t see how putting stuff on the server side instead of the kernel side would solve the problem. To check that what it has received it what it wants, the server must know what it wants, which implies being able to describe it. Once you know how to describe something, it’s pretty easy to communicate it.

    (more coming later)

  10. Brendan August 11, 2011 / 9:49 am

    Networked RPC operation: that’s your decision I guess. I’d be relatively hesitant to assume you’d never want to use your RPC over a network (even if it’s primary purpose is local IPC); and I’d be tempted to assume that eventually you will want something for networked RPC (even if you don’t want it now), and that you’re risking eventually ending up with 2 incompatible forms of RPC.

    However remote/networked communication isn’t the only problem with shared memory, and shared memory (for local communication) isn’t quite as forgiving as you might be hoping. For a start, if the client sends a pointer to something in it’s virtual address space, there’s no guarantee that the same specific virtual addresses will be “free” in the server. To get around that, shared memory areas are usually negotiated in advance; and “offset within negotiated shared memory area” is typically used instead of pointers.

    Then there’s information leaks – for example, the client can’t have a 1 KiB shared memory buffer at virtual address 0x12345000 and then store 3 KiB of confidential passwords 0x12345400; because that would give the server access to the entire 4 KiB page containing the 1 KiB of shared memory, which includes the 3 KiB of confidential passwords.

    Finally there’s synchronisation. Let’s say a client puts some data into it’s shared memory and sends an asynchronous RPC to a server (that uses the data in that shared memory area). How soon can the client modify it’s shared memory after sending the asynchronous RPC? The client couldn’t modify that data immediately after sending the asynchronous RPC, because the server wouldn’t have had a chance to use that data yet (and by the time the server does use the data it’d be modified/corrupt). Would it be safe for the client to modify the data 10 ms after sending the asynchronous RPC to the server? What about 20 seconds after sending the asynchronous RPC? How can the client know when it’s safe? It can’t. Basically it’s only safe for the client to modify the data in shared memory after it knows the server has finished with that data, which makes asynchronous RPC (where the server doesn’t send back any reply) worthless when shared memory is involved – in most cases you’d have to use synchronous RPC instead (where the client can know when the server has finished with the data).

    Of course you probably realise I’m not a fan of “synchronous” for anything modern (where multiple CPUs are likely). To get good scalability out of “synchronous” you have to spawn more threads (which increases overhead) and then you end up with hassles synchronising threads (with things locks/mutexes, polling, signals, etc) which are far too easy to get wrong (and far too easy to get wrong in ways that result in “Heisenbugs”).

    If the chunk of data which you want the server to play with is large, let’s say 1GB RAW image as an example, you end up moving one page directory from client to server, which eats no extra RAM and can be extremely fast if it’s done right (e.g. if the page directory is mapped into the server’s address space during a task switch to avoid additional TLB flushing costs, when possible).

    Variable-sized IPC packets are already on the table – for your RPC; a function that takes no parameters implies a tiny IPC packet (maybe 0 bytes or 8 bytes or something), and a function that takes 20 different parameters implies a much larger packet (maybe 120 bytes or something). Packets can easily be treated as atomic objects regardless of what the actual or maximum size is. You probably don’t like variadic functions (functions that take a variable number of parameters), but only because your “kernel tries to pre-check stuff that it can’t check properly anyway” approach can’t handle variadic functions in a sane/consistent way (although that’s probably not limited to variadic functions – e.g. anything involving unions would be equally “un-pre-checkable”).

  11. Brendan August 11, 2011 / 10:34 am

    To simulate a task/process switch, you don’t want to flush the entire TLB (you only want to flush “non-global” TLB entries). If you ever did want to flush all TLB entries (including “global” entries) then you’d toggle the “PGE” flag in CR4.

    There’s about 4 different types of “context switch”, depending on, um, context. There’s switching from the context of one process to the context of another process; switching from the context of one thread in a process to the context of another thread in the same process; switching from “user context” to “kernel context” (e.g. CPL=3 to CPL=0) and back; and switching into and out of “interrupt context”.

    Instructions like SYSCALL/SYSRET and SYSENTER/SYSEXIT switch from “user context” to “kernel context” (e.g. CPL=3 to CPL=0) and back, and do it relatively quickly (e.g. quicker than the older methods like call gates and software interrupts; because a lot of the protection checks involved are skipped). For calling the kernel API that is all that is required (but certain functions in the kernel may directly or indirectly cause a switch to another thread or a switch to another process; which is an entirely different and entirely separate type of context switch).

    For simulating synchronous IPC on single-CPU; you’d have a CPL=3 to CPL=0 switch (e.g. SYSCALL), some IPC overhead, a process switch (caused by client blocking and server unblocking), then a CPL=3 to CPL=0 switch (return to “user context” in server). Then you’d have the reverse for the reply – CPL=3 to CPL=0 switch (e.g. SYSCALL), some IPC overhead, a process switch (caused by client unblocking), then a CPL=3 to CPL=0 switch (return to “user context” in client).

    For simulating asynchronous IPC on single-CPU; you’d have a CPL=3 to CPL=0 switch (e.g. SYSCALL), some IPC overhead, then a CPL=3 to CPL=0 switch (return to “user context” in client). Eventually the client would block or be pre-empted for some reason (not necessarily directly related to the asynchronous IPC – maybe it calls “sleep()” after sending lots of things to a bunch of different servers). Whenever the client blocks (for whatever reason), you’d have the CPL=3 to CPL=0 switch (e.g. SYSCALL), a process switch to the server, then a CPL=3 to CPL=0 switch. Once the server is running it’d start fetching any async IPC it received, so (for each IPC received) you’d have a CPL=3 to CPL=0 switch (e.g. SYSCALL), some IPC overhead, then a CPL=3 to CPL=0 switch (return to “user context” in server).

  12. Brendan August 11, 2011 / 1:21 pm

    Um, what??

    Rather than doing “client sends data to server and server does checking”, and rather than doing “client sends data, kernel does unnecessary pre-checking (while still running in the context of the client thread), then server does proper checking”, you’re planning to do “client sends data, kernel does unnecessary thread switch, then kernel does unnecessary pre-checking, then server does proper checking”?

    Surely you’ve forgotten to mention that everything will be in XML? I’ve got an idea – you could add more threads (and more pointless overhead and thread switching) if you have “special” kernel threads for encoding/decoding all the data into/out of XML format. Something like “client calls kernel’s “encode as XML” function, kernel switches to special “XML encode”, XML encode thread encodes the raw data as XML, kernel returns XML to client; client calls “RPC_send()” with the XML data, kernel switches to a special “RPC send” thread, kernel checks the XML data on behalf of the server; “RPC send” thread sends XML to server. Then server receives the XML data and calls the kernel’s “XML decode” function, and kernel sends the XML to a special “XML decode” thread, which returns the decoded XML to the server. Only after all of that should the server be allowed to do anything useful, like properly checking the data it received (to determine if the call is valid for the current state, if parameters are in range, etc). :-)

    Of course I’m joking about XML; but how does the server describe what it accepts to the kernel? You’ll need some sort of “RPC description language”, but what exactly does that look like? If the first parameter is meant to be a 32-bit “float” and the client tries to pass a “uint32_t” instead, does the kernel do type checking (and how can kernel know the difference)? If a client can’t use “bar()” until it has used “foo()”, then will the kernel need to keep track of state?

    When a client says “call the function at virtual address 0x12345678 in server XYZ” does the kernel calculate a hash from 0x12345678, then find the hash table for server XYZ, then try to find the entry for 0x12345678 in the corresponding hash table? That’s the fastest way I can think of.

    Now imagine if the kernel did nothing more than passing (variable sized) packets of data from client to server. No “RPC description language” and no setup overhead needed at all. When the server receives a packet it can do something like “if(functionID < MAX_FUNCTION_ID), then load the function ID into RAX and do something like "call [function_table + RAX * 8]". No hash table lookup. Once the correct function is running in the server it can easily, correctly and completely validate that the function is acceptable for the current state, the correct number of parameters were supplied, the all parameters are within valid ranges, etc.

    Let's summarize the advantages and disadvantages (as compared to "function ID" approach):

    1) It's a lot more complex – "RPC description language" stuff would need to be designed and documented in some sort of formal specification, and understood by everyone writing a server. For "function IDs", it'd still need to be documented and understood, but it's very simple and could be described in a few sentences (rather than a few hundred pages).

    2) The setup overhead is a lot higher than "none" – the "RPC description language" would have to be created and sent to the kernel, and parsed by the kernel (including the kernel checking if it's valid, e.g. if the description complies with some sort of "RPC description language" specification).

    3) After it's all setup it's still slower

    4) It fails to avoid the need for the server to check parameters anyway (unless the "RPC description language" is several orders of magnitude more complex, and the kernel keeps track of state, allowed range/s of parameters, etc)

    5) It fails to handle things like variadic functions, unions, variable length arrays, etc (unless the "RPC description language" is several orders of magnitude more complex)

    6) The overhead of checking parameters, etc is attributed to the client and not the server. I call that a disadvantage. Ideally, everything involved in using the server should be attributed to the server, including the server's library that the client uses, checking the parameters the server's library sends to the server, etc. For example, imagine you're comparing servers, and one server pushes a lot of work into it's library (making the client seem slow and the server seem fast) while another doesn't – is that a fair comparison? However, as long as people know how CPU time is accounted for, I doubt anyone will actually care (it's not like anyone is planning to charge $$ for CPU time used by the clients are they?).

    7) It's less flexible. Every time the server's code changes, the library that the client uses also has to change (and if static linking is used by clients, every client needs to be recompiled). It also makes it difficult to do things like load distribution (e.g. multiple different versions of the server running at the same time, where the client connects to the "closest" instance of the server), and fail-over and live upgrades (e.g. seamlessly switching to a different/newer version of the server without clients being interrupted). In theory, by making the "RPC description language" more complex, you might be able to work around "some" of the problem. In practice I doubt "some" is going to be anywhere near as much as you hope – you'd be able to work-around simple changes changes (like same functions at different addresses) relatively easily, but the more changes you try to cope with the more complex the "RPC description language" will become and the more overhead you'll end up with. Most people writing servers are going to avoid half the hassle by putting a jump table at the start of their executable anyway (so that the kernel calls a "jmp real_address_of_function" instruction instead of calling the real address of the function directly), so adding extra complexity to work-around simple changes (like same functions at different addresses) isn't going to be very useful.

    8) It limits the OS's ability to extend the design to distributed computing in future (where the idea of using function pointers makes even less sense).

    9) If a client sends "bogus" packets, then (depending on a number of things) it might avoid the need to switch to the server's thread only to find out the data was dodgy (especially for "synchronous" – for "asynchronous" the server would hopefully already be running or need to run anyway). This advantage is mostly irrelevant because client's (or more correctly, the server's library in the client) shouldn't be sending dodgy data to begin with.

  13. Hadrien August 11, 2011 / 9:05 pm

    Note: I’m officially ignoring your “flexibility” and “compatibility” arguments – using function pointers is no more flexible and no easier for compatibility than using “function IDs”; the only difference is whether the burden of flexibility/compatibility is pushed onto the kernel or not.

    Except I don’t use function pointers directly (they are only kept by the kernel as a way to execute the call on the server side once it’s ready), but rather a more complex function descriptor structure which tells a little bit more than just the address at which the function is executed.

    One thing you’re overlooking is that what a server accepts isn’t static, but changes. For example, the server might only except “setup()” and “get_version()” initially; but then after a client calls “setup()” it might also except “foo()” and “change_state()”; and then depending on the current state it might (or might not) accept “first()” or “second()” or “third()”.

    Fair point, but this problem exists in every form of IPC in which servers can be in multiple states. A way to address that would be to allow the server to enable/disable each of its remote calls (and the associated connections), a disabled call resulting in instant call failure when the client tries to invoke it. Frustrating ? Sure, but servers with an inconsistent behaviour which changes depending on their mood and the time of the day generally are…

    Then there’s parameters themselves – if “change_state()” accepts one 8-bit integer as it’s only parameter, then which values are accepted and when?

    Kernel does not check parameter values, because that would imply some overhead on *every* remote call, and I’m against that. If the server offers a parameter which can have a wrong value, it’s up to it to check the values and introduce some exception system which can be used to report bad values to the client. What the kernel checks is that clients may only use valid servers entry points, and with a stack of the right length, because doing otherwise may cause an instant server crash (if server code tries to access a missing parameter beyond the beginning of the stack).

    In the same way, one “server” may provide multiple entirely different interfaces – for example, it might have one set of RPC calls that anonymous clients can use, and an entirely different set of RPC calls that only some special clients are allowed to use.

    This is something I’ve actually already been thinking about for some time, although I’ve not settled for a specific solution yet. What’s for sure is that the solution will be based on the capability system that will be at the heart of TOSP’s security, but the kernel can only do “binary” capability checks (capabilities that you either have or don’t have), while the rest must be implemented on server side. I don’t know yet if letting the kernel manage what it can manage or putting everything on server side is the best option, and I’ve decided to postpone this decision until I’m actually developing said capability system.

    The end result of all this is that kernel has to validate function parameters and decide whether or not the call might be accepted; and then the server validates function parameters a second time and decides if the call actually it accepted.

    Yes and no. First, no work is duplicated (what the kernel does, servers don’t have to do). Second, I think you mistakenly believe that the kernel does parameter checking on each call. With the current system, there’s only some significant overhead involved when a connection is set up between a client and a server. After that, the kernel can certify that the client calls are valid without doing the checks all over again each time.

    The next thing you’re overlooking is that communication is often 2-way, and is sometimes n-way. For example, do clients have to tell the kernel what they accept, so the kernel can figure out if the client might accept a call from a server?

    Of course. Processes do not either belong to the client or server category, this terminology is only used to simply discriminate the process who initiates a call from the process which receives it. Actually, any operation which returns a result (e.g. filesystem access) will require the server to become a client and the client to become a server at some point, when the server wants to sell some results back.

    When the compiler is started, the main piece checks the computer it’s running on and decides which stages should be run where. If there’s 2 NUMA domains with 4 CPUs per NUMA domain, it might decide that the first 6 stages run as one process with 7 threads (including the main piece) in the first NUMA domain, and the remaining 4 stages run as a second process with 4 threads in the second NUMA domain. Of course to make it simple (and to maximize the chance of keeping all CPUs busy without the need to use any locks/mutexes, etc), each stage uses asynchronous IPC to communicate with other stages (regardless of whether or not the other stages are running as threads in the same process or threads in a different process).

    Now, who is a server and who is a client, and which RPC calls are accepted by what?

    This strikes me as something which goes beyond the scope of RPC as I see it. RPC is a method of interprocess communication, it’s something which is meant to be used to achieve communication between processes. If you have multiple threads running within a single process, then there are much more efficient communication primitives. Also, I can’t see the point behind a software where things are or aren’t in isolated processes depending on the weather and the number of cores of the host machines. If you are crazy about performance, you use a single process with threads because that will always be faster than any IPC mechanism in existence. If you want the isolation and robustness that process isolation brings, you use multiple processes. Stuff in between is a dirty compromise that belongs more to the realm of marketing (“look, we have process isolation !”) than that of serious software development, in my opinion :)

    (PS : Wow, you post quickly considering the amount of details you put in your comments ! :) I’ll answer as much as I can tonight, but I guess you’re still gonna have to wait in order to get all answers)

  14. Hadrien August 12, 2011 / 9:12 am

    Networked RPC operation: that’s your decision I guess. I’d be relatively hesitant to assume you’d never want to use your RPC over a network (even if it’s primary purpose is local IPC); and I’d be tempted to assume that eventually you will want something for networked RPC (even if you don’t want it now), and that you’re risking eventually ending up with 2 incompatible forms of RPC.

    To be honest, I don’t think RPC (or another high-level IPC protocol, for that matter) is the right fit as the main IPC primitive over a network. When working with networked stuff, which has “hardware failure” and “heterogenous hardware” inside of its normal use cases, I’d want to use something of extreme structural simplicity, with a very small level of abstraction, like a basic stream of bytes, and let applications use whatever higher-level protocol they want on top of it.

    Either that, or fully go the distributed OS root, assume that all computers on the network run my OS, and abstract all differences between them away so that they may coherently work like a single machine, with lots of NUMA and propagation delays between cores. Allow processes, and even a significant part of the kernel, not to care so much about all the issues of network stuff. And then implement higher-level IPC stuff on top of that abstract “hive mind”.

    I don’t believe in stuff in between.

    But that was just an aside. The real issue, basically, is one of boundaries. I don’t have an infinite amount of lives and motivation in front of me, so I have to restrict myself to a limited set of problems which I want to address, and leave the rest to others. And the problem “abstracting away the network infrastructure from processes so that they can transparently use local IPC methods over it without a loss of functionality” doesn’t interest me right now. In my opinion, a single computer is already a funny enough toy to play with, and if I manage to get it right in my opinion, I’ll already have done a better job than Apple, Google, Microsoft, and other mainstream OS developers. Ambitious enough, I think.

    However remote/networked communication isn’t the only problem with shared memory, and shared memory (for local communication) isn’t quite as forgiving as you might be hoping. For a start, if the client sends a pointer to something in it’s virtual address space, there’s no guarantee that the same specific virtual addresses will be “free” in the server. To get around that, shared memory areas are usually negotiated in advance; and “offset within negotiated shared memory area” is typically used instead of pointers.

    Here you’re getting into stuff which I’ve been thinking about quite heavily at the time where I implemented my memory manager :) About this problem, I think unless you have language- or compiler-level support, the only thing which can be done is to restrict the sharing functionality to the sharing of “simple”, pointer-free data. Other stuff, like linked lists, would still have to be serialized first, typically using offsets within the buffer as you say. Shared memory is certainly not a silver bullet to make RPC work exactly like local function calls, it’s only an extra functionality that’s quite useful when working with large buffers, finite-speed memory buses, and small stacks, just like pointer function parameters are in C.

    (Also, on that subject, note that every class which internally uses pointers cannot be used directly as a parameter for RPC purposes. I’m still trying to figure out a way to bypass this heavy limitation, in the specific case where said pointers point to buffers (ex : string class), through minor modification of the affected classes. Something that makes classes “declare” their pointer parameters so that the related buffers can be silently added to the function parameters, typically, is on my mind)

  15. Hadrien August 12, 2011 / 11:40 am

    Then there’s information leaks – for example, the client can’t have a 1 KiB shared memory buffer at virtual address 0×12345000 and then store 3 KiB of confidential passwords 0×12345400; because that would give the server access to the entire 4 KiB page containing the 1 KiB of shared memory, which includes the 3 KiB of confidential passwords.

    Known and addressed. There’s a special allocator for shareable buffers, which makes sure that the allocated stuff is alone in its page(s). Since this is an unusual feature, I’ve made it more discoverable by making sure only buffers allocated this way may be shared, using some little flag.

    Finally there’s synchronisation. Let’s say a client puts some data into it’s shared memory and sends an asynchronous RPC to a server (that uses the data in that shared memory area). How soon can the client modify it’s shared memory after sending the asynchronous RPC? The client couldn’t modify that data immediately after sending the asynchronous RPC, because the server wouldn’t have had a chance to use that data yet (and by the time the server does use the data it’d be modified/corrupt). Would it be safe for the client to modify the data 10 ms after sending the asynchronous RPC to the server? What about 20 seconds after sending the asynchronous RPC? How can the client know when it’s safe? It can’t. Basically it’s only safe for the client to modify the data in shared memory after it knows the server has finished with that data, which makes asynchronous RPC (where the server doesn’t send back any reply) worthless when shared memory is involved – in most cases you’d have to use synchronous RPC instead (where the client can know when the server has finished with the data).

    Of course you probably realise I’m not a fan of “synchronous” for anything modern (where multiple CPUs are likely). To get good scalability out of “synchronous” you have to spawn more threads (which increases overhead) and then you end up with hassles synchronising threads (with things locks/mutexes, polling, signals, etc) which are far too easy to get wrong (and far too easy to get wrong in ways that result in “Heisenbugs”).

    I don’t think it’s that clear-cut.

    Synchronous RPC is when you can do *nothing* until the server has returned something. Here, it’s not what happens. You just can’t touch a piece of data until the server has notified you that it’s done with it. That doesn’t put you away from async goodness like sending several unrelated calls to different servers in a row, without having to wait for replies.

    Here are some uses I see to shared memory as I currently plan to implement it, which do not imply going back to purely synchronous behavior.

    1. You want to send a big pack of data to the server as a parameter of a remote call, and won’t ever need it back. In that case, you do the call, free the data on client side, and can go and do something else. Anything pipelined qualifies as an example.
    2. You await some results from the server, so you give it a dedicated buffer in client memory to write them. Server sends a notification to client once it’s done writing in the buffer, and client then works with the data. It’s pretty much like DMA. Disk or network drivers could work this way.
    3. You actually want several clients and servers to cooperatively work with some piece of data. In that case, you either use read-only memory blocks, or get the usual synchronization stuff to avoid writing on the same data at the same time. God helps you with the debugging in the latter case. For read-only, one possible use would be “broadcasting” some piece of information to several services.

    If the chunk of data which you want the server to play with is large, let’s say 1GB RAW image as an example, you end up moving one page directory from client to server, which eats no extra RAM and can be extremely fast if it’s done right (e.g. if the page directory is mapped into the server’s address space during a task switch to avoid additional TLB flushing costs, when possible).

    Pretty close to what I advocate with the “sharing memory” expression, actually, unless I’ve misunderstood you.

    Variable-sized IPC packets are already on the table – for your RPC; a function that takes no parameters implies a tiny IPC packet (maybe 0 bytes or 8 bytes or something), and a function that takes 20 different parameters implies a much larger packet (maybe 120 bytes or something). Packets can easily be treated as atomic objects regardless of what the actual or maximum size is. You probably don’t like variadic functions (functions that take a variable number of parameters), but only because your “kernel tries to pre-check stuff that it can’t check properly anyway” approach can’t handle variadic functions in a sane/consistent way (although that’s probably not limited to variadic functions – e.g. anything involving unions would be equally “un-pre-checkable”).

    Touché, it’s indeed variadic stuff which I’m not fond of. It strikes me as an area where a server must have blind confidence that the client will make packets of the right size. And yes, I don’t like printf.

  16. Hadrien August 12, 2011 / 11:47 am

    About the different kinds of context switching, thanks for the explanation ! :)

  17. Hadrien August 12, 2011 / 12:10 pm

    Um, what??

    Rather than doing “client sends data to server and server does checking”, and rather than doing “client sends data, kernel does unnecessary pre-checking (while still running in the context of the client thread), then server does proper checking”, you’re planning to do “client sends data, kernel does unnecessary thread switch, then kernel does unnecessary pre-checking, then server does proper checking”?

    Surely you’ve forgotten to mention that everything will be in XML? I’ve got an idea – you could add more threads (and more pointless overhead and thread switching) if you have “special” kernel threads for encoding/decoding all the data into/out of XML format. Something like “client calls kernel’s “encode as XML” function, kernel switches to special “XML encode”, XML encode thread encodes the raw data as XML, kernel returns XML to client; client calls “RPC_send()” with the XML data, kernel switches to a special “RPC send” thread, kernel checks the XML data on behalf of the server; “RPC send” thread sends XML to server. Then server receives the XML data and calls the kernel’s “XML decode” function, and kernel sends the XML to a special “XML decode” thread, which returns the decoded XML to the server. Only after all of that should the server be allowed to do anything useful, like properly checking the data it received (to determine if the call is valid for the current state, if parameters are in range, etc). :-)

    XML is not suitable, because it’s too old, and the brackets it uses are too ugly. An incompatible variant of CSS would totally be a better fit for that purpose ! :)

    And now, for a more serious discussion…

    Of course I’m joking about XML; but how does the server describe what it accepts to the kernel? You’ll need some sort of “RPC description language”, but what exactly does that look like? If the first parameter is meant to be a 32-bit “float” and the client tries to pass a “uint32_t” instead, does the kernel do type checking (and how can kernel know the difference)? If a client can’t use “bar()” until it has used “foo()”, then will the kernel need to keep track of state?

    Currently, I do it using data structures which basically contain the information of a C function prototype.

    On the server side, it’s…

    • Function pointer
    • Function name
    • Number of parameters
    • For each parameter…
      • Type name
      • Type size
      • Default value (if any)

    On the client side, it’s…

    • Server name
    • Function name
    • Number of parameters
    • For each parameter…
      • Type name
      • Type size

    This is not necessarily final, though (as an example, I wonder if it’s worth it to have separate client and server descriptors, and if a unified descriptor format wouldn’t be a better fit even if it means wasting a bit of space. I also still wonder if having the client ask for ServerName.FunctionName is the right thing, or if a unified namespace would be better. There are pros and cons to each approach). It’s just what I use in my testing code.

    The problem which you mention may still occur if a given type (typically a class) is modified while keeping its name internally, but I think that would be a problem even with libraries (the library calling code relies on an old library header where classes are defined in a specific way, if I’m not misunderstood).

    When a client says “call the function at virtual address 0×12345678 in server XYZ” does the kernel calculate a hash from 0×12345678, then find the hash table for server XYZ, then try to find the entry for 0×12345678 in the corresponding hash table? That’s the fastest way I can think of.

    Again, the client does not rely on a fixed server-side function pointer. First there’s a sort of handshake process where client and server exchange descriptors and the kernel silently sets up compatibility stuff if the client has an outdated descriptor that can be fixed. From that point, the client gets an integer “connexion number”, which it can use to refer directly to the remote call from that point. It’s pretty much like your function IDs, in a way, and optimizations like hash tables do of course apply.

  18. Hadrien August 12, 2011 / 4:39 pm

    Now imagine if the kernel did nothing more than passing (variable sized) packets of data from client to server. No “RPC description language” and no setup overhead needed at all. When the server receives a packet it can do something like “if(functionID < MAX_FUNCTION_ID), then load the function ID into RAX and do something like "call [function_table + RAX * 8]". No hash table lookup. Once the correct function is running in the server it can easily, correctly and completely validate that the function is acceptable for the current state, the correct number of parameters were supplied, the all parameters are within valid ranges, etc.

    Then your IPC primitive is not strict RPC, but rather data streams with an RPC protocol implemented on top of it.

    Syntactic pedantism aside, you can’t just do “call [function_table + RAX * 8]” and cross your fingers, because you have to check a number of things about the parameter stack first if you don’t want the function to crash. A way to do that would be to put an assembly snippet in the beginning of each server function, but that’s not exactly better than doing things in a centralized way as it involves duplicating code.

    Let’s summarize the advantages and disadvantages (as compared to “function ID” approach):

    1) It’s a lot more complex – “RPC description language” stuff would need to be designed and documented in some sort of formal specification, and understood by everyone writing a server. For “function IDs”, it’d still need to be documented and understood, but it’s very simple and could be described in a few sentences (rather than a few hundred pages).

    I confess there’s some extra complexity, but again, I don’t think there’s much more there than what you’d need anyway on server side to check that the client is calling a function properly. Besides, if we stick with relatively simple descriptors like the ones described above, it’s actually simple to have a code generator craft them automatically. Server maintainers would only need to be careful about what types they use as parameters to remote calls, and that’s true of any form of RPC.

    2) The setup overhead is a lot higher than “none” – the “RPC description language” would have to be created and sent to the kernel, and parsed by the kernel (including the kernel checking if it’s valid, e.g. if the description complies with some sort of “RPC description language” specification).

    Yet having some initial setup done also avoids doing some checks each and every time a call is made, which strikes me as an interesting compromise overall.

    3) After it’s all setup it’s still slower

    Not so much, if done properly. I don’t know what’s wrong with the code tested in this article, or even if something’s wrong, but it basically copies function parameters around, shares any pointer parameter, and optionally appends an extra “compatibility stack frame” in the end. Nothing which another RPC implementation with similar features could avoid, really. There’s the overhead of going through the kernel, perhaps, but you just said that modern syscalls are cheap (20 CPU cycles, what is that really ?), and if having them around is too much of a bother you can also introduce some dedicated “RPC buffer” mechanism so that the kernel only processes the call once the scheduler switches to it.

    4) It fails to avoid the need for the server to check parameters anyway (unless the “RPC description language” is several orders of magnitude more complex, and the kernel keeps track of state, allowed range/s of parameters, etc)

    As mentioned before, I really don’t think that servers with an internal state are such a bright idea, but you can manage them by allowing servers to enable and disable their remote calls, and having the kernel also “disable” the associated client connexions. I hardly call that an order of magnitude more complex.

    As far as allowed ranges of parameters are concerned, the server indeed still has to do that himself, but there’s a huge difference in server implementation complexity between…

    • Checking the values of some parameters through conditional structures
    • Not being even able to tell without a platform-dependent assembly snippet whether all parameters to the remotely called function do exist, or if touching an inexistent parameter will result in the server brutally stack underflow-ing.

    5) It fails to handle things like variadic functions, unions, variable length arrays, etc (unless the “RPC description language” is several orders of magnitude more complex)

    Unions : Why shouldn’t they work ? The RPC system, in its current form, has no knowledge of the internal contents of the types which are fed into it, it only knows their name and their size. Unions do not have a variable size, so they do work. (I have yet to find a use to them though)

    Variable length array : Depends on how they are implemented, really

    • Most C/C++ programs manage arrays whose length is not known in advance in a fashion similar to MyType* array = new MyType[n]. This can easily be ported to RPC-compatible code, by using the “shareable” memory allocator instead of the regular one.
    • If you want to get fancy and have an array whose length varies in his lifetime, then you need something like a STL container, where pointers are hidden on the inside. As said before, how this situation should be managed is still an open question. But I’m pretty sure that it is a problem for all RPC methods which do not have a full description of all involved classes at hand.
    • Then, of course, you can also allocate arrays on the stack and enjoy some well-deserved stack overflows. I’m personally not so fond of that.
    • Or perhaps you can also have static VLAs in C99 (true question, I really don’t know). How to transfer static arrays from a process A to a process B without resorting to slow and ugly copies is something I still have to think about. If I can’t find anything, worse-case scenario is to make a copy of the static array on a shareable heap object, which is approximately as fast as copying an array on a message queue with regular message passing.

    Variadic functions : There I agree. Then again, I can’t think of a clean way to make function ID-based RPC support shared memory buffers, and I use heap objects much more often than I use variadic functions… I guess no form of RPC will ever, with a reasonable amount of complexity, be able to offer the full coding comfort of an in-process call. Which set of C/++ features one chooses to support is essentially a matter of taste.

    6) The overhead of checking parameters, etc is attributed to the client and not the server. I call that a disadvantage. Ideally, everything involved in using the server should be attributed to the server, including the server’s library that the client uses, checking the parameters the server’s library sends to the server, etc.

    For example, imagine you’re comparing servers, and one server pushes a lot of work into it’s library (making the client seem slow and the server seem fast) while another doesn’t – is that a fair comparison?

    Now this is getting philosophical, but if I’m not misunderstood, your point is “why should the client have to pay for server idiocy ?”, whereas my point is “why should the server have to pay for client idiocy ?”. If it’s effectively what’s happening, we can spend a lot of time there, considering that both can play a relatively symmetrical role…

    But okay let’s just picture ourself a situation where a server provides a service to many clients. Let’s also assume that it does so asynchronously, that is, with a pending request queue. In that case, a server slowdown affects the response time of all clients, whereas a client slowdown only affects the relevant client. Therefore, I consider that clients should carry the weight of their mistakes.

    Now about your example in particular…

    For example, imagine you’re comparing servers, and one server pushes a lot of work into it’s library (making the client seem slow and the server seem fast) while another doesn’t – is that a fair comparison?

    In my opinion, the performance of a server is that of everything it provides (library stub included) combined. At least if I had, myself, to benchmark two servers doing the same thing against each other, I’d write a little client which gives N time the same task to do to each server, make it run on an idle, stable machine, and compares the full time it takes for each batch of operations to be completed.

    However, as long as people know how CPU time is accounted for, I doubt anyone will actually care (it’s not like anyone is planning to charge $$ for CPU time used by the clients are they?).

    Not sure I understand what you mean there, but I think it means that we agree on this :)

    7) It’s less flexible. Every time the server’s code changes, the library that the client uses also has to change (and if static linking is used by clients, every client needs to be recompiled).

    No ! The whole point of having client and server descriptions of the call to compare with each other at all is to enforce compatibility to the fullest extent possible without recompiling a client.

    Let’s assume that our server is a window manager. At some point, to create a window, it provided the call CreateWindow(wchar_t* name, uint32_t width, uint32_t height). Our client was compiled to work with that.

    Then a new version of the server is rolled out, with extended functionality, and we can now choose the border of the color of each window we’re creating. New prototype, reflected by the new server descriptor, is CreateWindow(wchar_t* name, uint32_t width, uint32_t height, Color border_color = CL_BLUE).

    When the server starts, it provides a call descriptor to the kernel. Then the client comes in and provides his own version of the call descriptor to the kernel. The kernel has a closer look, and mumbles “Hmmm… This doesn’t match, but I can fix it”. So it creates a “compatibility stack frame”, containing the default border_color value (CL_BLUE), that is to be appended to the stack provided by the client each time CreateWindow is called.

    Thanks to this, the server doesn’t have to maintain deprecated versions of CreateWindow, and the client can operate without recompilation at the cost of a slight performance hit. This is a win-win situation, in my opinion.

    It also makes it difficult to do things like load distribution (e.g. multiple different versions of the server running at the same time, where the client connects to the “closest” instance of the server)

    If you’re talking about having one copy of the server running on each NUMA domain, and having the client connect to a copy of the server that is on the same NUMA domain, that is certainly possible. You only have to make the kernel look for the server on the same NUMA domain first, and then extend the search to other NUMA domains if it can’t find it there.

    and fail-over and live upgrades (e.g. seamlessly switching to a different/newer version of the server without clients being interrupted).

    Thought about those, actually. Live upgrades can be managed pretty simply, all the server has to do is to…

    • Install and start the new version of itself, check that it’s alive and well.
    • Notify the kernel in some way that it is shutting down and, as such, is not a valid target of RPC requests anymore. This will in effect discard all existing RPC connexions from clients, and prevent the creation of new ones (so that they are redirected to the new version of the server). We need such functionality for computer shutdown to work properly anyway :)
    • Complete treatment of the ongoing RPC requests and kill itself

    Can it really be made much simpler ?

    As for server crash handling, this is a bit more complicated, because it requires more kernel involvement. That’s because since the server has died, it’s not there to manage its own death anymore, leaving that dirty job to the kernel.

    The job of the kernel, at this point, is essentially to…

    • Resuscitate the server (if it has the karma it takes :))
    • Inform clients which had sent work to the server of its tragic death, so that they may send the work again

    The first job is relatively straightforward. Check if the server has automatic resurrection enabled, and if yes run it and wait until it is ready (requirement : a way for the server to tell the kernel when it’s “ready”).

    The second part, however, is more tricky. I’m not sure how to do it without drastically sacrificing performance by logging which functions clients have been calling and which ones have been fully executed by the server. But I’m pretty sure that you’d have the same problem with any RPC system. Or can you prove the contrary ?

    In theory, by making the “RPC description language” more complex, you might be able to work around “some” of the problem. In practice I doubt “some” is going to be anywhere near as much as you hope – you’d be able to work-around simple changes changes (like same functions at different addresses) relatively easily, but the more changes you try to cope with the more complex the “RPC description language” will become and the more overhead you’ll end up with. Most people writing servers are going to avoid half the hassle by putting a jump table at the start of their executable anyway (so that the kernel calls a “jmp real_address_of_function” instruction instead of calling the real address of the function directly), so adding extra complexity to work-around simple changes (like same functions at different addresses) isn’t going to be very useful.

    So far, I’ve not seen anything which requires a major feature-packing of the RPC API to the point of significantly increasing its overhead, except this server death notification thing.

    8) It limits the OS’s ability to extend the design to distributed computing in future (where the idea of using function pointers makes even less sense).

    There aren’t much local abstractions which scale well to a distributed computing environment. As said earlier, it’s deceptive to believe that a mature OS can be made “distributed-ready” on the blink of the eye, in my opinion. Either you develop a distributed OS from the ground up, or you develop a local OS. But there is no way back on each side. A distributed OS will never be a good local OS, and vice-versa.

    9) If a client sends “bogus” packets, then (depending on a number of things) it might avoid the need to switch to the server’s thread only to find out the data was dodgy (especially for “synchronous” – for “asynchronous” the server would hopefully already be running or need to run anyway). This advantage is mostly irrelevant because client’s (or more correctly, the server’s library in the client) shouldn’t be sending dodgy data to begin with.

    Sure, but that’s not what I want to achieve with this. I want to make it easier for servers to deal with the case where a buggy or malicious client sends trashed data. I won’t do all the job for them, but a stack of insufficient length is typically something which the RPC subsystem (be it either kernel code or a server-side library) can frown upon very easily, whereas the called code must resort to horrible practices like assembly snippets to detect it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s