Like last year, I’ve had quite a bit of spare time at hand for coding purposes, and I haven’t had Internet access available during that time. So here’s a big update on what I’ve been doing. In the first part of this update, let’s talk about the elephant in the room : I’m done with checking the viability of my RPC model. The tests, which follow the plan defined earlier, have been performed on three emulators : Bochs, QEMU, and VirtualBox, from the slowest to the fastest. All three ran on my fellow laptop, sporting a Core i5 M430 and 4GB of RAM, and mostly running on Fedora Linux 14 64-bit for OSdeving purposes. Here are the results :
Server startup performance tests
This step is pretty easy to optimize : since the server broadcasts its whole interface at once, pooled memory allocation can be used, which makes performance fly tremendously high. As a result, an optimized version of the code takes about half of a second to execute in Bochs, and executes instantaneously under QEMU and VirtualBox. Conclusion : on real hardware, there’s no need to worry about server startup performance.
Client startup performance tests
As predicted earlier, there’s no way this test could be successful the way it was defined earlier. As a performance compromise, I had to give up on the requirement that all 5000 call descriptors had to be outdated, as it resulted in a major performance hit (gigantic heap pollution, several minutes for test completion under VirtualBox). This could be optimized to some extent, sure, but is it worth it ? Outdated call descriptors appear as a compatibility workaround when clients do not keep up to speed with servers, which shouldn’t be the case for the vital system services that start on boot, assuming that users keep their OS installation in a coherent distribution instead of updating some things without updating others.
When using up to date descriptors, the test still takes a long time to complete under Bochs (20s), but not in QEMU (2s) nor VirtualBox (instantaneous). It should be noted that Bochs’ performance becomes weirdly slow on heavy memory manipulation tasks, to a point where it is not representative of real hardware, even slow, anymore. As an example, Bochs takes half a minute to map 4GB of RAM in x86’s multilevel page tables, despite real hardware being able to complete this task instantaneously even if it’s really old. Therefore, I’ll take the option of saying that the bad performance in Bochs is just a case of its emulation being stupidly slow, and will consider that this stuff is still ready to run on most real hardware.
Threaded RPC performance tests
Without much optimization, the code was too fast for the test to be of interest with 1 000 calls, so I’ve taken the freedom to increase this to 10 000 calls. This took 4s to complete in Bochs, less than 1s in QEMU, and was instantaneous in VirtualBox. What this means is that even a relatively slow computer (as illustrated by Bochs, which fully emulates x86 hardware on the inside) would be able to execute thousands of RPC calls per second. I’d say that’s good enough for a start :)
For future performance improvements, two optimizations may be envisioned. First, stop allocating and freeing stacks all the time, when possible, by keeping a pool of dead threads around instead. This is very likely to make its way into the final codebase, but quantifying the performance improvement brought by this is difficult. Another optimization would be to avoid going through an intermediary cdecl stack, as the current code does, and fully embrace the AMD64 ABI instead. But a the designers of said ABI chose to maximize performance at the expense of cleanness, that would be difficult and make the code a mess, so I don’t want to do it unless it’s absolutely necessary.
Asynchronous RPC performance tests
Not having to allocate and free stacks give some nice performance boost to the code, though less than I would have expected. If we look at the time it takes to emulate 100 000 RPC calls, Bochs needs 40s for threaded code vs 34s for async code. QEMU needs 6s for threaded and 5s for async. Virtualbox continues to execute code near-instantaneously, keeping me amazed by the power of a modern CPU.
When writing these tests, the two last ones especially, I’ve been wondering about the possibility to create a code generator for RPC calls (in an IDE, as an example you right click the definition of a server-side prototype and can easily turn it into a remote call descriptor, and client-side stuff that you can easily copy-paste inside of a library). So far, I see nothing going in the way, so that’s probably the way things will be in the final development tools. If I get up to that point obviously.
Second thing is, as I believe that the performance of this RPC interprocess communication primitive is very satisfying, implementation is soon to begin.