Library-, message-, and RPC-based system services : a comparison

Traditionally, system services have been implemented in two ways, depending on their requirements. Either people wrote shared libraries for “light” work, or they wrote daemons that were contacted using bytestream- or message-based IPC primitives (such as pipes, sockets, or D-BUS). With RPC, I have proposed another IPC method that would be more specifically tailored to the job, but how well would it fare against the old ways at a conceptual level ? Well, here is my attempt at a fair comparison.

The comparison criteria

So, what would be a good way of implementing system services ? First, it should be fit for most purposes, so as to avoid the schizophrenic separation between libraries and daemons that we have these days, in which most system services end up implemented half in the form of shared libraries, half in the form of privileged daemons (most obvious example being memory allocation). Second, it should guarantee a good level of reliability. System services shouldn’t crash, and when they do, they should be able to do it quietly, restart automatically, and resume work right away. Third, it should be fast. No one likes laggy operating systems. Fourth, it should be future-proof, because DLL hell and system-breaking updates are evil.

Other criteria which I had considered were security and simplicity. Concerning security, it appears to me that most of the parameters which affect it (user and system code sharing a single address space or communicating through well-controlled entry points, bug likelihood) also affect reliability, so that putting both in the same comparison would be a bit redundant. At this low level, simplicity only affects reliability and performance, so again I am not sure that mentioning it again would be relevant.

The table

Library Daemons + RPC Daemons + Messages
Universality
Services may access privileged system resources that are beyond client reach No Yes Yes
Tasks can be processed with the priority of client code Yes Yes No
Tasks can be processed with a service-specific priority No Yes Yes
Support is bundled in all programming languages and no wrapper code is required Yes No No
Reliability and security
Client code is forbidden from messing with service code and vice versa No Yes Yes
Service bugs only have an indirect impact on client code No Yes Yes
If the service crashes for one client, it does not necessarily affect other clients Yes No No
The communication protocol can gracefully handle crashes No Yes No
Development complexity (and thus bug likelihood) is… Low Medium High
Service code has a separate stack, and thus may not cause client stack overflow No Yes Yes
Performance
Function call overhead is… Low Medium High
Services and clients may easily share memory for efficient data exchange Yes Yes No
Services can be loaded and initialized in advance No Yes Yes
When a new client is started, it can reuse running services instead of initializing a new copy No Yes Yes
Thread safety is unnecessary when processing concurrent requests No Yes Yes
Future-proofness
Service functions can be extended without requiring existing client recompilation No Yes Yes
Given some support code, services can be updated without rebooting the whole system No Yes Yes

Conclusion

From this comparison, the reason why system services are traditionally implemented using a mix of libraries and message-driven daemons becomes pretty apparent : they literally complement each other. Does RPC succeed at its task of bridging the gap between both ? I would say yes. The only defects which it inherits from its daemon-based natures are the need for some automatically generated wrapper code, some coding complexity and function call overhead that would make it unsuitable for very simple tasks, and a centralized nature that requires better-than-usual crash management (it should be noted, however, that libraries have the bad habit of killing clients in cold blood without any hope of recovery when they crash…)

Overall, I’m pretty satisfied with that. But is it all there is to it ? If you have feedback on this comparison or more criteria to add in mind, do not hesitate to talk about it in the comments !

21 thoughts on “Library-, message-, and RPC-based system services : a comparison

  1. Alfman May 25, 2012 / 6:53 am

    Hi!

    Just stopping by, very interesting research here!
    I actually disagree with some comparisions, but I really wanted to ask about concurrency in the 3 models you’ve listed.

    A shared library can be thread-safe or not. Sometimes safe and unsafe functions are compiled together in the same library even. Many standard unix functions are not multi-thread safe because global variables are reused (consider strerror vs strerror_r or readdir vs readdir_r).
    Do you specify requirements that would avoid this situation in your OS?

    I presume everything about RPC should be thread safe.

    With sockets, if every thread has it’s own socket then it should be thread safe as well.
    However sockets offer an interesting alternative form of parallelism since it’s possible to transmit hundreds of requests and not have to wait for individual responses.

    I personally have a distaste for shared libraries the way they’re often used in linux. Libresolv for example (dns name resolution) should be a system daemon and not code running inside the local process. It may seem like an arbitrary opinion on the surface, but I’ve faced situations where the library was a setback. In an app having many extremely lightweight threads (the min 4k stack size), each thread needed to do a DNS resolution before connecting to a server, I got segfaults inside the DNS calls, but none with hardcoded IPs. Turns out libresolv itself needed a 16K stack in each thread. This was not documented anywhere, but never the less my use of libresolv included a hidden dependency on stacks sizes 300% bigger than what I needed for myself. The good thing is I noticed the problem right away, but what if it had been right on the border? My code could have easily broken after a libresolv update or on end user systems with differing implementations of the *same api*. Ideally that just shouldn’t happen.

    I realize you’ve already ruled out managed languages, but I think the software VM driver isolation we discussed last year would compare rather favorably in these categories.

  2. Alfman May 25, 2012 / 6:58 am

    One more thing.
    Do you plan on supported Thread Local Storage?

    The most common example of this is “errno”, which looks like a global variable, but each thread gets it’s own copy.

    I’m bringing it up because it has interesting uses. For example, it allows a shared library to save variables and keep track of resources per thread, using that information in future calls to the library. It seems to me these probably wouldn’t be compatible with the daemon approaches.

  3. Hadrien May 25, 2012 / 8:46 am

    Just stopping by, very interesting research here!

    Thanks :)

    I actually disagree with some comparisions,

    I’m always interested in hearing which and why !

    but I really wanted to ask about concurrency in the 3 models you’ve listed.

    A shared library can be thread-safe or not. Sometimes safe and unsafe functions are compiled together in the same library even. Many standard unix functions are not multi-thread safe because global variables are reused (consider strerror vs strerror_r or readdir vs readdir_r).
    Do you specify requirements that would avoid this situation in your OS?

    I tend to think that library code should always be thread-safe, so as to let software that uses it freely spawn as many threads as it likes. But in the end, library developers do whatever they want, so I cannot tell user software developers to assume thread-safety from libraries that do not come from me either…

    Maybe a good compromise would be to assume that all libraries are thread-unsafe unless written otherwise. This could be suggested as a good practice to developers, but another approach would be to implement this rule in software by putting a “thread-safe” flag in the lib binary and automatically surrounding library calls with mutexes when this flag is off.

    I presume everything about RPC should be thread safe.

    Well, your presumption would be wrong, or more precisely a bit outdated.

    During the early discussions in which the RPC concept was built, I have had heated arguments with some OSnews members, noticeably Brendan, concerning which behaviour is best between putting incoming service requests in a queue and popping them from there one at a time, or just creating one thread and letting it fly for each incoming request.

    Basically, my conclusions were that the latter works well for requests that can be processed in a parallel (or quasi-parallel) fashion, and may reduce the impact of a frozen thread in a server (which in the case of an asynchronous queue blocks everything), but requires more work from service developers (as you say, everything must be thread-safe), and may have an averse effect on performance or memory usage when used improperly (lots of blocked threads).

    So instead, what I came to propose is to let developers choose. When creating an RPC service, there is a boolean switch that one can flip in order to switch between an “asynchronous/queued” mode (one request is processed at a time) and a “threaded” mode (all requests are processed in parallel).

    The two approaches have a lot in common, so supporting both requires extremely few extra code. As such, I tend to believe that this is the best option in the end.

    I would gladly write more, but I have to go and do some mad science in my lab.

  4. Hadrien May 25, 2012 / 7:42 pm

    With sockets, if every thread has it’s own socket then it should be thread safe as well.
    However sockets offer an interesting alternative form of parallelism since it’s possible to transmit hundreds of requests and not have to wait for individual responses.

    Well, as far as I can tell, pretty much every modern IPC method allows data to be sent asynchronously, so I’m not sure if sockets have that big of an advantage here… Anyway, I think I’m going to add something about concurrency in the “performance” section later.

    I personally have a distaste for shared libraries the way they’re often used in linux. Libresolv for example (dns name resolution) should be a system daemon and not code running inside the local process. It may seem like an arbitrary opinion on the surface, but I’ve faced situations where the library was a setback. In an app having many extremely lightweight threads (the min 4k stack size), each thread needed to do a DNS resolution before connecting to a server, I got segfaults inside the DNS calls, but none with hardcoded IPs. Turns out libresolv itself needed a 16K stack in each thread. This was not documented anywhere, but never the less my use of libresolv included a hidden dependency on stacks sizes 300% bigger than what I needed for myself. The good thing is I noticed the problem right away, but what if it had been right on the border? My code could have easily broken after a libresolv update or on end user systems with differing implementations of the *same api*. Ideally that just shouldn’t happen.

    I hate library-based service implementations too. They crash your software, are a major OS exploit vector, add weight to software packages… In my dream world, we would only use them for wrappers, and very simple functionality for which multiple processes are overkill (e.g. math functions).

    I realize you’ve already ruled out managed languages, but I think the software VM driver isolation we discussed last year would compare rather favorably in these categories.

    We have already been discussing this many times, so as a sort of birthday post I am going to try and add something new to this debate between us.

    In theory, VMs can do lots of things that are beyond the reach of native code. They can be used to write platform-agnostic code, make old software use the performance tricks of the latest CPU models, improve security by implementing higher-level security checks, reduce context switching overhead, make reverse-engineering work harder, and much more.

    The problem is that when it comes to actually implementing a VM, developers have to make design choices. High-level vs low-level, multi-platform vs single-platform, JIT vs interpreted code… Each of these choices will have a decisive impact on the characteristics of the resulting VM : performance, security, versatility, coding simplicity (and thus bugs), etc…

    Due to this, it sounds like cheating to me to invoke the advantages of “VMs” as a whole. You have at least to be more specific about the characteristics of the VM that you are discussing, and a proof-of-concept implementation would be a welcome asset.

    All that notwithstanding, I am not sure that libraries would cease to exist in a VM world. Processes living in separate address spaces still require different coding practices than processes which share code, stack and data. And although VMs can reduce the cost of context switching, they cannot completely destroy it without voiding system security (or at least requiring a completely different approach than the sandboxed process one).

    In the end, though, you are right that I can’t take this debate very far anyway, since in my OS, I have decided to reinvent so much parts of the computing world already that also reinventing low-level programming would not be a reasonable option :)

  5. Hadrien May 25, 2012 / 8:14 pm

    Edited the comparison to add one bit about thread safety and one about stack sharing !

    One more thing.
    Do you plan on supported Thread Local Storage?

    The most common example of this is “errno”, which looks like a global variable, but each thread gets it’s own copy.

    I’m bringing it up because it has interesting uses. For example, it allows a shared library to save variables and keep track of resources per thread, using that information in future calls to the library. It seems to me these probably wouldn’t be compatible with the daemon approaches.

    I have heard of Thread Local Storage in the past, but I’m still not sure I get what the point is.

    As far as I can tell, two existing programming abstractions are already suitable for the jobs that TLS is designed for : stack objects (abstracted as local variables by most languages) as a light abstraction for storing a state during the lifetime of a thread, and heap objects as a heavier abstraction to store large amounts of data or keep track of a state between service runs.

    Why is there a need for more ?

  6. Alfman May 26, 2012 / 7:21 am

    “I’m always interested in hearing which and why !”

    Ok, just a few quick comments:

    “Services can be loaded and initialized in advance”
    Shared libraries might already be loaded and memory mapped prior process execution, so arguably such services are already loaded before the process runs.

    “Service functions can be extended without requiring existing client recompilation”
    I might not be understanding you precisely. However often times library conventions use namespaces that allow operator overloading (by appending the number of parameters and even the parameter types to the function name) and therefor can be “extended”. An existing client could continue to use the old function prototype.

    What about “service code is forbidden from messing with client code”?
    A caller might not want a service to be able to have access to anything other than the I/O parameters involvced in the function call.

    “Well, as far as I can tell, pretty much every modern IPC method allows data to be sent asynchronously, so I’m not sure if sockets have that big of an advantage here… ”

    My thought with sockets was that it’d often be trivial to submit one composite packet containing multiple back to back requests without needing to send them each individually. It’s a bit ironic for an asynchronous interface to invoke calls sequencially instead of in parallel at the same time. I suppose it’s theoretically possible for an RPC mechanism to bunble requests and “commit” them in one shot, but I haven’t seen it before.

    “We have already been discussing this many times, so as a sort of birthday post I am going to try and add something new to this debate between us.”

    It’s not my goal to drag you away from more important things, I’m sure there’s plenty of fun things to talk about that have a more direct impact on your project.

    “All that notwithstanding, I am not sure that libraries would cease to exist in a VM world. Processes living in separate address spaces still require different coding practices than processes which share code, stack and data. And although VMs can reduce the cost of context switching, they cannot completely destroy it without voiding system security…”

    Ah, I think you’ve forotten the zero-syscall isolation model that had been proposed, where completely seperate “processes” would run in the same address space and remain oblivious of one another on account of not having any references to each other’s objects. Protection would be achieved via managed language semantics instead of CPU memory barriers. Objects could be reassigned as they are passed via RPC. Security would be enforced by the VM which doesn’t permit unpermissible code to be generated in the first place. I think there’s very little difference between a “library” and a “daemon” in the VM enforced isolation model.

    “Why is there a need for more ?”

    Have you given thought to how “errno” works in multithreaded programs?

    Thread1:
    xyz(); // sets errno to something

    Thread 2:
    abc(); // sets errno to something

    Both threads have a different copy of errno, which very much resembles a global, except it’s per thread. I tend to dislike globals in principal and TLS is very similar, never the less I have found uses for TLS in special cases.

    Consider an object factory (like malloc/free). Where do they keep track of their lists of objects? In global variables of course. Well MT versions need to be thread safe, which means mutexes and/or spinlocks, which become a costly bottleneck when many threads are simultaniously accessing the factory. So what if each thread could have it’s own local allocation pool, that pool could be accessed instantaniously without synchronization! But how does a call to malloc/free get a pointer to the local pool for the running thread? One might find the pthread calls to store information speficit to the thread, but it turns out they’re slow dictionary lookups, and not good in functions we prefer to inline. The answer is thread local storage!

    This turns out to be extremely fast since TLS just uses segment registers that are specific to each thread.

    Thread 1:
    mov EBX, es:[mylocalpool]

    Thread 2:
    mov EBX, es:[mylocalpool]

    Thread 3:
    mov EBX, es:[mylocalpool]

  7. Hadrien May 26, 2012 / 6:45 pm

    “Services can be loaded and initialized in advance”
    Shared libraries might already be loaded and memory mapped prior process execution, so arguably such services are already loaded before the process runs.

    I had planned this objection.

    First, I’d argue that library images may be pre-fetched from a hard drive, like any other file for that matter, but not fully loaded, since library loading involves some dynamic linking steps that can only be performed during a client process’s startup, at the point where the position of the library in that process’ virtual address space is effectively decided.

    You may consider this to be nitpicking, though, since on modern computers, fetching the library binary image from disk is already doing a fair share of the library loading work. And I can concede that.

    Which leads me to my second point : since pre-fetched libraries are not fully loaded and thus cannot run code, it would be impossible, unless I’m misunderstood, to initialize a library before the point where a client process which uses this library is run. As such, client startup time actually gets a hit due to library initialization time, which seems unfair to me. What’s more, this is the case of every client, not just the first one that uses the system service, as would be the case with a daemon that is started on demand.

    “Service functions can be extended without requiring existing client recompilation”
    I might not be understanding you precisely. However often times library conventions use namespaces that allow operator overloading (by appending the number of parameters and even the parameter types to the function name) and therefor can be “extended”. An existing client could continue to use the old function prototype.

    Here is an example of why I don’t feel that this is satisfying yet.

    Let’s say that I am writing a simple monochromatic graphics library, in which I have a function which draws black lines. Its original syntax would be something along the lines of…

    DrawLine(SourcePoint, DestPoint)

    Then, later, I decide to add support for drawing lines of different colors. So I would now like to extend the prototype of DrawLine to…

    DrawLine(SourcePoint, DestPoint, Color = Black)

    Now, notice the “Color = Black” default value. This new library function is source-compatible with older clients without changes, because it provides a sane default value for the “Color” parameter. Yet, with current library conventions, in which it is the job of client code to push every single required function parameters on the stack, the new prototype is not binary-compatible with older clients, and requires either a recompile, or the addition of an ugly kludge of the form…

    DrawLine(SourcePoint, DestPoint) -> DrawLine(SourcePoint, DestPoint, Black)

    To the contrary, daemons which either use their own homemade function call semantics, or the RPC system that I propose, can tolerate clients which send an insufficient amount of parameter, by automatically appending default values at the end of the parameter stack. This is, in my opinion, a superior approach, since it does not require an ever-expanding set of compatibility kludges.

    What about “service code is forbidden from messing with client code”?
    A caller might not want a service to be able to have access to anything other than the I/O parameters involvced in the function call.

    Actually, the original version was something along the lines of “client code is forbidden from messing with service code, and vice-versa”. Then I thought that since system services are generally trusted not to buffer-overflow user processes or nasty things like that, readers might not immediately feel concerned about the issue of service code messing with client code. So I added another table line containing one specific examples of how such a situation can go wrong : client crashes due to a service bug.

  8. Hadrien May 26, 2012 / 7:29 pm

    “Well, as far as I can tell, pretty much every modern IPC method allows data to be sent asynchronously, so I’m not sure if sockets have that big of an advantage here… ”

    My thought with sockets was that it’d often be trivial to submit one composite packet containing multiple back to back requests without needing to send them each individually. It’s a bit ironic for an asynchronous interface to invoke calls sequencially instead of in parallel at the same time. I suppose it’s theoretically possible for an RPC mechanism to bunble requests and “commit” them in one shot, but I haven’t seen it before.

    It would be possible to do that with an RPC mechanism, but I do not plan to do it in version 1 of the implementation for two reasons :

    1/It is a performance optimization more than a vital feature, and I can add support for it later without breaking existing software. So I want to see if there is actually a bottleneck to optimize there first, that would be worth the extra OS complexity.

    2/When a service is often subject to such “bulk requests”, it often makes more sense to provide a special function for that at the service level, instead of having clients spam it with a large amount of small requests. Because although individual request overhead can be reduced, I don’t think that it can be completely voided.

    “All that notwithstanding, I am not sure that libraries would cease to exist in a VM world. Processes living in separate address spaces still require different coding practices than processes which share code, stack and data. And although VMs can reduce the cost of context switching, they cannot completely destroy it without voiding system security…”

    Ah, I think you’ve forotten the zero-syscall isolation model that had been proposed, where completely seperate “processes” would run in the same address space and remain oblivious of one another on account of not having any references to each other’s objects. Protection would be achieved via managed language semantics instead of CPU memory barriers. Objects could be reassigned as they are passed via RPC. Security would be enforced by the VM which doesn’t permit unpermissible code to be generated in the first place. I think there’s very little difference between a “library” and a “daemon” in the VM enforced isolation model.

    A VM can replace separate address spaces for process isolation, but it does not, in itself, void the need for, and computational cost of, separate processes. Maybe it’s best that I explain first what I call a process, though, because over the course of this project, I have realized that there is more than one definition of those.

    For me, a process is something that is conceptually close to a security sandbox. It is a restricted environment in which code can run, independently from other code, and within OS-set boundaries. When system services have to answer the “should we obey this request ?” question, they do it by answering the “can the process do this ?” one. Process can be naturally be confused with programs, since there is generally exactly one running program per process, but a program is just code and data, whereas a process is an isolated box in which code and data are allowed to run, and whose OS-defined boundaries define what said code can do.

    In a VM-based OSs, I would still use processes as the elementary protection unit. It would just happen that memory isolation between processes, avoiding collision between unrelated programs, would be done in software, and not in hardware. But there would still be such a thing as context switching, in which I would switch from one box of running code to another, in which I would change VM protection rules from those of the old process to those of the new process. And this has a cost. Daemons (separate software) pay it, libraries (added code to the software) do not.

    Then, of course, I am not saying that processes are the only possible approach to OS security. But I believe that few competing security models have managed to build something as elegant as the idea of unrelated programs each running in independent boxes, tailored for their needs, and only communicating with each other and the outside world in an OS-controlled way.

    (I will reply to the part about TLS in a while)

  9. Alfman May 27, 2012 / 12:42 am

    Very quick response:

    “at the point where the position of the library in that process’ virtual address space is effectively decided.”

    Actually, I believe library addresses on linux are system wide and not per process (not 100% sure if this is always the case). However obviously the client’s own code needs to be relocated to point to the library’s address, so you’re rebutal is still valid :-)

    “2/When a service is often subject to such ‘bulk requests’, it often makes more sense to provide a special function for that at the service level, instead of having clients spam it with a large amount of small requests.”

    Yes your right, but technically I still think it’s an advantage for sockets since they can handle the case more naturally. It’s true you can rewrite any RPC funcitons to take a message block full of concatenated requests, but I kinda feel that’s cheating since it now means your RPC function becomes a wrapper for a message passing API rather than something I’d consider a strict RPC API.

    For example, linux supports “read” and “write” between processes, yet I’d be reluctant to call those “RPC”. Like you though, I’m just nitpicking…

    “So I added another table line containing one specific examples of how such a situation can go wrong : client crashes due to a service bug.”

    Well, I don’t consider a “bug” the same thing as a daemon escalating it’s priviledge within the client on purpose, If I link in a facebook API as a C library, it could easily turn my process into trojan. If I call a facebook service API, then it wouldn’t be able to escalate it’s privilege within my process and as long as I check return values than I can use the API without giving it privileged access to my process. In the end this distinction doesn’t really affect your conclusion anyways, so my point is pretty mute.

    Will response to next post later.

  10. Alfman May 27, 2012 / 4:46 am

    “In a VM-based OSs, I would still use processes as the elementary protection unit…would be done in software, and not in hardware.”

    Correct.

    “But there would still be such a thing as context switching, in which I would switch from one box of running code to another, in which I would change VM protection rules from those of the old process to those of the new process.”

    Well, kindof…however in the implementation I was thinking of, a “context switch” may be implicit from within the JIT code that the CPU is running instead of actually involving a change of state in CPU registers or memory, which involces overhead. The JIT compiler could enforce that process A code can only call other process A functions, and only access process A globals, the same goes for process B. For RPC calls between them, the JIT compiler could lookup A’s permission to call a specific process B function. Security is achieved by making sure code that escalates a process’es security is never generated, assuming were using a managed language to start with, then it shouldn’t be too hard to do. This way run-time RPC security checks are never needed. There’d be zero overhead since the security checks were already enforced by the JIT compiler that generated the code.

    Assuming you can accept the above paragraph, the conclusion is an RPC call from process A to process B might be identical to the equivalent library call entirely within process A – the two cases wouldn’t require any differeces at run time. Well, there might be if the JIT injects instructions for process accounting such as tracking process time, but I don’t think any context changes are necessary for security.

    If you aren’t convinced, can you illustrate a case where this idea would fail?

    “And this has a cost. Daemons (separate software) pay it, libraries (added code to the software) do not.”

    Did the above explaination change your mind?

  11. Hadrien May 28, 2012 / 5:33 am

    Have you given thought to how “errno” works in multithreaded programs?

    Thread1:
    xyz(); // sets errno to something

    Thread 2:
    abc(); // sets errno to something

    Both threads have a different copy of errno, which very much resembles a global, except it’s per thread. I tend to dislike globals in principal and TLS is very similar, never the less I have found uses for TLS in special cases.

    Isn’t this, as you imply yourself, about patching up the flaws of global variables without trying to understand what’s wrong with them ? I mean, this behaviour of errno is messing with the core mechanics of C-style languages by introducing an object which looks exactly like a global variable to non-knowledgeable developers, but is not shared by all threads of a program like a global would. Shouldn’t this ring an alarm bell in our heads ?

    I will give these “special cases” a chance, though :)

    Consider an object factory (like malloc/free). Where do they keep track of their lists of objects? In global variables of course. Well MT versions need to be thread safe, which means mutexes and/or spinlocks, which become a costly bottleneck when many threads are simultaniously accessing the factory. So what if each thread could have it’s own local allocation pool, that pool could be accessed instantaniously without synchronization! But how does a call to malloc/free get a pointer to the local pool for the running thread? One might find the pthread calls to store information speficit to the thread, but it turns out they’re slow dictionary lookups, and not good in functions we prefer to inline. The answer is thread local storage!

    Now, this is a much better use case for this feature ! But then, I have to wonder : what happens to the local allocation pool when the thread is terminated ? Will it just hang there forever as a sad example of a memory leak ? Or has some sort of “thread-local dynamic memory allocation”, whose heap is automatically freed when the thread is discarded, been used here ?

    Another thing which puzzles me is that I believe that the pthread call that you mention (and the equivalent TlsAlloc functionality on Windows) are considered to be thread-local storage, so I am not sure if I understand what you consider “true” thread-local storage.

    This turns out to be extremely fast since TLS just uses segment registers that are specific to each thread.

    Thread 1:
    mov EBX, es:[mylocalpool]

    Thread 2:
    mov EBX, es:[mylocalpool]

    Thread 3:
    mov EBX, es:[mylocalpool]

    I doubt that it works like this on all CPU architectures, though, considering that x86_64 pretty much gets rid of segmentation (it assumes identity-mapped flat segments in most cases, with few exceptions), and other architectures like ARM or POWER have never had segmentation functionality to begin with.

    So I have to guess that most modern TLS implementations just use lookup tables, as Wikipedia suggests Windows is doing

    The principle, it I get it right, is as follows : you provide each thread with an array (that can dynamically grow through usual techniques) of pointer-sized integers. If all that a thread needs to store is an integer, it can get directly stored in the table. Anything bigger will be dynamically allocated, with its address stored in the array. This is actually much more efficient than a dictionary lookup.

  12. Alfman May 28, 2012 / 7:08 am

    “this behaviour of errno is messing with the core mechanics of C-style languages…”

    Perhaps you are right, I see two factors involved.
    1. C doesn’t offer a “nice” way to return multiple variables, hense the original motive for errno.
    2. In order to evolve threads *nix needed this kludge to to stay compatible.

    Interestingly I think the faulty design pattern can be exposed without using globals at all:

    Thread1: x = obj.action(); // sets a class variable obj.errno
    Thread2: x = obj.action(); // sets a class variable obj.errno

    “what happens to the local allocation pool when the thread is terminated ?…Or has some sort of ‘thread-local dynamic memory allocation’, whose heap is automatically freed when the thread is discarded, been used here ?”

    Pthread offers a way to hook into the termination sequence and clean up the thread local pool.

    My own implementation of this actually has both global pools and thread pools. When there are too many objects in the local pool, they’re bumped into the global one by multiples of an efficiency factor. When the local pool is depleated, it is repopulated from the global pool or expanding the heap. This design explicitly supports one thread consuming all the objects and passing them to another thread to be freed.

    “so I am not sure if I understand what you consider ‘true’ thread-local storage.”

    Well, the pthread/windows functions are dictionaries built at run time, storage for variables with the TLS attribute are determined at compile time (like globals) so that they have a static offset into the TLS area (bit more complicated due to dynamic linking…)

    “I doubt that it works like this on all CPU architectures, though, considering that x86_64 pretty much gets rid of segmentation”

    I suspect architectures with 32+ registers reserve one for TLS base offset. And although I’ve never done assembly on it, I think x86_64 still uses fs/gs.

    http://www.pagetable.com/?p=25
    “The CS (code), DS (data 1), ES (data 2) and SS (stack) segment registers are practically gone, and the FS and GS segments still support a base (which can be used in tricks to quickly access data at a constant position, like the TCB), but the limit is no longer enforced.”

    “So I have to guess that most modern TLS implementations just use lookup tables, as Wikipedia suggests Windows is doing”

    The wikipedia article is wrong or at least neglects to mention C/C++ TLS mechanisms in the microsoft windows section.

    Here’s more information about using TLS in general.

    http://www.informit.com/guides/content.aspx?g=cplusplus&seqNum=346

    I guess this discussion, while tangential to the main topic, could still prove important if you intend to support TLS, which brings up my original question: do you intend to support it? :-)

  13. Hadrien May 28, 2012 / 4:04 pm

    Yes your right, but technically I still think it’s an advantage for sockets since they can handle the case more naturally. It’s true you can rewrite any RPC funcitons to take a message block full of concatenated requests, but I kinda feel that’s cheating since it now means your RPC function becomes a wrapper for a message passing API rather than something I’d consider a strict RPC API.

    For example, linux supports “read” and “write” between processes, yet I’d be reluctant to call those “RPC”. Like you though, I’m just nitpicking…

    True, that, and in the case of heterogeneous requests the “bulk request” model also breaks down.

    Again, it’s possible to do it, but I think that’s the kind of advanced functionality that I should add after I get something that works and can be tested for performance, not in a first release.

    If I would do it, I guess I’d use some kind of batch system to allow RPC requests to be put on hold until an order is given to send them all at once. Given some syntactic sugar from the wrapper library, it could look like this in everyday use…

    start_rpc_batch()
    rpc_call1()
    rpc_call2()
    rpc_call3()
    send_rpc_batch()

    Then of course, being able to make it look nice does not mean that it would be pretty hard to implement. As a trivial example, start_rpc_batch() and send_rpc_batch() would have to correctly manage the case where several threads try to make batches at the same time, in this scenario, among many, many other issues. Globally, I think that request batches would be quite a bit more difficult to handle than isolated requests, which is one more reason why I prefer to wait until I have a working isolated request code at hand AND face a specific problem that requires batched requests before attempting to code this.

    Well, I don’t consider a “bug” the same thing as a daemon escalating it’s priviledge within the client on purpose, If I link in a facebook API as a C library, it could easily turn my process into trojan. If I call a facebook service API, then it wouldn’t be able to escalate it’s privilege within my process and as long as I check return values than I can use the API without giving it privileged access to my process. In the end this distinction doesn’t really affect your conclusion anyways, so my point is pretty mute.

    Then I’ll get the “vice-versa” part back…

  14. Hadrien May 28, 2012 / 4:50 pm

    Regarding VMs and context switches :

    If, like me, you use processes as sandboxing units, then you must agree that at the point where the VM will switch from compiling and executing A’s bytecode to compiling and executing B’s bytecode, its state will have to change, just like the state of the kernel and the CPU is changed during a traditional context switch.

    The VM will need to stop gathering its process-specific data from the area of RAM that contains A’s security permissions, JIT cache, PID, etc…, and start gathering those from the area of RAM that contains B’s process-specific data. Since B has not executed for a long time, it is unlikely that its data is still cached by the processor. As such, all this data will need to be fetched from RAM.

    This phenomenon has a similar computational cost to the TLB flushing that occurs in hardware-based context switching, which is one of the most costly parts of a context switch. Adding to the mix the fact that VMs do not do things as efficiently as native code (because they cannot just let code run, and have to regularly pay CPU cycles to monitor and follow its flow), I would not take it for granted that VM-based context switching would have a negligible cost as compared to hardware-based context switching.

    But again, without specifics of how the VM that we are talking about works, it is hard to make even qualitative performance discussions.

  15. Alfman May 28, 2012 / 7:17 pm

    “Again, it’s possible to do it, but I think that’s the kind of advanced functionality that I should add after I get something that works and can be tested for performance, not in a first release.”

    ‘nough said!

    “Regarding VMs and context switches :”
    “you must agree that at the point where the VM will switch from compiling and executing A’s bytecode to compiling and executing B’s bytecode, its state will have to change, just like the state of the kernel and the CPU is changed during a traditional context switch.”

    Just to keep this discussion simpler, I’d like to jump past the complexity of the JIT compiling phase and assume everything’s already been compiled. In this case we’re just executing, and the security constraints should have been validated at compile time.

    “The VM will need to stop gathering its process-specific data from the area of RAM that contains A’s security permissions, JIT cache, PID, etc…, and start gathering those from the area of RAM that contains B’s process-specific data.”

    I agree that process accounting will have overhead. But I’d still like to see a scenario in which a managed process is inherently able to escape it’s sandbox if we don’t apply a (security) context switch? What lines of code would it execute?

    1. The process would not have access to privileged CPU instructions. Y/N?

    2. The managed language doesn’t have to provide a mechanism to produce arbitrary userland instructions which would allow the process to tamper with memory that it’s not supposed to access. Y/N?

    3. All funciton calls are pre-vetted by JIT, so it couldn’t call unauthorized functions. Y/N?

    4. As long as a procedure’s compiled code (and that of it’s dependencies) doesn’t change, all the security checks that passed at the time of JIT compilation will remain valid for every invocation of the procedure. Y/N?

    What’s left that a process can do to escallate it’s privileges and do something unauthorized? I’m not denying context information can be extremely helpful for OS process management, but I don’t see why it’d be necessary to enforce security boundaries in principal given this model.

    This discussion has got me thinking, would it be possible to keep track of process accounting without a context switch? I think on a probabilistic level, the answer is yes. The OS could sample processes at random intervals. It could derive alot of runtime information just by looking at the instruction pointer and stack traces. Probabilistic sampling could open up sample timing vulnerabilities, but interesting idea, no?

  16. Hadrien May 28, 2012 / 8:11 pm

    Perhaps you are right, I see two factors involved.
    1. C doesn’t offer a “nice” way to return multiple variables, hense the original motive for errno.
    2. In order to evolve threads *nix needed this kludge to to stay compatible.

    The worst here, C++ did offer several ways to report errors and status information without using global variables.

    Stourstrup was a strong supporter of exceptions, so he kind of favored those in the language’s design, but apart from that, one could also imagine returning status information along with a result in a more traditional way, by use of some very simple templating.

    template ‹typename ValueType› class ReturnValue {
    ….private:
    ……..ValueType returned_value;
    ….public:
    ……..int return_status;
    ……..ValueType* operator->(void) const { return &returned_value; }
    ……..ValueType& operator*(void) const { return returned_value; }
    }

    This works exactly like a pointer to the specified value type, except that you also have access to function status information by looking up the return_status integer (which can also be something more fancy than an integer if you need it.

    Well, the pthread/windows functions are dictionaries built at run time, storage for variables with the TLS attribute are determined at compile time (like globals) so that they have a static offset into the TLS area (bit more complicated due to dynamic linking…)

    Oh, I see. But as far as OS support is concerned, one can simply give each thread a page (or a dynamically growable heap if we want to get fancy) of “TLS storage space”, have a pointer to this memory block accessible through a syscall or RPC command, and let threaded code deal with it using as much syntactic sugar as it wants, right ?

    I guess this discussion, while tangential to the main topic, could still prove important if you intend to support TLS, which brings up my original question: do you intend to support it? :-)

    As a basic protection against bloat, my basic answer to every question of the form “Do you plan to support feature x ?” is “Is it useful, and is it worth the development cost ?”.

    So far, your malloc example has demonstrated how TLS can be useful in massively parallel applications, such as supercomputer programs, in which a process-wide malloc mutex would indeed be a performance bottleneck. The errno example also suggests that I will be more or less forced to implement it if I want to easily port existing software, due to the large amount of software out there that use an UNIX-like development methodology. So the only thing that I’m a bit sad about, is that I would really have liked an example of how it can be useful in the smaller-scale computers that I’m interested in : desktops, laptops, tablets and cousins.

    But since it sounds relatively easy to implement, and since I am forced to deal with it in order to port existing software and developers, I guess I’ll have to add it anyway.

    P.S : Just taking a breath to say, thanks for passing bye ! It’s great to have some lively and fruitful technical discussions in the comments of this blog, that usually feels like me screaming on a desert cliff in order to sort out my OSdeving ideas.

  17. Hadrien May 28, 2012 / 9:45 pm

    Just to keep this discussion simpler, I’d like to jump past the complexity of the JIT compiling phase and assume everything’s already been compiled. In this case we’re just executing, and the security constraints should have been validated at compile time.

    Fine, although in such a case, VM-enforced security constraints would have to keep simple and low-level. Kind of like the “software doesn’t touch the memory of other software without permission” one that is usually managed in hardware.

    I agree that process accounting will have overhead. But I’d still like to see a scenario in which a managed process is inherently able to escape it’s sandbox if we don’t apply a (security) context switch? What lines of code would it execute?

    1. The process would not have access to privileged CPU instructions. Y/N?

    2. The managed language doesn’t have to provide a mechanism to produce arbitrary userland instructions which would allow the process to tamper with memory that it’s not supposed to access. Y/N?

    3. All funciton calls are pre-vetted by JIT, so it couldn’t call unauthorized functions. Y/N?

    4. As long as a procedure’s compiled code (and that of it’s dependencies) doesn’t change, all the security checks that passed at the time of JIT compilation will remain valid for every invocation of the procedure. Y/N?

    What’s left that a process can do to escallate it’s privileges and do something unauthorized? I’m not denying context information can be extremely helpful for OS process management, but I don’t see why it’d be necessary to enforce security boundaries in principal given this model.

    1/It is fairly easy and not very damaging to enforce N, I agree.

    2/Enforcing N is possible, but a much thougher choice. No compiler is perfect, and as of today, many highly-optimized programs (such as 3D rendering engines, video codecs, game engines) have to resort to hand-written assembly and manual pointer manipulation in order to reach sufficient performance. I am curious about how VMs would deal with this situation : would they try to analyze what the ASM code is doing ahead-of-time in order to determine if it is harmless without harming it’s runtime performance, follow what it’s doing in real time at the cost of a potentially serious performance hit, or completely forbid hand-optimized code altogether ?

    Going a bit beyond pure VM considerations, I believe that this problem arises even on non-VM OSs when one brings GPUs to the mix, since high-performance GPU-accelerated software often uses some sort of shader scripting language (GLSL, HLSL), which is compiled and sent to the GPU by its driver. In this situation, how would the OS control what happens there, and make sure that programs do not use shader programs to do nasty things ? (I’ve seen Microsoft claim that they have found a solution since Windows Vista and WDDM, but no detailed explanation so far)

    3/Same as 2, I guess : as soon as you can have some hand-optimized assembly around, how do you deal with code that engineers a home-made function pointer and attempts to call it ?

    4/Yes, so long as one of these checks is the prevention of self-modifying code, which is a common practice in everything that includes a JIT (interpreters, web browsers, other VMs…)

    This discussion has got me thinking, would it be possible to keep track of process accounting without a context switch? I think on a probabilistic level, the answer is yes. The OS could sample processes at random intervals. It could derive alot of runtime information just by looking at the instruction pointer and stack traces. Probabilistic sampling could open up sample timing vulnerabilities, but interesting idea, no?

    Since even human beings have a hard time finding out what running programs are doing by looking at their stack and registers, I would spontaneously say that this might not work with an acceptable performance cost. On the other hand, we are talking about code coming from a VM there, so maybe said VM’s JIT compiler could ask generated code to put some debug information on the stack in order to ease this task, at the expense of slightly slower performance and increased memory usage.

    I don’t know. Sounds interesting. By the way, I’ve been wanting to ask for some time : have you ever thought about trying to implement all those VM-based OS ideas that you have in a hobby project ? After all, if I could take the crazy decision of getting into hobby OS development, so can you, as long as your weekly schedule and health give you enough time…

  18. Alfman May 29, 2012 / 12:54 am

    ReturnValue class:
    Interesting idea, I’ve never thought of doing it that way.

    Exceptions are probably the best answer. In the past when I gave this problem more thought, I concluded that ideally “exceptions” should be abstracted into more generic constructs that aren’t limited to exceptions.

    A caller would call a function with multiple return addresses, and the callee would decide which one to take, none being treated specially by the language. The possible paths would be declared as part of the function prototype. Errors are an obvious case, but there are other times when we might want to continue a logical branch that took place inside the function. An example is “fork”, which would have three possible branches: Error, parent, child. This eliminates the artificial need to collapse the code path upon return only to branch again.

    (I’m so sorry, I’ve branched the discussion yet again…bad Alfman)

    TLS:

    “as far as OS support is concerned…have a pointer to this memory block accessible through a syscall or RPC command, and let threaded code deal with it using as much syntactic sugar as it wants, right ?”

    Well sure, in theory. However, if you want to support __thread allocation attributes, then you’d either have to modify the compiler to support your OS syscall, or you’d have to stick with the TLS mechanism that the C/C++ compilers are already generating code for.

    “P.S : Just taking a breath to say, thanks for passing bye ! It’s great to have some lively and fruitful technical discussions in the comments of this blog, that usually feels like me screaming on a desert cliff in order to sort out my OSdeving ideas.”

    It’s a lot of fun. I don’t get to work on or even talk about any of these things in my professional life and to be honest it’s left a rather big void, which is probably why I’m here.

    VM:
    “I am curious about how VMs would deal with this situation : would they try to analyze what the ASM code is doing ahead-of-time in order to determine if it is harmless without harming it’s runtime performance, follow what it’s doing in real time at the cost of a potentially serious performance hit, or completely forbid hand-optimized code altogether ?”

    Well, if you’re talking about a software VM like VMWare, then I think it would be extremely difficult to make sure the code is contained in all cases. There are so many tricks a program can use (like using RET/IRET to CALL a procedure instead of returning, etc), VMWare needs to protect against all possibilities…that’s a big wow in my book! However since we’re talking about a language VM like the JVM, it only has to make sure that it doesn’t generate such unauthorised instructions itself, and I think that is far easier to commit to. .Net/Java will run “native” code if you want, but you have to tell them to implicitly trust it as their VM’s cannot vouch for it’s security.

    “On the other hand, we are talking about code coming from a VM there, so maybe said VM’s JIT compiler could ask generated code to put some debug information on the stack in order to ease this task, at the expense of slightly slower performance and increased memory usage.”

    All true, however many languages are already keeping track of stack frames in order to support exception handling, so I don’t think we need anything new. One can still debug a C stack frame without debug information, we just loose all the variable/function names. BTW GCC can be told to disable stack frame tracking to gain an extra register (BP), but then you loose stack traces obviously.

    “have you ever thought about trying to implement all those VM-based OS ideas that you have in a hobby project ? After all, if I could take the crazy decision of getting into hobby OS development, so can you, as long as your weekly schedule and health give you enough time…”

    Hmm, well these ideas came about more recently than my first OS project a decade ago, which was far more conventional and truthfully had little merit other than being an educational tool that carried some bragging rights. Everything changes after university. The need to make a living becomes imperative. If I could make a living out of it, I’d love to do it, but otherwise there’s a lot of economic pressure to focus on things that can be monetised. I used to think having strong skills would equate to easy money, but that’s not been the case. It’s still a dream of mine to do a startup with friends, some day maybe.

  19. Hadrien May 29, 2012 / 9:37 pm

    ReturnValue class:
    Interesting idea, I’ve never thought of doing it that way.

    I’m always impressed by what can be done with templates. If only I could think of using them more often…

    Exceptions are probably the best answer. In the past when I gave this problem more thought, I concluded that ideally “exceptions” should be abstracted into more generic constructs that aren’t limited to exceptions.

    A caller would call a function with multiple return addresses, and the callee would decide which one to take, none being treated specially by the language. The possible paths would be declared as part of the function prototype. Errors are an obvious case, but there are other times when we might want to continue a logical branch that took place inside the function. An example is “fork”, which would have three possible branches: Error, parent, child. This eliminates the artificial need to collapse the code path upon return only to branch again.

    This idea strikes me as something out of the asynchronous world, very similar to what can be done using callbacks. The main issue which I have with those so far is that they tend to make it harder to visualize the code path in a program, making me wish for a simpler abstraction in simple case… But that’s me anticipating one long-planned blog post on how RPC calls would “return” results or status information.

    (I’m so sorry, I’ve branched the discussion yet again…bad Alfman)

    That’s okay, it goes very well with my way of thinking… Always jumping from one thing to another.

    Well sure, in theory. However, if you want to support __thread allocation attributes, then you’d either have to modify the compiler to support your OS syscall, or you’d have to stick with the TLS mechanism that the C/C++ compilers are already generating code for.

    But then I’d just have to stick my nose in the C++11 spec and find out how compilers are supposed to locate the TLS area (at least here’s to hoping that it is part of the standard), then put my TLS base pointer there instead of putting it at syscall reach… If so, it remains a fairly simple OS abstraction to support.

    Hmm, well these ideas came about more recently than my first OS project a decade ago, which was far more conventional and truthfully had little merit other than being an educational tool that carried some bragging rights. Everything changes after university. The need to make a living becomes imperative. If I could make a living out of it, I’d love to do it, but otherwise there’s a lot of economic pressure to focus on things that can be monetised. I used to think having strong skills would equate to easy money, but that’s not been the case. It’s still a dream of mine to do a startup with friends, some day maybe.

    I don’t know… Here, at the rythm at which my past and current internships go, I still have enough spare time to work on OS matters… So if “real jobs” will go at the same rythm, which they are legally supposed to, I would expect to still be able to have hobbies after I get one of those. Not accounting for all the spare time that my annoying bad health eats for lunch, of course.

    It may be that as a French person and due to the specific field that I work in (public uni teaching and research), I have it easier than you, though. And then there’s the issue of kids : right now, I’m happily living alone in my flat, sharing it with my girlfriend when she passes by, but I keep fearing that once we start to make those, I will have to give up on all my private life, hobbies and sleep in order to properly raise them.

  20. Alfman May 30, 2012 / 6:10 am

    Jobs & Life & hobbies:
    “So if ‘real jobs’ will go at the same rythm, which they are legally supposed to, I would expect to still be able to have hobbies after I get one of those.”

    I’d say on average my FT jobs have been 48hrs per week. I donno if it’s corporate pressure, deadlines, culture, competition, poor job market, but in some places it’s just the norm to leave work at around 19:00. What I could do with an extra 52 days/year! All the places I interned at ended shifts at 17:00, so the norms are probably different everywhere. A hypothesis I haven’t researched is that not long after my own internships, the FLSA reclassified software engineers as explicitly exempt from legal employee overtime pay requirements. It may be responsible for a shift in the corporate work week.

    I’ve heard from people overseas who seem to think all salaried employees can come and go as they please. Does it work like that in France? Do you guys have a 35 hour week by law?

    “And then there’s the issue of kids : right now, I’m happily living alone in my flat, sharing it with my girlfriend when she passes by, but I keep fearing that once we start to make those, I will have to give up on all my private life, hobbies and sleep in order to properly raise them.”

    Well that’s where I’m at now. All my free hobby time suddenly vanished. Our girl wants attention all the time, she won’t sleep through the night. I love our time together, but I also miss having time to myself. It’d be very nice to have more help. I try speaking to her exclusively in french (no matter how badly) to teach her a second language. Watching her learn new interactions is a lot of fun, that kind of thing makes every parent proud.

    (Go ahead and take this off your blog if you want to and/or continue via email considering how OT this subject is)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s