Out of memory !

Until some clever electronics engineers come up with a form of computer memory that is simultaneously fast, cheap, and nonvolatile, we will have RAM in our computers, and in fairly limited amounts. And like with any hardware resource of limited availability, this RAM will be filled at some point. In this article, I will discuss what should happen then.

Swap is not a solution

Some may argue that RAM shortage can be prevented by using swap, that is, using files on mass storage devices as RAM. This is, however, partly untrue and overall a very ugly solution to this problems, for the following reasons:

  1. Swap files also have a limited size so they can be filled up too, though admittedly more slowly
  2. Slow swap is assumed to be fast RAM by software, causing major system performance hits
  3. OSs often set many restrictions on what can be put in swap, reducing its usefulness
  4. Removing the drive on which the swap file is located will cause a near-instant major system crash
  5. Swapping causes massive strain and wear on mass storage devices as compared to normal OS usage

In short, I believe that swap is a kludge that can be used as a temporary solution to address a well-known lack of RAM on a machine, but shouldn’t be used as a long-term protection against RAM shortage. Other countermeasures should be used instead.

Separating “vital” RAM and “extra” RAM

More often than not, large programs do not strictly require all the RAM that they allocate. A part of it is actually necessary for their job, while another part is made of caches and similar stuff that is used to speed them up. Effectively, this can be handled by evaluating the amount of free RAM when the program is started, and then allocating a reasonable amount of cache memory, or even none at all if the situation is truly critical. However, this is not the proper way to do it.

The Linux disk cache shows an example of how such situations should be handled instead : as long as there’s free RAM, it keeps allocating into it to reduce the impact of disk accesses, but does it in such a fashion that if other software ends up allocating more RAM than is available, that “cached” ram can be promptly reclaimed by the kernel without issues. On the inside, it likely works as follows: each time the kernel filesystem driver wants to access a part of the disk cache, it first probes the associated memory region to see if it is still allocated. If not, it considers the associated cache line to be invalid, and reloads it from the disk to another cached RAM region.

Obviously, such a mechanism cannot work without explicit support from the caller application. However, nothing prevents the system library from featuring some sort of “cache memory” abstraction that hides away the low-level work and makes it easier to use for user-mode devs. The benefit would be to ensure that as long as applications are kept well-coded, RAM shortage would only happen when all allocated RAM chunks are strictly necessary to the well-being of the associated software, and not, as an example, when some web browser’s in-RAM web cache has grown anarchically beyond reasonable dimensions.

Preventing RAM shortages

Even with the aforementioned precautions, there would be some situations in which software would fill up the available RAM through perfectly legitimate operations. All it takes is a single application that is given enough of a heavy workload, massive amounts of independent applications running simultaneously, or an application that is subject to memory leaks that gradually eat up all of the available RAM as time passes. Before discussing what should happen when all RAM has been legitimately exhausted in such a way, one can wonder if and when such an event could be planned in advance, allowing the system to warn its user so that he takes action.

Sudden allocation of massive amounts of RAM can rarely be planned in advance. Clean failure, which will be discussed in the next section, sounds like the only option in such a case. But when RAM is filled up more slowly, and especially if this happens on the user’s request, one can envision setting a warning threshold of RAM exhaustion (which can be absolute, relative, or a mix of both), and display a warning notification when this threshold is reached. This threshold could either be an arbitrary number, or something more advanced based on statistical data about users’ and applications’ RAM usage.

Failing with and without causalities

There is a number of ways a lack of RAM can be managed when it happens. Two solutions tend to stand out, however : the one in which the allocating software manages the failure by itself and is brutally killed by the OS it if it doesn’t (as is implemented by most BSDs in the form of returning NULL to the memory allocation request), and the one in which the OS directly kills some memory hog in order to free up some RAM to answer the allocation request (as is implemented by Linux in the form of the “out of memory killer”).

I tend to be biased towards the first solution, since it is my opinion that software should never ignore the error return codes of system functions. It is also undoubtedly the best option when a user asks software to do an impossible task, like loading 64GB of data in RAM. However, not all processes are equal in the world of operating systems, and this creates situations in which rejecting a process’ memory allocation requests might be highly undesirable.

Consider the following scenario, as an example: as the machine is running close to its RAM limits, a low-level system service such as the kernel or a hardware driver suddenly ends up requiring more RAM to do its job. If this RAM request is not satisfied, it will result in some sort of driver failure (though not a full crash if that driver is well-coded), that may affect all running user-mode processes in some way. In such circumstances, wouldn’t it be better to sacrifice just one user-mode process so that the driver may continue to operate properly?

From this data, it appears that having the RAM allocation request fail is the best option in some situations, and that killing some process to free up RAM is the best option in other situations. Which is likely why both approaches have their defenders. The solution which I propose, and have in fact already implemented to some extent, is to allow malloc() to have both behaviours, switching from one to another through an optional “force” flag. If that force flag is disabled, malloc will fail and return NULL when there isn’t enough available RAM. If that force flag is enabled, it is guaranteed that malloc calls will either succeed or kill the caller, no matter how much process-killing it takes. Since this malloc flag puts a lot of nuisance power in the hand of the caller, its use will obviously be subjected to the possession of a special security permission that only trusted system services should have.

Deciding to kill processes to free up RAM is not enough, though. It is also of primary importance to kill the right process. On Linux, as an example, I can’t count the number of times when the OOM killer has decided to kill X and all GUI applications simply because I asked too much from VirtualBox, and this shouldn’t happen. The decision should probably be based on rock-solid heuristics, and since out of memory events are getting pretty rare these days as computers now sport multiple gigabytes of RAM, it might even be a good idea to interactively ask for user feedback before proceeding.

Here is my take on the basic building blocks of an optimal “out of memory killer” heuristic:

  • The more RAM a process has allocated, the more likely it is that no additional murders will needed.
  • The more third-party RPC clients a process has, the higher impact killing it is likely to have.
  • The higher the scheduling priority of a process, the more likely its existence is likely to matter to users.
  • Processes which have no unsaved data or “writing” file connexions are better candidates for killing than processes who do.
  • The kernel always comes last on the list, since killing it amounts to halting the system with a panic screen.
  • In any case, that killing spree stops once the malloc caller has been killed, and never starts if it is determined that killing “better” candidates won’t solve the problem.

You may notice that there is no direct notion of system services here, save for the kernel’s special treatment. That is because in a sandboxed OS design, root access does not exist anymore, and determining what is most likely to be a system process using lists of security privileges becomes extremely complex and error-prone. Which is why an RPC-based heuristic is used instead to determine more pragmatically how many processes may rely on the candidate to operate.

Another thing which I should point out is that some sensible killing heuristics may use fairly high-level concepts. As an example, the RPC one requires making sure that a malicious process cannot spawn lots of children and use them as RPC clients to reduce its likelihood of being killed. In a sandboxed design, ensuring with fair reliability that the connected processes use third-party code may be done by checking that they belong to a separate sandboxed application, since that limitation requires much more elaborate schemes for crackers to bypass (the malicious application must have received permission to call executables from other applications and the user must have installed several applications from the cracker).

A interesting limitation of this heuristic, however, is that it only works if RPC is mostly used for client-to-server communication, further strengthening the asymmetric role of something which I originally envisioned as a symmetric communication protocol.

And I think that’s it for out of memory issues. Just one more unrelated thing though: while working on the documentation of kernel components, it has occured to me that some of them might be ill-named, in sense that their names do not clearly reflect what they do. So while no third-party code relies on the names yet, I thought that it would be a good idea to change that. Here goes:

  • “Physical memory manager” is quite a long name and does not clearly state that this component deals with the allocation of page-aligned chunks of RAM. Thus, I would spontaneously change the name to “RAM manager”.
  • “Virtual memory manager” is again a long and misleading name for something which deals with the virtual address space of processes. Since I only support page-based address translation anyway, I might as well call it “Paging manager”.

23 thoughts on “Out of memory !

  1. kdepepo November 2, 2012 / 3:39 pm

    The application allocating memory should indicate how costly it would be to re-generate the data contained in that part of the memory. If it is data received from the network, reading it again is very slow. If the memory is used to cache data from a file, it will be faster to re-read it from disk, but this is still very slow. If the data, however, is converted from other data available in other memory, temporarily freeing it and later re-generating it might be much faster than swapping it out to disk.

    Such a collaborative way to handle memory was used in AmigaOS (due to lack of an MMU), where each application could register a “memory handler”. If memory got filled, the OS asked the applications in turn to release some of its caches.

    Now combine virtual memory with above described cost-dependend releasing, and you should always get the optimum performance.

  2. Brendan November 2, 2012 / 7:52 pm

    While slow swap is assumed to be fast RAM by software, this shouldn’t cause major system system performance hits. It’s the OS’s job to (attempt to) ensure that only the “least likely to be needed” pages are swapped out. For example, if a process repeatedly accesses 10 MiB of pages but hardly ever accesses another 10 MiB of pages, then sending those “hardly ever accessed” pages to swap won’t make any noticeable difference to performance. Swap only effects performance if software is actually using those pages frequently enough.

    For example, you might have a computer with 4 GiB of RAM. When software is using 3 GiB of data frequently and another 123 GIB of data rarely then everything works fine. When software is using 5 GiB of data frequently you start seeing some performance loss. When software is using 7 GiB of data frequently you see major performance problems (but the user probably would’ve had plenty of warning before things get that bad).

    There are also much less intuitive situations. For example, imagine a computer with an FTP server under heavy load (that doesn’t really use much RAM at all, as it’s relying on the OS’s disk caches), and a graphics image editor that somebody used 2 days ago (that has been minimised/unused since) that is tying up 1234 MiB of RAM. In this case, by sending the graphics image editor’s pages to swap space you could increase the amount of data cached by the OS’s disk caches and make the FTP server faster, and reduce disk wear caused by “disk cache misses”. In this case; even though no process actually needs the RAM, and even though there may be plenty of “free” RAM already being used by the OS’s disk caches, you improve performance (and reduce disk wear) by sending data to swap!

    I’d also point out that (for most OSs) support for swap space is tied up with support for memory mapped files – e.g. maybe those “hardly ever accessed” pages are part of a memory mapped file and were never loaded into memory to begin with (and if any of them were loaded, maybe the weren’t modified and can be freed immediately without saving anything in a swap partition first, and then loaded from the original file if they’re needed again).

    The point of all this is to avoid premature failure (e.g. instead of failing as soon as actual RAM is exhausted, performance degrades as the amount of data actually being used increases and failures only occur in extreme cases).

    – Brendan

  3. Hadrien November 3, 2012 / 10:31 am

    @kdepepo:

    The application allocating memory should indicate how costly it would be to re-generate the data contained in that part of the memory. If it is data received from the network, reading it again is very slow. If the memory is used to cache data from a file, it will be faster to re-read it from disk, but this is still very slow. If the data, however, is converted from other data available in other memory, temporarily freeing it and later re-generating it might be much faster than swapping it out to disk.

    Such a collaborative way to handle memory was used in AmigaOS (due to lack of an MMU), where each application could register a “memory handler”. If memory got filled, the OS asked the applications in turn to release some of its caches.

    That would be interesting indeed! However, it makes me wonder about the way data acquisition costs could be estimated in a manner that is comparable from one software to another and ideally can be specified at RAM allocation time (thus, before the data has actually been acquired).

    The simplest measurement which I can think of is an estimation of the amount of real time that it will take to acquire the cached data, but even that can vary a lot from one attempt at performing a task to another. In the realm of mass storage, high-end SSDs are much faster than HDDs, which are themselves much faster than the crappy flash cells used in your typical USB pen drive or SD card, and the performance of all these can vary considerably depending on if they are busy, idle but active, or turned off by the power management system. For another example, network performance varies considerably from one internet connexion to another, from one server to another, and even across the typical workday.

    With all this in mind, would it still be possible to give a simple yet fairly reliable estimate of the cost of some block of cached data, that is precise enough to actually differentiate one block of data from another?

  4. kdepepo November 3, 2012 / 4:16 pm

    The application could use a “malloc / fill / report time” sequence. The OS then knows, how much time the application would need to recreate the content, but you could also add heuristics to compute moving averages for disk and network speed. Using direct measurements might solve questions such as “How much time does a browser need to recreate the rendering tree from a HTML stream”, so that it can decide to keep the tree, discard it, or even discard the raw HTML data, if it can be re-read from disk again.

    Indeed, with SSD systems having similar speed than main memory, it could be useful to swap out. I was mainly thinking about rotational disks, which are still the most used storage devices.

  5. Alfman November 4, 2012 / 5:16 am

    I don’t like the out of memory killer in linux, I probably wouldn’t like it much here either.
    But if it’s necessary, here are my thoughts:

    1. Have a new “killer malloc” function in addition to normal malloc.
    As you’ve indicated, it is rather silly for a malloc invocation to successfully return memory, having killed another application, when the caller would have been willing to handle the out of memory condition gracefully by itself. But instead of a flag, I’d rather there be a completely separate allocation function for critical allocations which should succeed at the cost of other processes.
    I guess it’s personal taste:

    malloc(sizeof(abc), true);
    malloc_critical(sizeof(abc)); // I like this better

    2. Distinguish between system critical and normal processes. A system critical service wouldn’t need permission to call the critical malloc. A normal process shouldn’t be able to kill processes without explicit permission (issued either ahead of time, or as a real time popup). System critical services should be exempt from “kill duty” regardless of their priority levels or other factors.

    3. Implement a garbage collection signal.
    Ask processes to voluntarily give up ram before resources get critically low.

    4. Consider alternatives to killing all together. Instead of killing any processes at all, suspend them to user files such that they can be restarted later. In fact, you might even display suspended processes on the task bar so they are easy to recover. Suspended applications might even be available to restart after a reboot.

    http://checkpointing.org/

    I think I like this idea even better than having generic swap memory – an application instance can get it’s very own optional swap file where it can be hibernated/resumed/deleted. No need for all processes to share swap.

    I’m not sure if I’d go this far, but an idea to consider is that all mallocs could succeed as long as the user had enough disk quota to fulfil the request. An admin wouldn’t have to define a global system limit in terms of memory+swap space.

    5. Returning null versus terminating. A process that doesn’t handle null will terminate anyways when it accesses the memory, so I don’t know what the benefit would be for adding a new mode to malloc. My preference is to avoid process level bits which change the behaviour of the programming environment since you might be building a conflict when some libraries begin expecting it to work one way, and other libraries another way (I’m reminded of PHP’s magic quotes).

    6. What about your stance on overcommitting? Fork is an obvious case. Another case is when processes are loaded from a binary/object and haven’t yet written to the global variables stored in shared data pages. Actually you might have told me once that you weren’t using fork to spawn other processes.

  6. Hadrien November 5, 2012 / 8:32 am

    @Brendan:

    You make some very good points as usual, in particular pointing out that I was assuming that all RAM that is allocated by software is actually used by it. You are also right in saying that swap is a way to work around the limitations of software that allocates large amounts of RAM and uses little of it, at least to some extent. However, I believe that you might be a bit too optimistic as for what swap can do in other areas.

    For example, you might have a computer with 4 GiB of RAM. When software is using 3 GiB of data frequently and another 123 GIB of data rarely then everything works fine. When software is using 5 GiB of data frequently you start seeing some performance loss. When software is using 7 GiB of data frequently you see major performance problems (but the user probably would’ve had plenty of warning before things get that bad).

    An issue which I have with this specific example is that the first statement implies that there could be a 123 GiB large swap file full of infrequently used data lying somewhere on a hard drive, eating up disk space that could have been used by user data. When the average hard drive can still rarely store more than 500 GB of data, and when sales of SSDs with capacities of 250 GB or less are booming, I would not consider this an acceptable situation. Of course, things are different if the swap space can be reclaimed by the OS, but then again that would mean that copying large files to the system disk would result in random out of memory crashes, which is probably not acceptable either.

    This is actually one of my gripes with swap: it is designed to work around small amounts of RAM, but is typically bounded to onse single HDD file or partition whose size is of the same order of magnitude as RAM size, so I am not sure if so much has been gained by using it. Perhaps larger amounts of swap, like those which you use in this example, could work if swap space was something that each specific process owns, rather than something which all processes share (kind of like Alfman’s proposal of saving processes to disk to free up RAM). In particular, killing a single process that abuses swap space would be enough to free up disk space when it is required.

    There are also much less intuitive situations. For example, imagine a computer with an FTP server under heavy load (that doesn’t really use much RAM at all, as it’s relying on the OS’s disk caches), and a graphics image editor that somebody used 2 days ago (that has been minimised/unused since) that is tying up 1234 MiB of RAM. In this case, by sending the graphics image editor’s pages to swap space you could increase the amount of data cached by the OS’s disk caches and make the FTP server faster, and reduce disk wear caused by “disk cache misses”. In this case; even though no process actually needs the RAM, and even though there may be plenty of “free” RAM already being used by the OS’s disk caches, you improve performance (and reduce disk wear) by sending data to swap!

    You are right that this would be an interesting use case for swap, but how would it work in practice ? How would the OS disk cache figure out that the RAM is not actually filled up, but that some large software can be swapped out, allowing it to allocate even more cache, without doing so in situations in which the relevant software is actually active and thus in which swapping it out would be undesirable ?

    One option would be to automatically swap out software if it has been inactive for an arbitrary amount of time and the disk cache has run out of memory, but then that would mean that just leaving a running computer for a while with some background task that makes mild use of the disk (say, a bittorent client) would result, on wakeup in the characteristic sluggishness of something that has ran out of RAM and started to eat into swap. Personally, I wouldn’t want that on my desktop.

    I’d also point out that (for most OSs) support for swap space is tied up with support for memory mapped files – e.g. maybe those “hardly ever accessed” pages are part of a memory mapped file and were never loaded into memory to begin with (and if any of them were loaded, maybe the weren’t modified and can be freed immediately without saving anything in a swap partition first, and then loaded from the original file if they’re needed again).

    Memory mapped files are truly a beautiful idea in principle, and I considered using these as a central concept for a while since they simplify many system tasks. However, at some point, I realized that they really cannot apply to most modern file use cases, due in no small part to one little problem : compressed files.

    Since we live in an era of computing where mass storage media are slow and CPUs are fast, pretty much any kind of large file is nowadays stored in a compressed form, so as to reduce disk footprint, ease transport, and sometimes reduce loading times (while increasing it at other times due to data nonlocality, but that’s another debate). The problem is that memory-mapping a compressed file is basically useless, since these can only be read and written to in an uncompressed form, that has to be maintained separately by software anyway.

    Another area where I find memory-mapped files to perform pretty weakly is when content is inserted in the middle of a file or appended at the end of it. Addressing these use cases requires either moving huge amounts of data around (and even if RAM is fast, it has its limits), or creating pathetic fragmented monsters that are stored in an endless linked list that must be parsed over and over again or operated solely though the intermediary use of a hash table…

    In short, it appears to me that memory-mapped files only perform well when used for manipulating file formats which no one uses anymore (for large amounts of data at least) in a read-only fashion. In these conditions, is it still an abstraction that can help a lot with swapping issues?

  7. Hadrien November 5, 2012 / 8:57 am

    @kdepepo:

    The application could use a “malloc / fill / report time” sequence. The OS then knows, how much time the application would need to recreate the content, but you could also add heuristics to compute moving averages for disk and network speed. Using direct measurements might solve questions such as “How much time does a browser need to recreate the rendering tree from a HTML stream”, so that it can decide to keep the tree, discard it, or even discard the raw HTML data, if it can be re-read from disk again.

    Indeed, that could well be the least bad option in the end. My issue with such sequences, however, is that one would start to encounter problems as soon as the RAM gets filled up with cache data: when a caching process would request cache RAM for data of unknown cost, the system wouldn’t know if it should reject the cache request or invalidate some least costly cached data instead.

    Example: If the RAM contains lots of disk cache and the network services request a bit of network cache, the right decision would likely be to invalidate some disk cache since network accesses are more costly. Conversely, if the RAM contains lots of network cache and the mass storage services request a bit of disk cache, the system should probably reject the caching request, unless the disk cache is getting really small.

    Perhaps the system should revert to a simpler unprioritized FIFO policy for memory allocations, and only take account of data costs when the amount of cache-able RAM decreases (i.e. when “vital” RAM is allocated). That would reduce the risk of starving “low-cost” caches on the plus side, but would also mean that cache is considered more or less valuable depending on the situation, which is a debatable matter.

    Also, how should the system treat unfilled caches when some cache RAM has to be reclaimed by user processes?

  8. Hadrien November 6, 2012 / 8:27 pm

    @Alfman:

    1. Have a new “killer malloc” function in addition to normal malloc.
    As you’ve indicated, it is rather silly for a malloc invocation to successfully return memory, having killed another application, when the caller would have been willing to handle the out of memory condition gracefully by itself. But instead of a flag, I’d rather there be a completely separate allocation function for critical allocations which should succeed at the cost of other processes.
    I guess it’s personal taste:

    malloc(sizeof(abc), true);
    malloc_critical(sizeof(abc)); // I like this better

    I tend to prefer optional flags to new functions myself, as long as the behaviour of the function is not fundamentally changed, because the latter tends to result to overly long function name as the feature set grows (e.g. malloc_critical_shareable would already be a bit long, and future features could make it longer). However, I agree that what constitutes a “fundamental change” is in itself a matter of debate.

    2. Distinguish between system critical and normal processes. A system critical service wouldn’t need permission to call the critical malloc. A normal process shouldn’t be able to kill processes without explicit permission (issued either ahead of time, or as a real time popup). System critical services should be exempt from “kill duty” regardless of their priority levels or other factors.

    I actually evoked this one idea in the blog post, while pointing out that there’s an obvious problem with it: distinguishing what is a system process from what isn’t, since the boundary between both can get a bit blurry. The Linux OOM killer algorithm heuristically guesses this using permission flags like root access and raw hardware access, and only kills the associated processes in extreme cases, but the permission flag architecture which I envision is much more complex than the Linux one and thus unsuitable for this kind of game. This is why I suggested using an IPC-based “how much process do apparently rely on this fellow” heuristic.

    3. Implement a garbage collection signal.
    Ask processes to voluntarily give up ram before resources get critically low.

    In my opinion, this should probably remain a system task, since most user applications operate in a highly abstract library-based world in which they wouldn’t know which tasks they can perform while freeing up RAM without exhausting the remaining resources. This is why I tend to favor implementing a dedicated “cached RAM” abstraction which the system can free up itself when it needs it, even if it is certainly less flexible than full garbage collection.

    4. Consider alternatives to killing all together. Instead of killing any processes at all, suspend them to user files such that they can be restarted later. In fact, you might even display suspended processes on the task bar so they are easy to recover.

    (…)

    http://checkpointing.org/

    I think I like this idea even better than having generic swap memory – an application instance can get it’s very own optional swap file where it can be hibernated/resumed/deleted. No need for all processes to share swap.

    I agree that the idea of per-application swap files has potential, but what would be the advantage of fully hibernating entire applications instead of just using these files like “regular” swap? As Brendan pointed out above, swap can actually make sense when applications keep large data sets in RAM without operating on much of them at once, and resuming a fully swapped out application would probably result in even slower performance than resuming parts of it as part of a “regular” swapping process.

    Suspended applications might even be available to restart after a reboot.

    That reminds me a bit of an earlier “process backup” idea of mine to increase reliability, back in the early brainstorming days of this project. Basic idea was that if a process crashed, it would be great if we could just jump in our time machine, reload a copy of its RAM image from a few minutes ago, and start over before data loss occured. Further reflexion suggested that this wouldn’t work very well without explicit system and application support, though.

    That is because processes tend to rely on a fair share of run-time system state, such as the PID of other processes which they are communicating with, unique “handle” identifiers which have been handed to them by lower-level system services, tasks which said system services are currently performing for them… The issue being that if you reinitialize core system services through a reboot, then try to start a hibernated process that relied on their previous state, you are likely to face some serious issues.

    I’m not sure if I’d go this far, but an idea to consider is that all mallocs could succeed as long as the user had enough disk quota to fulfil the request. An admin wouldn’t have to define a global system limit in terms of memory+swap space.

    I like the idea too, however it also must address a point which I have made since in the parallel discussion with Brendan: if swap has no size limit but can be reclaimed, does it mean that copying large files to a disk that is almost full of swap files would result in OOM application crashes? If swap can get infinitely large and cannot be reclaimed, does it mean that users can have all their disk space eaten by swap files?

    5. Returning null versus terminating. A process that doesn’t handle null will terminate anyways when it accesses the memory, so I don’t know what the benefit would be for adding a new mode to malloc. My preference is to avoid process level bits which change the behaviour of the programming environment since you might be building a conflict when some libraries begin expecting it to work one way, and other libraries another way (I’m reminded of PHP’s magic quotes).

    As mentioned in the blog post, I do not consider the “killing malloc” as something which your average user application should be able to use. Rather, it’s here to address the problem of critical processes getting starved of RAM by less critical processes. To recall my earlier example: if you have 5 “unimportant” user processes that rely on the activity of one driver, and the driver runs out of memory, rejecting the driver’s memory allocation request will potentially smash 6 processes in some way, whereas killing off one of the user processes will only affect that one process which just died.

    6. What about your stance on overcommitting? Fork is an obvious case. Another case is when processes are loaded from a binary/object and haven’t yet written to the global variables stored in shared data pages. Actually you might have told me once that you weren’t using fork to spawn other processes.

    Yup, I’m not a fan of fork since for modern use case, having multiple clones of the same process running simultaneously is getting less and less useful (threads are more suitable for parallel tasks), and thus the overhead of fork+exec as compared to a “pure” CreateProcess-ish mechanism sounds less and less justified.

    As for overcommitting… Honestly, I don’t know. It’s quite the huge trade-off: if you do it, you can potentially use hardware resources much more efficiently by only acquiring them when they are actually needed. But on the other hand, I hate the idea of lying to processes by telling them that some resources are available on request when they actually aren’t.

    To make matters wose, I guess that such a choice would also have profound implications on future work, since if I have overcommitting, I’ll tend to use it everywhere so as to benefit from it and to design system abstractions to work around its flaws, whereas if I don’t have it, I’ll need to work around problems that can be solved easily through overcommitting with more complex methods.

  9. kdepepo November 6, 2012 / 10:28 pm

    > when a caching process would request cache RAM for data of unknown cost, the system wouldn’t know if it should reject the cache request or invalidate some least costly cached data instead.

    You only need to select one of the already filled caches and find the ones with least cost. malloc simply returns a valid virtual pointer. If you never fill the block, there is no physical storage used. If you fill it, you know the time that was needed to fill it.

    Of course you can reject keeping the pages later, effectively forcing the application to generate/convert the data again and again, but you do not need to reject them “in advance”.

  10. Alfman November 9, 2012 / 7:13 am

    Hadrien,

    “This is why I suggested using an IPC-based ‘how much process do apparently rely on this fellow’ heuristic.”

    Well, sometimes the pro-OOM killer view is that it’s just a matter of getting the right heuristic, but my opinion is that having any heuristic at all in the kernel for killing off innocent processes is the wrong approach. It implicitly kills off processes that were not involved in the allocation that triggered the OOM killer.

    For me, it boils down to this: my application is well behaved, it uses the resources that were approved by the OS, now the OS is turning around and killing my process because because *it* failed to adequately manage resources in the first place. It’s like claiming eminent domain on something that’s already been promised to me. It’s immoral, don’t you see :-)

    I’ll try my best to embrace whatever OOM model you put forward! But I’d like to see every reasonable effort go into avoiding the killer in the first place, like pre-emptively denying memory allocation requests of non-critical tasks if they’re likely to trigger the OOM killer later, maybe having some ram reserved for critical services (don’t linux and solaris reserve ram for this?). Ask processes to voluntarily go into hibernation and/or free resources. Etc.

    “I agree that the idea of per-application swap files has potential, but what would be the advantage of fully hibernating entire applications instead of just using these files like “regular” swap?”

    Well, think in terms of “fine grained sessions”. A system wide pagefile typically creates a huge session containing every process running on the system. We can hibernate the system as a whole, but in multi-user scenarios (and sometimes even single user scenarios) that’s often not what we want. For example, a user locks his session, leaving web browsers & documents open. Upon the next login, these are immediately available because they were paged out to the global swap. But the whole time the user was away, these processes continue to consume global swap resources. This may even lead subsequent users to experience OOM conditions, which is especially unfortunate considering that the first user’s processes really should not be consuming swap resources while hibernating and their user is away. Of course, the OOM handler could kill the hibernating processes, but if they resided in local user/process swap instead, the problem wouldn’t have arised at all.

    “As Brendan pointed out above, swap can actually make sense when applications keep large data sets in RAM without operating on much of them at once, and resuming a fully swapped out application would probably result in even slower performance than resuming parts of it as part of a “regular” swapping process.”

    Couldn’t it work the same way? When I say “hibernate”, I’m referring to the state of the application (ie: not running), not necessarily that it’s been “fully swapped out” to disk in one shot. It could be written to/read from disk on demand as memory is required (exactly like regular swap), the only difference would be that it’s mapped to a different file. A side benefit might be that a process based swap file would inherently be less fragmented than the global swap (since it contains only one process’s ram).

    “Further reflexion suggested that this wouldn’t work very well without explicit system and application support, though…That is because processes tend to rely on a fair share of run-time system state, such as the PID of other processes which they are communicating with, unique “handle” identifiers which have been handed to them by lower-level system services, tasks which said system services are currently performing for them”

    I’d argue that PIDs are racy anyways, it’s poor practice to hold one for any extended length of time whether hibernation is involved or not. Unfortunately it’s standard practice because some apis provide no safe alternatives. Depending on the contents of /proc/sys/kernel/pid_max, wraparounds may pose a serious problem to someone who’s killing processes from ‘ps’ or ‘top’.

    Beyond these global ids however, I don’t think there’s a problem saving and resuming resource handles on linux because each process has it’s own handle table within the kernel anyways. The file handle the process sees is simply an index into the processes file/socket/etc table in the kernel starting with the initial #0-2 (stdout/stdin/stderr). I think it’d be pretty easy to save the contents of this table upon hibernation, and then reopen the files upon reload. As far as the process is concerned, the same file handles would still be valid before and after hibernation. Any errors (esp. sockets) could just leave the handle in a closed/errored state, and the process would naturally get an IO error on the next call.

    Of course, this may be much trickier to pull off in a microkernel, I won’t pretend that I know where to start :-)

    “if swap has no size limit but can be reclaimed, does it mean that copying large files to a disk that is almost full of swap files would result in OOM application crashes?”

    A dedicated swap partition obviously doesn’t compete with user files, but I guess even the hypothetical user/process based swap files could reside on a dedicated partition to isolate them in a similar way. Quotas could help keep the OOM conditions of one user from affecting another user, but ultimately there needs to be enough ram+disk space or an OOM is probably inevitable.

  11. Alfman November 9, 2012 / 7:40 am

    kdepepo,

    “You only need to select one of the already filled caches and find the ones with least cost. malloc simply returns a valid virtual pointer. If you never fill the block, there is no physical storage used. If you fill it, you know the time that was needed to fill it.”

    I personally dislike the deferred allocation approach used by linux/malloc because it essentially renders the “if (ptr==null) …” test useless and causes the application to page fault in rather unexpected places because the OS lied about the availability of ram. So my opinion is that it’d be bad to require malloc to use deferred allocation.

    But this doesn’t really preclude you from reserving/allocating ram and setting up a page fault anyways to see when a page changes, but how do you get around the lack of one to one correspondence between malloc objects and page sizes? Processes are typically indifferent to page sizes whether they’re 4K-8M, but here it sounds like you need the size of the page to be on par with the size of the buffer?

  12. Alfman November 9, 2012 / 8:56 pm

    kdepepo,

    Just a quick question. Are you aware of any programs that actually depend on deferred allocation under linux? (I don’t know of another platform that does it). In theory something might allocate a huge array and then go use only some of the elements without initialising the rest, but I’ve never knowingly encountered anything like that. I’d be very interested in learning where this would ever be considered practical.

  13. Hadrien November 13, 2012 / 7:44 am

    @kdepepo:

    > when a caching process would request cache RAM for data of unknown cost, the system wouldn’t know if it should reject the cache request or invalidate some least costly cached data instead.

    You only need to select one of the already filled caches and find the ones with least cost. malloc simply returns a valid virtual pointer. If you never fill the block, there is no physical storage used. If you fill it, you know the time that was needed to fill it.

    Of course you can reject keeping the pages later, effectively forcing the application to generate/convert the data again and again, but you do not need to reject them “in advance”.

    Wouldn’t this only solve the problem in one specific situation, that where a process has allocated memory and not used it at all? If a process allocates memory and only fills part of it, then as far as I can tell the system is back to square one: even if the unfilled part remains unallocated due to paging magic, the system still has one “partially filled cache” of unknown cost on his hands and cannot tell if it can get rid of it as needed or not.

  14. Hadrien November 13, 2012 / 8:30 am

    Well, sometimes the pro-OOM killer view is that it’s just a matter of getting the right heuristic, but my opinion is that having any heuristic at all in the kernel for killing off innocent processes is the wrong approach. It implicitly kills off processes that were not involved in the allocation that triggered the OOM killer.

    For me, it boils down to this: my application is well behaved, it uses the resources that were approved by the OS, now the OS is turning around and killing my process because because *it* failed to adequately manage resources in the first place. It’s like claiming eminent domain on something that’s already been promised to me. It’s immoral, don’t you see :-)

    I’ll try my best to embrace whatever OOM model you put forward! But I’d like to see every reasonable effort go into avoiding the killer in the first place, like pre-emptively denying memory allocation requests of non-critical tasks if they’re likely to trigger the OOM killer later, maybe having some ram reserved for critical services (don’t linux and solaris reserve ram for this?). Ask processes to voluntarily go into hibernation and/or free resources. Etc.

    Consider me of a moderate pro-OOM myself perhaps, in that

    1. I tend to agree that killing processes should be avoided if reasonably possible, but
    2. Since computers have finite storage resources (think low-end tablets with their 8GB of mass storage versus 1GB of RAM), even with the best countermeasure, there can always be a situation in which it will be a choice between killing one innocent process or trashing the whole running OS, and in such a situation I’ll pragmatically kill the innocent guy.

    I agree, however, that if swap can be made less useless in some ways, it could be an efficient countermeasure: think warning the user as soon as swap usage (as measured by a combination of swapping-related disk activity and maximal disk usage over a given time period) gets abnormally high, suggest options to free up memory based on the automatic killer heuristic, and if he chooses to do nothing… well, as soon as we have made sure to warn him soon enough, whatever happens when the swap is full is his problem.

    “I agree that the idea of per-application swap files has potential, but what would be the advantage of fully hibernating entire applications instead of just using these files like “regular” swap?”

    Well, think in terms of “fine grained sessions”. A system wide pagefile typically creates a huge session containing every process running on the system. We can hibernate the system as a whole, but in multi-user scenarios (and sometimes even single user scenarios) that’s often not what we want. For example, a user locks his session, leaving web browsers & documents open. Upon the next login, these are immediately available because they were paged out to the global swap. But the whole time the user was away, these processes continue to consume global swap resources. This may even lead subsequent users to experience OOM conditions, which is especially unfortunate considering that the first user’s processes really should not be consuming swap resources while hibernating and their user is away. Of course, the OOM handler could kill the hibernating processes, but if they resided in local user/process swap instead, the problem wouldn’t have arised at all.

    That didn’t quite answer my question, perhaps I worded in inappropriately.

    We can have per-process or per user swap files, with the advantages that you describe. But that doesn’t require per-application hibernation functionality (i.e. the application state survives across reboots while that of system processes is reinitialized), which sounds much more difficult to implement altogether. It’s just that: a regular swap file, which only the caller process can allocate into, and which is accounted by the system as per-application disk usage.

    So, what would be the advantage of “full” software hibernation as compared to that?

    “Further reflexion suggested that this wouldn’t work very well without explicit system and application support, though…That is because processes tend to rely on a fair share of run-time system state, such as the PID of other processes which they are communicating with, unique “handle” identifiers which have been handed to them by lower-level system services, tasks which said system services are currently performing for them”

    I’d argue that PIDs are racy anyways, it’s poor practice to hold one for any extended length of time whether hibernation is involved or not. Unfortunately it’s standard practice because some apis provide no safe alternatives. Depending on the contents of /proc/sys/kernel/pid_max, wraparounds may pose a serious problem to someone who’s killing processes from ‘ps’ or ‘top’.

    Well, what other indicator would you use to identify the specific instance of a given system process that is currently processing a task from a user process? String filenames are fine for bootstrapping, but they might quickly get costly for run-time checks and IPC requests…

    Beyond these global ids however, I don’t think there’s a problem saving and resuming resource handles on linux because each process has it’s own handle table within the kernel anyways. The file handle the process sees is simply an index into the processes file/socket/etc table in the kernel starting with the initial #0-2 (stdout/stdin/stderr). I think it’d be pretty easy to save the contents of this table upon hibernation, and then reopen the files upon reload. As far as the process is concerned, the same file handles would still be valid before and after hibernation. Any errors (esp. sockets) could just leave the handle in a closed/errored state, and the process would naturally get an IO error on the next call.

    Of course, this may be much trickier to pull off in a microkernel, I won’t pretend that I know where to start :-)

    That’s precisely this microkernel issue which I am worried about. Since Linux is a monolithic and coherent (well, somewhat) construct, it’s easy to enforce a system-wide policy like “when a process is hibernated, I want to be able to save all of its state, including that stored in system processes, to a file”. On a microkernel where system components can theoretically be implemented independently and should not rely on each other’s implementation, however, such system-wide features become quite costly. Which is why I would avoid them unless I can truly justify the cost to myself in some way.

    (Example of such a self-justification : “Power management is costly to implement, but necessary because without it, anything battery-powered will run like shit” :))

  15. Alfman November 13, 2012 / 5:10 pm

    Hadrien,

    “even with the best countermeasure, there can always be a situation in which it will be a choice between killing one innocent process or trashing the whole running OS”

    I don’t think this is implicitly true. In a kernel where all operations are designed to be atomic, all operations can fail in a safe way with no negative side effects (other than a service request returning an exception, of course). Using something like NT’s structured kernel exceptions could make it possible to handle faults anywhere. It ought to work with a micro-kernel as well, however I concede this might be difficult and out of scope with TOSP goals.

    “and in such a situation I’ll pragmatically kill the innocent guy.”

    Just to play devil’s advocate, there may be some cases where rebooting the system would be preferable because having a system return to a known state could be superior to having a “running” system where a needed process got killed. Say, a Tivo where the main process is heuristically killed when a smaller process tried allocating memory, or a remote server where all processes are necessary. There’s the obligatory pacemaker example too :-) Of course the developers of these things ought to take extraordinary steps anyways, like some kind of kill-proof restart daemon or custom kernel.

    Trying to reach a conclusion… I’d feel better about an OOM killer that was a special process rather than one with logic hard coded into the kernel, is that something you could do? Do you think there’d be a way to guarantee an OOM-killer process could be signaled to do it’s duty under OOM conditions?

    “So, what would be the advantage of “full” software hibernation as compared to that?”

    Oh, nothing I guess, my misunderstanding.

    “Well, what other indicator would you use to identify the specific instance of a given system process that is currently processing a task from a user process?”

    I think the right way to avoid race conditions would be to treat processes as any other resources and return handles to them that could be held indefinitely. This needn’t change anything using an OOP design, the object would use a handle instead of a pid, though obviously a handle needs to be closed. The handle itself might be used as a process iterator similar to a dirfd.

    “That’s precisely this microkernel issue which I am worried about…On a microkernel where system components can theoretically be implemented independently and should not rely on each other’s implementation,”

    Well, in theory each component could implement a standard API for serializing and deserializing state for held resources. “Serialize the resources held by this/these handles, unload them now and pass me the information. At a later time, I’ll supply the same information, you can deserialize it and reconstitute the previous state” This process could be recursive such that components relying on other components could also be serialized and hibernated. This way, any process which relies exclusively on components that support this hibernation API could be hibernated and resumed at will.

    On the one hand, I can see how this is a non-goal at this point, and it can always be added later. On the other hand, having transparent & native application hibernation support could change the entire premise for an OS UI: save & open might become hibernate & resume.

    Note: I’m not expecting you to take this seriously at this point, but it’s an idea to keep on the back shelf.

  16. Hadrien November 14, 2012 / 9:07 am

    “even with the best countermeasure, there can always be a situation in which it will be a choice between killing one innocent process or trashing the whole running OS”

    I don’t think this is implicitly true. In a kernel where all operations are designed to be atomic, all operations can fail in a safe way with no negative side effects (other than a service request returning an exception, of course). Using something like NT’s structured kernel exceptions could make it possible to handle faults anywhere. It ought to work with a micro-kernel as well, however I concede this might be difficult and out of scope with TOSP goals.

    So, if I follow your reasoning…

    1. System services should only need extra RAM when requested to do something.
    2. Any service request should come with error reporting capabilities.
    3. Therefore, system services can always fail nicely and leave the caller with errors, so critical RAM allocations (and the killing it generates in OOM conditions) are unnecessary.

    I agree with this in principle. In practice, however, I believe choosing to go down this path will likely put some significant extra design overhead on service and/or application design.

    For a simple example, let’s take IPC. In a microkernel design, IPC is omnipresent, if a process is not able to perform it, it is isolated in its small world and unable to even report errors on a GUI (well, if GUIs are managed by an isolated process like I want to at least). To avoid this undesirable “autistic hung process” situation, at least some IPC channels must in turn be guaranteed not to use any dynamic memory allocation, and that can be a strong design constraint.

    Generalizing this, any system service which is relied on when reporting errors must be designed in an OOM-safe way. That likely includes such things as the filesystem (for software which reports errors through written logs), some GUI functionality (for software which reports errors through dialogs or notifications), and the underlying system services which each of these fellows depend on.

    And that’s only counting the most commonly used error reporting mechanisms. The proper way to design things would probably be to have a flexible system-wide error reporting system (like stderr but with much more features), have application developers know that if they don’t use it their error messages are not guaranteed to work, and make sure that neither it nor either of the components it relies on would critically need to allocate RAM at some point.

    Then there is the whole family of drivers whose operation is driven by hardware rather than user processes, in which case returning error codes is generally impossible. Think hardware hot-plugging, power management, input A/V streams… All those should also do their best not to allocate memory under any circumstances, and in some case (such as input data streams) I wouldn’t know how to do it except for the “make a big enough buffer and use the ostrich algorithm” option.

    Would you see a way to avoid all this complexity?

    “and in such a situation I’ll pragmatically kill the innocent guy.”

    Just to play devil’s advocate, there may be some cases where rebooting the system would be preferable because having a system return to a known state could be superior to having a “running” system where a needed process got killed. Say, a Tivo where the main process is heuristically killed when a smaller process tried allocating memory, or a remote server where all processes are necessary. There’s the obligatory pacemaker example too :-) Of course the developers of these things ought to take extraordinary steps anyways, like some kind of kill-proof restart daemon or custom kernel.

    I believe I can answer this in at least three different ways:

    1. For such situations where a seemingly mundane user process is known to matter a lot in advance, the Linux OOM killer allows for some processes to be considered more important than others as far as killing is concerned. I could likely do the same. To make sure that the underlying system services are also preserved, the “IPC OOM killer grade” of system processes could be defined as the sum of the OOM killer grade of all processes which rely on it instead of just the amount of processes.
    2. I do not target those critical systems which cannot fail under any circumstance ever, since my main target is after all desktops and other kinds of personally owned, interactive systems, and going for the former involves a whole new world of pain (hard real time constraints, full hardware self-tests and graceful operation when a part has failed…) which I am not prepared to deal with.
    3. Wouldn’t a rebooting pacemaker or set-top box fall in the category of “things that should never happen under any circumstances” already? I’d honestly prefer to kill that leaking Firefox process which the user has been running on his pacemaker for some weird reason rather than to reboot the whole thing :)

    Trying to reach a conclusion… I’d feel better about an OOM killer that was a special process rather than one with logic hard coded into the kernel, is that something you could do? Do you think there’d be a way to guarantee an OOM-killer process could be signaled to do it’s duty under OOM conditions?

    Well, we go back to the “malloc-free IPC” problem… If we are running out of memory, we cannot allocate extra memory for the extra IPC packet, so that memory must somehow have been allocated in advance, or there must be a malloc-free IPC method that is good enough for full communication between the kernel and the OOM killer.

    Just out of curiosity, why would you put that in a separate process though, apart from “so that I can ask it to just kill itself”? :) So far, my criteria for picking kernel functionality has been “if it is essential to basic process operation, then it fits”, and since I consider memory allocation to be an essential service, I would tend to put services which are necessary to memory allocation (such as an OOM killer, if present) in the kernel as well.

    “Well, what other indicator would you use to identify the specific instance of a given system process that is currently processing a task from a user process?”

    I think the right way to avoid race conditions would be to treat processes as any other resources and return handles to them that could be held indefinitely. This needn’t change anything using an OOP design, the object would use a handle instead of a pid, though obviously a handle needs to be closed. The handle itself might be used as a process iterator similar to a dirfd.

    I’m not very fond of this option because it means having two separate abstractions for users and developers (users see PIDs in task managers, developers see handles) that do pretty much the same thing. But what is the race issue with PIDs anyway?

    “That’s precisely this microkernel issue which I am worried about…On a microkernel where system components can theoretically be implemented independently and should not rely on each other’s implementation,”

    Well, in theory each component could implement a standard API for serializing and deserializing state for held resources. “Serialize the resources held by this/these handles, unload them now and pass me the information. At a later time, I’ll supply the same information, you can deserialize it and reconstitute the previous state” This process could be recursive such that components relying on other components could also be serialized and hibernated. This way, any process which relies exclusively on components that support this hibernation API could be hibernated and resumed at will.

    The “any process which relies exclusively on components that support this hibernation API” (and support this API themselves) part is worrying me a bit, because it means that not all processes are equal regarding hibernation, and then I would like to ask what criteria should be applied to determine which processes should be able to hibernate and which shouldn’t.

    If we just dictate that all processes should support hibernation, then at least the problem is simpler, since the question becomes a combination of 1/”How will I force developers to support this” and 2/”Why should they do it?”

    On the one hand, I can see how this is a non-goal at this point, and it can always be added later. On the other hand, having transparent & native application hibernation support could change the entire premise for an OS UI: save & open might become hibernate & resume.

    “Changing the entire premise for an OS UI” reads like something which I would write in order to demand extra funding for my research :) I will need more precise descriptions than that to make informed decisions. Would it be something like how iOS and Android phones sometimes ask applications to transparently save their states before silently killing them when there’s not enough RAM (an option which requires explicit app support, but which is arguably more efficient than pure swap on devices with small amounts of mass storage like phone), except again in a fashion that survives reboots?

    Note: I’m not expecting you to take this seriously at this point, but it’s an idea to keep on the back shelf.

    Well, sure, but if you’re inspired to talk about it right now, this blog page will remain here as a reminder later :)

  17. Alfman November 17, 2012 / 8:38 am

    Hadrien,

    “To avoid this undesirable ‘autistic hung process’ situation, at least some IPC channels must in turn be guaranteed not to use any dynamic memory allocation, and that can be a strong design constraint.”

    My impression is that usually kernels don’t need dynamic memory allocation to perform IPC on an established channel. Ie, data is transferred from one pre-allocated buffer to another. Is this incorrect?

    “Generalizing this, any system service which is relied on when reporting errors must be designed in an OOM-safe way. That likely includes such things as the filesystem (for software which reports errors through written logs), some GUI functionality…”

    You’re not wrong, however OOM is but one exception and you eventually have to answer the question more generically: how should the logger handle it’s own exceptions? It obviously cannot log them…Maybe the logger needs to throw it’s own exceptions due to the failure of underlying subsystems like disk, file system, SMTP outages, system file limits, etc.

    “Well, we go back to the ‘malloc-free IPC’ problem… If we are running out of memory, we cannot allocate extra memory for the extra IPC packet, so that memory must somehow have been allocated in advance, or there must be a malloc-free IPC method that is good enough for full communication between the kernel and the OOM killer.”

    I’m not entirely sure there is a malloc-free problem, but then I’m not intimately familiar with your IPC even though we’ve talked about it. Assuming there was a problem, a solution might be to have the OOM killer initiate the IPC to the kernel, have it normally block and only return when an OOM condition occurs. All other processes could be blocked until the OOM killer finishes it’s dirty deeds and re-establishes it’s blocked IPC link. Wouldn’t this help ensure that the OOM killer has enough resources for it’s own IPC?

    “Just out of curiosity, why would you put that in a separate process though, apart from ‘so that I can ask it to just kill itself’?”

    It’s just my opinion that a sysadmin should be able to configure their own local OOM killer algorithms without recompiling a custom kernel (aka policy does not belong in the kernel). Kernel level modules/dlls would do the trick, but are those even on the drawing board?

    “But what is the race issue with PIDs anyway?”

    PIDs can be/are reused (my linux system wraps at 32K) without any assurances that those who hold on to them have “forgotten” about them. Consider this scenario: the PID counter has wrapped and the last allocated PID was 97. I intend to kill a long running task because it’s consuming too many resources. I do “ps” and get it’s id=100, ps itself was 98. Now I type “kill 100”, but before I hit enter, PID 100 ended normally or got killed by someone else and some other task is now using PID 100. Therefor the kill signal goes to the wrong process. The odds of this happening increases greatly on multiuser systems with many tasks such that a few hundred PIDs might be issued between the time I observe the PID and the time I execute the signal.

    Although it might be unbearable to use, 64bit PIDs would greatly reduce the odds that a process would end at about the same time the counter approaches it’s own PID since it would take a very long time just to count that high even once. 32bit to a much lessor degree. You might try to tombstone PIDs for some length of time to prevent their immediate reuse, but I still think that’s an ostrich algorithm as you called it.

    “I would like to ask what criteria should be applied to determine which processes should be able to hibernate and which shouldn’t.”

    Assuming the kernel has the ability to track a processes’ resource handles, then couldn’t it look up all referenced services to check whether hibernation is supported by them?

    “If we just dictate that all processes should support hibernation, then at least the problem is simpler, since the question becomes a combination of 1/’How will I force developers to support this’ and 2/’Why should they do it?'”

    Well, only the processes that export stateful services to other process would need to explicitly support hibernation – most user processes don’t fit under that classification. As for enforcement, that’s a political problem :) In theory though if hibernation were a common occurrence (hibernate and resume things on the fly during normal use), then it’d become apparent rather quickly that a service is registering exceptions due to hibernation support.

  18. Hadrien December 3, 2012 / 9:09 pm

    My impression is that usually kernels don’t need dynamic memory allocation to perform IPC on an established channel. Ie, data is transferred from one pre-allocated buffer to another. Is this incorrect?

    IPC buffer size is best chosen so that this assumption holds most of the time, if hardware allows it, but I don’t think that the “malloc will never be needed” requirement can always be strictly enforced in a nonblocking IPC system.

    If, as an example, during a burst of intense activity, a sender ends up throwing IPC packets faster than their recipients can process them, a filled buffer is to be expected at some point. At this point, there are three options: either allocate more buffer space (which cannot be done without killing innocents in OOM conditions), reject the IPC request with the hope that the client will try again latter (which is inappropriate for that “cannot fail” functionality which logging arguably belongs to), or block the sender while the recipient processes requests.

    For critical system functions, falling back to blocking is arguably the best option, since later is better than never. But by doing so the OS breaches its nonblocking IPC contract with processes. Since those weren’t designed with the pitfalls of blocking in mind, they may then do all sort of things that are inappropriate in a blocking environment, such as using IPC calls in time-critical tasks or engaging in deadlock-inducing behaviours.

    So we end up again seeing a dichotomy between these processes which must do their job no matter the cost, and these processes for which gentle failure is possible, each needing specialized tools to do their job. Only this time, it’s about IPC, not memory allocation. I guess that picking which is better is a matter of design choices…

    You’re not wrong, however OOM is but one exception and you eventually have to answer the question more generically: how should the logger handle it’s own exceptions? It obviously cannot log them…Maybe the logger needs to throw it’s own exceptions due to the failure of underlying subsystems like disk, file system, SMTP outages, system file limits, etc.

    Indeed. OOM is sadly only the first of many hard-to-handle exceptions which I’ll have to take care of :) Overall, it seems unfair that hardware-related exceptions are so hard to handle properly when they are so rarely encountered in real life. I don’t think it’ll possible to devise a generic way to handle them due their very system-disrupting nature, though.

    I’m not entirely sure there is a malloc-free problem, but then I’m not intimately familiar with your IPC even though we’ve talked about it. Assuming there was a problem, a solution might be to have the OOM killer initiate the IPC to the kernel, have it normally block and only return when an OOM condition occurs. All other processes could be blocked until the OOM killer finishes it’s dirty deeds and re-establishes it’s blocked IPC link. Wouldn’t this help ensure that the OOM killer has enough resources for it’s own IPC?

    Sure, it could even be done fairly easily without a need for OOMK-specific IPC primitives: kernel grabs a mutex, OOM killer attempts to grab it too once it’s ready to operate, and kernel releases the mutex as soon as an OOM condition occurs. By storing the required process descriptor database in the same shared memory region where the mutex lies, it would also be possible to reduce the odds that the killer will itself fall in OOM-bound conditions while performing its dirty duty.

    It’s just my opinion that a sysadmin should be able to configure their own local OOM killer algorithms without recompiling a custom kernel (aka policy does not belong in the kernel). Kernel level modules/dlls would do the trick, but are those even on the drawing board?

    Not even on the design radar right now, since I fail to see a satisfactory use case for them. Most of the time, when a kernel bit is independent enough from the rest that it can be teared apart to a module with little pain, it could as well be put in a separate process unless there is a very good reason not to do so.

    PIDs can be/are reused (my linux system wraps at 32K) without any assurances that those who hold on to them have “forgotten” about them. Consider this scenario: the PID counter has wrapped and the last allocated PID was 97. I intend to kill a long running task because it’s consuming too many resources. I do “ps” and get it’s id=100, ps itself was 98. Now I type “kill 100″, but before I hit enter, PID 100 ended normally or got killed by someone else and some other task is now using PID 100. Therefor the kill signal goes to the wrong process. The odds of this happening increases greatly on multiuser systems with many tasks such that a few hundred PIDs might be issued between the time I observe the PID and the time I execute the signal.

    Although it might be unbearable to use, 64bit PIDs would greatly reduce the odds that a process would end at about the same time the counter approaches it’s own PID since it would take a very long time just to count that high even once. 32bit to a much lessor degree.

    Sounds like a design oversight from Linux to me, or perhaps something that was done with the aim to keep compatibility with ill-designed Unix programs that assume the size of a PID to be 16 bits or write stuff in the high-order bits of their PID-related requests. There is reason not to use PIDs that are at least 32-bit large these days, which would address the deficiency you point out except on very large-scale systems which I don’t target anyway.

    You might try to tombstone PIDs for some length of time to prevent their immediate reuse, but I still think that’s an ostrich algorithm as you called it.

    Credits for that awesome expression go to Andy Tanenbaum, who introduced it in Modern Operating Systems in a discussion of deadlocks and how most modern OSs handle them. :)

    “I would like to ask what criteria should be applied to determine which processes should be able to hibernate and which shouldn’t.”

    Assuming the kernel has the ability to track a processes’ resource handles, then couldn’t it look up all referenced services to check whether hibernation is supported by them?

    Oh, I’m not disputing that the system could find out on its own which processes are capable of hibernation and which aren’t, what I was thinking of was more along the lines of “In which circumstances should a software developer take the time to painfully implement hibernation in his projects, and would the benefits be worth the cost in terms of OS design complexity and potential state degradation across reboots (if all handles are to be maintained)?”.

    Well, only the processes that export stateful services to other process would need to explicitly support hibernation – most user processes don’t fit under that classification. As for enforcement, that’s a political problem :) In theory though if hibernation were a common occurrence (hibernate and resume things on the fly during normal use), then it’d become apparent rather quickly that a service is registering exceptions due to hibernation support.

    That is true, what remains of my previous question is thus only the “What would I benefit from the pain of implementing that?” part

  19. Alfman December 4, 2012 / 10:06 am

    neolander,

    “IPC buffer size is best chosen so that this assumption holds most of the time, if hardware allows it, but I don’t think that the “malloc will never be needed” requirement can always be strictly enforced in a nonblocking IPC system.”

    I’m not really familiar with hardware that needs dynamic allocation on the fly. Of course the drivers might do it, but the hardware?

    “If, as an example, during a burst of intense activity, a sender ends up throwing IPC packets faster than their recipients can process them, a filled buffer is to be expected at some point.”

    That’s true, however I don’t see this as specific to OOM. IPC code has to be prepared to block if writes are not preceded by a writable socket test. All async code should do this anyways and should not be assuming unlimited write buffers, so I’m not sure if there’s any additional burden to support the OOM case?

    “At this point, there are three options: either allocate more buffer space (which cannot be done without killing innocents in OOM conditions), reject the IPC request with the hope that the client will try again latter (which is inappropriate for that “cannot fail” functionality which logging arguably belongs to), or block the sender while the recipient processes requests.”

    Are you talking about userspace or kernel requesting more memory for buffers? Weren’t you going to use shared memory for your IPC implementation? If that’s still the case, does the kernel need any buffers at all to aid in IPC?

    I found a link about some typical buffer sizes. I’m not really sure whether these buffers are allocated on the fly, or upon initiation. You could support minimum guaranteed buffers without depending on dynamic allocation though.

    http://unix.stackexchange.com/questions/11946/how-big-is-the-pipe-buffer

    “Sure, it could even be done fairly easily without a need for OOMK-specific IPC primitives: kernel grabs a mutex, OOM killer attempts to grab it too once it’s ready to operate, and kernel releases the mutex as soon as an OOM condition occurs. “

    The main reason I suggest a blocking IPC call though is because we would know that the memory needed for the OOMK’s IPC channel is already allocated and could be reused for other calls. But your idea seems more normal.

    “Sounds like a design oversight from Linux to me, or perhaps something that was done with the aim to keep compatibility with ill-designed Unix programs that assume the size of a PID to be 16 bits or write stuff in the high-order bits of their PID-related requests.”

    A guess is that there was a performance tradeoff between using an array versus a sparse data structure such as a hash table. Internal kernel PID lookups are undoubtedly very common because I think it is the primary key used by linux.

    “That is true, what remains of my previous question is thus only the “What would I benefit from the pain of implementing that?” part”

    Is it necessary? I guess not. But having fined grained hibernation buys us the ability to migrate an application between devices, which would be an awesome feature to me.

    Of course, it’s more important to get something working than to spin wheels on details and expanding scope!

  20. Hadrien December 4, 2012 / 8:55 pm

    This is a great day, for this morning I have managed to nearly empty my personal and professional mailboxes. Does not happen that often.

    “IPC buffer size is best chosen so that this assumption holds most of the time, if hardware allows it, but I don’t think that the “malloc will never be needed” requirement can always be strictly enforced in a nonblocking IPC system.”

    I’m not really familiar with hardware that needs dynamic allocation on the fly. Of course the drivers might do it, but the hardware?

    Actually, I was mentioning the existence of hardware constraints, such as the finite nature of RAM, that put an upper limit on the default IPC buffer size :)

    “If, as an example, during a burst of intense activity, a sender ends up throwing IPC packets faster than their recipients can process them, a filled buffer is to be expected at some point.”

    That’s true, however I don’t see this as specific to OOM. IPC code has to be prepared to block if writes are not preceded by a writable socket test. All async code should do this anyways and should not be assuming unlimited write buffers, so I’m not sure if there’s any additional burden to support the OOM case?

    I was under the impression that using some variant of dynamically growing array as a buffer would have been a more flexible option in such a case, why wouldn’t it work ? Code that constantly checks out the state of IPC buffers before doing anything sounds like something pretty unreadable to me.

    “At this point, there are three options: either allocate more buffer space (which cannot be done without killing innocents in OOM conditions), reject the IPC request with the hope that the client will try again latter (which is inappropriate for that “cannot fail” functionality which logging arguably belongs to), or block the sender while the recipient processes requests.”

    Are you talking about userspace or kernel requesting more memory for buffers? Weren’t you going to use shared memory for your IPC implementation? If that’s still the case, does the kernel need any buffers at all to aid in IPC?

    I think that whether the allocation is performed in the kernel or in user space is probably irrelevant in this context, since we’re talking about a privilege-agnostic design decision: an IPC subsystem that is located between process A and process B, either in the form of a library or as a kernel functionality, receives a request that it cannot satisfy right now due to a lack of memory. The question is whether it should abort with an IPC-specific error code or wait potentially indefinitely for the memory outage situation to resolve itself.

    I found a link about some typical buffer sizes. I’m not really sure whether these buffers are allocated on the fly, or upon initiation. You could support minimum guaranteed buffers without depending on dynamic allocation though.

    http://unix.stackexchange.com/questions/11946/how-big-is-the-pipe-buffer

    Well, my current idea would be to allocate a medium-sized buffer that should be good enough for most use at the time where an IPC connection of some sort is created, then allocate extra memory as needed and garbage-collect it once it isn’t needed anymore. But it seems that there would be some issues with that design which I had not foreseen.

    “Sounds like a design oversight from Linux to me, or perhaps something that was done with the aim to keep compatibility with ill-designed Unix programs that assume the size of a PID to be 16 bits or write stuff in the high-order bits of their PID-related requests.”

    A guess is that there was a performance tradeoff between using an array versus a sparse data structure such as a hash table. Internal kernel PID lookups are undoubtedly very common because I think it is the primary key used by linux.

    Indeed, I have been wondering from time to time about how good PID lookup performance can be achieved without using arbitrary limitations myself… But the underlying problem of building efficient sparse arrays has been studied in so much details along the history of computing that I’m sure theoreticians have come up with great solutions in the past. The only problem is to find the right one within the huge mess of data structures that have been invented across the ages of computing (associative arrays, self-organizing linked lists, hash tables, VLists…) :)

    “That is true, what remains of my previous question is thus only the “What would I benefit from the pain of implementing that?” part”

    Is it necessary? I guess not. But having fined grained hibernation buys us the ability to migrate an application between devices, which would be an awesome feature to me.

    Do you mean something like moving running software’s state from one machine to another without closing it? Sounds like something that can only be achieved using VMs to me, or am I missing something?

    Of course, it’s more important to get something working than to spin wheels on details and expanding scope!

    Yup! I’m sure you’ll love the upcoming technical post which I have drafted today, and if the spare time avalanche continues I might even manage to complete the Github migration and go back to implementation matters before Christmas…

  21. Alfman December 5, 2012 / 3:55 am

    Hadrien,

    “Actually, I was mentioning the existence of hardware constraints, such as the finite nature of RAM, that put an upper limit on the default IPC buffer size :)”

    Oh ok. What is the minimum ram (virtual or physical) that you are targeting?

    I suspect making the buffers too big will perform worse if live data overflows the CPU caches. I haven’t performed benchmarks on this.

    “I was under the impression that using some variant of dynamically growing array as a buffer would have been a more flexible option in such a case, why wouldn’t it work ? Code that constantly checks out the state of IPC buffers before doing anything sounds like something pretty unreadable to me.”

    That’s fine, but to what end do you want to increase the buffer size because the receiver is slower than the sender? For example: cat /dev/sdc | bzip -c > hdimage.bzip.

    1. Unlimited buffer size, so the sender never blocks (this breaks down quickly due to the fact that sdc >> ram).
    2. Limited buffer size, the sender is throttled by blocking
    3. Limited buffer size, the sender is throttled by waiting for a writable socket notification.

    You’d be right that async code can be less readable because the code flow is broken up into callbacks /event handlers rather than using the more traditional threaded code. For async, this generally needs to be done, otherwise it can block and isn’t true async.

    In theory, a blocking thread could get away with zero buffering because the kernel would be free to read data directly from the blocked thread’s memory and write it directly to the receiver’s buffer.

    “The question is whether it should abort with an IPC-specific error code or wait potentially indefinitely for the memory outage situation to resolve itself.”

    I’m ok with either way, obviously your OOMK will ultimately try to avoid a permanent condition. It’s (still) not clear to me whether an established IPC channel would ever need to fail under OOM conditions, but the devil is in the details.

    “Indeed, I have been wondering from time to time about how good PID lookup performance can be achieved without using arbitrary limitations myself…”

    Well, you might never have to use PIDs internally to the kernel at all. Everywhere you’d use a PID, just use a pointer to the process record instead. A PID lookup would only be necessary for userspace calls.

    “Do you mean something like moving running software’s state from one machine to another without closing it? Sounds like something that can only be achieved using VMs to me, or am I missing something?”

    Kind of. If you hibernate the process, and save the state of it’s resource dependencies using the driver hibernation API we talked about earlier, then the process could be resumed elsewhere.

    For example, a game running at home has a process memory image + open files + IPC handles for I/O. Under the premise that all drivers support serialization and deserialization of state on behalf of individual processes entering hibernation, then there’s no reason it couldn’t be “migrated” to a computer at work if the same dependencies existed at the target. The state could just be stored as xml files. (ie handle#5=game.dat RO, handle#6=game.cfg RW, handle#7=x11 RW, etc). Rsync could become a tool for migrating whole application instances!

    Going even further, it might even be possible to migrate between different versions of one dependency if the serialised state information is compatible between the two. This potentially allows us to upgrade dependencies in place without even restarting our programs! Wow! :)

    If you were to take this seriously, I’d propose four “hibernation” modes:
    1. Running
    2. Paused
    3. Partial hibernation, still comes up as system process, resources inactive but references intact.
    4. Full hibernation, all state information saved, no trace of process.

  22. Hadrien December 9, 2012 / 5:58 pm

    Oh ok. What is the minimum ram (virtual or physical) that you are targeting?

    Considering the time it’d take for this project to get anywhere close to an interesting public release, I consider asking for 1GB of RAM as a minimum. RAM is getting so cheap that even some cellphones start to have that much these days…

    That’s fine, but to what end do you want to increase the buffer size because the receiver is slower than the sender? For example: cat /dev/sdc | bzip -c > hdimage.bzip.

    1. Unlimited buffer size, so the sender never blocks (this breaks down quickly due to the fact that sdc >> ram).
    2. Limited buffer size, the sender is throttled by blocking
    3. Limited buffer size, the sender is throttled by waiting for a writable socket notification.

    You’d be right that async code can be less readable because the code flow is broken up into callbacks /event handlers rather than using the more traditional threaded code. For async, this generally needs to be done, otherwise it can block and isn’t true async.

    In theory, a blocking thread could get away with zero buffering because the kernel would be free to read data directly from the blocked thread’s memory and write it directly to the receiver’s buffer.

    You’re right that unlimited buffers would be a bad idea, since it won’t be able to cover all use cases anyway. Perhaps the best option to address the aforementioned “activity burst” use case with good reliability without constantly polling IPC channel buffer size would then be one of the following:

    1. Allocate large enough IPC buffers to start with. Like, a few MB or so, but this is obviously largely application-dependent and should be chosen by the client and/or server processes. This works, but is arguably wasteful since that memory will only be used from time to time.
    2. Start with small buffers that can grow up to a limit chosen equal to the “large enough buffer size” discussed above. This option is more memory efficient but requires a bit more code to be supported, and may also cause minor computational delay anytime extra buffer space is allocated

    In all cases, once the buffer is filled up or in the event where a process tries to grow it in OOM conditions, the client process would block. Processes for which never blocking is of critical importance can poll remaining buffer side on each IPC call as you described previously, while less critical process can rest confident that their async call are unlikely to block (no guarantee that they won’t, but it will be an uncommon event).

    Well, you might never have to use PIDs internally to the kernel at all. Everywhere you’d use a PID, just use a pointer to the process record instead. A PID lookup would only be necessary for userspace calls.

    That makes sense indeed, though benchmarking will perhaps be needed at some point when it comes to finding out if user space calls still require an unacceptably high amount of PID lookups.

    Kind of. If you hibernate the process, and save the state of it’s resource dependencies using the driver hibernation API we talked about earlier, then the process could be resumed elsewhere.

    For example, a game running at home has a process memory image + open files + IPC handles for I/O. Under the premise that all drivers support serialization and deserialization of state on behalf of individual processes entering hibernation, then there’s no reason it couldn’t be “migrated” to a computer at work if the same dependencies existed at the target. The state could just be stored as xml files. (ie handle#5=game.dat RO, handle#6=game.cfg RW, handle#7=x11 RW, etc). Rsync could become a tool for migrating whole application instances!

    All that advanced technology just to slack off at work? I am disappointed :P More seriously, wouldn’t process still risk crashing if their hibernated self was moved to less capable hardware? To follow your game example, it is common for games to probe GPU capabilities during initialization and then be done with that, since probing them over and over again would be wasteful, wouldn’t that be a problem in an hibernation scenario?

    Going even further, it might even be possible to migrate between different versions of one dependency if the serialised state information is compatible between the two. This potentially allows us to upgrade dependencies in place without even restarting our programs! Wow! :)

    I already have something in mind to update system processes without restarting the user processes they depend on, is that what you are discussing here?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s