Useful discussion about “out-of-memory” handling

This post contains some quotes from Lennart Poettering and other free software guys about OOM handling in applications.  I really liked the discussions and wanted to share it. Happy reading :)

It all starts with the following patch sent to jack-devel:

thread_args = (jack_thread_arg_t *) malloc (sizeof (jack_thread_arg_t));
+    if (thread_args == NULL) {
+        return 0;
+    }

Then fons says:

“Any real-life jack system is likely to be on its knees and begging for mercy long before the first malloc() fails. And on Linux, a non-NULL return from malloc doesn’t even mean that the memory is really available.”

Slightly disagrees Nedko Arnaudov with:

“I tend to disagree on this issue but I’ve been arguing with other people working on linux audio software and I dont see the point of going into such theoretical discussions. IMO, checking for allocation failures is good thing. The only bad thing about is that it adds more code. Bad as more code to maintain and somewhat reduced readability. It is definitively good otherwise. Most importantly it does not really hurt.”

Stéphane Letz suggests to look at PulseAudio’s malloc() implementation which can easily be browsed thanks to gitweb.

Then Lennart talks about glibc’s malloc() redefinition hooks and finally replies with an extremely instructive e-mail which I directly paste over here:

The OOM situation has been discussed quite often in various
projects. On modern systems it has become pretty clear that for normal
userspace software only an aborting malloc() makes sense for a couple
of reasons:

OOM handling creates a substantial number of error paths that need to
be tested, but seldomly are. i.e. when in an inner routine a malloc()
fails you need to rollback the entire operation which can be very hard
to get right, and since this is seldomly tested for almost never
works. OOM error paths add a substantial amount of code to most
projects and that code is seldomly tested and verified. Also very
interesting here are Havoc’s notes on the OOM handling in D-Bus:

http://log.ometer.com/2008-02.html#4.2

And note that I can list you a couple of places were D-Bus’
transactional rollback code still didn’t get it right, although this
is much much better and systematically tested than most other
projects. (e.g. https://bugs.freedesktop.org/show_bug.cgi?id=21259)

And then there’s the big thing that malloc() returning NULL seldomly
means what people think it means. On modern systems RAM exhaustion is
signalled differently, certainly not via malloc(), since that
allocates address space, not actual memory. So if malloc() returns
NULL this is a sign of address space exhaustion or by hitting some
resource limit
(whcih is actually more or less the same thing). In
that case it seldomly makes sense trying to go on, and especially the
often suggested solution such as “malloc() in a loop” is not going to
help a tiny bit.

If the system is truely short on RAM the kernel will kill a process
via the OOM killer, in an unfriendly way
. I’d say it only makes sense
for a process to treat address space exhaustion the same way as true
OOM is treated: by killing itself.

Also, almost all sensible libraries enforce an aborting malloc()
anyway. For example, glib/gtk works exclusively with aborting
malloc, and that’s the same for quite a few libraries. That means the
entirety of your GNOME desktop, or even many system daemons such as
HAL or DeviceKit-disks abort on OOM, and if you think it is worth
handling OOM in your user software one might wonder what the point is
if the underlying services dont’t.

Now, the often heard argument for handling malloc() returning NULL
properly is that it is that users don’t want their data lost and the
app should exit cleanly saving everything that there is to save. In my
opinion that is a completely artificial issue. A program should always
be written in a way that the data it can lose is minimal, and that
includes when it is terminated abruptly by a power outage, or by being
killed by OOM. So auto-save the user data from time to time anyway,
dont wait until youre officially fucked by malloc() returning
FULL
. Also, it is incredibly hard to write code that is still able to
write data safely to disk when you cannot allocate memory while doing
that.

So in summary, OOM-safety is wrong:

- Because it increases your code size by 30%-40%
- You’re trying to be more catholic than the pope, since various
systems services you build on and interface with aren’t OOM-safe
anyway
- You are trying to solve the wrong problem. Real OOM wil be signalled
via SIGKILL, not malloc() returning NULL.
- You are trying to solve the wrong problem. Make sure your app never loses
data, not only when malloc() returns NULL
- You can barely test the OOM codepaths

Does that mean it never makes sense to write OOM-safe userspace code?
No. For very low-level system daemons, such as udev or the init system
itself clean OOM handling is important. And in some embedded
appliances that is true too
. But it is very easy to identify these
cases: these are those programs that modify their own oom_adj value in
/proc or appliances where memory overcommit is disabled anyway. (And
recursively, the few libraries that are used by those daemons should
be OOM-safe, too, i.e. libc, dbus)

I don’t think that Jack or libjack qualify as that.

So, say no to OOM handling! Saves you time, makes your code faster and
nicer and shorter!

Or to say in a more sarcastic way: the most visible effect the extra
code you have to write for making things OOM-safe will be that due to
higher memory/address space consumption the OOM situation will be
coming earlier then without it.

Then Jack O’Quin answers that JACK does function as one of those low-level daemons you mention.

Lennart clarifies his arguments:

Producing music is nothing lifes depend on. Nor something where a
machine reboot would cost millions of dollars. And that means that
it is not important that they never ever crash.

That’s different in a car or in medical devices. If one of the
gazillions of embedded devices that make up today’s cars or medical
devices fail, this could have fatal consequences, and hence spending
the extra time and money on investing in OOM-safety makes sense. But
if you claim that having the music go on under all circumstances is
similary important this is quite some hybris.

..

And finally: if your system really gets under memory pressure, isn’t
the jack daemon the first thing that should be killed? i.e. it just
provides connectivity, right? Killing it will stop audio, but should
not lose data. So I’d argue that Jack is the first thing that should
go away, not the last thing. It should go away way before your Ardour
is shot down! Because losing ardour means losing data. Losing jack
just means merely a temporary interruption of your work.

And the discussion goes on with a lot of good explanations about VM mechanisms around like memory overcommitting etc.

The whole thread is at:

http://thread.gmane.org/gmane.comp.audio.jackit/19983/focus=19998

Be Sociable, Share!
Posted Çarşamba, Aralık 23rd, 2009 under Operating Systems, Programming.

Tags: , , , ,

Leave a Reply