How NodeJs fails to deal with backpressure

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com

Interesting:

The Node concurrency model is kind of like a credit card for computing work. Credit cards free you from the hassle and risks of carrying cash, and they are completely great in almost every way, unless you spend more than you make. It’s hard to even know you have spent more than you make until the next bill comes. Similarly, node lets you do more work, and it’ll call you back when it’s done, whenever that is. You might not even realize that you’ve been scheduling work, but any time you have an event listener, you must handle that event when it goes off, on the stack where the event was fired. There is no way in node’s event system to defer callbacks until the system is “less busy”, whatever that means.

This is a subtle but important tradeoff, so it’s worth repeating. Any time you have an event listener, you must handle that event when it goes off. This applies to everything from new connection listeners, new request listeners, data events, or anything else. Node will not defer the dispatch of an event when one is emitted.

The Shape of the Problem

When you schedule work in a node process, some memory needs to be allocated, typically some from the V8 heap for JS objects and closures, and some in the node process heap for Buffers or TLS and gzip resources. This memory allocation happens automatically, and it’s hard, perhaps impossible, to know in advance how much memory an operation will require, or how much space you have left. Of course the actual size or cost of an operation is probably irrelevant to a program at runtime, but it’s interesting to notice where in the system this backpressure problem starts.

If you schedule too much work for a node process, here is how things usually break down and ultimately fail. As work is accepted faster than it is finished, more memory is required. The cost of garbage collection is directly related to the number of objects in the heap, and all of these new objects will steadily grow the heap. As the heap grows, GC cost grows, which cuts into the rate that useful work can be done. Since the process was already unable to keep up, this causes even more work to queue up, growing the heap and GC time in a vicious feedback cycle. Eventually the process spends almost all of its time doing GC and very little time doing actual work. If you are lucky, the heap will grow so large that V8 will exceed its own heap size limits and the process will abort. If you are unlucky, the process will languish in pain for hours, blindly accepting new connections that it’ll never be able to service until you eventually run out of file descriptors. Processes will often go into a tight logging loop complaining about “EMFILE”, which will then start filling up your disks.

If you’ve got a node process that’s using more memory than you expect, the CPU is at 100%, but work is still proceeding slowly, chances are good that you’ve hit this problem. If you are on a system with DTrace, you can watch the GC rate or overall performance profile to definitively diagnose this problem. Whether or not you have DTrace, at this point your process is pretty much screwed. You’ve spent past your credit limit, and now you are stuck paying GC interest payments. Recovering gracefully is nearly impossible, so restarting the process is usually the best move.

But wait, it gets worse. Once a process can’t keep up with its work and starts to get slow, this slowness often cascades into other processes in the system. Streams are not the solution here. In fact, streams are kind of what gets us into this mess in the first place by providing a convenient fire and forget API for an individual socket or HTTP request without regard to the health of the rest of the system.

However, I was just talking with Vache of Haystack.im and he was telling me about the success he had with the “fail fast” model, where Node apps get pushed to their limit, die, and then restart. For any problem where you can get away with the “fail fast” architecture (which is a very large category of software) then clearly you can use NodeJs without running into too many problems.

Source