A painful pause while my app is running

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

I used Clojure to build a RESTful API. I was successful in so far as that went, but now I face the issue that every once in a while, the program pauses, for a painfully long time — sometimes 30 seconds, which causes some requests to the API to timeout. We are still in testing, so there is no real load on the app, just the frontenders, writing Javascript and making Ajax calls to the service.

The service seems like a basic Clojure web app. I use Jetty as the webserver, and the libraries in use are:

Ring

Compojure

Liberator

Monger

Timbre

Lamina

Dire

When someone complains about the pauses, I will go test the service, and I can hit with 40 requests in 10 seconds and it has great performance. The pauses actually seem to come after periods of inactivity, which made me think that this had something to do with garbage collection, except that the pauses are so extreme — like I said, sometimes as much as 30 seconds, causing requests to timeout. When the app does finally start to respond it again, it goes very fast, and responds to those pending request very fast.

But I have to find a way to fix these pauses.

Right now I packaged the app as an Uberjar and put it on the server, spun it up on port 24000 and proxied it through Apache. I put a script in /etc/init.d to start the app using start-stop-daemon.

Possible things that could be going wrong:

Maybe Jetty needs more threads, or maybe less threads? How would I test that?

Maybe the link to MongoDB sometimes dies? (Mongo is on another server at Amazon) How would I test that?

Maybe it is garbage collection? How would I test that?

Maybe I have some code that somehow blocks the whole app? Seems unlikely but I’m trying to keep an open mind.

Maybe the thread pool managed by Lamina sometimes gets overwhelmed? How would I test that?

Maybe when Timbre writes to the log file it causes things to pause? (But I believe Timbre does this in its own thread?) How do I test that?

This is a small app: only about 1,100 lines of code.

Then I saw the article “Logging can be tricky“. This is similar to my issue:

No significant amount of resources appeared to be in use — disk I/O, network I/O, CPU, and memory all looked fairly tame. Furthermore, the bulk of queries being served were all performing as expected.

If I run:

sudo strace -c -f -p 19363

I get:

Hmm, no fsync. But the restart_syscall does seem strange.

Now I have to figure out how to track down what that means.

Post external references

  1. 1
    http://corner.squareup.com/2014/09/logging-can-be-tricky.html
Source