A surprising NodeJS failure mode: deterministic code becomes probabilistic under load

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com

I have a client. Let’s say they run some scripts that scrape the web for data, then they run some NLP scripts to pull out the facts they need, then they need to insert that into their MySQL database. When they gather facts, they need to know if the facts are about an organization that is already in the database. So they asked me to create an API that could take their new facts, and try to match those facts with something in the database.

If I had the choice of technology I would have used Clojure, because I feel the JVM offers a lot of what is needed here: concurrency to handle massive load, lots of tools to handle backpressure, and Clojure to make the concurrency less painful than in something like plain Java. Also I hate Java. Also I love Clojure.

But this client had no software that ran on the JVM. They used PHP and NodeJS. (The NLP guy used Python.) They’d already used NodeJS for a read-only API that delivered certain kinds of data to paying customers. So, okay, I should use NodeJS.

I took their existing NodeJS API (built with HapiJS) and modified it. The import/NLP scripts would now call this new API I was creating. The import/NLP scripts would send in some strings, and then I would do some fuzzy matching, and generate a score about how likely it was that the new facts being imported were about an organization that was already in the database.

I’m running the NodeJS API behind a reverse proxy via Nginx. I don’t think this should effect anything I observe.

Just to avoid any confusion, I want you to understand that this code is deterministic. Given the same input, you should get the same score every time (that is, at any given point in time. Obviously if you run the script a year from now you might get a different score, since the data in the database might have changed).

I created the API such that most of its parameters could be set by the caller. So for instance, how many points should be given when a Jaro-Winkler match was made? The default is 16 points, but the caller could vary that. And how many points should be given for an exact match?

Since there were so many parameters, and they could all be tuned, it seemed wise to test all possible combinations to see what gave the best F1 Score. There were 11 parameters and I tested 3 possibilities each, so there are a total of 177,147 possible combinations. I wrote the test app in Clojure, and made it very multi-threaded so I could pound the API. One of my goals was to stress test the API.

The Clojure test app starts up and grabs our real data from the database, then pounds the API with that. This gives us the True Positives and False Negatives. Then the app mangles the text slightly, inserting some characters randomly. Then we pound the API with the mangled data. This gives us the True Negatives and False Positives.

A few years back I read Matt Ranney’s very good essay “Backpressure and Unbounded Concurrency in Node.js“. He wrote:

The Hardest Problem in node, in my view, is unbounded concurrency. Most tricky problems that we’ve tracked down at Voxer have ultimately been problems of too much concurrency. This is a kind of backpressure problem, but at a different level than streams address. Node makes it very easy and relatively efficient to manage high concurrency driven primarily by network events, which is why we use node in the first place. There just aren’t good ways to limit this concurrency before it causes problems. Node’s unbounded concurrency might also cause problems in other systems it touches, like disk drives and databases. Even if node itself isn’t out of memory or CPU, it can easily overwhelm these other systems.

With that in mind, I thought I understood the ways my NodeJS app could fail. But NodeJS still managed to surprise me.

When I started the Clojure test app, it pounded the API with several thousand HTTP calls every minute. The API is a bit slow and could barely handle 100 requests a minute (the fuzzy matching is slow and I’ve done nothing to optimize it — and none of that matters).

Here is what I expected: the API would respond to 5% of the requests, but 95% of the time I would get a SocketTimeout Exception. And that would be fine. After all, part of my aim was to overwhelm the API and see where it might break.

At first, everything happened the way I expected. Sometimes a request went through and came back with the correct answer. Other times I got a SocketTimeout Exception.

But then I started to see something that really surprised me: the API would respond with a 200 OK HTTP status, and the JSON shell of a response, but it would claim there was no match, for a string that should have matched. I tested this from the Terminal on my Mac, using curl, and I was amazed to see that, for exactly the same input, I sometimes got the right answer, sometimes got a SocketTimeout Exception, and sometimes got the wrong answer! My deterministic code had become probabilistic.

So I might run a request such as this:

curl –verbose -d ‘{“profile_name”:”!Go B.V.”,”headquarters_addr1″:”Wilhelminakanaal Noord 3″,”headquarters_city”:”oosterhou dw”,”headquarters_phone”:”31 112 37400″,”api_call_meta”:{“allowed_url_weight”:1.5,”allowed_shortness_of_string_less_than__which_we_increase_strictness”:4,”allowed_headquarters_city_weight”:1,”allowed_headquarters_addr1_weight”:3,”allowed_minimum_total_score_necessary_to_be_returned_from_best_match”:40,”score_given_whenever_an_exact_match_is_made_on_a_field”:19,”allowed_points_for_a_jaro_winkler_match”:15,”allowed_headquarters_phone_weight”:1.5,”allowed_profile_name_weight”:3,”allowed_headquarters_zip_code_weight”:3,”allowed_jaro_winkler_distance”:0.94}}’ -H ‘Content-Type: application/json’ -X POST http://dev.example.com/identify/v6/

And I get back some JSON with 26 matches in the “records” field. I’ve shrunk this down, but this gives you the general idea of what the correct answer is:

* upload completely sent off: 668 out of 668 bytes
 HTTP/1.1 200 OK
 Server: nginx/1.11.9
 Date: Thu, 30 Mar 2017 23:32:59 GMT
 Content-Length: 31258
 Content-Type: application/json; charset=utf-8
 Connection: keep-alive
 vary: origin
 cache-control: no-cache
{
  "records": {
    "2326191f-b4a4-4429-855e-dce6625c275f": {
      "headquarters_addr1": {
        "type_of_match": "exact_match",
        "field": "headquarters_addr1",
        "santized_string_from_api_call": "wilhelminakanaal noord 3",
        "santized_string_from_database": "wilhelminakanaal noord 3",
        "score": 58
      },
      "total_score": 107.5,
      "headquarters_city": {
        "type_of_match": "exact_match",
        "field": "headquarters_city",
        "santized_string_from_api_call": "oosterhou",
        "santized_string_from_database": "oosterhou",
        "score": 20
      },
      "headquarters_phone": {
        "type_of_match": "exact_match",
        "field": "headquarters_phone",
        "santized_string_from_api_call": " 31 112 37400",
        "santized_string_from_database": " 31 112 37400",
        "score": 29.5
      }
    },

   // REDACTED

  },
  "highest_score": 107.5,
  "context": {
    "comparison_id": "a590e1a9-5364-4452-9623-6ce1c5430495",
    "api_call_meta": {
      "allowed_url_weight": 1.5,
      "allowed_shortness_of_string_less_than__which_we_increase_strictness": 4,
      "allowed_headquarters_city_weight": 1,
      "allowed_headquarters_addr1_weight": 3,
      "allowed_minimum_total_score_necessary_to_be_returned_from_best_match": 40,
      "score_given_whenever_an_exact_match_is_made_on_a_field": 19,
      "allowed_points_for_a_jaro_winkler_match": 15,
      "allowed_headquarters_phone_weight": 1.5,
      "allowed_profile_name_weight": 3,
      "allowed_headquarters_zip_code_weight": 3,
      "allowed_jaro_winkler_distance": 0.94,
      "allowed_difference_in_score_for_close_matches": 7,
      "allowed_exponential_fade": true,
      "allowed_headquarters_phone_area_code_weight": 1,
      "bonus_score_for_passing_a_filter": 5.7
    },
    "errors": [],
    "how_many_documents_were_searched": 116,
    "how_many_documents_were_returned_after_all_root_endpoint_filters": 26,
    "which_root_endpoint_filters_passed": []
  }

Then I run the request several more times. 5 times in a row I get a SocketTimeout Exception. And that’s fine, I expect that. But then I get this, which is simply the wrong answer:

* upload completely sent off: 668 out of 668 bytes
 HTTP/1.1 200 OK
 Server: nginx/1.11.9
 Date: Thu, 30 Mar 2017 23:33:47 GMT
 Content-Type: application/json; charset=utf-8
 Content-Length: 1122
 Connection: keep-alive
 vary: origin
 cache-control: no-cache
{
  "records": {},
  "highest_score": 0,
  "context": {
    "comparison_id": "b854936c-c166-421c-82be-601aec1dbfe9",
    "api_call_meta": {
      "allowed_url_weight": 1.5,
      "allowed_shortness_of_string_less_than__which_we_increase_strictness": 4,
      "allowed_headquarters_city_weight": 1,
      "allowed_headquarters_addr1_weight": 3,
      "allowed_minimum_total_score_necessary_to_be_returned_from_best_match": 40,
      "score_given_whenever_an_exact_match_is_made_on_a_field": 19,
      "allowed_points_for_a_jaro_winkler_match": 15,
      "allowed_headquarters_phone_weight": 1.5,
      "allowed_profile_name_weight": 3,
      "allowed_headquarters_zip_code_weight": 3,
      "allowed_jaro_winkler_distance": 0.94,
      "allowed_difference_in_score_for_close_matches": 7,
      "allowed_exponential_fade": true,
      "allowed_headquarters_phone_area_code_weight": 1,
      "bonus_score_for_passing_a_filter": 5.7
    },
    "errors": [],
    "how_many_documents_were_searched": 1,
    "how_many_documents_were_returned_after_all_root_endpoint_filters": 0,
    "which_root_endpoint_filters_passed": []
  }

You’ll notice that this is empty:

“records”: {},

It should have 26 records!

There is nothing in my API code that would give rise to a probabilistic answer. The exact matching and the Jaro-Winkler matching are all deterministic. But under load, my app has become probabilistic!

And if I keep running the curl call, over and over again, I keep getting some mix of these 3 responses. Mostly I get SocketTimeout Exceptions. Sometimes I get the right answer. And, very rarely, I get the wrong answer.

This is a surprising difference, from the correct and incorrect attempts:

“how_many_documents_were_searched”: 116,

“how_many_documents_were_searched”: 1,

Yet, given the same input, the same documents should be matched against each time.

I did not know that NodeJs had this failure mode. I assume I’m dealing with some kind of race condition, but it is not the obvious one you might expect: a callback from the database. In fact, the API does not call the database while it is running. It calls the database and startup, and then it stores 4 gigs of data in memory, breaking the data up into various indexes, to help with fast lookup. There are no callbacks in the code, other than the one that matches an incoming request to an endpoint.

Possibly this is some kind of resource contention in NodeJS, but I’m not sure how to instrument for this. It would be tough to recreate this bug reliably.

Sadly, this undermines my attempt to get an accurate F1 Score. While I can easily filter out the SocketTimeout Exceptions, I have no way to distinguish the wrong answers from correct True Negatives.

I’m not feeling great about NodeJS right now, but I am open to constructive suggestions.

Source