How does your frontend know the id of a document if you’re using an async architecture? Rely on UUIDs.

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

Building an async system is more complicated than building a sequential system, but it can offer competitive advantages. For instance, you can build mobile apps that work offline, and later sync their data with your system, and this ability to function to work offline can enable uses that your competitors don’t have. This is what allowed Handshake CRM to thrive, even though they started after the CRM market had seemingly consolidated down to a handful of winners. Given the dominance that Salesforce had in the CRM market, it took courage to start a new CRM company. But they had a unique idea. The founder had once been trying to sell product in the Javits Center, down in the basement, where the wifi was non-existent. And he had to write orders down pieces of paper, because none of his software could get a connection to the Internet. And he thought, what if we build frontend software that work offline? Thus a new company was born, that could go into the crowded CRM space and offer something new.

Likewise with farm software. I have a client that is trying to revolutionize the way farmers manage their fields. The farmers are often out in rural areas, where cell phone coverage is poor. So it was important that the frontend of this system could work offline. That is, a farmer could go out into the fields, gather data and type it into their iPhone or iPad, and then later, when they get home, the app on the iPhone or iPad could sync with the online system.

For the sake of this essay, assume I’m talking about a system that looks like this:

I recently had this conversation with a frontend developer, working on this project, who was completely new to async architectures.

Them:

How do I know I have created a document on the backend? Suppose a new user wants to create their profile. If we were using a normal website, something like Django or Ruby On Rails, then my Javascript code does a POST to the server, and I’m going to either get back a 200 or a 400 or a 404 or maybe a 500. Either it works, or it doesn’t work, and I get back an HTTP response code that tells me what happened. And then the frontend code can show an appropriate message to the user, and if there was an error then the user can try to save the data again. But in this architecture that you are talking about, I send a JSON document via POST and then… what? Maybe the API Gateway gives me a 200, but that doesn’t tell me if they document made it all the way to the database. That just means I had the correct API key and so I was allowed to make a POST. But I’ve got no idea if anything succeeded, and I’ve got no way to reference this document. I’ve got no ID, no way of knowing how to refer to this in the future.

Me:

It might seem like a complicated architecture, but in my experience, when things are working correctly, a new document can get through the system in less than 2 seconds. It should never take more than 2 or 3 seconds for a document to go through these steps:

1.) from the users web browser to the API Gateway

2.) to an app that reformats the data for Kafka

3.) to Kafka

4.) to the backend robot that handles User profiles

5.) to PostGres

6.) to the Translator app

7.) to ElasticSearch

If, for some reason, your system slows down the point where it is taking 20 or 30 seconds for a document to get through the system then, first of all, you need to look at the bottle neck and fix it, but, second of all, I have seen many startups cheat a bit, and for high priority information, the backend robots will both write to PostGres and also write directly to ElasticSearch. That leads to tight coupling between the backend robots and the denormalized data in ElasticSearch, so I don’t recommend it, but I have often seen it done.

Assume that 99.9% of the time a new document can get through your system and arrive in ElasticSearch in less than 2 seconds. Then your code can POST to the write point, wait 2 seconds, and then make a read from the read point — your new document should be there.

Them:

But how would I know? How would I be able to tell if the new document is in ElasticSearch? I don’t know its ID in the database, so I have no way to identify it.

Me:

Before you send it, you’ll want to create a UUID and add that in as the document id. And you’ll want to send it with the session id. So 2 seconds later your code can ask, is there a document with this session id and this document id? In other words, for a new user profile, you send something like this:

{
“first_name” : “Allie”,
“last_name” : “Veblen”,
“phone” : “01-987-234-9812”,
“company_id” : “9871fVGTED120dkfir93F”,
“document_id” : “f02343dlGTRF9090909dLc”,
“session_id” : “888F8F2cvnd4jf09LKJiv2”
}

And then 2 seconds later you can query for:

{
“document_id” : “f02343dlGTRF9090909dLc”,
“session_id” : “888F8F2cvnd4jf09LKJiv2”
}

If the document comes back, then you know it was successfully created. If nothing comes back, wait another 5 seconds and try again. Then another 5 seconds and try again. You can set some cut off, let’s say 15 seconds. If the document is not there after 15 seconds, you can assume something failed. You can either do the POST again, or tell the user that there is a problem.

In general, you really should not rely on the auto-generated ID of a database table to be your document id. That style was popularized by PHP and MySQL and then Ruby On Rails and MySQL, but that style needs to die. Arguably it is a security risk, since outsiders can easily guess the id of any document in your system, which they can not do if you use UUIDs.

Them:

But even with UUIDs, isn’t there still a big security risk? Can’t they potentially guess a UUID?

Me:

Sure, hackers can try to guess UUIDs, that is a well known attack. But there are a few things to keep in mind. First of all, with UUIDs, they would have to guess many trillions and trillions of UUIDs before they would even have a 50% chance of correctly guessing a UUID that you are using. Second of all, you should have some code that checks to see if hackers are attacking your site with random guesses. If you use the AWS API Gateway, then you don’t have to write any code yourself, because Amazon has already done that for you. The only requests that get through will be requests that have your API key.

Them:

A UUID seems complicated. I don’t know how to create them. Maybe I can just use a timestamp instead?

Me:

No, you don’t want to use a timestamp. That would only work for the simplest cases. First of all, as time goes by, your frontend will become more complex, and eventually you will find you are attaching multiple events to a single button click. These would all have the same timestamp, so the timestamp would fail to tell you which event/document is which. Second of all, you don’t need to create UUIDs yourself, there are many good libraries that do you this for you, and which are easy to use.

How does the user login?

Especially nowadays, inexperienced developers have gotten so used to working with systems that automatically handle sessions that many such developers don’t seem to understand what a session is, or how to handle it manually. So I am often asked a variant of this, by whoever is working on the frontend:

How can I handle login in a system like this? If I build a website using WordPress or Ruby On Rails or Django then all of those systems have code to handle logins. Normally I get the user’s username and password, from an HTML form, and then I POST that to the server, and if the username and password match an existing username and password, then the user is logged in. But in this OWPORPOL system, I don’t have a direct connection to the backend, so I don’t think I can maintain a session?

To answer, let’s first talk about a totally normal website, like the type run by WordPress. In that case:

1.) user types username and password into an HTML form

2.) the username and password are POSTed to the server

3.) (assume success) the server sends back a session_id, to be set as a cookie

4.) the web browser records the cookie

5.) thereafter, the web browser sends the session_id with every request, in the HTTP cookie headers

One could just send the username and password with every single request, but that would get tedious and would also possibly be a security risk.

How to handle authentication with an OWPORPOL system? There are a few ways, including the traditional “send username and password from HTML form.” But nowadays 3rd party auth has become very common, so let’s talk about that too.

In the (very traditional) first case:

1.) you have a website where the user fills in the username and password in an HTML form.

2.) You also have a small auth app that is available at a public endpoint. This has access to a cache of all usernames and passwords (I’m just going to assume encryption and not talk about it). This auth app is an exception to the overall architecture, since it amounts to a write point, other than the main write point. I think this is a minor exception that doesn’t hurt the overall system

3.) The HTML form POSTs the username and password to the auth app

4.) let’s assume the auth app finds a match for that username and password

5.) the auth app generates a UUID to be used as the session_id. This gets sent back to the web browser

6.) the auth app also writes the session_id and username to Kafka (this is a direct write, and therefore another exception to the notion of “one write point”), so that the User robot on the backend can pick it up and see what the current session_id is. If your backend is written in a standard Django or Ruby On Rails style, you might have a Sessions table in PostGres where you store all current session_ids

7.) thereafter, any document sent to the write point should include the session_id

In other words, the auth app writes something like this to Kafka:

{
“action” : “login”,
“username” : “joel”,
“session_id” : “9230940329420394”,
“status” : “success”
}

or this for a failed login:

{
“action” : “login”,
“username” : “joel”,
“status” : “failed”
}

(To many failed login attempts is a problem. You should have some backend robot that tracks this.)

Later, someone tries to send in a New Business action, which might look like this:

{
“item-type” : “business”,
“name” : “Joel’s Country Vegetables”,
“document_id” : “frt43209KJH3111nvBt2”,
“session_id” : “9230940329420394”
}

This gets picked up by the Business robot, which needs to know if this is a valid user, so it checks the session_id in PostGres. This is exactly what the major frameworks do: Flask, Ruby On Rails, Django, Symfony, etc. In fact, you could use any of those frameworks for the backend.

The Business robot sees the session_id is valid, so it creates a new row in the Business table in PostGres. And it should write a message “Your new business has been created!” into the Messages table, and that should be passed along to ElasticSearch, and then to the user (all of this usually happens in less than 2 seconds).

PostGres is a cache that holds the current snapshot of what data is valid. However, it is not the canonical source of truth. Note that you could delete PostGres, and delete all of your PostGres backups, and you could still correctly recreate all of the information in PostGres, by having your robots run over the data in Kafka. So the history of events in Kafka is the canonical source of truth, in the sense that all the knowledge of your app can be recreated by re-reading the data from Kafka. However, in terms of what is the current data, you can keep that in PostGres, and in that sense the app will work a lot like a typical Flask app, or a typical Ruby On Rails app.

The failure mode is also simple, if the Business robot can not find a matching session_id in the Sessions table, then it should log this as a failed attempt. Failures should also be stored in the Messages table, to be passed along to the user. “We could not create a business, could not find a valid session.”

How long does login last? Every website has different rules about this. Most sites will allow you stay logged in forever — the session never expires. A user remains logged in unless they specifically initiate a logout action. However, some sites, such as my bank, are wary about security, so they cancel sessions after just 5 minutes. Ending a session just means deleting the session_id from the Sessions database table. Or maybe the user clicks the “logout” button on the website, at which point a “logout” event is sent to the public write point, then to Kafka, then it gets picked up by the User robot (or maybe you have a Sessions robot) and that deletes the session_id from the Sessions database table.

Now let’s consider the other case, when you use 3rd party auth.

Firebase: let’s assume you use something like Firebase for all authentication.

Your frontend code can make a call directly to Firebase. You’ll get back a JWT, which contains a session key (for some reason, Firebase uses an enormous session key, something like 900 characters. I’ve no idea why this is so huge. But it works fine as a session key). Pass that session key along with every document that you send to the public write point. Each of the robots on the backend will then have to verify that the session token is valid. That is, they get the session key, then they make a call to Firebase to see if the key is valid and who does it belong to. For this, it can be helpful if you are using a monolith on the backend, such as Django or Ruby On Rails, as it makes it easier to use the “verify with Firebase” code just once for each document. Otherwise, potentially, if one document needs to be processed by 3 different backend robots, each robot will have to verify the session key with Firebase. But if you have developed your backend as a series of microservices, it is not much work to write a simple wrapper for your Firebase code and then include that everywhere as a shared library. A very minor pain point.

In either approach, authentication and maintaining sessions are not radically different compared to traditional websites, especially if your website was already mostly Javascript (so, “traditional” in the post-2005 sense, not in the 1993 sense).

After I explain these ideas to a team, at some point I get this pushback:

This is too complicated. It’s also unnecessary. I understand why our web scraping script has to dump raw HTML somewhere — it’s pulling in 50 gigabytes a day. So, yes, the web scraping script can dump its raw HTML on S3 or Kafka, or wherever. But I’m just writing the website code. This part of our code can be a normal Ruby On Rails app. There is no reason why our frontend code needs to interact with a system as complicated as what you are talking about.

My response is usually something like this:

Be wary! You might well be in the early stages of creating a disaster like the one created by Imani and Santiago. If you build a system with multiple write points, at the very least be sure you have a unified log through which all messages go. Otherwise you are on the road to a system that has multiple data silos and no canonical source of truth. The unified log is essential.

What should be stored permanently in Kafka?

You can store a lot in Kafka. At LinkedIn, they currently keep 900 terabytes of data in Kafka. But your system will almost certainly generate some useless debris that does not need to be stored forever in Kafka. You should use Kafka as your main communication channel, so everything should go through Kafka, but some types of messages can be deleted after a few minutes.

I’m going to share an actual chat I had with a client, because I think this shows the learning process that a team goes through:

Them:

When you used ElasticSearch, how did you resolve race conditions pertaining to two documents updating at the same time? Some merge logic? Did you retry the write?

Me:

I was simply pulling from MySQL and writing to ElasticSearch so generally it was fine to go with “Last write wins.” You need to think carefully about how much harm will be caused if sometimes you have a document in ElasticSearch that’s out-of-date. Obviously, if the issue is payment, that could be very painful (if the user has paid, but ElasticSearch still thinks the user has not paid). The need for “real time” handling can get very complicated. When we say “real time” we mean “high priority.” Some companies use Apache Storm in this situation, which has multiple redundant checks and restarts on failed functions, as such would allow a team to avoid the possibility that a failed app or a failed write leaves the final cache (ElasticSearch) in an out-of-date state. But both Jay Kreps and Nathan Marz have agreed that you’ll want to run Kafka no matter what, but running both Kafka and Storm leads to an incredible amount of complexity. Still, some companies do this. Parsely does this. But in general, you want to keep things simple. Just stick with the basic architecture, and try to make the translator app as fast as possible, so if it gets something wrong now, hopefully it will be less than a minute before it re-tries and get things correct. Will your customer be angry if the data is wrong for one minute? Can you calm them down? Because if you can manage their expectations, then your architecture can be a lot less complex.

Them:

Hmmm, I think perhaps I have a different kind of race condition than what you just talked about. I’m dealing with the async nature of NodeJS. I’m writing the Translator app in NodeJS. And there are multiple Kafka messages coming in.

Me:

I never wrote directly from Kafka to ElasticSearch, so I’m not sure how to answer. What I typically do is I have the backend robots do some processing and then store the data in normalized form in PostGres.

Them:

So, say a user makes a payment. To facilitate this, a message is posted to Kafka saying, “This user, with these credentials, would like to make a payment.” This message is picked up by the “Billing” service, which uses the incoming information to determine if the billing succeeded.

At this point, there are two ways the front-end could be notified of this change.

Option A: The Billing service could cache the successful billing receipt in PostGres. The Translator app could listen to this log and update accordingly. The User service could also listen to this log and update some info in the User profile.

Option B: The Billing service does not cache anything in PostGres, but instead posts a “Transaction Complete” message to Kafka. Anything listening can do whatever it wants with that information, including updating some information in ElasticSearch or info in a User profile.

In both of these situations, the event is stored somewhere.

We will have some forms of orchestration, and I don’t think listening to PostGres in that regard is the best idea. I don’t think it does what we want. We have Kafka specifically for events.

Me:

Sending a Message to ElasticSearch that says “Your payment was successful” is the right idea. And you can mark a “Payment successful” in Kafka. I’d say those are two separate messages that the Billing robot will send. I think it is okay to cheat a little bit, if you need the speed, and have the Billing robot write directly to ElasticSearch, rather than write to PostGres. Again, if you need the speed. Balance this against the fact that you are making your life a little bit more complicated every time you make an exception to the overall architecture. And when you think you need that extra bit of speed, so you want to cheat a little bit, you are taking a step down the road that lead the industry to the whole debate about “Should we use Kafka and Storm together?” See this:

“The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is exactly as painful as it seems like it would be. I don’t think this problem is fixable.”

As a general rule you only write to Kafka what you would need to recreate the whole history of your app, and therefore the current state. That does not include all human readable messages, but it does include things like receipts, since you would not be able to recreate that later on.

As to how many events you write to Kafka, you can be practical about that. No writing two messages when one will describe the event. Don’t write permanent messages if your only goal is to trigger some other backend robot that you know is waiting for some particular signal.

Them:

Hmmm. I see what you’re saying. Would you agree there is a difference between a “Please Process A Payment” message, versus a “A Payment Was Successful” type of message?

Me:

You can distinguish between intentions and facts. The future tense versus the present tense. In standard CQRS lingo, these are called “commands” and “events.”

“I would like to make a payment with these credentials” is a command.

“Your payment was successful” is an event.

I don’t often use that lingo because I don’t know if it clarifies much, but maybe it is useful?

Them:

Oh it does!

It slipped my mind that while the incoming messages are commands, they’re not important and all we need are the resulting events.

Me:

That is a reasonable architectural decision. There are different ways to handle commands. The crucial thing is that the long-term permanent records in Kafka hold all info that you would need to recreate the history of your service, and therefore the current state.

Them:

So “command” messages can delete themselves after an hour, or something like that. 5 minutes. Or a day. Whatever. They don’t need to be permanent.

Me:

That’s correct, they don’t need to be permanent. Some of that depends on what sort of experience you want to guarantee for your customers.

You don’t need to permanently record the commands, you could just permanently records the results (the events). But if an app dies when it’s in the middle of handling a command, it’s useful to have recorded the command, so the app can try again when it restarts. I assume the app will need less than 1 second to restart. For instance if you get the command “I would like to make a payment with these credentials” and your Billing app dies when it tries to handle the command, then when the Billing app restarts it would see an unhandled command and could re-run it. Or you can simply assume that your customer will try again later. This latter option is less complicated, so it is preferable if you are allowed to treat your customers that way. Just be sure it’s really okay with your business model.

Them:

Yes.

So, to bring this back to the Translator app. It could also respond to these events, and be somewhat agnostic regarding the use of PostGres. We don’t need PostGres, do we? If ElasticSearch is our cache, why do we also need PostGres?

Me:

Be wary. This conversation runs the risk of confusing two technologies with two different kinds of cache. You can use ElasticSearch for everything, but you will still have two conceptually different caches: one that holds normalized data, and one that holds denormalized data.

You don’t need to use PostGres. You can use any kind of data store. You can use Cassandra of MongoDB or MySQL or Redis. Since you are already using ElasticSearch, you can use ElasticSearch for everything.

Your backend robot services are going to want to access data that is in a normalized form. Your frontend is going to want to access data that has been denormalized for the convenience of the frontend. When your backend Notification app needs to ask “Is this User a valid, paid up user?” the easiest way for it to answer the question is to look up the status of the User in a User table. That is, the Notification robot will want to look up the answer in data that is normalized, and is therefore ready to answer such questions. But you don’t want your frontend talking to a normalized data source, because then your frontend would have to make 20 or 30 or 40 Ajax calls to render a page. The frontend should talk to a denormalized data source, so the frontend will only need to make 1 or 2 or maybe 3 Ajax calls to render a page.

So it doesn’t matter if you use PostGres. You could use ElasticSearch for everything. Just remember that, regardless of which technologies you use, you still will have 2 conceptually different caches — one for normalized data and one for denormalized data. You can put both caches in ElasticSearch, but remember how different they are.

And also remember, even if you use PostGres, you are using it as a cache. The permanent items in Kafka are the only canonical source of truth for your system. You could delete your PostGres database, and delete all the backups for the PostGres database, and that wouldn’t matter at all, because you could regenerate that data by having your backend robots re-read the history of all events in Kafka. Your log is your source of truth. Everything else is a cache.

.
.
.

On a related subject, you might want to read Why Don’t Software Developers Understand Kafka and Immutable Logs?