One write point, one read point, one log

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com

My last several clients have needed high write throughput systems. Most of these clients were doing something with Natural Language Processing (NLP) and they were often sucking in gigabytes of data, which had to be put somewhere until it could be processed. I found that entrepreneurs were surprised at how much time and money had to go into the support systems. So finally I created this all-in-one OWPORPOL graphic, which tries to show how all the different parts of the system fit together:

I encourage comments, but please read the whole essay first. Maybe I’ve already answered your question.

A rant about industry jargon

All of the folks who write about CQRS are very smart, and nothing in this essay is meant as criticism of them. Martin Fowler is ten times smarter than I am, and you can certainly learn more from him than you can learn from me. His essay about CQRS is great and worth reading. And since I’m going to talk about Kafka in particular, it is worth reading Jay Kreps’s awesome essay “The Log: What every software engineer should know about real-time data’s unifying abstraction“.

Still, there is a problem in this industry (the tech industry) of people putting their own spin on what should be standard words. In particular, the industry has overloaded certain words with multiple meanings, to the point that these words have become almost useless for expressing actual meaning:

event

message

document

command

query

ownership

responsibility

service

If I told you I was working on an event driven messaging system, you would not know if I meant a frontend GUI that passed “messages” (Ajax calls) to a backend after someone clicked a “Send” button (an event) or if I meant I was building a dating app that sends text messages via Twillio whenever someone swipes left on your profile or maybe I’m talking about an NLP system that updates a PostGres database with a message/document whenever the NLP system found a Statistically Significant Phrase (SSP) in a scanned article. The words “event” and “message” and “document” all get so overused that they no longer clarify anything.

That is one (very minor) complaint that I have about Command Query Responsibility Segregation (CQRS) — merely reading those four words doesn’t tell me much about the architecture. Meanwhile, consider this blog post (that you are now reading). You could skip the whole thing, if you simply read the headline, you’ve already a glimmer of the architecture that I’m going to talk about: one write point, one read point, one log (OWPORPOL). An experienced software developer can probably guess 75% of what I’m going to say, just based on the title. I think it’s a good title that can communicate so much in just 8 words.

(Anecdote alert! Skip this paragraph if you’re in a hurry.) Recently I was invited to dinner at the New York Fogo de Chão Steakhouse to discuss with a data center provider (both traditional and cloud) and consultancy how they could get higher margins from the consulting side of their business. There were 20 of us there, invited to give advice, and the main idea that everyone agreed to was “Write a white paper where you propose your own proprietary system for managing projects. Invent your own meanings for industry standard jargon. Doesn’t matter what the white paper says, just claim that it arises from your team’s long years of experience, and you’ve used this process many times before, and it guarantees success, so long as your client follows it rigorously.” And then the CEO of the company said, “No, no, no, we’re looking for recurring revenue.” And the smartest guy in the room shot back with, “It is recurring revenue! You tell your clients ‘You need to do X once a week, and Y once a month, and then once a quarter you do Z, to be sure that everything is going well.’ And then you oversee, X, Y, and Z, to be sure everything is going well. And that is recurring high-margin revenue.” And the CEO said “We make $50 million a year right now, but I’ve promised the Board that I can take revenues to $150 million a year within 4 years. Do you think this strategy could help drive that kind of growth?” And everyone in the room agreed, yes, it can. And, hey, this works. It’s tough to argue with this strategy when it works so well. And it’s not really harmful, so I don’t want to sound vicious. Presumably the fact that so many large enterprises find it desirable to hire these expensive consulting firms means that these expensive consulting firms offer something of value to large enterprises (yes, yes, I know, the point is debatable, but let’s leave that discussion for another essay). What that means is there are a lot of consultancies out there, and they have a strong incentive to invent their own meanings for well-accepted, industry standard jargon. And I think that’s how a phrase like CQRS (or, going back in time, “Web Services” or “Agile“) ends up with multiple meanings, and thus becomes meaningless.

One write point, one read point, one log

I’m about to describe what I once thought of as CQRS. What I’m describing is what can be thought of as Classic CQRS or Simplified CQRS, but there are many experienced software developers who would insist that what I’m talking about here has nothing to do with CQRS, because in their mind the key thing about CQRS is the separation of command code from query code, and the distribution of responsibilities within the code, whereas my focus is on the physical setup of the network and the overall architecture of the system, while I’m also being deliberately agonistic about how the software model is organized (for reasons I talk about later). Also, as a pratical (not theoretical) matter, when you search Google for information about CQRS, most of the top ranked articles combine the subject with Event Sourcing or you find articles that focus on the difference between mutable events versus immutable subscriptions (as an example see how a CustomerWorkflow subscribes to a series of ReflectiveMutableCommandProcessingAggregate). And there are some interesting articles, such as (Microservices, Events and CQRS at Klarity) that show graphs that mainly focus on the flow of logic inside the codebase, but mix in some items that are obviously physical, so you get a mish-mash of logical and physical items. All of which can lead to confusion. Since my focus is more on the physical movement of data through the network, for the rest of this essay I’ll talk about “one write point, one read point, one log” (OWPORPOL). If you feel this essay is simply about CQRS, I’m okay with that, and if you feel this essay has nothing to do with CQRS, I’m okay with that. The nomenclature is not a thing I want to argue about.

Most of this essay is a long list of questions that I’ve received from my clients. And mostly I’m dealing with inexperienced teams because if they had someone like me in-house, then they would not need to hire me. So don’t be surprised if the questions seem like they are from someone inexperienced.

You can see that I’ve put together a giant graphic that tries to explain the system I’m talking about. You’ll see I also include a lot of devops details that you might think of as superfluous to OWPORPOL. That’s because a few months back I showed a classic CQRS diagram to a client and I had this conversation:

CLIENT: So, this is all we need to build?

ME: Yes, that is the whole system for importing the text and running the NLP processes on them.

(A week passes)

CLIENT: Okay, the team has made some progress reading and writing to Kafka, and running the NLP, but it seems like we have to manually copy our code to the servers to make it run for real?

ME: No worries, we will set up a Jenkins server, which can run tests, build the uberjar, and then deploy it to your servers.

CLIENT: What? We need another server?

Me: For the build system, yes.

CLIENT: Last week you said you showed us everything we need to build! Are you trying to rip us off?

ME: What?

CLIENT: You told me that graphic you gave us was everything we needed to build!

ME: Sure, for importing text and running NLP. But you need a build system, a deployment system, a health check system, a lot of other things.

CLIENT: God damn, are you kidding me! How much is all this going to cost?

I didn’t like the idea that they thought I was trying to manipulate them into giving me more money. I’d stupidly made the assumption that they already knew that there would have to be other systems. I was wrong to assume that. So I created the all-in-one graphic, because I’ve found it useful to have a graphic that explains how all the systems combine and work together. Especially if I’m talking to a non-technical entrepreneur, this lets us have an honest conversation about what the total costs will be.

The nightmare when ad-hoc quick fixes become your long-term architecture

Why use OWPORPOL? Martin Fowler says:

CQRS allows you to separate the load from reads and writes allowing you to scale each independently. If your application sees a big disparity between reads and writes this is very handy. Even without that, you can apply different optimization strategies to the two sides… If your domain isn’t suited to CQRS, but you have demanding queries that add complexity or performance problems, remember that you can still use a ReportingDatabase…
Despite these benefits, you should be very cautious about using CQRS. Many information systems fit well with the notion of an information base that is updated in the same way that it’s read, adding CQRS to such a system can add significant complexity. I’ve certainly seen cases where it’s made a significant drag on productivity, adding an unwarranted amount of risk to the project, even in the hands of a capable team.

Fowler is right to warn about the complexity, although I’ve seen teams get into even worse situations when they try to implement piecemeal ad-hoc solutions to deal with high write throughput. I believe what happens at such teams goes like this (based on a real example):

Santiago: Huh, our Ruby On Rails app has always been able to take input from the Web and write it to our MySQL database, but now that we are letting our users upload Excel spreadsheets, the app is going too slowly, what should we do?

Imani: No worries. Let’s set up some separate apps that just handle the Excel spreadsheets! One app can convert the XLSX to CSV, the put them on a RabbitMQ queue, then another app can parse the CSV and write that to MySQL.

Santiago: Excellent idea! That solves all of our problems!

(One year later)

Santiago: Our new Web scraping Python scripts are pulling in 50 gigabytes of HTML each day. We can’t write all of this to MySQL, what should we do?”

Imani: No worries. We can store all the raw HTML in S3 buckets, then we can have other scripts that pull the HTML files and run them through our NLP filters, then write the parsed Statistically Significant Phrases (SSPs) and linked sentences back to other S3 buckets.

Santiago: Excellent idea! That solves all of our problems!

(One year later)

Santiago: The sales teams wants us to allow our customers to set up notifications, so when a particular name is discovered by our NLP scripts, we can send an alert to our customers, but our Notifications table is in MySQL and our NLP output is in an S3 bucket. What do we do?

Imani: No worries. We can pull from MySQL and the S3 buckets and combine them in MongoDB, then create an app that looks for new data appearing in MongoDB, and then we can send the appropriate notifications!

Santiago: Excellent idea! That solves all of our problems!

(One year later)

Santiago: Our business intelligence team wants to find out if our customers are getting any real benefit from our Notifications, but their analysis tools needs data from S3 and MySQL and MongoDB. What should we do?

Imani: No worries. We’ll create an app that pulls from S3 and MySQL and MongoDB and combines all the data in Redshift. Then the business intelligence team can run their queries against Redshift.

Santiago: Excellent idea! That solves all of our problems!

(More years pass, more additions are made, and so on, and so on.) [1 About the names]

No, this does not solve all their problems. What has happened, over time, is a series of ad-hoc solutions has built up data silos that now each have a bit of the canonical truth of the system, which is to say, the system no longer has a canonical source of truth. This is such a powerful anti-pattern, that I need to emphasize it:

The system no longer has a canonical source of truth

More so, the intuitions that allow this system to work exist only in Imani’s head, and if you tried to document this system, you would have great difficulty writing it down in a manner that would allow a new member of the tech team to understand what is actually happening.

At this point we need to reconsider Martin Fowler’s warning:

I’ve certainly seen cases where it’s made a significant drag on productivity, adding an unwarranted amount of risk to the project

because now OWPORPOL would be simpler and less risky than what Santiago and Imani have done.

Remember what Jay Kreps warned about:

But this is what you want:

Of the ideas in this essay, the unified log is more important than “one write point, one read point” because there are some companies that have tried to implement CQRS without having a unified log and they end up worse off:

To be honest our system doesn’t do that much yet and all these services might seem like overkill. They do to us at times as well, that’s for sure. It can be annoying that adding a property to a case will pretty much always involve touching more than one service. It can be frustrating to set up a new service just for “this one little thing”. But given the premise that we know that we must build for change I think we are heading in the right direction. Our services are loosely coupled, our query services can easily be thrown away or replaced, the services have one and only one reason to live and are easy to reason about. We can add new query services (I can see lots of them coming up, e.g. popular-cases and cases-recommended-to-you) without touching any of the command services.

You’ll notice that the graphic in that article has an “events database” but it is not the center of the whole architecture. That article has no concept of the unified log, and I think that is why they got into trouble. I find it fascinating how their ideas overlap with 90% of mine but the remaining 10% seems to be very important. They do understand this much:

You get a transaction log for free, and you can get the current state by replaying all the events for a given order. Add to that, now you can also get the order state at any given point in time (just stop the replay at this time).

But they don’t seem to understand the idea that the log needs to be the center of all communication in their system. Indeed, they call it a “transaction log” rather than a “unified log.” In this case, it’s possible the nomenclature is significant. They also say this:

Worry not, there is a better way, and it is called event collaboration. With event collaboration, the Order Management Service would publish events as they occur. You typically use a message bus (e.g. Kafka, RabbitMQ, Google Cloud PubSub, AWS SNS) and publish the event to a topic.

If you combine their “transaction log” and their “message bus” then you get the concept of a Unified Log. Maybe they understand this, or maybe they don’t; what I can see is that the graphic they use in that article has “publish” and “subscribe” scattered around the system in a disorganized way.

An all-in-one OWPORPOL graphic

A few notes about the big workflow graphic:

1.) I previously complained about graphics that are “a mish-mash of logical and physical items.” In this graphic, you can assume each item is a physical item, if you believe the system is built using a microservices style. However, the graphic is really agnostic on that issue. I used to promote the microservices style. However, I’ve seen some teams do well with it, and other teams do badly with it. The more I consult with different teams, the more I’ve come to appreciate that creating software is more of an art than a science, and therefore architectural ideas should show deference to the artists who are creating the software. If your lead engineer loves microservices, use them. If your lead engineer loves a monolith, then use that monolith. In the graphic, you can believe that each model is its own app, or you can believe they are separate modules in Ruby On Rails or Sinatra or Django or Flask or Symfony or Zend or Axon or whatever you want to use. If you have an inexperienced team that mostly just works with Django, then using Django and PostGres on the backend can give them a pleasant feeling that at least part of this architecture is familiar to them.

2.) Many of the graphics used in other tech articles are vague and abstract. Understandably, many developers try to be “technology agnostic” when they write about architecture, because they assume other developers will want to implement the same architecture in a different language. However, my non-technical (or technical but inexperienced) audience is allergic to this much abstraction, so here I wanted to be specific, using the names of actual technologies. But depending on your needs, any of these technologies could be replaced by something else, and reasonable software developers might prefer different technologies simply because they have experience with those technologies. So for instance, I use ElasticSearch in this graphic, but I worked with one company that converted to GraphQL, as part of a re-write during which they committed to the React eco-system (at least for now, I would recommend against GraphQL unless you are committing to the React system). In that case, we started with Code Foundries Universal Boilerplate for Relay, but we tore out the code for Cassandra and used MongoDB — we did this simply because we all had experience with MongoDB, but we didn’t have experience with Cassandra. This is not a vote against Cassandra, as I have friends at Parsely, where they use Cassandra to good effect. Likewise, elsewhere in the graphic I use PostGres, but of course you could use MySQL or Oracle or MS SQL Server, or for that matter, since it is only a cache, you could use any cache technology, such as MongoDB or even Redis. But most of the time I recommend PostGres, because RDBMSs are familiar to most teams, and PostGres is the best of such Open Source systems. Also, I mention Kafka as the Unified Log, but AWS has a similar product called Kinesis.

I am grateful to my clients for educating me

I am fascinated with the things my clients don’t understand. Their questions are a major part of my own education. I am deeply grateful for the chance to answer their questions. What follows are the most common questions I’ve gotten regarding OWPORPOL.

How does the frontend talk to this backend?

I had this conversation with a frontend developer who was completely new to the OWPORPOL architecture.

Them:

How do I know I have created something on the backend? Suppose a new user wants to create their profile. If we were using a normal website, something like Django, then my Javascript code does a POST to the server, and I’m going to either get back a 200 or a 400 or a 404 or maybe a 500. Either it works, or it doesn’t work, and I get back an HTTP response code that tells me what happened. And then the frontend code can show an appropriate message to the user. But in this architecture that you are talking about, I send a JSON document via POST and then… what? Maybe the API Gateway gives me a 200, but that doesn’t tell me if they document made it all the way to the database. That just means I had the correct API key and so I was allowed to make a POST. But I’ve got no idea if anything succeeded, and I’ve got no way to reference this document. I’ve got no ID, no way of knowing how to refer to this in the future.

Me:

It might seem like a complicated architecture, but in my experience, when things are working correctly, a new document can get through the system in less than 2 seconds. It should never take more than 2 or 3 seconds for a document to go through these steps:

1.) from the users web browser to the API Gateway

2.) to an app that reformats the data for Kafka

3.) to Kafka

4.) to the backend robot that handles User profiles

5.) to PostGres

6.) to the Translator app

7.) to ElasticSearch

If, for some reason, your system slows down the point where it is taking 20 or 30 seconds for a document to get through the system then, first of all, you need to look at the bottle neck and fix it, but, second of all, I have seen many startups cheat a bit, and for high priority information, the backend robots will both write to PostGres and also write directly to ElasticSearch. That leads to tight coupling between the backend robots and the denormalized data in ElasticSearch, so I don’t recommend it, but I have often seen it done.

Assume that 99.9% of the time a new document can get through your system and arrive in ElasticSearch in less than 2 seconds. Then your code can POST to the write point, wait 2 seconds, and then make a read from the read point — your new document should be there.

Them:

But how would I know? How would I be able to tell if the new document is in ElasticSearch? I don’t know its ID in the database, so I have no way to identify it.

Me:

Before you send it, you’ll want to create a UUID and add that in as the document id. And you’ll want to send it with the session id. So 2 seconds later your code can ask, is there a document with this session id and this document id? In other words, for a new user profile, you send something like this:

{
“first_name” : “Allie”,
“last_name” : “Veblen”,
“phone” : “01-987-234-9812″,
“company_id” : “9871fVGTED120dkfir93F”,
“document_id” : “f02343dlGTRF9090909dLc”,
“session_id” : “888F8F2cvnd4jf09LKJiv2″
}

And then 2 seconds later you can query for:

{
“document_id” : “f02343dlGTRF9090909dLc”,
“session_id” : “888F8F2cvnd4jf09LKJiv2″
}

If the document comes back, then you know it was successfully created. If nothing comes back, wait another 5 seconds and try again. Then another 5 seconds and try again. You can set some cut off, let’s say 15 seconds. If the document is not there after 15 seconds, you can assume something failed. You can either do the POST again, or tell the user that there is a problem.

In general, you really should not rely on the auto-generated ID of a database table to be your document id. That style was popularized by PHP and MySQL and then Ruby On Rails and MySQL, but that style needs to die. Arguably it is a security risk, since outsiders can easily guess the id of any document in your system, which they can not do if you use UUIDs.

Them:

But even with UUIDs, isn’t there still a big security risk? Can’t they potentially guess a UUID?

Me:

Sure, hackers can try to guess UUIDs, that is a well known attack. But there are a few things to keep in mind. First of all, with UUIDs, they would have to guess many trillions and trillions of UUIDs before they would even have a 50% chance of correctly guessing a UUID that you are using. Second of all, you should have some code that checks to see if hackers are attacking your site with random guesses. If you use the AWS API Gateway, then you don’t have to write any code yourself, because Amazon has already done that for you. The only requests that get through will be requests that have your API key.

Them:

A UUID seems complicated. I don’t know how to create them. Maybe I can just use a timestamp instead?

Me:

No, you don’t want to use a timestamp. That would only work for the simplest cases. First of all, as time goes by, your frontend will become more complex, and eventually you will find you are attaching multiple events to a single button click. These would all have the same timestamp, so the timestamp would fail to tell you which event/document is which. Second of all, you don’t need to create UUIDs yourself, there are many good libraries that do you this for you, and which are easy to use.

How does the user login?

Especially nowadays, inexperienced developers have gotten so used to working with systems that automatically handle sessions that many such developers don’t seem to understand what a session is, or how to handle it manually. So I am often asked a variant of this, by whoever is working on the frontend:

How can I handle login in a system like this? If I build a website using WordPress or Ruby On Rails or Django then all of those systems have code to handle logins. Normally I get the user’s username and password, from an HTML form, and then I POST that to the server, and if the username and password match an existing username and password, then the user is logged in. But in this OWPORPOL system, I don’t have a direct connection to the backend, so I don’t think I can maintain a session?

To answer, let’s first talk about a totally normal website, like the type run by WordPress. In that case:

1.) user types username and password into an HTML form

2.) the username and password are POSTed to the server

3.) (assume success) the server sends back a session_id, to be set as a cookie

4.) the web browser records the cookie

5.) thereafter, the web browser sends the session_id with every request, in the HTTP cookie headers

One could just send the username and password with every single request, but that would get tedious and would also possibly be a security risk.

How to handle authentication with an OWPORPOL system? There are a few ways, including the traditional “send username and password from HTML form.” But nowadays 3rd party auth has become very common, so let’s talk about that too.

In the (very traditional) first case:

1.) you have a website where the user fills in the username and password in an HTML form.

2.) You also have a small auth app that is available at a public endpoint. This has access to a cache of all usernames and passwords (I’m just going to assume encryption and not talk about it). This auth app is an exception to the overall architecture, since it amounts to a write point, other than the main write point. I think this is a minor exception that doesn’t hurt the overall system

3.) The HTML form POSTs the username and password to the auth app

4.) let’s assume the auth app finds a match for that username and password

5.) the auth app generates a UUID to be used as the session_id. This gets sent back to the web browser

6.) the auth app also writes the session_id and username to Kafka (this is a direct write, and therefore another exception to the notion of “one write point”), so that the User robot on the backend can pick it up and see what the current session_id is. If your backend is written in a standard Django or Ruby On Rails style, you might have a Sessions table in PostGres where you store all current session_ids

7.) thereafter, any document sent to the write point should include the session_id

In other words, the auth app writes something like this to Kafka:

{
“action” : “login”,
“username” : “joel”,
“session_id” : “9230940329420394″,
“status” : “success”
}

or this for a failed login:

{
“action” : “login”,
“username” : “joel”,
“status” : “failed”
}

(To many failed login attempts is a problem. You should have some backend robot that tracks this.)

Later, someone tries to send in a New Business action, which might look like this:

{
“item-type” : “business”,
“name” : “Joel’s Country Vegetables”,
“document_id” : “frt43209KJH3111nvBt2″,
“session_id” : “9230940329420394″
}

This gets picked up by the Business robot, which needs to know if this is a valid user, so it checks the session_id in PostGres. This is exactly what the major frameworks do: Flask, Ruby On Rails, Django, Symfony, etc. In fact, you could use any of those frameworks for the backend.

The Business robot sees the session_id is valid, so it creates a new row in the Business table in PostGres. And it should write a message “Your new business has been created!” into the Messages table, and that should be passed along to ElasticSearch, and then to the user (all of this usually happens in less than 2 seconds).

PostGres is a cache that holds the current snapshot of what data is valid. However, it is not the canonical source of truth. Note that you could delete PostGres, and delete all of your PostGres backups, and you could still correctly recreate all of the information in PostGres, by having your robots run over the data in Kafka. So the history of events in Kafka is the canonical source of truth, in the sense that all the knowledge of your app can be recreated by re-reading the data from Kafka. However, in terms of what is the current data, you can keep that in PostGres, and in that sense the app will work a lot like a typical Flask app, or a typical Ruby On Rails app.

The failure mode is also simple, if the Business robot can not find a matching session_id in the Sessions table, then it should log this as a failed attempt. Failures should also be stored in the Messages table, to be passed along to the user. “We could not create a business, could not find a valid session.”

How long does login last? Every website has different rules about this. Most sites will allow you stay logged in forever — the session never expires. A user remains logged in unless they specifically initiate a logout action. However, some sites, such as my bank, are wary about security, so they cancel sessions after just 5 minutes. Ending a session just means deleting the session_id from the Sessions database table. Or maybe the user clicks the “logout” button on the website, at which point a “logout” event is sent to the public write point, then to Kafka, then it gets picked up by the User robot (or maybe you have a Sessions robot) and that deletes the session_id from the Sessions database table.

Now let’s consider the other case, when you use 3rd party auth.

Firebase: let’s assume you use something like Firebase for all authentication.

Your frontend code can make a call directly to Firebase. You’ll get back a JWT, which contains a session key (for some reason, Firebase uses an enormous session key, something like 900 characters. I’ve no idea why this is so huge. But it works fine as a session key). Pass that session key along with every document that you send to the public write point. Each of the robots on the backend will then have to verify that the session token is valid. That is, they get the session key, then they make a call to Firebase to see if the key is valid and who does it belong to. For this, it can be helpful if you are using a monolith on the backend, such as Django or Ruby On Rails, as it makes it easier to use the “verify with Firebase” code just once for each document. Otherwise, potentially, if one document needs to be processed by 3 different backend robots, each robot will have to verify the session key with Firebase. But if you have developed your backend as a series of microservices, it is not much work to write a simple wrapper for your Firebase code and then include that everywhere as a shared library. A very minor pain point.

In either approach, authentication and maintaining sessions are not radically different compared to traditional websites, especially if your website was already mostly Javascript (so, “traditional” in the post-2005 sense, not in the 1993 sense).

Too complicated! I don’t wanna!

After I explain these ideas to a team, at some point I get this pushback:

This is too complicated. It’s also unnecessary. I understand why our web scraping script has to dump raw HTML somewhere — it’s pulling in 50 gigabytes a day. So, yes, the web scraping script can dump its raw HTML on S3 or Kafka, or wherever. But I’m just writing the website code. This part of our code can be a normal Ruby On Rails app. There is no reason why our frontend code needs to interact with a system as complicated as what you are talking about.

My response is usually something like this:

Be wary! You might well be in the early stages of creating a disaster like the one created by Imani and Santiago. If you build a system with multiple write points, at the very least be sure you have a unified log through which all messages go. Otherwise you are on the road to a system that has multiple data silos and no canonical source of truth. The unified log is essential.

Less essential, but also helpful, is the idea of having a single write point. That means that even your web scraping app writes to the end point. Yes, you could instead have the web scraping app write directly to Kafka. You could have lots of apps write directly to Kafka. If you do that, three possible annoyances often arise:

1.) logic for writing to Kafka is all over the place

2.) logic for enforcing schemas is all over the place

3.) lots of material appears in Kafka and you may not know where it is coming from

That said, I see a lot of companies that cheat regarding “one write point” and they don’t seem to suffer much. But you absolutely need the unified log. That much is essential.

Caches are cheap, build a lot of them for the frontend

There was a stretch, maybe from 2005 to 2015, which will be remembered as the era of simple Ajax uses. The era started when Sam Stephenson released the Prototype library, but it really got going once jQuery was released. It was an era of Javascript fetching data via Ajax, and relying on the backend code to decide what data is sent in response to each request.

That era is evolving into something different. Frontends are now complicated enough that the frontend coders need to have control over what their queries fetch. GraphQL was invented to satisfy this need.

Maybe GraphQL is the big wave of the future. I don’t know. For now, I favor a simpler approach.

I wrote a small translator app. It is less than 1,000 lines of Clojure. I hope to eventually clean it up and release it as Open Source. You can recreate it easily on your own. The idea is to pull data from PostGres and denormalize it to ElasticSearch. The crucial idea is that the SQL, and the reformatting of the data, should be controlled by a simple configuration file, which the frontenders can change at will. Therefore, whenever they need a new data structure in ElasticSearch, they just add a few more lines to the configuration file, and the Translator app picks up the changes in the configuration file and immediately starts creating a new data structure in ElasticSearch. And the Translator app is forever polling PostGres, so whenever new data appears in PostGres, the Translator app is quick to pick up the change and move it to ElasticSearch.

In the early days of Ajax, it was common to offer RESTful interfaces that exactly matched the database tables in a SQL database. So if your MySQL or PostGres had tables for Users, Businesses, Messages, then the RESTful interface offered endpoints for Users, Businesses, and Messages. And the Ajax interfaces were very busy, making a lot of calls. I can recall early Angular websites where the pages were making 20 or 30 or even 40 Ajax calls to fill in the page.

You should denormalize your SQL data into a form that is convenient for the frontend. Your frontend should only have to make 1 or 2 or maybe 3 Ajax calls to fill in a web page.

Because the Translator app is easy to use, the frontenders can create different kinds of documents for every page on the website. They can have some documents for the Profile page, different documents for the Rising Trends page, different documents for the Most Commented page. You can have a cache for Rising Trends page logout and a cache for Rising Trends page logged in. If a given page shows data from 8 different SQL database tables, then you need to fetch those 8 tables and combine the data into one JSON document, so the frontend can get everything it needs with 1 Ajax call. You can and should have as many cache collections as possible. The only limit is that the speed of writes to ElasticSearch. On a B2B site you could potentially have a cache for every user. On a B2C site you would probably have too many users to do so. But in the interest of convenience for the frontend, you should certainly push the limit of how many caches you can have before the write/updates begin to slow down to unacceptable levels.

When I first created my Translator app, it was at a company that had a PHP website which made straight calls to MySQL. They were trying to develop an API, so they could have some separation between the backend code and the frontend code. I built the Translator app to make it easy for frontenders to pull data from MySQL and get it to ElasticSearch. Here is an example of the config the frontenders could use to create 2 document collections:

{
    "denormalization" : [

        {
            "sql" : " SELECT id as profile_id, name as profile_name, headquarters_addr1, headquarters_city, headquarters_state_code, headquarters_country_code, headquarters_phone, page_views_rank, clean_name, clean_name2, visibility_name_search_auto, dtc FROM company_profile WHERE name is not null and name != '' ",
            "key_to_denormalize_upon" : "profile_id",
	    "collection_name" : "recent_investments_page"
        },

        {
            "sql" : " SELECT id as profile_id, name as profile_name, headquarters_addr1, headquarters_city, headquarters_state_code, headquarters_country_code, headquarters_phone, clean_name, clean_name2, dtc FROM advisor_profile WHERE name is not null and name != '' ",
            "key_to_denormalize_upon" : "profile_id",
	    "collection_name" : "recent_investments_page"
        },

        {
            "sql" : " SELECT id as profile_id, name as profile_name, headquarters_addr1, headquarters_city, headquarters_state_code, headquarters_country_code, headquarters_phone, clean_name, clean_name2, dtc FROM investor_profile WHERE name is not null and name != '' ",
            "key_to_denormalize_upon" : "profile_id",
	    "collection_name" : "recent_investments_page"
        },

        {
            "sql" : " SELECT id as profile_id, name as profile_name, headquarters_addr1, headquarters_city, headquarters_state_code, headquarters_country_code, headquarters_phone, page_views_rank, clean_name, clean_name2, visibility_name_search_auto, dtc FROM company_profile WHERE name is not null and name != '' ",
            "key_to_denormalize_upon" : "profile_id",
	    "collection_name" : "company_details_page"
        },

        {
            "sql" : "SELECT * FROM company_websites",
            "key_to_denormalize_upon" : "profile_id",
	    "collection_name" : "company_details_page"
        },

        {
            "sql" : " SELECT  advisor_profile_id as profile_id, name as advisor_former_name_name FROM advisor_former_name ",
            "key_to_denormalize_upon" : "profile_id",
	    "collection_name" : "company_details_page"
        },

        {
            "sql" : " SELECT * FROM company_board_members",
            "key_to_denormalize_upon" : "profile_id",
	    "collection_name" : "company_details_page"
        },

        {
            "sql" : " SELECT * FROM company_investors",
            "key_to_denormalize_upon" : "profile_id",
	    "collection_name" : "company_details_page"
        }

        ]       

}

This creates 2 document collections:

recent_investments_page

company_details_page

(I used the word collection because I was originally going to use MongoDB, not ElasticSearch.)

This allows the frontenders to write whatever SQL they need, have the data combine around some key, and live in denormalized state in ElasticSearch. I believe this gives all the flexibility that frontenders seek from GraphQL, but setting this up is easier than setting up a GraphQL system.

What should be stored permanently in Kafka?

You can store a lot in Kafka. At LinkedIn, they currently keep 900 terabytes of data in Kafka. But your system will almost certainly generate some useless debris that does not need to be stored forever in Kafka. You should use Kafka as your main communication channel, so everything should go through Kafka, but some types of messages can be deleted after a few minutes.

I’m going to share an actual chat I had with a client, because I think this shows the learning process that a team goes through:

Them:

When you used ElasticSearch, how did you resolve race conditions pertaining to two documents updating at the same time? Some merge logic? Did you retry the write?

Me:

I was simply pulling from MySQL and writing to ElasticSearch so generally it was fine to go with “Last write wins.” You need to think carefully about how much harm will be caused if sometimes you have a document in ElasticSearch that’s out-of-date. Obviously, if the issue is payment, that could be very painful (if the user has paid, but ElasticSearch still thinks the user has not paid). The need for “real time” handling can get very complicated. When we say “real time” we mean “high priority.” Some companies use Apache Storm in this situation, which has multiple redundant checks and restarts on failed functions, as such would allow a team to avoid the possibility that a failed app or a failed write leaves the final cache (ElasticSearch) in an out-of-date state. But both Jay Kreps and Nathan Marz have agreed that you’ll want to run Kafka no matter what, but running both Kafka and Storm leads to an incredible amount of complexity. Still, some companies do this. Parsely does this. But in general, you want to keep things simple. Just stick with the basic architecture, and try to make the translator app as fast as possible, so if it gets something wrong now, hopefully it will be less than a minute before it re-tries and get things correct. Will your customer be angry if the data is wrong for one minute? Can you calm them down? Because if you can manage their expectations, then your architecture can be a lot less complex.

Them:

Hmmm, I think perhaps I have a different kind of race condition than what you just talked about. I’m dealing with the async nature of NodeJS. I’m writing the Translator app in NodeJS. And there are multiple Kafka messages coming in.

Me:

I never wrote directly from Kafka to ElasticSearch, so I’m not sure how to answer. What I typically do is I have the backend robots do some processing and then store the data in normalized form in PostGres.

Them:

So, say a user makes a payment. To facilitate this, a message is posted to Kafka saying, “This user, with these credentials, would like to make a payment.” This message is picked up by the “Billing” service, which uses the incoming information to determine if the billing succeeded.

At this point, there are two ways the front-end could be notified of this change.

Option A: The Billing service could cache the successful billing receipt in PostGres. The Translator app could listen to this log and update accordingly. The User service could also listen to this log and update some info in the User profile.

Option B: The Billing service does not cache anything in PostGres, but instead posts a “Transaction Complete” message to Kafka. Anything listening can do whatever it wants with that information, including updating some information in ElasticSearch or info in a User profile.

In both of these situations, the event is stored somewhere.

We will have some forms of orchestration, and I don’t think listening to PostGres in that regard is the best idea. I don’t think it does what we want. We have Kafka specifically for events.

Me:

Sending a Message to ElasticSearch that says “Your payment was successful” is the right idea. And you can mark a “Payment successful” in Kafka. I’d say those are two separate messages that the Billing robot will send. I think it is okay to cheat a little bit, if you need the speed, and have the Billing robot write directly to ElasticSearch, rather than write to PostGres. Again, if you need the speed. Balance this against the fact that you are making your life a little bit more complicated every time you make an exception to the overall architecture. And when you think you need that extra bit of speed, so you want to cheat a little bit, you are taking a step down the road that lead the industry to the whole debate about “Should we use Kafka and Storm together?” See this:

The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is exactly as painful as it seems like it would be. I don’t think this problem is fixable.

As a general rule you only write to Kafka what you would need to recreate the whole history of your app, and therefore the current state. That does not include all human readable messages, but it does include things like receipts, since you would not be able to recreate that later on.

As to how many events you write to Kafka, you can be practical about that. No writing two messages when one will describe the event. Don’t write permanent messages if your only goal is to trigger some other backend robot that you know is waiting for some particular signal.

Them:

Hmmm. I see what you’re saying. Would you agree there is a difference between a “Please Process A Payment” message, versus a “A Payment Was Successful” type of message?

Me:

You can distinguish between intentions and facts. The future tense versus the present tense. In standard CQRS lingo, these are called “commands” and “events.”

“I would like to make a payment with these credentials” is a command.

“Your payment was successful” is an event.

I don’t often use that lingo because I don’t know if it clarifies much, but maybe it is useful?

Them:

Oh it does!

It slipped my mind that while the incoming messages are commands, they’re not important and all we need are the resulting events.

Me:

That is a reasonable architectural decision. There are different ways to handle commands. The crucial thing is that the long-term permanent records in Kafka hold all info that you would need to recreate the history of your service, and therefore the current state.

Them:

So “command” messages can delete themselves after an hour, or something like that. 5 minutes. Or a day. Whatever. They don’t need to be permanent.

Me:

That’s correct, they don’t need to be permanent. Some of that depends on what sort of experience you want to guarantee for your customers.

You don’t need to permanently record the commands, you could just permanently records the results (the events). But if an app dies when it’s in the middle of handling a command, it’s useful to have recorded the command, so the app can try again when it restarts. I assume the app will need less than 1 second to restart. For instance if you get the command “I would like to make a payment with these credentials” and your Billing app dies when it tries to handle the command, then when the Billing app restarts it would see an unhandled command and could re-run it. Or you can simply assume that your customer will try again later. This latter option is less complicated, so it is preferable if you are allowed to treat your customers that way. Just be sure it’s really okay with your business model.

Them:

Yes.

So, to bring this back to the Translator app. It could also respond to these events, and be somewhat agnostic regarding the use of PostGres. We don’t need PostGres, do we? If ElasticSearch is our cache, why do we also need PostGres?

Me:

Be wary. This conversation runs the risk of confusing two technologies with two different kinds of cache. You can use ElasticSearch for everything, but you will still have two conceptually different caches: one that holds normalized data, and one that holds denormalized data.

You don’t need to use PostGres. You can use any kind of data store. You can use Cassandra of MongoDB or MySQL or Redis. Since you are already using ElasticSearch, you can use ElasticSearch for everything.

Your backend robot services are going to want to access data that is in a normalized form. Your frontend is going to want to access data that has been denormalized for the convenience of the frontend. When your backend Notification app needs to ask “Is this User a valid, paid up user?” the easiest way for it to answer the question is to look up the status of the User in a User table. That is, the Notification robot will want to look up the answer in data that is normalized, and is therefore ready to answer such questions. But you don’t want your frontend talking to a normalized data source, because then your frontend would have to make 20 or 30 or 40 Ajax calls to render a page. The frontend should talk to a denormalized data source, so the frontend will only need to make 1 or 2 or maybe 3 Ajax calls to render a page.

So it doesn’t matter if you use PostGres. You could use ElasticSearch for everything. Just remember that, regardless of which technologies you use, you still will have 2 conceptually different caches — one for normalized data and one for denormalized data. You can put both caches in ElasticSearch, but remember how different they are.

And also remember, even if you use PostGres, you are using it as a cache. The permanent items in Kafka are the only canonical source of truth for your system. You could delete your PostGres database, and delete all the backups for the PostGres database, and that wouldn’t matter at all, because you could regenerate that data by having your backend robots re-read the history of all events in Kafka. Your log is your source of truth. Everything else is a cache.

What sort of machines should be used to run Kafka?

This info comes from Chris Clarke, who runs devops at Parsely:

For any production system, I’d recommend running at least 3 Kafka brokers on EC2 instances with local SSD storage and replication set to at least 2. 5 nodes with replication of 3 will allow for faster recovery. I highly recommend using a separate Zookeeper cluster that is dedicated to this Kafka cluster. These can be tiny machines. We use c5d.large, but this is overkill. They use almost no CPU, and around 2G of RAM, which is mostly page cache. These things are just a high availability key-value stores.

To give a little background, we run an old version of Kafka (0.8.2) on EC2 instances and connect to it using pykafka which is an open source python module that we maintain. This version of Kafka is buggy, and there are some problems with pykafka, particularly around handling disconnected/down brokers. These issues have lead to a few terrible disasters, so we’re testing the newest version of Kafka and a number of clients. As of right now, the confluent client[1] appears to be the clear winner at least with regard to reliability.

.

.

.

Epilogue:

Do you use Docker? Then keep reading. But if you don’t use Docker, then you can stop reading now.

How to use Docker to handle software deploys

I recommend against using Docker, but I sometimes have clients who insist on Docker, so I did recently build a system that used the AWS Docker ecosystem. The parts of the system:

1.) a public URL that pointed to a load balancer

2.) behind the load balancers, some EC2 instances optimized for the Elastic Container Service (these are a particular AMI that Amazon makes available)

3.) Docker images stored in the Elastic Container Registry

4.) A Jenkins buildfile that could pull the Python code from Bitbucket, build the Docker images, and then push them into the Elastic Container Registry.

5.) A call to “aws ecs update-service” to force AWS to push the new Docker image out to the ECS instances.

The Jenkins build script did stuff like run unit tests and fail the build if the tests did not pass. For the sake of careful stability, the client decided that we would only deploy code when someone logged into Jenkins and hit the “deploy” button. They would have to add a tag such as “production” or “staging” to determine which system we were deploying to. We captured the user’s writing by using the “input” command:

    stage('Add tags') {

        try {
            timeout(time: 180, unit: 'SECONDS') {
                userInput = input(
                    id: 'userInput',

message: "Add tags, separated by whitespace. 'develop' or 'staging' or 'production' will deploy to our various environments.",
                    parameters: [
                        [$class: 'TextParameterDefinition', defaultValue: "build-${env.BUILD_NUMBER}", description: 'Whatever you type becomes a Docker tag.', name: 'Add tags']
                        ])

                listOfDockerTags = userInput.tokenize()
            }
        } catch(e) { // timeout reached or input false

            println "An exception occurred during the 'Add tags' stage:"
            println e.getMessage()
            throw e
        }

        println "End of the 'Add tags' stage"
    }

Obviously, this is all written in Groovy, which is the DSL language for Jenkins. The input looked like this:

The Jenkins script had to first authorize itself with AWS, and capture a login token that could be used in future AWS commands. For the sake of high available reliability, we were multi-regional, and since AWS logins are per region, I had to log in to the 2 regions we were operating in:

    stage('AWS ECR get-login') {
        println "Start of stage: AWS ECR get-login"

        // 2018-03-26 -- possibly I should test for the presense of 'docker login' to be sure the this line works
        def awsEcrLoginEast1 = sh(script: 'aws ecr get-login  --no-include-email', returnStdout: true)
        sh " ${awsEcrLoginEast1} "

        def awsEcrLoginEast2 = sh(script: "aws ecr get-login  --region 'us-east-2' --no-include-email", returnStdout: true)
        sh " ${awsEcrLoginEast2} "

        println "End of stage: AWS ECR get-login"
    }

Finally we call “aws ecs update-service” to force AWS to push the new Docker images from the ECR to the ECS servers:

    stage('ECS update-service') {
        println "At the start of stage ECS update-service"
        println "As of 2018-03-27 we call 'aws ecs update-service --service aui-stage-fcr-service --cluster aui-stage-ecs-cluster --force-new-deployment' after every build, since we only do manual builds. We assume the person doing the build is confident that the code can be deployed, assuming we have a service/task set up in the Elastic Container Service."

        for (String oneDockerTag : listOfDockerTags) {
            if (oneDockerTag == 'production') {
                sh "aws ecs update-service --service aui-prod-fcr-service --cluster aui-prod-ecs-cluster --region 'us-east-1' --force-new-deployment"
            }
            if (oneDockerTag == 'staging') {
                sh "aws ecs update-service --service aui-stage-fcr-service --cluster aui-stage-ecs-cluster --region 'us-east-1' --force-new-deployment"
            }
            if (oneDockerTag == 'develop') {
                sh "aws ecs update-service --service aui-dev-fcr-service  --cluster aui-dev-ecs-cluster  --region 'us-east-1' --force-new-deployment"
            }
        }

        println "At the end of stage ECS update-service"
    }

The best thing about this is that AWS handles the graceful rollout of the new app. This is awesome because I’ve wasted many long hours of my life writing that kind of code, trying to ensure minimal downtime when we roll out a new version of our code, trying to spin up a new set of servers while the old servers spin down. But if you use this system, then AWS does all that for you.

If I call “aws ecs update-service” and then ssh (via the bastion box) to the ECS box where the Docker containers are running, I can see 3 old containers that have been running for weeks, and 3 that have only been running for a few seconds:

And if I check again in 15 minutes, the old ones are gone, and all that is left running are the 3 containers I just started.

But please, please, please, don’t actually do this.

Why isn’t Docker a part of this graphic?

If you’ve read this blog before you know that I don’t recommend Docker. I’ve previously made the case that if you want security, isolation, flexibility, orchestration, and an easy way to scale up, then the smart way to go is to use Terraform/Packer. Please read my article “Docker is the dangerous gamble which we will regret.” The article sparked very long conversations at places such as Reddit and HackerNews, which are linked in the comment section of the blog post.

Most of that article went to criticizing Docker, and very little went to explaining Terraform and Packer. I’ll try to say a little more about Packer here.

Sean Hull wrote this to me in a private email:

Terraform can do everything up until the machine is running. So things post boot, *could* be done with Ansible. That said another route is to BAKE your AMIs the way you want them. There’s a tool called “packer” also from hashicorp. It builds AMIs the way docker builds images. In that model, you can simply use terraform to deploy a fully baked AMI. If your configuration changes, you re-bake and redeploy that new AMI. Barring that, it’s terraform for everything up until boot, and then Ansible for all post-boot configuration.

Create your ideal Linux setup, then save it as an image. Use Packer, which will create the image for you:

Packer only builds images. It does not attempt to manage them in any way. After they’re built, it is up to you to launch or destroy them as you see fit. After running the above example, your AWS account now has an AMI associated with it. AMIs are stored in S3 by Amazon, so unless you want to be charged about $0.01 per month, you’ll probably want to remove it. Remove the AMI by first deregistering it on the AWS AMI management page. Next, delete the associated snapshot on the AWS snapshot management page.

In terms of provisioning a Linux box with with your software, and all of its dependencies, you can tell Packer to use any normal Linux tools to provision the box. You can use a simple bash script, or Chef or Ansible, or you can use the “user_data” that Amazon allows whenever you spin up a new AMI.

Here is a bit of HCL (the Hashicorp Configuration Language, to be used with Packer and Terraform):

    "provisioners": [
        {
            "type": "file",
            "source": "./welcome.txt",
            "destination": "/home/ubuntu/"
        },
        {
            "type": "shell",
            "inline":[
                "ls -al /home/ubuntu",
                "cat /home/ubuntu/welcome.txt"
            ]
        },
        {
            "type": "shell",
            "script": "./example.sh"
        }
    ]

You can see how easy it is to include and run bash scripts, you could just as easily run Chef or Ansible scripts.

For most of the reasons that people use Docker, you can just use Packer.

Then you can deploy your system with Terraform. The level of automation at this level is truly incredible. Once you’ve got your whole system described by Terraform, you only need to run one command to set up a completely new system. Or run your Terraform code 3 times to setup a development environment, a staging environment, and a production environment. Or if you have 20 engineers, they could run the Terraform code 20 times, so they all have their own system to play with. Or if you decided that you need to be a highly available, highly reliable service, and therefore you want redundant systems in multiple AWS regions, you only need to run one command to recreate your EC2 instances, your ElasticSearch, your translator app, or your gateway app that writes to Kafka. Since PostGres is only a cache, you could have duplicate instances without much worry. But I would only use one Kafka log. You don’t want to split that, and it is already setup to be a distributed log.

One warning: Terraform aims to be immutable, so if you ever tear down a system via Terraform, and it is attached to an AWS database, such as PostGres via RDS, then Terraform will delete your database AND IT WILL DELETE ALL OF YOUR DATABASE BACKUPS.

That was a big surprise when we first discovered it (thankfully we were working with Sean Hull, who saved the day). You’ll need to write your own backup script for the databases. We copied them to S3. That much was safe from Terraform.

Why Docker is awesome?

I’ve recently heard two criticisms of the Packer/Terraform approach, which almost have some merit:

1.) a full VM like an AMI is much more heavyweight than Docker. A typical software developer, with a good computer, can run 10 Docker instances on their machine, but they would struggle to run more than one VM.

2.) if a software developer wants to make some changes to the dependencies of the system, they can do this more easily via Docker than via Packer.

Regarding #1, Docker is obviously lightweight compared to a full VM. Do I really need to run 10 VMs on my laptop? From 1999 to the current day I’ve worked on many major projects, and I’ve never faced this particular need. Two other approaches have worked for me. I often ask the devops person to setup a single dev database, or multiple dev databases, these are available on the Internet, with access locked down so only my team can reach those databases. Rather than run the databases on their laptops, I encourage the software developers to use the dev database that the devops person set up for them. I’ve also been able to run multiple major apps on my laptop. In my book How To Destroy A Tech Startup In Three Easy Steps I had to run all the software on my laptop, this included a huge Java app (the binary was a gigabyte in size) which included the entire Stanford NLP library. And I had 5 Clojure apps I had to run as well. I could run all those apps on my laptop, I didn’t need to run them in a VM. It is fairy normal for a team to find ways to use similar technologies, and to limit their resource use, so that we can run multiple apps on one laptop. But if you decided to use Docker, you have to commit to the Docker ecosystem, which is large and brings in many other additional technologies such as Kubernetes, maybe Rancher, and then the Docker deployment system (ECR or Docker Hub), and maybe ECS if you are on AWS. Think very carefully if the benefit is worth the enormous increase in complexity. Do you really want to force your team to commit to this large (and rapidly changing) ecosystem? What do you get in return?

Regarding #2, I typically have to struggle to get control of the software development team. Especially where deadlines are concerned, I’m often asking myself “Is this person lying to me?” I don’t want software developers to have the power to make large scale changes without first getting my permission. The fact that Docker makes it easy to make such changes seems like a bug, not a feature. On most projects, I bring in a devops person who I have worked with before, someone I trust, and as much as possible, I want them to have absolute control over the devops side of the project. I don’t want the team to have the freedom that Docker gives them. That freedom undermines my ability to take responsibility for delivering working software on time. And often, if I think a software developer is lying to me, the easiest way for me to find the truth is through the devops side of the project — can their app pass the messages that it is supposed to pass, or meet its schema enforcement contract or pass its unit tests? Once I have control over the CI system, such as Jenkins, it becomes easier for me to see who is making progress and who is falling behind.

I’ll share a story that happened this summer:

A startup got going in late 2017, and hired a development team in Portugal in November of 2017. The team in Portugal wrote an app in Python. In April they were told they had to Dockerize the code. They used Ubuntu as the base OS for the Docker image. I joined the project in March. I set to work on setting up Jenkins and writing the code to integrate the build system with the rest of the deployment system on AWS (ECR, ECS). In June, a new developer joined the team. He was young and very smart, but also very impulsive. He wasn’t sure how he should contribute to the project. I assigned him some tickets via Jira, but he ignored all the work that I assigned to him. He knew Python very well, though he lacked experience with most other languages. In that regard, he was deep but narrow. I would not call him a novice, but he was not highly experienced. He was strongly opinionated. He realized he could shrink the Docker image if he got rid of Ubuntu and used Alpine instead. He pushed this change through on his own, without talking to anyone else. He successfully shrank the Docker image by 70%, which is a very nice optimization. But remember “Premature optimization is the root of all evil.” The dependencies of his new system were a bit different than before, in a way that things broke in my build system. We spent 3 days tracking down the problems. There was a permission problem that we had to solve with gosu, and a problem with the Docker cache that we had to jump through hoops to fix, and then a mysterious problem which should have been impossible, but it was there, and seemed to be sensitive to the way certain setup commands were called, in particular, there were timing issues making sure that Django, gunicorn and Nginx were started in the correct order. It all worked fine on his laptop, but it took 3 days to get it to work on the main Jenkins server. And I was furious to lose those 3 days to a project that I had not authorized. It would have made more sense for us to get the basic app into production, and then later in the summer we could have done some cool optimizations, such as the one he did.

So the fact that it’s easier for software developers to change the dependencies of the system with Docker strikes me as a bug, not a feature. If you are in charge of a team, I’d be surprised if this freedom doesn’t eventually sting you.

Can Docker coexist with Terraform?

Finally, a point that I’ve never seen resolved. At previous clients, we’ve had long debates about this one, since there seems to be no perfect answer.

If you’re going to use Docker to build and deploy, then you’ll use a service such as the AWS Elastic Container Registry. You’ll need to define the task, using JSON, and this specifies things like which app container to use, plus dependencies such as which Nginx container to use.

However, we were also committed to the providing a highly reliable, highly available service, so we were committed to the modern devops slogan of “infrastructure as code.” This idea has gained momentum over the last ten years, and software such as Puppet and Chef and Ansible has helped make this more real, but as the tech world move to the cloud, Terraform emerged as the ultimate way to script entire cloud systems (including computers, public networks, private networks, security groups, databases, and more).

The question came up, where do we define the JSON for the ECR Task? We initially defined it in the Terraform, which is the right way to go if you really want to commit to the idea of “Infrastructure as code.” However, there was, on the team, a software developer, very smart but also very impulsive and opinionated, who insisted that all that stuff be handled from the Dockerfile, because if he wanted to make a change, he should be able to make that from the Docker config — after all, that kind of flexibility is why he wanted to use Docker in the first place.

But if we went that route, there was a lot of code we’d have to move from Terraform to Docker.

So, be wary. Docker brings in some complications. Docker will cause you to have long debates about how you build and deploy your code. Think very carefully before you commit to it.

My general advice is, don’t use Docker. Terraform and Packer will give you all of the flexibility and security and isolation and orchestration that you want.

If I am responsible for a project, then I am responsible for the devops

Previously, I’ve been misunderstood regarding Docker and my sense of responsibility for a project, so I’m going to repeat myself.

If someone puts me in charge of a project and says “Here is a million dollars, you need to have this software done in six months” then the first thing I do is try to find a good devops person to help me with this project. For me personally, that often means reaching out to either Chris Clarke or Sean Hull to see if they have any time in their schedule. My experience has been that if I can keep the devops side of the project on schedule, then it is much easier to keep the rest of the project on schedule. Setting up the support systems (the build tool, the monitoring tools, the health checks) allows the rest of the team to go faster.

Relying on Terraform/Packer helps to centralize devops decisions in the hands of the devops person. Docker puts too much of that power into the hands of ordinary software developers. I do not want every developer on the team to have the power to make devops decisions. That is a recipe for anarchy.

Therefore, strictly from a project management point of view, I recommend against Docker.

By the way, Dave Elliman wrote this essay about a project where they decided to use Clojure:

Two months early. 300k under budget

Is anyone aware of a similar article about Docker? Can anyone point me to an essay where the writer says “We thought this project would take a year, but once we switched to Docker we knew it would only take 3 months”? Because I am not aware of any such essay. I’ve read a lot of horror stories about Docker and Kubernetes leading to problems, and I’ve seen such problems with my own eyes on projects that I’ve worked on. But for all the hype about Docker, I’ve yet to see much evidence of teams having success with it.

Acknowledgements

For tips about the devops, I give my thanks to Chris Clarke and Sean Hull.

[1] About the names

There is an old tradition in computer science that the names used in these examples should be “Alice” and “Bob” but the names “Alice” and “Bob” sound Caucasian and the two of them are typically drawn as Caucasians:

Therefore, in the interest of diversity, I looked up what was rated as the most African American female name (Imani), and what was the most common male Hispanic name (Santiago), and I’ve used those names here.

[ [ UPDATE 2018-11-28 ] ]

An interesting comment on Hacker News:

I work in the public sector. We operate more than 500 systems at the muniplacity where I work. Some are major, some are not. Some have support completely handled by the providers, some do not. Some are build in house, some are bought by local software companies and developed specifically for us others again are standard software.
What’s common among them is that they run on different technology.

So we have 500 different ways to do front-ends, middle-ware, back-ends and databases. So literally thousands of combinations.

We have 5 IT-operations staff to run and maintain it. They’re very good, but they are very good at specific things. Like one guy knows networking, another knows server infrastructure and user management, but only for Microsoft products because that’s our backbone.

But we’d like to run a JBOSS server on redhat, and we’d like to run some Django and Flask servers, and we’d like to have a PostgreSQL and a MongoDB for GIS, a MySQL for some of the cheap open source stuff and a MSSQL for most things. And so on.

Guess what makes that easy, docker for Windows. Our IT staff can’t support all those techs, but they can deploy and maintain containers and keep surveillance on them. Sure the content of the containers is up to the developers, but it’s infinitely easier to operate than having to support the infrastructure to do it without docker.

The same thing is less true in Azure or AWS where your cloud provider can act similarly to containers, with web-apps and serverless, but it’s still nice to keep things in containers in case you need to vacate the cloud for whatever reason.

I can believe there might be some benefit to Docker when you face the problem of putting together so many systems. This is not the case for the average startup, and so I do not recommend Docker to my clients. Also, as much as possible, the sheer anarchy of the situation described above is something that companies should, obviously, try to avoid, even if there is a technical solution that reduces the overall pain.

Source