Why are software developers confused by Kafka and immutable logs?

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com

Simplicity. Why is it so hard to keep things simple? There are many concepts that we use in software which arise from common sense which we then encumber with needless complexity. Why?

For instance, logs. It’s great to have a record of things. If you’ve built some software that gathers information, it is often very useful to know the order in which information arrived in your system. That is, it is insightful to know the history of events that built the store of knowledge that your system now possesses. Logs are great. Simple and straightforward. Why does this simple idea get trapped in a world of buzzwords and acronyms and complex theories about The Way Things Ought To Be?

Let me digress for a moment and talk about microservices. I was talking with a fellow the other day, an acquaintance who is the lead engineer of a small startup. They tried to build their original system following what they thought of as the microservices methodology. We were exchanging texts on Whatsapp and he wrote:

I’m never doing microservices again. Having worked on a game engine in the past, which was monolithic, I know that it is definitely possible to have high performance software in monolith form. And with stable development processes.

I responded:

I’ve worked on single apps that are fast, even when large. I don’t think microservices should be a goal, they should arise organically. If your company needs new functionality, and it doesn’t make sense to ram the new functionality into the old monolith, then it makes sense to create a new app. And if you do that a few times, then you are on the road to microservices, at which point you need to think of an architecture that supports the interactions of multiple apps. But it doesn’t make sense to say “We will build new software, and the most important thing about this new system is that it be microservices.”

Microservices is one of those things that started off seeming like a commonsense approach to increasing complexity in a system, but then it was hijacked by a crew that heaped a large dose of methodology on it. Mind you, I’m not criticizing the kind of careful thinking on display in Building Microservices: Designing Fine-Grained Systems by Sam Newman. That’s a good book. Lots of good advice for the difficult issues of whether each service should share a given database, or whether each service should “own” it’s own database, or “own” specific tables in a database, or whether multiple services should be allowed to read a database table, but only one service is allowed to write to that table.

A year after I first heard of microservices, I was being told that the word “microservices” referred to a specific setup whereby apps use HTTP to communicate with each other. And I was like, what? One of the best microservice systems I ever built was something that looked very much like the system that
Matthias Nehlsen describes
: a bunch of Clojure apps communicating with each other by sending serialized objects through Redis. No HTTP needed.

On July 30th, 2013 I wrote my essay “An Architecture Of Small Apps“. At the time I’d never heard the phrase “microservices”. It wasn’t until March of 2014 that Martin Fowler wrote his essay about microservices, an essay which announced the arrival of the trend as a general movement in the software world.

I used the phrase “An Architecture Of Small Apps” because there was no other word for what I was describing, but also because what I was talking about seemed so simple and straightforward. Did such a simple idea need a complex name?

Also, notice that the process that I described in that essay was very organic. Timeout.com was struggling with an old monolithic PHP framework, built around Symfony, and it was terribly slow, and the issues they had were severe, so it made sense to look for the bottlenecks, and then pull out those bottlenecks and write them as a separate service. It also made sense to move to the JVM, so we could work with more performant languages, for the services that were processor intensive. So, for instance, the main API, which had to create a unified interface over several databases, became its own service that was written in Scala. And we needed an internal ad server because the sales team was allowing advertisers to ask surveys of our readers, and we had special handling for the way the interaction happened between our readers and our surveys. So I wrote that service using Clojure. We left the PHP/Symfony code to mostly handle the HTML templates, while we moved all the demanding parts of the system to Scala or Clojure apps that were faster and more specialized. The whole process was driven by commonsense. We did not need any acronyms, nor did we need any weird theories, like the idea that HTTP was the only valid communication protocol.

My point is, the initial impetus that lead to an architecture of small apps was clear thinking and pragmatism. And yet, within a year, there were consultants claiming that they knew the secret of implementing microservices “the right way” and the systems they wanted to build were immediately more complex than what commonsense demanded.

I’ve seen something similar happen with the idea of having a historical log. While the argument for logs is a simple one, there is a remarkable army of consultants ready to make things complex. You say “log” and they hear “CQRS”. You say “messages” and they hear “Event Streaming Platform”.

Mind you, some situations really are complex, and in those cases, it is appropriate to tackle those situations with complex solutions. But the complexity is not the goal. In those cases, simplicity is still the goal, and the complexity of the software is necessary only to match the reality of the situation. That is, the complex solution should be the simplest complex solution that can handle the underlying complexity of the real-life situation that is being dealt with.

It’s also true that if you take a step back from a specific software project, and you get honest with your client about all the other pieces that might make up the total system, the overall picture can begin to look complex, even if the main part of the system is still fairly simple. I’ve recently adopted a strategy, when dealing with new clients, where I try to be honest with them about all the many aspects of what a complete system involves, so that they can have a realistic idea about the cost. So for one recent client I developed this graphic:

The software at the center of that system is not wildly complicated, but I’ve here included all of the devops concerns, with the goal of educating my clients about where some of the costs are going to be (and I advocated for Terraform and Packer, while advocating against Docker/Kubernetes, for reasons I’ve written about elsewhere). Devops is often expensive, and that’s an expense in addition to the main software development. (If you would like to lose my respect forever, please feel free to argue “Devops is cheap and easy if you just use Docker and Kubernetes.”)

Let’s talk about a simple use of logs. At Parsely they conduct analytics by adding a one pixel gif to pages. The requests to this gif end up in their Nginx logs. Parsely has a very complicated system for analyzing the data in their logs — they use Cassandra and Apache Storm and Kafka and ElasticSearch. But the starting point for their system is their Nginx logs. And they’ve kept all of their logs since they started back in 2009. And those logs are the only thing they need to rebuild all the data in their system. That is to say, the Nginx logs are canonical source of truth at Parsely. If some horrible accident happened, and they lost all the data in Cassandra and Storm and Kafka and ElasticSearch, and they lost every database backup they have, it wouldn’t matter. They could regenerate all of the data, for all of their customers, going back to 2009, just by having their software reprocess the Nginx logs. It might take a few hours, or even a few days, but they could do it.

That’s a good use of an immutable log.

Any use of Kafka should be as simple as Parsely’s use of Nginx logs. Kafka does allow some nice bells and whistles that gives you more flexibility about how you set up your system, and I encourage you to make use of those bells and whistles where appropriate, but your core use case should always be something that is nearly as simple as recording requests in an Nginx log.

The way Jay Kreps talks about the origins of Kafka, he makes it sound like it was a common sense solution to a problem they were facing at LinkedIn. They were dealing with an exponential explosion of complex communication lines:

But Kafka allowed them to simplify all of that:

As a general rule, if Kafka is not making your system simpler, then you are probably using Kafka the wrong way. That is a sadly common mistake. I have worked with many companies that were using Kafka the wrong way. They were using Kafka unnecessarily, so it complicated the architecture, instead of simplifying it.

As a rough rule of thumb, if you are using Kafka, but it is not the canonical source of truth of your system, you need to really ask yourself if you need Kafka. I’ve worked with companies where the canonical source of truth was elsewhere, and Kafka was being used as a message queue. Is that the right use of Kafka? Maybe, but be careful. There are a lot of great message queues out there, if that is all you need, then you don’t necessarily need Kafka. Kafka is distributed and requires ZooKeeper which adds a small amount to the tax you pay to complexity (via devops). If you can use a simple, centralized message queue, then do so, it’s simpler than Kafka.

I need to add some qualifiers to this essay. Any argument that focuses on simplicity runs the risk of lapsing into anti-intellectualism, and I hope to avoid that. Nothing in this essay should be read as denying the reality of complex situations. There are some situations that demand profound thought. I’ve only admiration for those who invent appropriate innovations for difficult situations. However, as I’ve said, the goal should never be complexity, the goal should be to create the simplest possible system that satisfies the goals of a project. If the project is tackling a difficult reality, the system might end up being complex, which is fine, so long as complexity is never the goal.

The next few paragraphs, in particular, might sound borderline anti-intellectual, but keep in mind I’ve no real criticism of difficult strategies, so long as they are used in appropriate situations.

There are many companies who are looking to sell services to enterprises, in part by formalizing some of the solutions that I’ve already mentioned (microservices, logs). As they formalize these solutions, they tend to make them sound very complex. Some of this is marketing. If they were honest about how simple these solutions can be, then who would buy from them? Making logs sound complex is a way to sell services around logs. But that means instead of talking about logs, they talk about CQRS and Event Driven Platforms. And sometimes, as they talk about the complexity of these systems, they invent processes that really do add complexity to the system. And this is terrible. And more so, this adds to the overall confusion in the software profession. These companies will hire teams to write intelligent sounding white papers, and among these, the more readable papers will end up discussed on Reddit and Hacker News, and then junior developers hear about these topics and assume that these solutions (microservices, logs) are cool but must automatically be complex.

So for instance, think of the poor reasoning that lead to ESB (Enterprise Service Bus) and the absolutely confusing mess that is now occurring as vendors struggle to invent the architecture that will replace ESBs:

There seems to be a de facto consensus that iPaaS (Integration Platform as a Service) is the next big thing. So much so that the MuleSoft, one of the leading iPaaS vendors, achieved unicorn status last year after a successful round of venture capital financing and, then, this past March, went on to have a very successful IPO.

And while there’s little doubt for CIOs that legacy middleware integration technologies such as ESB are on the verge of obsolescence, there should be plenty of doubt as to whether iPaaS (and the vendors such as MuleSoft, Jitterbit, SnapLogic, etc. that offer iPaaS) is the logical successor to fill the white space being left behind. Why? Because iPaaS is stubbornly more of the same. It’s a modernization, but not the much-needed re-imagination, of the ESB model that is currently on life support at organizations around the world.

Back in March I wrote Why are large companies so difficult to rescue (regarding bad internal technology) in which I wrote about a client of mine that was trying to clean up a truly complex reality:

The writes are another thing. If a customer in London wants to rent a resource from the subsidiary we have in London, the rent request (the database write) needs to go to the central API, but does that mean the central API has to know which internal database that particular write is supposed to go to? Likewise the write requests happening in Nigeria, and Germany, and Brazil, each of which will go to different databases. This becomes a bit of a nightmare. Twenty years ago this line of thinking lead to the creation of the Enterprise Service Bus architecture, but ESB is now going out of favor, because it was too complex and rigid and unwieldy.

Two years ago, SuperRentalCorp decided to become a customer of MuleSoft, to help create their new API. They have so far spent about $25 million on their efforts involving MuleSoft. MuleSoft has some great tools for building APIs but those tools seem to help with the reads much more than with the writes. Which is to say, MuleSoft helps with the easy stuff, but not so much the hard stuff. (Having said that, I’ll add that there are engineers working for SuperRentalCorp who love MuleSoft.)

In terms of the best integration architecture, what seems to me the only long-term solution is something like the unified log architecture that Jay Kreps wrote about back in 2013. All incoming writes need to go into a centralized log, such as Kafka, and then from there the various databases can pull what they need, with each team making its own decisions about what it needs from that central log. However, SuperRentalCorp has retail outlets with POS (point of sale) systems which talk directly to specific databases, and the path of that write (straight from the POS to the database) is hardcoded in ways that will be difficult to change, so it will be a few years before the company can have a single write-point. For now, each database team needs to be accepting writes from multiple sources. But a unified log is the way to go in the long-term. And that represents a large change of process for every one of those 20 teams. Which helps explain why the company has spent 2 years and $25 million trying to build an API, and so far they have failed.

So here is, again, a place where Kafka seems like the right solution, because here is a case where Kafka simplifies a complex situation.

What I find surprising, and disappointing, is how many intelligent software developers don’t seem to get the basic idea of an immutable log, which is to say, a log that can be replayed to regenerate the accurate state of the system. And again, the Nginx logs at Parsely are a good example. But not every company can use a web server log as their canonical source of truth, and for those other companies, Kafka can be an excellent choice.

I’ve worked with many clients who don’t seem to understand this fundamental idea, how a log can simplify things. I also see essays at companies where they get confused by jargon and industry hype and they hurt themselves implementing ideas that seem cool, but which only add complexity, because they missed the underlying idea that would actually improve their situation. So, for instance, some companies have tried to implement CQRS without having a unified log and they end up worse off:

To be honest our system doesn’t do that much yet and all these services might seem like overkill. They do to us at times as well, that’s for sure. It can be annoying that adding a property to a case will pretty much always involve touching more than one service. It can be frustrating to set up a new service just for “this one little thing”. But given the premise that we know that we must build for change I think we are heading in the right direction. Our services are loosely coupled, our query services can easily be thrown away or replaced, the services have one and only one reason to live and are easy to reason about. We can add new query services (I can see lots of them coming up, e.g. popular-cases and cases-recommended-to-you) without touching any of the command services.

You’ll notice that the graphic in that article has an “events database” but it is not the center of the whole architecture. That article has no concept of the unified log, and I think that is why they got into trouble. I find it fascinating how their ideas overlap with 80% of mine but the remaining 20% seems to be very important. They do understand this much:

You get a transaction log for free, and you can get the current state by replaying all the events for a given order. Add to that, now you can also get the order state at any given point in time (just stop the replay at this time).

But they don’t seem to understand the idea that the log needs to be the center of all communication in their system. It needs to be the canonical source of truth. Indeed, they call it a “transaction log” rather than a “unified log.” In this case, it’s possible the nomenclature is significant. They also say this:

Worry not, there is a better way, and it is called event collaboration. With event collaboration, the Order Management Service would publish events as they occur. You typically use a message bus (e.g. Kafka, RabbitMQ, Google Cloud PubSub, AWS SNS) and publish the event to a topic.

If you combine their “transaction log” and their “message bus” then you get the concept of a Unified Log. Maybe they understand this, or maybe they don’t; what I can see is that the graphic they use in that article has “publish” and “subscribe” scattered around the system in a disorganized way. If I’d been hired as their consultant, that would have jumped out at me as something worrisome that I needed to ask more questions about. And indeed, I’ve worked with many clients like this, who seem to get 80% of the right ideas, but what they miss is fundamental.

The other thing I would note is how often I see essays which jumble together CQRS, events, message streaming, and logs, even though these are different subjects. To be clear, the biggest advantage with a unified log is when it is used as the canonical source of truth for your system. CQRS is something very different: the big advantage there is the separate handling of reads and writes. It’s important to remember that these are different strategies, because they have different risks. Martin Fowler has written about how often CQRS leads to unnecessary complexity:

CQRS allows you to separate the load from reads and writes allowing you to scale each independently. If your application sees a big disparity between reads and writes this is very handy. Even without that, you can apply different optimization strategies to the two sides… If your domain isn’t suited to CQRS, but you have demanding queries that add complexity or performance problems, remember that you can still use a ReportingDatabase…
Despite these benefits, you should be very cautious about using CQRS. Many information systems fit well with the notion of an information base that is updated in the same way that it’s read, adding CQRS to such a system can add significant complexity. I’ve certainly seen cases where it’s made a significant drag on productivity, adding an unwarranted amount of risk to the project, even in the hands of a capable team.

But again, I don’t want my focus on simplicity to be read as a kind of anti-intellectualism. If we refuse to think about how a system might evolve in the future, we run the risk of the nightmare when ad-hoc quick fixes become the long-term architecture.

Fowler is right to warn about the complexity, although I’ve seen teams get into even worse situations when they try to implement piecemeal ad-hoc solutions to deal with an evolving situation. I believe what happens at such teams goes like this (based on a real example):

Santiago: Huh, our Ruby On Rails app has always been able to take input from the Web and write it to our MySQL database, but now that we are letting our users upload Excel spreadsheets, the app is going too slowly, what should we do?

Imani: No worries. Let’s set up some separate apps that just handle the Excel spreadsheets! One app can convert the XLSX to CSV, then put them on a RabbitMQ queue, then another app can parse the CSV and write that to MySQL.

Santiago: Excellent idea! That solves all of our problems!

(One year later)

Santiago: Our new Web scraping Python scripts are pulling in 50 gigabytes of HTML each day. We can’t write all of this to MySQL, what should we do?”

Imani: No worries. We can store all the raw HTML in S3 buckets, then we can have other scripts that pull the HTML files and run them through our NLP filters, then write the parsed Statistically Significant Phrases (SSPs) and linked sentences back to other S3 buckets.

Santiago: Excellent idea! That solves all of our problems!

(One year later)

Santiago: The sales teams wants us to allow our customers to set up notifications, so when a particular name is discovered by our NLP scripts, we can send an alert to our customers, but our Notifications table is in MySQL and our NLP output is in an S3 bucket. What do we do?

Imani: No worries. We can pull from MySQL and the S3 buckets and combine them in MongoDB, then create an app that looks for new data appearing in MongoDB, and then we can send the appropriate notifications!

Santiago: Excellent idea! That solves all of our problems!

(One year later)

Santiago: Our business intelligence team wants to find out if our customers are getting any real benefit from our Notifications, but their analysis tools needs data from S3 and MySQL and MongoDB. What should we do?

Imani: No worries. We’ll create an app that pulls from S3 and MySQL and MongoDB and combines all the data in Redshift. Then the business intelligence team can run their queries against Redshift.

Santiago: Excellent idea! That solves all of our problems!

(More years pass, more additions are made, and so on, and so on.)

Eventually, this does not solve all their problems. Eventually, the complexity of the system becomes difficult to manage, for all the reasons that Jay Kreps wrote about back in 2013, in his original essay about unified logs. What has happened, over time, is a series of ad-hoc solutions has built up data silos that now each have a bit of the canonical truth of the system, which is to say, the system no longer has a canonical source of truth. This is such a powerful anti-pattern, that I need to emphasize it:

The system no longer has a canonical source of truth

More so, the intuitions that allow this system to work exist only in Imani’s head, and if you tried to document this system, you would have great difficulty writing it down in a manner that would allow a new member of the tech team to understand what is actually happening.

At this point we need to reconsider Martin Fowler’s warning:

I’ve certainly seen cases where it’s made a significant drag on productivity, adding an unwarranted amount of risk to the project

because now CQRS and a unified log would be simpler and less risky than what Santiago and Imani have done.

To repeat myself, there is nothing wrong with complexity when its the appropriate answer to a complex reality. But complexity itself should never be the goal.

So what is Kafka for?

Here are some rough rules of thumb to keep in mind:

1.) When you introduce Kafka, it should simplify communication in your system. Or if you are just starting your company, you should use Kafka as your canonical source of truth if you believe that doing so will ensure that in the future it offers the easiest way to regenerate the state of your system as your system evolves.

2.) What should be stored permanently in Kafka? You can store a lot in Kafka. At LinkedIn, they currently keep 900 terabytes of data in Kafka. But your system will almost certainly generate some useless debris that does not need to be stored forever in Kafka. Keep the data that is necessary to regenerate the state of the system. Feel free to delete the rest. If you have an app that pulls data from Kafka, runs some analysis, and then writes its results back to Kafka, those results are probably something you can delete, because the app could regenerate the same data by again drawing the original data from Kafka. It’s the original data that you need to keep.

3.) If you could replace Kafka with a simpler message queue, then you should. If you are only using Kafka as a message queue, then you probably don’t need Kafka.

4.) If you are using Kafka as your canonical source of truth, keep in mind that all the other databases in your system are either caches or input devices, but they are no longer canonical sources of truth. You might pull data from Kafka, build a current snapshot, and store that in an SQL database, but that SQL database could be erased and it wouldn’t matter, which means it is just a cache. You could easily regenerate that snapshot by re-reading the data from Kafka.

The basic idea here is simple. Try to hang on to that simplicity, despite the tendency of the software industry to make simple solutions sound more complex than necessary.

A final thought: I do realize there are some situations that have no easy answers. If you are WalMart, you have a Point Of Sales (POS) system that handles sales in each store. And for years, you’ve probably thought of that POS as the canonical source of truth for customer interactions. But you’ve probably reached a level of internal complexity where you need to put something like Kafka in-between the POS and all the many systems that need information from that POS (your inventory system, your customer service, you tax accountants, your Profit & Loss managers, all need different kinds of reports). You want to treat Kafka as your canonical source of truth, but it will take years before you can re-write every system so that those systems stop looking in the POS for the The Truth. That’s a legitimately difficult situation. And yet, in such situations, it is more important than ever that the technical leader speak plainly and clearly to all of the stakeholders who need information, and that message needs to be “We are moving towards a world where Kafka will be The Truth, and all other data stores will merely be caches. Remember this.”

Source