My final post regarding the flaws of Docker / Kubernetes and their eco-system

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

With a surprising number of Docker advocates, they advocate for Docker/Kubernetes in a vacuum, as if the question was “Do you want devops, or not?” when the correct question is “Of this set of dozens of devop technologies, which subset is the right one for my company?” I’m often subjected to a three part argument that’s missing its middle part:

1.) Do you want easy development, deployment, consistency, networking, security, isolation and infinite scalability?

2.) ???

3.) In conclusion, Docker!!!!!!

Some of the conversations I’ve recently had about containers remind me of the older conversations about objects; especially the circular reasoning, blindspots, and conclusions which must be motivated by something that’s been left unsaid. This is what it often sounds like:

Advocate: Docker really helps with development. It’s all really simple.

Critic: More than a standard Virtual Machine (VM), the kind of thing I could run on my Mac in VirtualBox?

Advocate: Are you joking? Docker is very lightweight compared to a standard VM. You can run a hundred Docker containers on your laptop. Can you do that with a standard VM?

Critic: No, of course not.

Advocate: See? Docker allows you to run a hundred apps in a lightweight manner! It’s all really simple!

Critic: But it forces me to run a lot of apps, doesn’t it? I don’t have a choice? I mean, if every app is in a separate container?

Advocate: That is a ridiculous myth! You don’t have to put each app in a separate container! You could put a hundred apps in a single container!

Critic: Oh, I see. Do you put a hundred apps in a single container?

Advocate: No, I put each app in a separate container.

Critic: Oh…

Advocate: But it’s easy to orchestrate all of the different apps! We have better and better tools all the time to make it easy to orchestrate a hundred apps in a hundred containers!

Critic: But if I actually needed to run a hundred apps, I could write a simple bash configuration script to orchestrate them. Isn’t Docker just giving me the same thing, but making it more complicated?

Advocate: What if an app fails to start? Are you going to add restart logic to your bash script? After awhile, that becomes a very complicated script.

Critic: Well, I’d only have to write the configuration script once.

Advocate: Do you realize how stupid you sound? The words “I’d only have to write the configuration once” are among the most notorious words that a software developer can say. Because the reality is you end up changing your configuration endlessly, as your project grows and changes. You really end up writing it a million times. Anyway, no one writes bash anymore. This isn’t 1995. Using bash is an incredibly stupid idea.

Critic: Sure, that’s true, but you could use Ansible or Chef or even something older, like Puppet.

Advocate: None of those are real orchestration technologies. They don’t make it easy to get a hundred apps to talk to each other.

Critic: Okay, how does Docker help you get a hundred apps to talk to each other?

Advocate: The very premise of your question is ridiculous. Docker isn’t about orchestration! We use Kubernetes for that! And setting up a Kubernetes cluster is easy! It’s all really simple!

Critic: Oh. So what is Docker for?

Advocate: It’s actually a shipping container!

Critic: Oh? It allows you to deploy code?

Advocate: Exactly! It’s the easiest way to deploy the app once you’ve built it!

Critic: But wouldn’t it be easier for me to create a binary with all dependencies included, and deploy that? Like an uberjar on the JVM, or a Go binary, or something like that?

Advocate: Ha! You know nothing about the modern tech world! How can you standardize your deployments when you’re using different build systems?

Critic: So Docker frees me from building my code?

Advocate: No, you still need to build your code. But do you have a standardized way to build different apps with different technologies? How is your app going to get its environment variables?

Critic: In the old days I would just set the environment variables on my machines, and that is still a valid answer in simple setups. But assuming we’re in the cloud, can’t I just inject the environment variables at build time? Jenkins has several plugins that make this painless.

Advocate: Ha! You know nothing about the modern tech world! Nobody uses Jenkins anymore!

Critic: Why is that?

Advocate: Because we’re all using containers!

Critic: Why is that?

Advocate: Without containers, how is your app going to get it’s environment variables?

Critic: Um, wait, really? We just…

Advocate: And what about databases? What about the failure modes of your persistence layer?

Critic: Well, at the application level, I’d use a library that has automatic retries, and at the system level, I’d presumably have some health checks in place, that can force a restart when necessary.

Advocate: So complicated! Think about how many technologies you just mentioned! That’s totally insane! You need to simplify your system!

Critic: Oh, okay, well, how would you do it?

Advocate: Easy! First I Dockerize my app and create the Dockerfile that details what sort of operating system I’m expecting in the container, and what software, if any, I expect to see in that container, then, for security, I create a private image repository and verify the integrity of everything in it, or I use a 3rd party repo, after I verify the integrity of its images. It’s all really simple!

Critic: Oh, I see, that’s for the software libraries, right? Like if I’m working in Java, I might have a private Maven repo?

Advocate: No, if you’re working in Java and need private packages, you still need to set up a local Maven repo. Setting up a private Docker image repository would be in addition to that.

Critic: Oh!

Advocate: Stop interrupting! The beautiful thing about creating your app container with Docker is that every command that you run becomes its own unique layer in the container, and you can run the command “docker history” to see exactly how your image was created. If you run a command such as ^ RUN mkdir -p /usr/local/bin/ ^ that becomes its own layer! And then if run ^ RUN yum install emacs ^ that becomes its own layer! And you can run the “docker diff” command to see the difference between each image, so if you save each layer as its own image, you can see the difference between every layer! This gives you fine-grained controlled over the way each layer makes up your app image! It’s all really simple!

Critic: That sounds fascinating! And how have people used this power?

Advocate: Oh, God only knows. I make a change and then recreate the whole image. I guess it helps with some cache stuff, in some situations? But listen, someone, somewhere, is probably doing very cool and amazing and mind-blowing things with this power!

Critic: Um, could you give an examp…

Advocate: VERY COOL THINGS!!!!!

Critic: Um…

Advocate: Stop interrupting! Locally, I can now run the “docker build” command to create my image. It’s all really simple!

Critic: Image?

Advocate: Yes, as the documentation says “a read-only templates from which Docker containers are launched.”

Critic: How long do the builds take?

Advocate: For small apps, just a few seconds.

Critic: And large, complex apps?

Advocate: A bit longer.

Critic: And I rebuild after each change?

Advocate: That’s how most developers have done it in the past.

Critic: It sounds like the old cycle of write, compile, run, change, compile, run, change, compile, run, change, and so on. Like Java programming, 15 years ago, before we had ways of doing hot reloading.

Advocate: Don’t worry, layer caching is getting so good, soon it’ll be just like hot reloading. For some platforms and some languages you can already do it! Almost.

Critic: Oh, good. How soon will that be reliable and consistent?

Advocate: You’re getting sidetracked by trivialities.

Critic: My productivity as a developer is a triviality?

Advocate: Would you like a system that can be scaled to infinity?

Critic: Well, sure, but I have to balance a menu of concerns, so…

Advocate: If you want to live in the future, some sacrifices will be necessary.

Critic: Are you saying that my productivity as a developer should be sacrificed so…

Advocate: DO YOU WANT A SYSTEM THAT CAN SCALE TO INFINITY?

Critic: Well, I’m not opposed…

Advocate: Good! So stop interrupting. Now, as I was saying, I can use the “docker run” command to spin up the image, then I can use “docker ps” to see the IDs of the running containers, then I use “docker exec” to get an ssh session in the container and see what is going on. It’s all really simple!

Critic: Wouldn’t it be easier to just to run the app in my terminal?

Advocate: You’re embarrassing yourself! Who cares if it works on your machine? You call yourself a professional? All that matters is whether it runs on other machines, and that’s what Docker is giving you, true consistency across platforms, true certainty that what is running on your machine will also run in the cloud!

Critic: But I could use a terminal inside of a standard VM, right? And then deploy that to any cloud. An AMI can run on my Mac, in VirtualBox, and the same AMI could also run on AWS.

Advocate: Are you sure you’re intelligent? Because sometimes you seem pretty slow. We already covered why you can’t use a standard VM!

Critic: And why was that again?

Advocate: Because it isn’t Docker!

Critic: Oh…

Advocate: Try to keep up. Now, I configure my CI to build the container image. This is pulling from my secure private Docker image hub, or some 3rd party service where I’ve verified the integrity of the images. These can be pulled in and built and stored in my image repository. It’s all really simple!

Critic: How would I handle a situation where my app has heavy-weight and very specific dependencies that need to be locally cached for fast building?

Advocate: Oh, I don’t know, maybe write a bash script to force a local cache refresh?

Critic: I thought you said nobody was using bash anymore? That using bash was “an incredibly stupid idea”?

Advocate: Everything is smart once you use Docker! Bad ideas become good again!

Critic: Um…

Advocate: And that’s it! Basically, on the app side, I’ve covered everything. It really is that simple. Set your CI to deploy from your repo. You spin up your containers and the app runs. Of course, you also need to setup the database.

Critic: But don’t you actually have to map every port in the container to a port in the outside world, and then those external ports have to be dynamically remapped to other ports depending on what apps are actually running? Isn’t auto-discovery a lot to configure?

Advocate: No, for God’s sake, you are so stuck back in 2016. Maybe once upon a time you had to be careful to enable autodiscovery on your Kubernetes daemonset and also be careful to attach the proper annotations to your pods, but that was a long, long time ago. I haven’t done it in weeks. Nowadays we just use Helm and the correct Helm Chart. Helm is the package management system for Kubernetes. It’s all really simple!

Critic: I can see using a default Chart for a default setup, but isn’t it true that most developers will have to tweak the setup to their actual needs? For instance, most companies will have some specific logging needs? Or if I’m using the PostGres database, what if I want to add in pgBadger to generate reports?

Advocate: Yes, that can get complicated, but it’s fine because, really, you only have to write the configuration once.

Critic: But you just…

Advocate: So long as you use the right configuration file, you’ll be fine. Be sure you don’t touch stuff like postgresqlConfiguration because you might end up wiping out some of the values you need. But, really, it’s very easy. Do you understand what a nightmare it can be to try to achieve high availability of databases without Docker and Kubernetes?

Critic: I typically setup a system that relies on etcd, with some kind of “I’m alive” heartbeat check against etcd.

Advocate: That’s a complete nightmare! Think about how much work it is to build a system like that!

Critic: But Kubernetes is built around etcd, isn’t it? All the data about the cloud, everything that Kubernetes is supposed to setup or maintain, all of the state of the current system, that has to live in etcd, doesn’t it?

Advocate: Yes, but we no longer have to setup everything manually! We have tools to automate the work! It’s all really simple!

Critic: Even if Helm and Helm Charts makes it easy to install a set of apps into a Kubernetes cluster, surely some actions are more complicated than a simple install? What about upgrading a PostGres database?

Advocate: Way ahead of you! We worked out that problem a long time ago! You just use a Helm Operator! It’s all really simple!

Critic: Really? And this solves all the problems of upgrading PostGres in a stable and reliable way?

Advocate: Uh, well, it’s supposed to. It’s, uh, all really simple?

Critic: Supposed to?

Advocate: Sure, so long as you have a working Go environment, you just use the Operator SDK to generate the scaffold for your Operator.

Critic: The scaffold?

Advocate: Sure, the scaffold sets up the basics, figures out the permissions, the dependencies, everything needed to install the Helm Chart. Then you can build the Operator container, and install it in your Kubernetes cluster. It’s all really simple!

Critic: This sounds complicated.

Advocate: Way ahead of you! The good folks at RedHat knew people like you were going to whine about stuff, since you obviously like to whine about stuff, so they created the Operator Lifecycle Manager to make all of this a lot easier.

Critic: Doesn’t it seem like we keep piling new technology on top of new technology, to manage the excessive complications of the previous layer of technologies?

Advocate: Hey, think of the alternatives. You don’t want to go back to the bad old days of the past, do you?

Critic: You mean, the bad old days when stuff mostly worked and I didn’t have to learn 3 new alpha technologies each day?

Advocate: That’s a ridiculous exaggeration! Some of these technologies are beta.

Critic: And you seriously regard these piles of code, heaped upon piles of code, as an improvement on the old situation?

Advocate: Are you kidding? It’s like night and day. I’d rather drink arsenic than get dragged back to the bad old days when I had to write Ansible scripts. We live in the future now. We’ve escaped the old world where every attempt at devops became a painful, confusing, unmaintainable disaster after 2 years.

Critic: How long have you been using Docker/Kubernetes in production?

Advocate: 18 months.

Critic: Did Docker/Kubernetes solve all of the problems you were having with Ansible?

Advocate: If you want dive deep into the weeds then you’re going to have to carefully define your technical terms.

Critic: Which technical terms?

Advocate: Well, uh, define the word “all”.

Critic: I’ll re-phrase. Has Docker/Kubernetes solved your top 3 Ansible problems?

Advocate: Yes!

Critic: And it did this without introducing any new problems?

Advocate: Define “new”.

Critic: Did you actually get rid of Ansible, or are you still using it?

Advocate: Define “still”.

Critic: So you’re just adding more technologies on top of the old technologies?

Advocate: Define “on top”.

Critic: How long has your Docker/Kubernetes setup been stable?

Advocate: Define “stable”.

Critic: Doesn’t the complexity get insanely out of control?

Advocate: Way ahead of you! Have I told you about Rancher? It’s a complete platform for managing Kubernetes because, let’s face it, sometimes Kubernetes is a nightmare.

Critic: But you said that setting up Kubernetes clusters was easy!

Advocate: Define “easy”.

Critic: You’re the one who used “easy” so maybe you should define it!

Advocate: Uh, “People on Reddit seem to like it.”

Critic: And once I’ve worked with the Helm Operator SDK and generated the scaffold, don’t I still need to write out some operational knowledge of how to upgrade my specific instance of PostGres? After all, every version of PostGres has some unique concerns when doing an upgrade.

Advocate: Sure, we can’t automate everything. In the end, you have to write a few details down.

Critic: But how is this an improvement over the bad old days when I wrote a bash script to upgrade my instance of PostGres?

Advocate: Infinite scaling! Your old bash script was ad-hoc and error prone and could only be used once! We have real automation now! Look, I get that Docker and Kubernetes and Helm and Helm Operators might seem like a little bit of setup work…

Critic: A little!

Advocate: …but once you’ve got it all setup, then you’ve got a system that scales to infinity! Don’t you want infinity? It’s all really simple!

Critic: Infinity is nice, but I’ve actually got a menu of concerns I am responsible for, and infinite scaling is just one of them.

Advocate: Why are you being so conservative with your technology choices?

Critic: What?

Advocate: You’re stuck in the past! You refuse to learn new things! You’re an example of an “Expert Beginner”! You think you know things, but all of your knowledge is out of date! You haven’t kept up with the times!

Critic: Well, I just learned Terraform and Packer.

Advocate: Never heard of them. Do they help with Docker?

Critic: They could, but actually, they sort of make Docker unnecessary. It’s fascinating because with a tiny Terraform script you can…

Advocate: You’re doing it again!

Critic: What?

Advocate: Refusing to learn new things!

Critic: But I was just telling you about Terraform. It’s really interesting because you can use it to…

Advocate: [ putting hands over ears and shouting loudly ] BLAH BLAH BLAH I CAN NOT HEAR YOU BLAH BLAH BLAH YOUR WORDS CAN NOT HURT ME NOW!

There are many problems that will probably never be solved inside the paradigm established by containers and Kubernetes, but it is absolutely impressive how much money and intellectual brilliance is being poured into the effort. And yet, is there really a way to automate something as complex as upgrading a database, or reattaching a database’s persistent volumes to a database master that is running in a stateless pod? Consider the effort that is being made to try to solve these problems:

The CoreOS team (now part of RedHat) developed the concept of Kubernetes Operators. An Operator implements common operational tasks in code. These are run either manually when an API is invoked, or automatically when required or on a schedule. Such tasks could be “back up database” or “create a new read replica”. As such, Operators can reduce the administrative burden even for complex systems.

However, as we all know, automating the relatively easy tasks is easy. It is much harder when the tasks are more difficult. Adding a read replica may be easy, but fixing a database’s broken write-ahead-log file that was corrupted by a failing file system is not. Therefore, the engineering effort that goes into Operators is considerable. The etcd Operator is one of the most mature ones, and it currently has about 9,000 lines of code. And counting.

Sadly, it is unlikely that any Kubernetes Operator can cover all operational aspects of even a single complex stateful data store. They definitely make certain tasks easier. But if they could cover all the error cases and recover automatically, why would that functionality not already be in the code of the stateful data store to begin with?

Also, Jessie Frazelle asks us to consider that this ticket has been open for a year:

Jessie Frazelle has an excellent post with more details:

Kubernetes is not to be used for stateful data. There has been a lot of work done in this area but it is still not sufficent. For the more technical members of our audience I direct you to exhibit A. The linked issue goes over problems when a “StatefulSet” gets into an error during deploying or upgrading. This can lead to data loss or corruption since Kubernetes will need manual intervention to fix the state of the deployment. This could even lead to the point where the only recommended fix is you delete the state. What does this mean for your business? Well, if you lose or corrupt your data it could mean a lot of different things depending on what the data was. If the data was your customer database of new account signups, well you might have just lost the data for your new customers. If you are an ecommerce site, it might have been your latest sale. If you are in banking or investments, it might have been data accounting for the movement of capital.

Is this the end of the era of the inexpensive-to-launch software startup?

The following advice (from the Elastisys article) is correct for any one specific business, but for the overall world of startups this situation leaves leaves me feeling sad about where the tech industry has got itself:

You should ask yourself this. Is what makes your business unique your ability to manage databases (or other stateful data stores)? No? Then get a hosted database service from your cloud provider. Spend your time and effort on what makes your business unique instead. And on the off chance that you answered “Yes!”, then you should go out and find everybody out there who answered “No”. Because there are many out there!

I’ve been hearing this more and more: use hosted services, because the new devops situation is too complex for mere mortals; only a handful of experts really understand it.

My concern is this: we just enjoyed a roughly 25 year stretch, maybe 1990 to 2015, when the economics of starting a business favored software startups: cheaper computers, cheaper network bandwidth, open source software; it all combined to create a world where a small handful of people could come together, start a company, and do amazing things. And the magic ingredient was “It is really cheap to get started.” But nowadays, more and more, I’m hearing, “All this devops stuff is so damn crazy complicated, you probably can’t figure it out, so you should probably just use a hosted solution.” And that raises costs. I worry that we are slowly going back to the world that existed before 1990, when “software” meant “expensive”. If we are not careful, this beautiful era of software startups will be suffocated by the complexity we are needlessly inflicting on ourselves.

When I say this, some people respond, “These cloud services are ridiculously cheap and they actually help lower costs.” I’ll respond to that in a moment.

(Please note, I’m not arguing against all managed services here. I’m only arguing against making standard devops tools so complicated that we poor mortals have no choice but to use managed services.)

You are not Google

I said “the complexity we are needlessly inflicting on ourselves”. I mean “needless” in the sense that Oz Nova meant when he wrote “You Are Not Google“:

Software engineers go crazy for the most ridiculous things. We like to think that we’re hyper-rational, but when we have to choose a technology, we end up in a kind of frenzy — bouncing from one person’s Hacker News comment to another’s blog post until, in a stupor, we float helplessly toward the brightest light and lay prone in front of it, oblivious to what we were looking for in the first place. This is not how rational people make decisions, but it is how software engineers decide to use MapReduce.

As Joe Hellerstein sideranted to his undergrad databases class (54 min in):
The thing is there’s like 5 companies in the world that run jobs that big. For everybody else… you’re doing all this I/O for fault tolerance that you didn’t really need. People got kinda Google mania in the 2000s: “we’ll do everything the way Google does because we also run the world’s largest internet data service” [tilts head sideways and waits for laughter]

In response to one of my previous essays alter3d criticized my ideas with this comment:

This guy needs to tell Google that 100% of their infrastructure is wrong.

That’s a valid criticism, if you are overseeing infrastructure at Google. If you run devops at Google, please ignore my essays, I have not written anything that is relevant to you.

But are you running devops at Google?

Let’s talk about the word “agile”. It has meant different things in different eras. I’ve been writing software for 20 years, and I’ve been building software startups for 17 years. Almost everything we did in 2002 would be considered wildly unprofessional by today’s standards, and some of it was considered unprofessional by the standards of 2002. But it did allow us to iterate fast.

We did not start using any version control till 2005, and then it was Subversion. (I didn’t start using Git till 2012). In 2002, we were working on a blog engine written in PHP. (Weblogs were still a new idea then, and around that time Typepad raised $23 million to build out their weblog service.) We had two big web servers that we rented from Hostway, one for serving our frontend, and one for the database. Each server was $100 a month. There was no failover for the database. The backups for the database were saved to a folder which I had to remember to download to my computer every day or two, or three. Deployment meant we edited a PHP file, or an HTML file, and we uploaded it with FTP to our frontend server. We deployed 50 times a day, sometimes 100 times in a day. We definitely tested in production, but maybe not in a cool, sophisticated way. We had a ridiculous amount of fun, brainstorming ideas and pushing them out at a fast pace. We were extremely agile, under the only definition of agile which should matter to a small startup, which is fast iteration of the basic idea. When weblogs didn’t work out for us, we pivoted to ecommerce software, and there we had our main success. Being able to do fast pivots is the life-or-death question for small startups. When I think of that era, I am embarrassed about a great deal, yet our system had some attributes that I would still be willing to copy for a small startup:

1. Our hosting costs were ridiculously cheap. During the first year we did not spend more than $200 a month on servers.

2. Our devops setup was so simple that the non-technical staff understood it perfectly. Our graphic designer could completely redo our user interface, and deploy the new version, without needing any help from a computer programmer.

3. We were extremely agile. If a customer sent us a suggestion for a feature, we could design it, code it and deploy it in a day. We could experiment with different versions of code at different vhosts in Apache.

The extreme simplicity of the devops situation meant that we could focus on the other aspects of product/market fit. (My understanding of startup development is that Steve Blank will be angry with you if you focus on operational optimizations before you have found product/market fit.)

There are other definitions of “Agile” that are valid, but be aware what you might lose if you adopt them. Consider this definition, being pushed by Microsoft’s Azure service:

Achieve agility at scale with Kubernetes and DevOps

As containers, environments, and the teams that work with them multiply, release frequency can increase—along with developmental and operational complexity. Move quickly at scale with enhanced security by employing DevOps in Kubernetes environments, you can move quickly at scale with enhanced security.

Under this definition, you have to spend a lot of money to get back to the release frequency that we innocently enjoyed in 2002. If you actually need the reliability offered by these services, then of course you should investigate them to see if they answer your company’s needs. The most important phrase in the Microsoft text is “at scale”. Be sure your need for scale is real before you start spending money on this option. As I mentioned in High Availability is not compatible with an MVP, because an MVP is about fast iteration every one of the CEOs I’ve worked with recently have insisted they need High Availability right from the start. This is perfectionism, and perfectionism is dangerous for a business.

By the way, everywhere in this essay that I use the word “Kubernetes” I’m sure someone will suggest a hosted service that is supposed to remove all the pain of dealing with Kubernetes. Jessie Frazelle wants to remind us that even when you do use a hosted service, you are not avoiding all of the pain:

Now you are probably thinking, “my cloud provider said they’d take away all the pain you just described by selling me their managed Kubernetes.” That is indeed the dream. However, it is not reality. Having worked for some cloud providers, I have seen the pain customers still go through trying to learn the patterns Kubernetes implements and applying those patterns to their existing applications. This means your teams will still have to handle the steep learning curve. Just because it’s managed does not mean that your application’s uptime and availability are covered. That is still on your team.

(Slightly off topic, but I previously wrote how surprised I was, when I decided to use the managed ElasticSearch service that AWS offers, that it did not auto-scale the memory use, so I started getting OutOfMemory errors. Seriously? I have to do that manually? What is the advantage of using a hosted service?)

Cloud services help you save money!

Not exactly. Azure and Google Cloud and AWS can save you some money in certain situations, but typically you have to re-invent your architecture to take advantage of what cloud services offer. You get nothing but pain if you take your co-location data center setup and move it directly, unmodified, to the cloud. In High Availability is not compatible with an MVP, because an MVP is about fast iteration I mention my friends who were paying $7,000 a month at their co-location center, moved without changes to AWS, and ended up paying $35,000 a month. If you want to take advantage of cloud services, you will have to change your architecture, and re-inventing your architecture is itself a cost that should be included with any cost comparison.

If you offer a service that is easily decomposed into discrete compute units, and if usage is very uneven, with big spikes and long lulls, then something like AWS Lambda can save you money. But keep in mind it will be slower than what you can achieve on your own, and if your usage pattern is steady, instead of lumpy, then Lambda might in fact be more expensive than simply running some servers 24/7.

There are many issues to consider. This exchange from Hacker News brings up the kinds of arguments that developers are still having regarding AWS Lambda:

abiro wrote:

PSA: porting an existing application one-to-one to serverless almost never goes as expected. Couple of points that stand out from the article:

1. Don’t use .NET, it has terrible startup time. Lambda is all about zero-cost horizontal scaling, but that doesn’t work if your runtime takes 100 ms+ to initialize. The only valid options for performance sensitive functions are JS, Python and Go.

2. Use managed services whenever possible. You should never handle a login event in Lambda, there is Cognito for that.

3. Think in events instead of REST actions. Think about which events have to hit your API, what can be directly processed by managed services or handled by you at the edge. Eg. never upload an image through a Lamdba function, instead upload it directly to S3 via a signed URL and then have S3 emit a change event to trigger downstream processing.

4. Use GraphQL to pool API requests from the front end.

.

foxtr0t wrote:

So, to summarize, you should:

1. not use the programming language that works best for your problem, but the programming language that works best with your orchestration system

2. lock yourself into managed services wherever possible

3. choose your api design style based on your orchestration system instead of your application.

4. Use a specific frontend rpc library because why not.

Four different approaches to devops

Docker/Kubernetes offers you unrivaled fine-grained control of resources. Do you need that? There are older approaches that might work for you. I’ll here list the 4 main styles I’ve seen over the last 20 years:

1.) Slap some code on the server.

2.) Bare metal servers, with enough redundancy to easily support failover frontends and failover databases.

3.) Virtual Machines, Vagrant, Heroku.

4.) The true cloud optimized technologies: Terraform/Packer and Docker/Kubernetes.

Of these 4 approaches, which is best? The whole point of this essay is that there is no best. All 4 of these approaches are still in use, and all 4 approaches are valid.

(And obviously I’ve simplified things. I’m leaving out the many variations of CI/CD setups, version control work flows, and monitoring/observability tools. I can not cover every variation without writing a book.)

Each of these approaches has both strengths and weaknesses, which I’ll detail next.

Slap some code on a server

This is what we did in 2002. It really worked fine for us. Our customers understood our website occasionally had glitches. If you follow this strategy, you might get a reputation for being a bit unprofessional, but your devops costs will be rock bottom, and you can pass those costs savings along to your customers, or put them in your own pocket.

You might think this approach is hopelessly obsolete but I believe there are many advertising networks that still work this way — they often seem extremely glitchy, and website consumers nowadays use so many ad-blockers, it would almost be stupid for an ad network to offer an SLA contract to advertisers. What is the point of offering 99.999999% reliability when 50% of ads are blocked by some other filter, somewhere else on the network, or in the browser?

Also, I know of many graphic designers who knock up cheap WordPress sites for clients, and this is exactly their devops setup: they FTP code to the production server (which is the only server). If a company is paying $500 for a complete website, this is exactly the level of setup they should expect. Some companies compete on the basis of rapid iteration of marketing ideas; all they need is fast micro-sites for their next campaign. If cost really is more important than quality, for whatever market the company is in, then they are making a rational decision.

Bare metal servers, with redundancy

In 2011, when I worked at Shermans Travel, they were renting 24 machines on a long-term lease. They had two frontend web servers with another 2 for failover, 3 small test frontend machines, a big database machine with a failover, 3 small machines for test databases for development, plus a few other machines for other things, such as email.

We wrote code on our own machines but we didn’t run databases on our machines, instead we used the various remote test databases servers. The tech team consisted of 6 programmers and 1 devops person and 1 project manager. The devops person ensured that we always had some remote test database we could develop against, and that it had a copy of the current data, so we could see real world effects, such as any slowness that might appear if we wrote an 11 table JOIN against tables with tens of millions of rows.

Our project manager was also our QA team, and she knew how to deploy the code from Subversion to various test servers, where she would then kick it around and tell us if she found any bugs. Deployment was handled with Capistrano scripts.

The system was simple, everyone understood it, everyone could work with it. It was a perfectly good system that allowed us to push out changes fast.

There are two problems with a system like this, which you need to think about before you imitate this style at your own company.

First of all, each developer had to setup the software on their own machines, and it often took a few days of effort to get all the software installed and running on local machines. So that is a tax that every new developer pays. How often should your company pay that tax? If you have one or two developers, and they stay with you for many years, then the tax is not worth worrying about. But if you have 300 software developers, that tax will be horrendous, so you need to adopt a different style of development.

Second of all, when the company was at its peak, with 4 million weekly users, the long-term lease machines were cheaper than AWS, however, the company eventually lost most of its audience, at which point the long-term leased machines represented excess capacity and was way too expensive when the audience fell to just 1 million weekly subscribers. When the lease ended the company moved to AWS, because when your audience is fading, it’s good to be able to cut costs quickly. I previously mentioned that co-location data centers can be surprisingly cheap, but if you are in a long-term lease, that can be a problem if you’ve committed to more resources than you actually need.

Virtual Machines, Vagrant, Heroku

I worked with some startups where the software was developed in Ruby On Rails, in a VM, run in Vagrant, and then deployed to Heroku. For as long as a startup can use Heroku, it probably should, because Heroku keeps the devops situation as simple as possible.

One place where I saw the transition was at TimeOut.com, in 2012. They had built their main CMS with PHP, using the Symfony framework. They were also building a new API using Scala. And they’d bought a company that had a ticket selling system in Ruby On Rails. And when I got there, every developer was setting up all 3 software systems on their own machines. The Scala and the Rails apps were easy enough, but the PHP system had dozens of dependencies that were not managed by Composer and it took me 2 weeks to get it running.

While I was there they came up with a VM running CentOS (the same as our production machines) and they put the PHP CMS on that. After I left I believe they added the other apps to the VM, so when a new developer was hired, all the developer had to do was download the VM and spin it up in something like VirtualBox, and viola, they had all 3 apps running. Certainly, it’s an option to consider for your company. This makes it very easy for a new developer to become productive, and they work in an environment that is identical to the production machines. You can setup a standard Virtual Machine that runs Linux or Windows, and that means you can stay with the environment that you have many years experience with, using tools that you’re familiar with.

There are two problems with this style of development.

One is that resource use is coarse-grained compared to Docker/Kubernetes. If you want to increase your capacity, you are spinning up a new VM (or in AWS, a new EC2 server).

Two, if you are a developer, running a VM on your machine can be a real pain. From the point of view of 2016, the interest in Docker was obvious, and justifiable, because everyone was sick of working with standard VMs via software like VirtualBox on the Mac. Consider these comments on Hacker News:

rhinoceraptor on July 29, 2016

I think by ‘production’, they mean ‘ready for general use on developer laptops’. No one in their right mind is deploying actual production software on Docker, on OS X/Windows.
I’ve been using it on my laptop daily for a month or two now, and it’s been great. Certainly much better than the old Virtualbox setup.

.

mherrmann on July 29, 2016

I’m still using VirtualBox. Could you elaborate why Docker is better?

.

numbsafari on July 29, 2016

Leaving containers vs VMs aside, Docker for Mac leverages a custom hypervisor rather than VirtualBox. My overall experience with it is that it is more performant (generally), plays better with the system clock and power management, and is otherwise less cumbersome than VirtualBox. They are just getting started, but getting rid of VirtualBox is the big winner for me.

The phrase “slippery slope” is used too easily and too often, but in this case it really applies: “A VM is sluggish on my machine, Docker is lightweight, let’s switch to that. Oh, but wait, does that lead to complications in production? Hmm, okay, so we’ll just use Docker in development, we won’t use it in production. But wait, aren’t we missing all of the real benefits, if we don’t use it in production? It’s too weird to develop a working container and then not use it. So yes, let’s use it in production, but wait, we need to orchestrate this, so let’s use Kubernetes, but wait, Kubernetes can be a pain, so let’s also use Rancher.”

I’ve seen too many startups take one tiny step down that road and a minute later they are asking “How do we re-attach a failover persistent volume that might contain some corrupted data?”

Also, keep in mind, perfectionism can hurt you. I’ve run into a lot of CTOs who seem to want a devops setup that is truly painless. Is that realistic? I am willing to believe that God has a flawless system of failovers for the databases where God tracks our sins and our virtues, but I don’t believe any such devops setup will ever be created by the hand of mortal flesh born of this fallen and hollow world. There will always be some pain; if you are CTO, you must decide what tradeoffs are best for your company. What Charity Majors said about debugging in development also applies to devops:

Could we have ironed out all the bugs before running it in prod? No. You can never, ever guarantee that you have ironed out all the bugs. We certainly could have spent a lot more time trying to increase our confidence that we had ironed out all possible bugs, but you quickly reach a point of fast-diminishing returns.

We’re a startup. Startups don’t tend to fail because they moved too fast. They tend to fail because they obsess over trivialities that don’t actually provide business value. It was important that we reach a reasonable level of confidence, handle errors, and have multiple levels of fail-safes (i.e., backups).

There are many things to consider when thinking about the devops setup for your company but “It should be painless” is probably not the right lens to use for this question.

The true cloud optimized technologies

When the cloud first emerged circa 2009, most companies treated the cloud servers as if they were standard servers in a standard data center. Though new projects like Docker were launched, cloud native technologies did not grab the attention of the mainstream of the tech industry till somewhere around 2014 or 2015.

You can use Packer to create a VM that you can use on your own machine and also run in the cloud, and you can use Terraform to script how many servers you want to spin up, what security groups you want, what permissions they will have have, what databases you want to run, what backups you want to have — just about everything in your devops setup can be scripted with Terraform. Terraform/Packer allow you to stick with the pattern of development I mentioned in #3, it just adds a layer of automation that makes everything easier, and allows you to take full advantage of the cloud.

When you want to scale up, using a full VM is more coarse-grained that using Docker/Kubernetes, so in some cases you might waste resources, and in some cases you might waste money. With Docker/Kubernetes, if you have 100 microservices running, and your business expands, you can increase one microservice by 3%, another by 8%, another by 2%, another by 342%, another by 17%, and so on. Nothing can match the fine-grained control that Docker/Kubernetes gives you, which is probably why Google likes it so much. It’s a good choice for some companies, just please be sure you’ve thought about it carefully before jumping on that bandwagon.

Sometimes professionalism is bad, sometimes sloppiness is good, sometimes less is more

It’s becoming common that when I say “We don’t need to go with Docker/Kubernetes yet” someone else says “All professional shops are committed to it, if we don’t do this, then we are unprofessional.”

This continues the frustrating pattern, which has recurred many times in the tech industry, where a given technology is seen as the “believable promise” that is going to solve all the problems of the tech industry, and so it is then elevated to a status where it becomes untouchable, unimpeachable, unquestionable. I would really like to see the tech industry avoid making this mistake again.

“All professional shops are committed to it, if we don’t do this, then we are unprofessional” was applied to different technologies during different years:

1.) Compiled languages (as opposed to script languages) for all of the 90s

2.) Microsoft’s stack (as opposed to open source software) for all of the 90s

3.) Java EJBs/Struts, circa 2001

4.) XML (It’s a configuration language! It’s an RDF serialization! Strict XHTML is the future of HTML!) circa 2004

In every case, the tech industry was over-committed to a technology that eventually was seen to have some limits, and where alternatives existed that were eventually discovered to have unique benefits that could not be imitated by the technology that everyone had committed to.

Arguing “We must do this because this is professional” is an a variation of “argument from authority”. You’re not explaining the inherent goodness of the technology, but rather, you’re basically saying we should use it because everyone else is using it.

But in truth, sometimes a bit of sloppiness has costs savings that will be appropriate for your business. In the same way that sometimes a money manager might go long on soybeans and see that daring bet pay off, sometimes a company can run up a large tech debt, carry it for a long time, and have that daring bet pay off. Indeed, this is exactly what happens in real life, when tech debt is not an accident, but part of a deliberate plan. And, as any money manager will tell you, it is impossible to mitigate all risks. In a theoretically pure market there is no profit because perfect competition drives out all the profit. In the real world, the winner, with the biggest profit, is the company that finds the best way to arbitrage the risk.

Charity Majors said it well:

Organizations will differ in their appetite for risk. And even within an organization, there may be a wide range of tolerances for risk. Tolerance tends to be lowest and paranoia highest the closer you get to laying bits down on disk, especially user data or billing data. Tolerance tends to be higher toward the developer tools side or with offline or stateless services, where mistakes are less user-visible or permanent. Many engineers, if you ask them, will declare their absolute rejection of all risk. They passionately believe that any error is one too many. Yet these engineers somehow manage to leave the house each morning and sometimes even drive cars. (The horror!) Risk pervades everything that we do.

I might regret suggesting sloppiness can be good, since I realize this part of the essay is easily misread. The point is subtle. On the one hand, it is good to be realistic about the reasons why smart managers sometimes decide to allow tech debt. On the other hand, most tech debt arises accidentally, and it has a terrible effect on the morale of the software development team. Please don’t think this essay is making an argument for tech debt. For the most part, I’m arguing for the right level of simplicity, which can be understood as minimizing the number of moving parts that your team needs to worry about.

As CTO, you have an obligation to manage the risks of tech debt

Tech debt, if it is allowed to accumulate, will have a negative effect on the happiness of your software developers. In response to my previous essays, people made the point that some companies have a chaotic mix of different software packages and versions, and that Docker is the best way to manage the chaos. jstoja made this point on Lobste.rs:

jstoja
How do you manage easily 3 different versions of PHP with 3 different version of MariaDB? I mean, this is something that Docker solves VERY easily.

.

friendlysock
Maybe if your team requires 3 versions of a database and language runtime they’ve goofed…

.

jstoja
It’s always amusing to have answers pointing to the legacy and saying “it shouldn’t exist”. I mean, yes it’s weird, annoying but it exists now and will exists later.

.

friendlysock
It doesn’t have to exist at all–like, literally, the cycles spent wrapping the mudballs in containers could be spent just…you know…cleaning up the mudballs.

.

jstoja
I see it more like, the application runs fine, the team that was working on it doesn’t exist anymore, instead of spending time to upgrade it (because I’m no java 6 developer), and I still want to benefit from bin packing, re-scheduling, …

That’s absolutely true, and from the point of view of the individual software developers, Docker seems like a miracle that solves a lot of problems. If you are CTO and you are overseeing a situation where “3 different versions of PHP with 3 different version of MariaDB” is normal, and you have no plan for reducing the tech debt, your individual software developers will come up with a plan of their own, and you might not like it.

I once suggested that Docker will eventually be the kind of tech debt that we now jokingly associate with legacy Java apps. Someone responded that once everything is Dockerized, a company doesn’t have to worry about tech debt any more. That statement is as valid as “Once we do the complete re-write, we won’t have to worry about tech debt anymore,” which has become an industry joke. There is no reason to believe that Docker will solve the problems of tech debt, but rather, Docker moves everything to a higher level. Docker easily solves the problem of “3 different versions of PHP with 3 different version of MariaDB” but only by introducing a whole host of new devops technologies.

If you are a money manager, a big part of your job is managing the risk from the leverage (debt) you’ve taken on. If you are a CTO, a big part of your job is managing the risk from the tech debt you’ve taken on. What friendlysock wrote is altogether correct: “The cycles spent wrapping the mudballs in containers could be spent just… you know… cleaning up the mudballs.” jstoja’s point about sunk costs would be valid if the costs were actually in the past, but if you have to invest new money to keep old software running (by Dockerizing them), then that old software is not a sunk cost, and so re-inventing the software might be justified, since you have already decided to spend new money on it. Only you can make that decision, just be sure you take into account the full costs of a container strategy, when you are at those crossroads.

When the tech industry gave up on Objects For Everything an important step forward was made

In my previous essay I wrote:

The tech industry considers itself open minded, but in fact it is full of movements which gather momentum, then shut down all competing conversations, for a few years, then recede, and then it becomes acceptable for all of us to poke fun at how excessive some of the arguments were. In 2000 the excesses were XML and Object Oriented Programming (OOP).

It was OOP then, it’s containers now, but let’s consider an interesting possibility for the future.

During the 1990s, as the mania built around Objects For Everything, there was a major focus on solving the problem of object relational impedence mismatch. Getting rows of data from a SQL database, and then transforming those rows into objects, seemed like a flaw in the system. SQL was a dust mote in the eye of God, an unholy ugliness that needed to be abolished — SQL was not object oriented, therefore it needed to be destroyed and replaced by… uh, what exactly?

In a big company, you might have many teams, some using Java, and others using C++. Since SQL is a universal database language, both teams could read and write to the database, so long as their code understood SQL. But that means, when getting rid of SQL and replacing it with pure object databases, some new problems had to be confronted. If you want to write a Java object to the database, you need to serialize it first. That’s easy, but if you want the same universal quality as SQL, you need to serialize the Java object in a way that the C++ code can easily read and write to it. And vice versa, the Java code needs to be able to read and write to the serialized C++ objects. And if you’re selling data to 3rd parties, you need a system of serialization that will support all object oriented languages, and can work in a self-describing manner so that code that knows nothing about your code can still figure out how to search and read your serialized objects.

This lead to the Web Service Specification, one of the great mistakes in the history of the tech industry. After the fever finally broke, and people gave up on the dream of Objects For Everything, a reaction set in. As James Lewis and Martin Fowler said:

…a reaction away from central standards that have reached a complexity that is, frankly, breathtaking. (Any time you need an ontology to manage your ontologies you know you are in deep trouble.)

Yes, the Web Service Specification drove Martin Fowler to complain of “a complexity that is, frankly, breathtaking” — and if it befuddles someone as great as Fowler, then how are us are mere mortals supposed to understand it?

But the interesting thing is what happened next. People gave up on Objects For Everything and re-thought the problem. Object databases failed because every language had to come up with an object serialization that could be understood by every other language. But what if a language had a reasonable “plain text literal” format that could be used as a kind of intermediate universal serialization language? What if we could describe a User object like this:

{
 "_id" : "f9323nvhg829384",
 "name" : "Lawrence Krubner",
 "street" : "254 W 98th St",
 "apartment" : "6A",
 "city" : "New York",
 "state" : "NY",
 "phone" : "434 825 7694"
}

In other words, what if we all used JSON for serialization, and therefore, when in doubt, we could fall back to Javascript rules regarding reading and writing and querying? (Huge credit to del.icio.us for coming up with the first JSON API, back in 2004.) Then each language only needed to serialize its data to and from JSON, and other languages could decide how they were going to consume that JSON.

This was a big change for the industry. Dave Winer, who had a history of being wrong about things, acted like JSON was going to destroy the tech industry. He shouted “IT’S NOT EVEN XML!”. It’s worth reading his full reaction to get a sense of how certain people were shocked by JSON’s simplicity.

This idea was one of the starting points for NoSQL databases. And some of these databases, especially the document stores such as MongoDB and CouchDB, have answered some of the goals that people had initially hoped would be solved by pure object databases. Java and C++ can write to MongoDB. And yes, we gave up on some of the goals, such as a self-describing serialization format, because that seemed hopelessly complex. Instead we standardized around an API style that is sometimes called RESTful (though others insist it is still RPC).

We might enjoy a similar conceptual simplification once the mania for Containers For Everything is dead. Because there are some interesting ideas in this movement, though they are currently being dealt with through more and more layers of complexity, in a replay of the mistakes that lead to the Web Service Specification. What is needed, instead, is a re-thinking of the problem at the level of basic concepts. One of the more interesting ideas now associated with containers is separating compute from all other aspects of computer activity, such that functions can exist as pure entities floating in the cloud. AWS Lambda is currently the nearest thing we have to seeing this ideal come to life. But there might be other approaches that might work with fewer moving parts. Consider the argument that invoking a process on another machine should be exactly the same as invoking a process on one’s local machine. This idea has been under discussion since the 1980s, and perhaps should be re-examined as the right way forward for the tech industry. As my essay is already too long, I won’t waste any more words on the idea, but for those of you are interested, start by reading the Wikipedia page about RINA. (To be clear, RINA is fairly new, but it grew from a critique of the Internet that’s been percolating for decades.)

Thank you for reading.

For anyone interested in the previous conversations, here are the 3 essays and a partial list of places they were discussed:

Why would anyone choose Docker over uber binaries? (2017)

Lobste.rs

Hacker News

Docker protects a programming paradigm that we should get rid of (2018)

Lobste.rs

Hacker News

Docker is the dangerous gamble which we will regret (2018)

Hacker News

[ Style note: inspiration for some of the humor in the dialogue owes a debt to Pete Lacey’s old classic about SOAP. ]

Off-topic: I host a once-a-month party that is mostly a tech event. Do you live in New York City? Would you like to be invited? Contact me via LinkedIn.

Post external references

1
https://elastisys.com/2018/09/18/sad-state-stateful-pods-kubernetes/
2
https://github.com/kubernetes/kubernetes/issues/67250
3
https://blog.jessfraz.com/post/the-business-executives-guide-to-kubernetes/
4
https://blog.bradfieldcs.com/you-are-not-google-84912cf44afb
5
https://www.reddit.com/r/docker/comments/8jk22u/docker_is_a_dangerous_gamble_which_we_will_regret/
6
https://increment.com/testing/i-test-in-production/
7
https://www.amazon.com/Four-Steps-Epiphany-Steve-Blank/dp/0989200507/
8
https://azure.microsoft.com/en-us/topic/what-is-kubernetes/
9
https://news.ycombinator.com/item?id=21046547
10
https://www.shermanstravel.com
11
https://news.ycombinator.com/item?id=12184522
12
https://lobste.rs/s/gr8rcw/docker_is_dangerous_gamble_which_we_will
13
del.icio.us
14
https://en.wikipedia.org/wiki/Recursive_Internetwork_Architecture
15
https://www.reddit.com/r/programming/comments/78q3al/why_would_anyone_choose_docker_over_fat_binaries/
16
https://lobste.rs/s/zz9oc8/why_would_anyone_choose_docker_over_fat
17
https://news.ycombinator.com/item?id=15578147
18
https://lobste.rs/s/d7heeh/docker_protects_programming_paradigm_we
19
https://news.ycombinator.com/item?id=20371961
20
https://www.reddit.com/r/devops/comments/8j9yrn/docker_is_the_dangerous_gamble_which_we_will/
21
https://www.shlomifish.org/humour/by-others/s-stands-for-simple/
22
https://www.linkedin.com/in/krubner/

Source

My final post regarding the flaws of Docker / Kubernetes and their eco-system

Is this the end of the era of the inexpensive-to-launch software startup?

You are not Google

Cloud services help you save money!

Four different approaches to devops

Slap some code on a server

Bare metal servers, with redundancy

Virtual Machines, Vagrant, Heroku

The true cloud optimized technologies

Sometimes professionalism is bad, sometimes sloppiness is good, sometimes less is more

As CTO, you have an obligation to manage the risks of tech debt

When the tech industry gave up on Objects For Everything an important step forward was made

Post external references

RECENT COMMENTS

11 COMMENTS

Leave a Reply Cancel reply