November 7th, 2014
(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: firstname.lastname@example.org
I joined LinkedIn about seven years ago. At the time I joined, the company was just starting to run into scaling problems in its core database, its social graph system, its search engine, and its data warehouse. Each of the systems that stored data couldn’t keep pace with the growing user base. At the time, each of our data systems ran on a single server. This left us having to continually buy bigger and bigger machines as we grew. So the task I ended up focusing on for much of my time at LinkedIn was helping to transition to distributed data systems so we could scale our infrastructure horizontally and incrementally. We built a distributed key-value store to help scale our data layer, we built a distributed search and social graph system too, and we brought in Hadoop to scale our data warehouse. And yes, all of these systems had names just as unusual as Kafka (e.g. Voldemort, Azkaban, Camus, and Samza).
This transition was both powerful and painful. It was powerful because beyond just keeping our website alive as we grew, moving to this type of horizontally distributed system allowed us to really make use of the data about the professional space that LinkedIn had collected. LinkedIn has always had amazing data on the people and companies in the professional world and how they are all connected and inter-related, but finally being able to apply arbitrary amounts of computational power to this data using frameworks like Hadoop let us build really cool products.
But adopting these systems was also painful. Each system had to be populated with data, and the value it could provide was really only possible if that data was fresh and accurate.
This was how I found myself understanding how data flows through organizations, and a little bit about what people in the data warehouse world call ETL. The current state of this kind of data flow is a scary thing. It is often built out of csv file dumps, rsync, and duct tape, and it’s in no way ready to scale with the big distributed systems that are increasingly available for companies to adopt.