Smash Company Splash Image

November 1st, 2014

No Comments

If you enjoy this article, see the other most popular articles

If you enjoy this article, see the other most popular articles

If you enjoy this article, see the other most popular articles

The surprisingly difficult task of data importing

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

The level of difficulty is a surprise to me:

Second, it was clear that reliable data loads would require deep support from the data pipeline. If we captured all the structure we needed, we could make Hadoop data loads fully automatic, so that no manual effort was expanded adding new data sources or handling schema changes—data would just magically appear in HDFS and Hive tables would automatically be generated for new data sources with the appropriate columns.

Third, we still had very low data coverage. That is, if you looked at the overall percentage of the data LinkedIn had that was available in Hadoop, it was still very incomplete. And getting to completion was not going to be easy given the amount of effort required to operationalize each new data source.

The way we had been proceeding, building out custom data loads for each data source and destination, was clearly infeasible. We had dozens of data systems and data repositories. Connecting all of these would have lead to building custom piping between each pair of systems something like this:

Note that data often flows in both directions, as many systems (databases, Hadoop) are both sources and destinations for data transfer. This meant we would end up building two pipelines per system: one to get data in and one to get data out.

This clearly would take an army of people to build and would never be operable. As we approached fully connectivity we would end up with something like O(N2) pipelines.

Instead, we needed something generic like this:

As much as possible, we needed to isolate each consumer from the source of the data. They should ideally integrate with just a single data repository that would give them access to everything.
The idea is that adding a new data system—be it a data source or a data destination—should create integration work only to connect it to a single pipeline instead of each consumer of data.
This experience lead me to focus on building Kafka to combine what we had seen in messaging systems with the log concept popular in databases and distributed system internals. We wanted something to act as a central pipeline first for all activity data, and eventually for many other uses, including data deployment out of Hadoop, monitoring data, etc.

For a long time, Kafka was a little unique (some would say odd) as an infrastructure product—neither a database nor a log file collection system nor a traditional messaging system. But recently Amazon has offered a service that is very very similar to Kafka called Kinesis. The similarity goes right down to the way partitioning is handled, data is retained, and the fairly odd split in the Kafka API between high- and low-level consumers. I was pretty happy about this. A sign you’ve created a good infrastructure abstraction is that AWS offers it as a service! Their vision for this seems to be exactly similar to what I am describing: it is the piping that connects all their distributed systems—DynamoDB, RedShift, S3, etc.—as well as the basis for distributed stream processing using EC2.

I wish I was still working for LaunchOpen. As this would be a crucial warning to them: writing importers for all the different software packages in use for K-12 education is going to take a simplifying approach to be successful. Thankfully, they were already using MongoDB has a centralized log.

Post external references

1
http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
2
http://launchopen.com/

Source

Check out my books:

RECENT COMMENTS

February 8, 2022 9:33 am

From Michael S on How I recovered from Lyme Disease: I fasted for two weeks, no food, just water

"Did you have Bartonella, too? Seems it uses autogenesis..."

January 11, 2022 4:33 am

From Essie on Docker is the dangerous gamble which we will regret

"Once in 1990s, there are popular high performance solution called HPC software, many commercial softwares are ..."

December 17, 2021 7:32 pm

From John Carston on The ethics of being a high level tech consultant (a Fractional CTO)

"It helped when you mentioned that it is important to have a real connection with your consumer. My cousin ment..."

September 2, 2021 7:47 pm

From Mojavedfo on Where PHP regex fails

"55 thousand Greek, 30 thousand Armenian..."

August 7, 2021 9:53 am

From Colin Steele on The ethics of being a high level tech consultant (a Fractional CTO)

"Fantastic essay. Thoughtful, well-constructed, timely and applicable. I think every part-timer in the tech f..."

August 5, 2021 3:02 pm

From Rachiovwn on Where PHP regex fails

"consists of the book itself..."

October 19, 2019 3:08 am

From Bernd Schatz on Object Oriented Programming is an expensive disaster which must end

"I really enjoyed your article. But i can't understand the example with the interface. The example is reall..."

October 17, 2019 4:50 pm

From Anderson Nascimento Nunes on The conventional wisdom among social media companies is that you can’t put too much of the onus on users to personalize their own feeds

"Can't speak for anyone else, but on my feed reader: 5K bookmarked feeds, 50K regex on the killfile to filter o..."

October 10, 2019 11:17 am

From روابط: البث المباشر – صفحات صغيرة on RSS has been damaged by in-fighting among those who advocate for it

"[...] تاريخ تقنية RSS، مقال قديم ويلقي نظرة على الناس الذين طوروا التقنية [...]..."

October 9, 2019 3:08 pm

From Dan Campbell on Object Oriented Programming is an expensive disaster which must end

"Object-Oriented Programming is Bad https://www.youtube.com/watch?v=QM1iUe6IofM..."

October 4, 2019 8:44 pm

From lawrence on My final post regarding the flaws of Docker / Kubernetes and their eco-system

"Gorgi Kosev, I am working to clean up some of my Packer/Terraform code so I can release it on Github, and then..."

October 4, 2019 5:14 pm

From Gorgi Kosev on My final post regarding the flaws of Docker / Kubernetes and their eco-system

"> Packer, sometimes with some Ansible. The combination of Packer and Terraform typically gives me what I ne..."

October 4, 2019 12:40 pm

From lawrence on My final post regarding the flaws of Docker / Kubernetes and their eco-system

"Gorgi Kosev, about this: "I would love if you could point out which VM based system makes it simpler and..."

October 4, 2019 7:31 am

From Gorgi Kosev on My final post regarding the flaws of Docker / Kubernetes and their eco-system

"I won't list anything concrete that you missed, because that will just give you ammunition to build the next a..."

October 4, 2019 1:39 am

From lawrence on My final post regarding the flaws of Docker / Kubernetes and their eco-system

"Gorgi Kosev, also, I don't think you understand what a "straw man argument" is. This is a definition from Wiki..."

NO COMMENTS

Leave a Reply Cancel reply