My idea for a startup that standardizes the import of data through ML and NLP scripts

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at:, or follow me on Twitter.

Suppose I were to quiz you about this paragraph:

“In December 2012, MoonEx acquired one of the other Google Lunar X-Prize teams, Rocket City Space Pioneers, from Dynetics for an undisclosed sum. The new agreement makes Tim Pickens, the former lead of the RCSP team, the Chief Propulsion Engineer for MoonEx.”

Proper nouns which NLP can discover:

1.) MoonEx

2.) Google

3.) Rocket City Space Pioneers

4.) Dynetics

5.) Tim Pickens

6.) RCSP

7.) Chief Propulsion Engineer

Suppose I were to ask you, “Which of these nouns are the company that was acquired?”

You would probably answer #3, which is Rocket City Space Pioneers.

This is an easy question for humans, but not so easy for computers. Even with recent advances in Natural Language Processing (NLP), paragraphs with a lot of nouns that refer to each other is still a very tricky problem to resolve.

There will always be a need for some human intervention with NLP. Accuracy might reach 95% or 96% or 97% but there will always be some errors, and dealing with those errors needs to be done in a productive way. Even if the (unlikely) day comes where NLP models are 100% accurate, there will still be the need to develop a training set, and that part always needs to be done by humans. It is tedious, it is slow, and it is unproductive. It is also increasingly important. Therefore, this is a big opportunity.

What can be automated regarding the paragraph about MoonEx? The detection of Proper Nouns. They can be pulled out and put into a form and presented to a human, who then only needs to click on #3 (assuming you are building a model that is looking to find companies that have been acquired). Such an interface would automate the construction of training data.

More broadly, there will always need to be interaction between humans and NLP scripts, and automating the interface will bring important productivity gains to an increasingly important task.

I have seen talented people waste many weeks copying-and-pasting text into spreadsheets that are then used as the starting point as training data to train NLP models. Anything that makes the process easier is a big win for the company, and if the task can be structured such that less-skilled people can do it, then the productivity gains become huge.

Because of its innovative interface, Slack took over the market for corporate chat. Likewise, the right interface for training and resolving NLP data can transform this work, from a task that is tiresome and expensive and bespoke, into something that is streamlined and inexpensive and ubiquitous.

I see the basis of a startup.

The above is the summary of my idea.

You can stop reading now.

For those interested, below is a much longer argument for the same idea.


For many years I’ve worked with startups involved in data mining, so I’ve gotten to know how the current crop of these firms operates. The companies that I’ve worked with rely on Machine Learning (ML) and Natural Language Processing (NLP).

There is no ML/NLP process that delivers the right results 100% of the time, so most companies find they need some human review of the results of the ML/NLP software. The human review requires two levels of skill:

1. people without skill, found via a service such as Mechanical Turk

2. people with skill, most often staff that the company has hired and trained

For the sake of productivity, as much as possible you want to rely on software rather than humans, and when you must use humans, you want to rely on cheap, unskilled labor, rather than expensive, trained workers.

There is a great deal of redundancy in the current efforts to find the right balance of ML/NLP and humans. Each company stumbles through a painful process of trying to figure out where it should use humans to patch the failures of its ML/NLP techniques.

After months or years of failure, most companies eventually come to some version of this process:

1. run the data through various scripts and generate a confidence estimate. Is this the data we are looking for?

2. if step 1 generates a strong “yes” or “no” then the data can be confidently accepted or rejected. If accepted, it’s added to the database of items that people in the company actually use.

3. if step 1 fails to generate a strong “yes” or “no” then the data should be sent to Mechanical Turk. Since no one person on Mechanical Turk can be trusted to make a judgement like this, the data should be sent to 5 people. If all 5 can agree on the data, then we can accept their judgement as a strong “yes” or “no”.

4. if there is no agreement among the people on Mechanical Turk, then the data needs to be sent to staff for review. These staff might work remotely, but they will have some training and experience, so they can exercise much more judgement in rendering a decision (compared to people on Mechanical Turk). Whatever they decide results in a strong “yes” or “no”.

There are many startups that offer ML and NLP services, such as and These services are basically nice graphical interfaces over the tools that power them. They can make data analysis easier for inexperienced workers. However, they make no effort to automate the process of vetting — so far, no startup has attempted to standardize this process. For now, each company builds a bespoke system: bespoke data import tools, bespoke ML/NLP analysis, bespoke rejection filters, bespoke escalation filters, bespoke dashboards and interfaces, bespoke database designs. The data analysis offered by tools such as and are just the beginning of a long process. It is time someone automated the rest of the process.


Let’s start by talking about one particular industry, which is firms that sell data about privately-held companies. There is a great conversation on Quora that summarizes the strengths and weaknesses of most of the enterprises in this field:

“How do CB Insights, PrivCo, DataFox, Owler, Tracxn!, Mattermark, and Venture Scanner compare for private company research?”

Danielle Morrill has written a great blog post about how and why she created Mattermark (“The Deal Intelligence Company”):

According to Crunchbase, she has so far raised $17 million to build out her company. Samiur Rahman, the lead data engineer at Mattermark, has given a revealing interview about how NLP helps Mattermark pinpoint data about deals that companies may be arranging:

Anyone who tries to scour the Web for information about companies, either publicly held or privately owned, immediately runs into a few problems, including the fact that any given company may have dozens of subsidiaries with similar names. The law, or a regulatory agency, might have forced the company to break up. So, for instance, Deutsche Bank has set up different companies for loans and investments. As a consequence, one finds the following names on the web:

Deutsche Bank – Corretora de Valores S.A.
Deutsche Bank A S
Deutsche Bank AG
Deutsche Bank GmbH
Deutsche Bank S.A. Banco Alemao
Deutsche Bank Trust Company Americas
Deutsche Bank Trust Company Delaware
Deutsche Bank

Journalists tend to misspell these names, or just use the generic “Deutsche Bank,” which makes it difficult to discern which incarnation of Deutsche Bank is the subject of an article.

Samiur Rahman says he is a fan of “word vectors” and “paragraph vectors”, and then he uses the Nearest Neighbor algorithm to figure out which companies are similar to each other. As the interview makes clear, Mattermark got into a business where they thought they could use humans to do data aggregation; now they are desperately trying to build ML/NLP stacks to automate the work.

Danielle Morrill says she initially thought about going to work as a journalist for TechCrunch, so her mindset was very much focused on human beings gathering data. Samiur Rahman discloses that they started off using standard ElasticSearch string matching for search, and only later did they realize they should try to get into ML/NLP. Apparently, they were at a very early stage in 2016. Word vectors and Nearest Neighbor are simple techniques. (Though it’s possible Rahman was dumbing down his comments for the interviewer).

Mattermark charges $500 per user per year, with a cap at $50,000 a year for an entire firm. They keep track of 1.5 million companies worldwide. Samiur Rahman says, “If someone searches for AI companies, they should get a list of AI companies.” Mattermark only aspires to a level of search accuracy that matches Google. This is surprising, because their customers pay a lot of money for a level of accuracy that is much higher (more targeted) than what is offered by Google.

Some customers buy from several of these data-analysis companies. Some buy from both CB Insights, which has the best data about deals, and also PrivCo, which has the best estimates on revenue. The customers are often sales teams hoping to find new customers for their own companies, but sometimes the data is also used for research (as at the universities) and sometimes the data is used for VC investments, as Danielle Morrill made clear in her history of Mattermark. Right now there are something like 30 companies in this field (selling data about private companies), though I assume this will eventually consolidate to something like 3 or 4 companies.


A generalized process for pulling meaning from text has 3 markets that make up its fat head:

1.) enterprise sales

2.) law

3.) medicine

There is also a long tail of small scale uses, some of which I discussed above:

4.) VC investments

5.) possible corporate acquisitions

6.) academic research

7.) journalistic research (it is worth noting that Donald Trump’s businesses are privately held, and PrivCo is one of the only organizations that has good estimates of Trump’s revenue and number of employees. In 2016 PrivCo was cited by the New York Times, the Washington Post, the Wall Street Journal, and a hundred other news outlets).

Up above, I discussed the enterprise-sales applications of a hybrid approach to data aggregation. Now I’ll briefly discuss its application to the legal profession.

Lawsuits between large corporations cost tens of millions of dollars; the largest cost is “discovery.” A lot of text has to be sifted through before the lawyers can find the sentence “While I worked at Google I downloaded all of Google’s documents for self-driving cars and now I’d like Uber to finance my new self-driving car startup”. It is very much like looking for a needle in a haystack. And discovery is still done by humans. Ten years ago I had several friends graduate from law school. Most of them could not get traditional jobs in law firms, but they did pick up a lot of lucrative work doing legal reading for law firms doing discovery. They would be paid something like $100 an hour, and they would work 12 hour days, locked in a secure room with millions of documents. The law firms would hire an army of these out-of-work law school graduates to read through and try to find the incriminating sentences that might be lost amid the endless boring emails about ordinary corporate activities. Trial discovery is an area that would benefit from a hybrid approach of ML/NLP and humans.


To summarize, my customers would be every group that:

1.) needs the output of ML/NLP scripts…

2.) …applied to a data source they specify…

3.) …vetted by humans to a high level of confidence…

4.) …through an interface that makes the humans productive

The market for ML and AI startups might seem crowded, but I think there is still many opportunities there, in particular, regarding standardizing the process whereby humans review the output of ML and NLP scripts.

I have seen talented people waste many weeks copying-and-pasting text into spreadsheets that are then used as the starting point as training data to train NLP models. Anything that makes the process easier is a big win for the company, and if the task can be structured such that less-skilled people can do it, then the productivity gains become huge.

Post external references

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5