My idea for a startup that standardizes the import of data through ML and NLP scripts

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at:

Two scenarios:

1. Imagine you are a lawyer working for a corporation, and you realize that during 2016 another big corporation stole your best technology. So now you are initiating a major lawsuit. In the discovery phase of the lawsuit, you demand all emails from the year 2016, and the judge agrees with your request, so the enemy corporation sends you 5 million emails — all of its internal communications from 2016. Your company’s great technology is called AmazeTech, which you announced in June of 2016. When you search the emails for instances of “AmazeTech” you find thousands of instances, but sadly they all occur after June, and so the enemy corporation can claim that any mention of AmazeTech was simply in reaction to the public announcement. But you are certain they are guilty. You need to find the places where they talked about the new technology, before it had a name. What sort of sentences should you be looking for? What sort of sentences suggest criminal intent? What words would people use to describe AmazeTech, before it had a name? You can’t read all 5 million documents, so you need to find a way to narrow the search to those documents that will reveal infringement.

2. Imagine you run a company that uses software bots to scan the web for information about private companies. Your software bots find an article that mentions “vera security”. In your database, you are already tracking these three companies, Vera, a real estate REIT firm, Vera Security, Inc, a firm which rents bodyguards, and Vera Inc, a computer security firm. Are any of these companies a match for the article that you just found? There is no automated method that will allow you to be 100% certain, but suppose you had a way to be 95% certain?

In both cases, you can maximize productivity by automating some, but not all, of the discovery process. Since I’ve seen this pattern repeat at several companies, I’ve become convinced that someone should start a company that standardizes the process, and sells the knowledge back to the companies that need it. After all, not every company, or law firm, has the know-how to automate this work, but one company that specialized in this task could learn to do it well.

What follows is my argument for why such a company should exist.


For many years I’ve worked with startups involved in data mining, so I’ve gotten to know how the current crop of these firms operates. The companies that I’ve worked with rely on Machine Learning (ML) and Natural Language Processing (NLP).

There is no ML/NLP process that delivers the right results 100% of the time, so most companies find they need some human review of the results of the ML/NLP software. The human review requires two levels of skill:

1. people without skill, found via a service such as Mechanical Turk

2. people with skill, most often staff that the company has hired and trained

For the sake of productivity, as much as possible you want to rely on software rather than humans, and when you must use humans, you want to rely on cheap, unskilled labor, rather than expensive, trained workers.

There is a great deal of redundancy in the current efforts to find the right balance of ML/NLP and humans. Each company stumbles through a painful process of trying to figure out where it should use humans to patch the failures of its ML/NLP techniques.

After months or years of failure, most companies eventually come to some version of this process:

1. run the data through various scripts and generate a confidence estimate. Is this the data we are looking for?

2. if step 1 generates a strong “yes” or “no” then the data can be confidently accepted or rejected. If accepted, it’s added to the database of items that people in the company actually use.

3. if step 1 fails to generate a strong “yes” or “no” then the data should be sent to Mechanical Turk. Since no one person on Mechanical Turk can be trusted to make a judgement like this, the data should be sent to 5 people. If all 5 can agree on the data, then we can accept their judgement as a strong “yes” or “no”.

4. if there is no agreement among the people on Mechanical Turk, then the data needs to be sent to staff for review. These staff might work remotely, but they will have some training and experience, so they can exercise much more judgement in rendering a decision (compared to people on Mechanical Turk). Whatever they decide results in a strong “yes” or “no”.

I’ve seen multiple companies build similar processes, which is why I think this should be a business of its own. Most companies go through a painful process of self-education before they figure out how to build such a system. Often the process is delayed by the fact that no one person is fully devoted to figuring out the best strategy. A startup that focused on this task would have the advantage of a singular focus.

There are many startups that offer ML and NLP services, such as and These services are basically nice graphical interfaces over the tools that power them. They can make data analysis easier for inexperienced workers. However, they make no effort to automate the process of vetting — so far, no startup has attempted to standardize this process. For now, each company builds a bespoke system: bespoke data import tools, bespoke ML/NLP analysis, bespoke rejection filters, bespoke escalation filters, bespoke dashboards and interfaces, bespoke database designs. The data analysis offered by tools such as and are just the beginning of a long process. It is time someone automated the rest of the process.

This the end of the summary.

You don’t need to read anything else.

All the same, if you’d like to hear more of my experiences in this area, I post them below.


Let’s start by talking about one particular industry, which is firms that sell data about privately-held companies. There is a great conversation on Quora that summarizes the strengths and weaknesses of most of the enterprises in this field:

“How do CB Insights, PrivCo, DataFox, Owler, Tracxn!, Mattermark, and Venture Scanner compare for private company research?”

Danielle Morrill has written a great blog post about how and why she created Mattermark (“The Deal Intelligence Company”):

According to Crunchbase, she has so far raised $17 million to build out her company. Samiur Rahman, the lead data engineer at Mattermark, has given a revealing interview about how NLP helps Mattermark pinpoint data about deals that companies may be arranging:

Anyone who tries to scour the Web for information about companies, either publicly held or privately owned, immediately runs into a few problems, including the fact that any given company may have dozens of subsidiaries with similar names. The law, or a regulatory agency, might have forced the company to break up. So, for instance, Deutsche Bank has set up different companies for loans and investments. As a consequence, one finds the following names on the web:

Deutsche Bank – Corretora de Valores S.A.
Deutsche Bank A S
Deutsche Bank AG
Deutsche Bank GmbH
Deutsche Bank S.A. Banco Alemao
Deutsche Bank Trust Company Americas
Deutsche Bank Trust Company Delaware
Deutsche Bank

Journalists tend to misspell these names, or just use the generic “Deutsche Bank,” which makes it difficult to discern which incarnation of Deutsche Bank is the subject of an article.

Samiur Rahman says he is a fan of “word vectors” and “paragraph vectors”, and then he uses the Nearest Neighbor algorithm to figure out which companies are similar to each other. As the interview makes clear, Mattermark got into a business where they thought they could use humans to do data aggregation; now they are desperately trying to build ML/NLP stacks to automate the work.

Danielle Morrill says she initially thought about going to work as a journalist for TechCrunch, so her mindset was very much focused on human beings gathering data. Samiur Rahman discloses that they started off using standard ElasticSearch string matching for search, and only later did they realize they should try to get into ML/NLP. Apparently, they were at a very early stage in 2016. Word vectors and Nearest Neighbor are simple techniques. (Though it’s possible Rahman was dumbing down his comments for the interviewer).

Mattermark charges $500 per user per year, with a cap at $50,000 a year for an entire firm. They keep track of 1.5 million companies worldwide. Samiur Rahman says, “If someone searches for AI companies, they should get a list of AI companies.” Mattermark only aspires to a level of search accuracy that matches Google. This is surprising, because their customers pay a lot of money for a level of accuracy that is much higher (more targeted) than what is offered by Google.

Some customers buy from several of these data-analysis companies. Some buy from both CB Insights, which has the best data about deals, and also PrivCo, which has the best estimates on revenue. The customers are often sales teams hoping to find new customers for their own companies, but sometimes the data is also used for research (as at the universities) and sometimes the data is used for VC investments, as Danielle Morrill made clear in her history of Mattermark. Right now there are something like 30 companies in this field (selling data about private companies), though I assume this will eventually consolidate to something like 3 or 4 companies.


What the last year taught me is that there is a hard limit on how much ML/NLP techniques can help. Maybe someday a company will develop an article-reader that is as good as a human, but for now that remains science-fiction. So a hybrid approach, mixing ML/NLP with human intervention, is going to be the ideal solution for many years to come.

I’ll offer an example of the limits of ML/NLP. Suppose you want to find which company acquired which other company, as described in an article which your web scraping scripts have pulled in. There is currently no ML/NLP technique that can give a high level of confidence for parsing this paragraph:

“In December 2012, MoonEx acquired one of the other Google Lunar X-Prize teams, Rocket City Space Pioneers, from Dynetics for an undisclosed sum. The new agreement makes Tim Pickens, the former lead of the RCSP team, the Chief Propulsion Engineer for MoonEx.”

Proper nouns which NLP can discover:



Rocket City Space Pioneers


Tim Pickens


Chief Propulsion Engineer

What are the relationships between these proper nouns? It’s possible some day that a magically sentient Deep Neural Net will be able to parse this paragraph with a high degree of confidence, but for now this remains out of reach.


A generalized process for pulling meaning from text has 3 markets that make up its fat head:

1.) enterprise sales

2.) law

3.) medicine

There is also a long tail of small scale uses, some of which I discussed above:

4.) VC investments

5.) possible corporate acquisitions

6.) academic research

7.) journalistic research (it is worth noting that Donald Trump’s businesses are privately held, and PrivCo is one of the only organizations that has good estimates of Trump’s revenue and number of employees. In 2016 PrivCo was cited by the New York Times, the Washington Post, the Wall Street Journal, and a hundred other news outlets).

Up above, I discussed the enterprise-sales applications of a hybrid approach to data aggregation. Now I’ll briefly discuss its application to the legal profession.

Lawsuits between large corporations cost tens of millions of dollars; the largest cost is “discovery.” A lot of text has to be sifted through before the lawyers can find the sentence “While I worked at Google I downloaded all of Google’s documents for self-driving cars and now I’d like Uber to finance my new self-driving car startup”. It is very much like looking for a needle in a haystack. And discovery is still done by humans. Ten years ago I had several friends graduate from law school. Most of them could not get traditional jobs in law firms, but they did pick up a lot of lucrative work doing legal reading for law firms doing discovery. They would be paid something like $100 an hour, and they would work 12 hour days, locked in a secure room with millions of documents. The law firms would hire an army of these out-of-work law school graduates to read through and try to find the incriminating sentences that might be lost amid the endless boring emails about ordinary corporate activities. Trial discovery is an area that would benefit from a hybrid approach of ML/NLP and humans.


There are some companies that specialize in simply delivering data in raw JSON formats. is an example. Because PrivCo has a strong sales team but a weak tech team, it has considered abandoning its own data gathering activities, and focusing exclusively on sales and marketing. PrivCo could buy its data from Amenity Analytics or, and then they could resell that data.

Amenity Analytics and are upstream of companies such as CB Insights and PrivCo. The startup that I plan to create would be upstream of Amenity Analytics and Whereas they use proprietary technology to gather data which they then sell, I plan to sell access to the technology, to make it easier for other entrepreneurs to launch companies like Amenity Analytics and, as well as companies like CB Insights and PrivCo. The upstream market is much bigger. Whereas Amenity Analytics and are full companies, with an HR department and a tech team and a marketing team, I aim to empower small teams inside of larger corporations to easily setup their own data gathering operations. For instance, if a team of 8 lawyers (part of a legal firm that perhaps has hundreds of lawyers) needed to automate their discovery efforts, they would be able to do so.

All of the companies that I’ve described so far make their own decisions about their data sources. Amenity Analytics lists which sources on the Web it pulls data from, but it doesn’t allow you to use its technology to scour corporate emails. They look for specific kinds of data. My startup will not be focused on a specific data source. Rather, my startup leaves that decision to the customers. In this sense, the startup I’m thinking of would be more like BigML. Maybe my customers want to use the Web as a data source. Or maybe my customers are a bunch of lawyers, and they have 1 million documents that they want to examine for specific phrases.

My startup is comparable to some of the pure NLP startups, such as But my startup would deliver data whose meaning has already been vetted to a much higher level of confidence (for example, if my customer was looking for data about investments made by the investment wing of Deutsche Bank, my hybrid ML/NLP/human approach would have already determined if an article was about the investment wing of Deutsche Bank).


To summarize, my customers would be every group that:

1.) needs the output of ML/NLP scripts…

2.) …applied to a data source they specify…

3.) …vetted to a high level of confidence

The market for ML and AI startups might seem crowded, but I think there is still many opportunities there, in particular, regarding standardizing the process whereby humans review the output of ML and NLP scripts.