Big Data solves cancer

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at:


IN MAY last year, a supercomputer in San Jose, California, read 100,000 research papers in 2 hours. It found completely new biology hidden in the data. Called KnIT, the computer is one of a handful of systems pushing back the frontiers of knowledge without human help.

KnIT didn’t read the papers like a scientist – that would have taken a lifetime. Instead, it scanned for information on a protein called p53, and a class of enzymes that can interact with it, called kinases. Also known as “the guardian of the genome”, p53 suppresses tumours in humans. KnIT trawled the literature searching for links that imply undiscovered p53 kinases, which could provide routes to new cancer drugs.

Having analysed papers up until 2003, KnIT identified seven of the nine kinases discovered over the subsequent 10 years. More importantly, it also found what appeared to be two p53 kinases unknown to science. Initial lab tests confirmed the findings, although the team wants to repeat the experiment to be sure.

KnIT is a collaboration between IBM and Baylor College of Medicine in Houston, Texas. It is the latest step into a weird world where autonomous machines make discoveries that are beyond scientists, simply by rifling more thoroughly through what we already know, and faster than any human can.

In a paper to be presented at the Conference on Knowledge Discovery and Data Mining in New York City this week, the researchers say that society is better at generating new information than at analysing what it already has. “This leads to deep inefficiencies in translating research into progress for humanity,” they write. KnIT aims to iron out that inefficiency.

“In general, new p53 kinases are discovered at a rate of one per year,” says Olivier Lichtarge, who leads the work at Baylor. “We hope to greatly accelerate that rate of discovery.”

Studying kinases is important for cancer research, but the Baylor team thinks the approach can extend beyond biomedical studies to all areas of science. And if KnIT’s algorithmic discoveries hold up, they point to a future in which everyone could have a personalised algorithm trawling and making sense of the scientific literature to figure out cures for their ailments, including ones tailored at a genetic level.

Expanding KnIT to other areas of biology or the physical sciences isn’t straightforward. “We could run into big problems when we try and generalise to more proteins and genes,” Lichtarge says. And in subjects like physics, results tend to be presented using equations and graphs rather than words. However, data-mining groups are working to retrieve information from these too.

The idea that new knowledge can be unearthed by finding links between disparate strands of research was first crystallised in 1986 by information scientist Don Swanson at the University of Chicago. He analysed a database of scientific literature manually to deduce that fish oil might be a good treatment for Raynaud’s syndrome, a circulatory disorder, because studies showed that fish oil could reverse certain conditions also seen in Raynaud’s. His hunch turned out to be right.

Modern science has given us a far larger and more intricate haystack than the one Swanson picked through, but machine intelligence is now sorting through it to find new connections.