It's like Moore's Law on steroids: The total volume of biological data worldwide, having doubled every 18 months in recent years, is now doubling every half a year to three months. And this isn't a momentary spike, but a long-term trend that may require new ways to measure, analyze and mine biological databases.
The need to manage and extract meaningful information from this exploding mass of data is challenging the electronics business in ways that echo the frontier quality of the business IT revolution.
The biotech information explosion is the direct result of the automation of biochemistry. Robots, gene probe arrays and labs-on-a-chip are combining with large-scale computing capability to amplify past laboratory techniques while generating a raft of novel approaches. As chips that can perform 10,000 experiments in parallel roll off the fab lines, biologists, medical researchers and biotech companies are seizing the opportunity to probe new methods and avenues of research which in turn churn out vast quantities of data.
The volume of data may be less problematic than its diversity, which defies containment in any one data analysis system. The data can include descriptions of highly complex interactive molecular systems, maps of the surface of proteins, or simply storehouses of drug experiment findings or patient records.
The groups tasked with managing it all are tapping an array of new tools being marketed by the bioinformatics divisions of such household names as IBM, Hewlett-Packard and Sun Microsystems. Also in the game are smaller startups and consulting companies, offering specialized services that run on high-performance biological data servers or supercomputers.
Sun Microsystems has a joint technology development partnership with InforSense to develop a turnkey system that will facilitate the mining of biological data to create patentable intellectual property. The system is based on open standards such as Java, XML and WSDL. InforSense, based in London, was formed in 1990 with the objective of developing open systems for accessing academic databases.
Hewlett-Packard got into the bioresearch game early. Leveraging the Alpha workstation technology it acquired when it bought Digital Equipment Corp., HP rolled a dedicated system called BioCluster, which was used by several academic centers engaged in the project to sequence the human genome. BioCluster uses 27 AlphaServer 4 processor nodes, which each contain 54 Gbytes of local storage. The servers are connected by high-speed Ethernet to a central server with a 1-Tbyte storage capacity.
The marshaling of computers, databases and algorithms to digest the rapidly growing mass of biotech data is classified as bioinformatics, but the field is expanding so quickly that the definition's in flux. "It's not a single science, actually, but an aggregate of interdisciplinary areas," said Isidore Rigoutsos, who manages IBM's bioinformatics and pattern-matching effort at the T.J. Watson Research Center (Yorktown Heights, N.Y.). "In the early days 10 years ago people had a unified definition of bioinformatics, but they soon came to realize that this is something that needs the knowledge and expertise of people from various disciplines computer science, obviously, and then mathematics, chemistry, biology, electrical engineering and so on. The more things people want to do in this field, the more disciplines you need to bring into it ."
IBM has been in the IT business for some time, but its standard technology was not designed to handle the flood of biotech data. "The reality is that this data is accumulating at an unbelievable rate," Rigoutsos said. "Before you can even access the data, you need to store it and manage it, which means that you need to use database management systems. What is happening is that DBMSes are being used in a different context."
Re-engineering database management systems is only the first step. Once the data is accessible to practitioners in a specific area, algorithms that analyze and model data for specific purposes must be created. For example, searching vast DNA databases for a specific sequence of base pairs requires sophisticated pattern-matching algorithms. Someone else might need to search protein databases that describe the complex signaling pathways inside cells.
A decade ago, the sequencing of the genome was the achievement of the day. The ability to perform the required chemical analysis emboldened biological researchers to propose ever more ambitious projects. Thus the genome project ushered in a scientific revolution that opened a potentially lucrative IT market.
Though IBM named its flagship scalable supercomputer Blue Gene, the architecture was not created to probe the human genome. "But its architecture makes it easily applicable to some of the questions we are facing in the context of computational biology," said Rigoutsos. Since Blue Gene is scalable, processor configurations can be put together for the computational needs of specific projects. HP and Sun also offer hardware and software systems that are both scalable and configurable for individual problems. Indeed, those attributes can clinch the sale for a bioinformatic system.
But there is also a niche at the topmost level of the computer business. Biologists and biotech researchers would like to be able to perform "in silico" experiments. With enough knowledge of the biochemical workings of the cell and enough raw supercomputing power it might be possible to simulate the creation and activities of proteins, the basic hardware of life, without having to use living tissue or lab-on-a-chip technology.
Sequencing the genome was a trivial exercise in comparison.
Biologists now know how a segment of the gene sequence a chromosome produces a family of proteins. The process begins with a deceptively simple operation: A segment of DNA is read by a protein and RNA complex called the ribosome, which then links together amino acid building blocks in a corresponding linear chain called a peptide.
What happens next is the difficult part: The peptide chain spontaneously folds into a unique three-dimensional shape that defines the protein's operation.
Simulating that physical process on a computer is no mystery; the problem can be framed in the dynamical equations of quantum mechanics. But its computational difficulty is so daunting that only the simplest proteins are now within reach of the top supercomputers in the world.
A protein-folding expert at Stanford University hit on a novel hybrid software/Internet attack on the problem that didn't require an investment in supercomputers. Called Folding@home, it marshals a worldwide system of 200,000 desktop computers, which use their spare cycles to execute segments of a protein-folding algorithm devised by Vijay Pande and his Stanford colleagues.
"A lot of problems in biology can be related to problems of sampling where you need to get lots of samples to understand what is going on," said Pande. "The power of distributed computing allowed us to sample sizes that were orders of magnitude larger than what our peers could do."
The Folding@home approach had already been demonstrated by a project initiated by the Search for Extraterrestrial Intelligence (SETI), which set up a similar system for searching large databases of radio telescope data to comb for meaningful patterns (see story, page 14).
Implementing the same solution for protein folding required some challenging software development. Protein-folding simulations involve probability distributions that are exponential; they do not fit the typical Gaussian bell curve. As a result, numerous trials need to be performed initially until a random thermodynamic event kicks the system into the folded state.
"There are many protein-folding problems that we cannot attack with this approach, but it's been surprising how many medically relevant proteins are within reach," Pande said. In particular, some relatively short proteins, consisting of about 50 base pairs, figure in such diseases as Alzheimer's, Huntington's and many cancers. Folding@home has successfully simulated the folding of three of those proteins, he said.
Pande has been working with a group at IBM, "largely because Blue Gene is the only peer machine that can do calculations anywhere close to what we can do," he said. "We have a lot to talk about because we have similar problems and similar concerns, but the architectures are sufficiently different that we're hoping to come up with projects where we can do something that neither of us could do by ourselves."
If today's top supercomputers can hardly make a dent in the protein-folding problem, it may be a long time before in-silico systems begin to streamline the life sciences. Companies hoping to compete in bioinformatics markets are therefore gluing together a wealth of advanced technologies to get there faster.
Robotic systems, for example, are being used to insert genes into bacteria, which are cultured in biotechnology equipment so that the protein expressed by the gene can be extracted in large enough quantities to be crystallized into a sample that goes into X-ray diffraction imaging systems. It's a shortcut to determining the geometric form of a protein.
And, as has become typical of such efforts, each run generates gigabytes of information.