Bioinformatics Infrastructure Got You Down? Head to the Cloud, Rent a Supercomputer!

 Cloud-Based Bioinformatics
Jeffrey Perkel has been a scientific writer and editor since 2000. He holds a PhD in Cell and Molecular Biology from the University of Pennsylvania, and did postdoctoral work at the University of Pennsylvania and at Harvard Medical School.

So you’ve sequenced your sample. Congratulations! Now you need to store and analyze the data. But when it comes to next-generation DNA sequencing data, that’s more easily said than done.

The raw dataset for a single whole human genome can be on the order of hundreds of gigabytes, and many studies involve dozens or hundreds of samples. The computational resources required just to move such a dataset around, let alone manipulate and share it, far exceed those of a typical desktop or laptop computer.

The traditional solution to such a problem is to farm the storage and computational analyses out to a computer cluster, a resource to which many researchers have no or only limited access. But more and more researchers today are choosing another route. Taking a page from popular services like Dropbox and Gmail, they are migrating their work to the cloud.

The case for cloud computing

Say you want to build your own computer cluster and storage array to handle your bioinformatics data. You could do it, but it’s neither easy nor cheap. First, there’s the hardware itself—clusters embody from dozens to hundreds of compact computers running in parallel. You also need a place to physically store the computers, software to drive them, networking infrastructure to link them up and electricity to run and cool them.

“The power bill for a cluster can be easily $30,000 to $40,000 per year,” estimates Mark Gerstein, Albert L. Williams Professor of Biomedical Informatics at Yale University.

After the cluster is running, it must be maintained, meaning hardware must be replaced and upgraded, software patched and security and access enforced. You’ll likely need a trained IT person to keep the system humming along. Costs can soar rapidly.

Cloud-based bioinformatics platforms make those issues mostly disappear. “The notion of obtaining, installing and compiling software becomes irrelevant,” says Jordan Stockton, director of marketing for Illumina’s Enterprise Informatics Business Unit. “We make the technology available to people who are not inclined or are unable to hire an IT person.”

In a cloud-based environment, the user essentially rents a virtual cluster of the desired specification, uses it and then releases it. Built atop massive cloud infrastructures, such as Amazon Web Services or the Google Cloud Platform, system resources can grow or shrink as needed; users are charged only for the CPU time and storage they use. They can upload however much data they want, or pull it in from other external resources, including both public and private databases. All other considerations, including hardware maintenance, security, user access and so on, are handled by the service provider, leaving users free to focus on their work.

“The advantage of the cloud is its completely variable capacity,” explains Dick Daly, CEO of DNAnexus. “It’s like water; you can fill a pool or just a glass. You don’t have to decide [upfront] how much infrastructure you need.”

For cash-poor researchers and startup companies, the economics of such an arrangement often are compelling, says Lincoln Stein, director of the Informatics and Biocomputing Program at the Ontario Institute for Cancer Research. The math is trickier for those with access to extensive computational resources, though they rarely get exclusive and on-demand access to high-performance resources for long, so for them it often comes down to a trade-off between expense and speed.

Another complicating factor is time. It’s relatively expensive to use the cloud; at some point, balance shifts towards investing in local infrastructure. You can create a 10,000-core cluster on-demand, but “if you are going to be using those 10,000 cores over and over again at 90% utilization for a year, it ends up being far more expensive to do that in the cloud than on local hardware,” Stein says.

Still, the resources cloud-based systems can bring to bear on a problem are stunning. One large project, called CHARGE (Cohorts for Heart and Aging Research in Genetic Epidemiology), required Herculean resources. As described in a DNAnexus case study, the CHARGE dataset comprised 3,751 whole genomes and nearly 11,000 exomes, and needed to be accessible to some 300 researchers across five institutions. “Over the course of a four-week period, approximately 3.3 million core-hours of computational time were used, generating 430 TB [1 terabyte = 1,000 gigabytes] of results and nearly 1 PB [1 petabyte = 1,000 TB] of data storage hosted for further analysis.”

Heads in the cloud

Work in the cloud isn’t easy. It requires special computational know-how to take advantage of the distributed compute and storage resources that cloud environments provide. Commercial and free systems simplify the task.

Like many such systems, DNAnexus runs on the Amazon cloud. The system, Daly explains, is both a platform and a service. Users can run any bioinformatics tool they want through a command-line interface, or they can try a small number of pre-canned workflows in a user-friendly interface for tasks like mapping and variant calling. “You can upload any kind of file, produced by any instrument, any dataset, and analyze it any way you want with any software,” Daly says—basically, if you can make the software run on a computer, it can run in the cloud (though some optimization may well be required). Users can also share those data and workflows with colleagues on a secure, regulatory-compliant platform.

Illumina’s BaseSpace® informatics platform also is built on Amazon’s cloud. BaseSpace accepts data from Illumina sequencers and makes it available to users, along with a host of analytical tools that includes genome browsers, aligners and variant callers, in a user-friendly interface.

Illumina has adopted the “app store” metaphor for BaseSpace, with both Illumina and third-party tools available. According to Stockton, the company has some 25 apps at the moment, including: SeqMan NGen, from DNASTAR, for de novo bacterial assembly; BWA/GATK, for alignment and variant calling (from Illumina); and IGV, the Integrative Genomics Viewer, from the Broad Institute. At the moment, BaseSpace storage is free, though the company has announced a pricing schedule in which the first terabyte is free and additional storage costs $250/month for 1 TB or $1,500/month for 10 TB. Apps are either free or fee-based, with cost assessed per-run or per-data size.

CloudBioLinux and Galaxy also run on Amazon and are completely free and open-source, though users do incur usage charges. CloudBioLinux is a customized Amazon Machine Image (AMI), basically an Amazon virtual machine (like a computer fresh from the factory with nothing on it) preloaded with bioinformatics tools. “The goal was to make something people could use to do bioinformatics work with minimal overhead,” explains Brad Chapman, a research scientist at the Harvard School of Public Health, who has contributed to the project development. “You spin it up and have aligners, BLAST and other standard tools available for analysis work.” But CloudBioLinux is a tool for advanced users, Chapman warns. “It’s aimed at developers and bioinformaticians rather than biologists.”

Galaxy offers a web interface for UNIX command-line bioinformatics tools, enabling such analyses as RNA-seq, ChIP-seq, variant detection and genome assembly. It provides an easy way to execute and chain those tools together to build a workflow or pipeline. Users can access any one of 50 or so public Galaxy implementations around the world (for instance, at usegalaxy.org), install the system locally or instantiate it in the cloud (at usegalaxy.org/cloudlaunch) and even share their implementation and workflows with collaborators. Using a tool called “CloudMan,” users can instantiate a Galaxy cluster of whatever size they desire and maintain it across sessions (so data aren’t lost when you log out).

The cloud-based implementation of Galaxy is built atop CloudBioLinux, so users can access all the tools available in that AMI (tutorial here). New tools can be added either by pulling them from the Galaxy Tool Shed or creating a “tool wrapper” that tells the system how to invoke the tool and the parameters it uses, says Enis Afgan, a research scientist at RBI in Zagreb, Croatia, who is a member of the Galaxy project.

The new reality

According to Gerstein, cloud-based informatics reflects the new reality of the next-gen sequencing marketplace. Just a few years ago, he says, sequencing was relatively expensive and the analysis comparatively cheap. But as the cost of sequencing has plummeted, datasets have ballooned and informatics expenses have risen sharply. For many researchers, it’s easier and less expensive to farm that work out to experts than to set up a cluster anew locally.

But the cloud, Gerstein says, “cuts both ways.” Cloud providers generally provide a far more secure environment than do academic IT resources. And those data can be accessed from anywhere without needing to move them. But uploading data to off-site servers presents its own difficulties, including loss of physical control of the data, privacy (especially for patient genome data) and the possibility of data loss and theft. (For more on academic cybersecurity, see this 2010 Nature feature.)

Ultimately, each lab and institution must decide for itself what solution makes sense. But this, at least, is clear: Thanks to cloud computing, high-performance bioinformatics is no longer reserved for the scientific haves. Whatever your resources, it’s just a mouse-click away.

  • <<
  • >>

Join the discussion