Forecasting Science in the Cloud

 Forecasting Science in the Cloud
Jeffrey Perkel has been a scientific writer and editor since 2000. He holds a PhD in Cell and Molecular Biology from the University of Pennsylvania, and did postdoctoral work at the University of Pennsylvania and at Harvard Medical School.

Used to be that when it came to DNA sequence analysis, sequencing was the hard part. Today, thanks to “next-generation DNA sequencing” technologies, the burden has shifted: Data collection is a cinch, but analysis is increasingly daunting.

“We’re now able to sequence genomes at the proverbial $1,000 price point,” notes Sandeep Sanga, vice president of products and services at Station X. “The challenge is how to take that information, understand it … and generate new discoveries that increase human health.”

The problem is multifaceted but fundamentally comes down to scale. Next-gen sequencers produce so much data so rapidly that simply storing it poses problems, let alone archiving, analyzing and sharing it with collaborators. For the information-technology workers tasked with overseeing a university’s computational infrastructure, the job can seem Sisyphean, a never-ending struggle to add and maintain storage and compute resources.

Answers in the cloud

Increasingly, researchers are turning to cloud-based bioinformatics to ease the burden. Some such resources, such as Galaxy, are free, open-source, community-driven products. But there are commercial alternatives, too, from companies such as DNAnexus, Illumina, Qiagen, SCIEX, Station X and Thermo Fisher Scientific. Some are free, others are pay-per-job, pay-per-byte or subscription-based. But all aim to simplify the unique IT headaches posed by NGS bioinformatics.

According to Mike Lelivelt, senior director of bioinformatics products at Thermo Fisher Scientific, cloud computing offers several significant advantages for today’s bioinformaticians, including easy data sharing and on-the-go access—think Dropbox for scientists; freedom from hardware and software maintenance and upgrades; easy scalability and more.

“You can do things you cannot do on the desktop—have huge datasets, aggregate multiple data sources. When you have the space of the cloud, you can bring so much more to end users,” Lelivelt says.

Another benefit of the cloud, says Ramon Felciano, cofounder of Ingenuity (now part of Qiagen), which offers such cloud-based tools as Ingenuity Variant Analysis and Qiagen Clinical Insight, is instant access. “As soon as someone decides they want something, they can get it; there’s no delay.” And, he adds, the cloud enables researchers to try a service without huge upfront financial investments.

Thermo Fisher Scientific offers a cloud-computing service called Thermo Fisher Cloud. Built atop Amazon Web Services (AWS), Thermo Fisher Cloud accepts data directly from Thermo instruments and currently includes modules for holding and processing Sanger sequencing and qPCR data. Existing but separate tools for digital PCR and the company’s Ion Torrent sequencing data (called Ion Reporter) will be folded into Thermo Cloud by year’s end, Lelivelt says, and a new module for proteomics data is expected to go live in August.

‘Omics come together

Andreas Huhmer, director of marketing at Thermo Fisher Scientific, likens these tools to Google Docs or Microsoft’s Office 365—a place to store, share and work with data. In particular, he says, Thermo Fisher Cloud service, currently available at no cost, enables “a multi-omics collaborative environment”—a way to integrate genome, transcriptome, proteome and metabolome datasets.

Among other things, the Ion Reporter software has workflows for processing 16S ribosomal RNA sequencing data, and it includes mechanisms for documenting and controlling software versioning. Normally, Lelivelt says, cloud-based tools update frequently, and users have no say in the matter. But that can be a problem for researchers who may be reluctant to upgrade from working versions, especially those in regulated environments. “What we developed was versioning at the workflow level, so the workflow is consistent and the core data path is versioned.”

The cloud-based OneOmics™ Project, from mass-spectrometry vendor SCIEX, also aims to create a multi-omics environment, says Aaron Hudson, the company’s senior director of academic and clinical research business.

Integrated into Illumina’s cloud-based BaseSpace system (which is built on AWS), OneOmics allows researchers to integrate SWATH “next-gen” proteomics data—an unbiased workflow for collecting MS/MS fragmentation data using the company’s TripleTOF mass spectrometers—with transcriptomics data. Those data can then be attacked using any of a variety of third-party apps available for BaseSpace.

For instance, says Hudson, researchers can use a pathway-mapping BaseSpace app from a company called Advaita Bio to overlay differentially expressed transcript and protein data on metabolic pathways to identify, for instance, candidate biomarkers of disease. Another app, Yale University’s RNA-Seq Translator, converts transcriptome data into a customized protein-sequence database, enabling researchers to identify novel proteins in their proteomics experiments, a workflow Hudson called “proteogenomics.”

Oftentimes, Hudson notes, transcriptomics and proteomics experiments are performed in different labs at different institutions. In the past, sharing data might have meant “FedEx’ing a hard disk.” But with OneOmics, he continues, “the data stay in the same place. To collaborate with people, it’s as easy as sending them an email and giving them access.”

Navigating the systems

One benefit of cloud-based tools is currency. Ingenuity, for instance, has spent years developing “a very large-scale atlas of human disease and biology by paying expert scientists to read papers and books and converting that knowledge into a structured database we have built and maintained for over 15 years,” says Felciano—a genomic analog of how Google keeps its Maps application up to date. Users of QIAGEN’s cloud software can tap into that database, and the database remains constantly up to date, because the work to keep it current with newly published clinical research is ongoing, Felciano says. “That is a natural thing to use in the cloud.”

Similarly, Station X—a company whose name derives from the British code-breaking effort at Bletchley Park during World War II—gives users of its GenePool software easy access to the data generated by The Cancer Genome Atlas (TCGA), a rich compendium of sequencing and clinical data amounting to more than two petabytes; the data have been collected over the past decade on more than 10,000 patients by an initiative funded by the National Institutes of Health, according to Sanga. GenePool users can compare their data to TCGA without needing to build out massive compute and storage centers to see if, for instance, mutations in their patients have been seen before and gain insight into the clinical implications.

Not all cloud-based systems provide premade tools and workflows, however. DNAnexus describes itself as a “platform” or “infrastructure” for bioinformatics work, albeit a tool-agnostic one. “You can have a sequencer anywhere in the world. The data from that sequencer can flow to our platform, and you can use whatever tool you want [on it],” explains Mike Lin, director of research and development at DNAnexus.

Those tools typically are the same free, open-source applications that desktop users would use—such as TOPHAT for RNA-seq analysis—but supercharged thanks to the computing and storage facilities available through the cloud, Lin says. “We’re the infrastructure, [providing] unlimited compute and storage power…. The key element is the data are centralized, and many people can look at it.”

Oftentimes, those people are widely distributed, a situation for which NGS data-sharing traditionally has proven difficult. Among the projects enabled by DNAnexus are ENCODE and the 3,000 Rice Genomes Project. The latter project, which as its name suggests sequenced 3,000 rice genomes, has generated more than 100 terabytes of data. Downloading such a dataset is impossible in many cases, and certainly impractical, Lin says. But using the DNAnexus platform, researchers in both China and the Philippines could develop working bioinformatics pipelines, upload and instantiate them in DNAnexus through the company’s application programming interface (API) and interact with the data remotely. It is “a beautiful example of [how] having data centralized in the cloud enables a global collaboration,” Lin says.

Indeed, the beauty of the cloud is its ability to provide access to data without burdening users with the actual management of the data. Some GenePool power users, for instance, need to compare data from tens of thousands of patients and time points, Sanga notes. With each sample measuring tens of gigabytes or more, “the problem adds up and becomes almost untenable on hard drives,” he says. “That’s where the cloud fits in: You can interact with the data in a way that is not limited by the hardware.”

Image: Shutterstock

  • <<
  • >>

Join the discussion