Data Mining for Cancer Genes

 Data Mining for Cancer Genes
Caitlin Smith has a B.A. in biology from Reed College, a Ph.D. in neuroscience from Yale University, and completed postdoctoral work at the Vollum Institute.

Next-generation sequencing (NGS), combined with other molecular characterization technologies, has captured a wealth of sequence data from both normal and diseased human samples. For cancer researchers, collecting the data is only a start; the pressure is on to find genes that cause or contribute to the disease. On top of that, the critical challenge is being able to pinpoint specific mutations that trigger normal cells to become cancerous. At the moment, most of these cancer-driving mutations are buried in a sea of genetic information. Here are some new computational and bioinformatic tools that researchers are developing to fish these mutations out of the vast sea of genes and home in on genetic bases for cancer.

Commercial tools for digging deeper

Thermo Fisher Scientific's Oncomine® web-based platform enables researchers to search large databases of curated gene-expression studies for different cancer types. Oncomine® Power Tools add visual and interactive tools for interrogating the oncology database, including the Oncomine® Gene Browser, Gene Expression Browser, DNA Copy Number Browser and Mutation Browser. The Oncomine® NGS Power Tools allow analysis of NGS data, including biomarker prediction and driver mutation identification features.

Affymetrix’s freely available Transcriptome Analysis Console (TAC) software is designed to analyze and interpret whole-transcriptome expression data. TAC can perform gene-level, exon-level or alternative-splicing analysis. TAC also features a microRNA (miRNA) interaction networks tool, which is useful for cancer researchers because “miRNAs have become an important part of cancer research, as dysregulation of miRNA has been shown to contribute towards cancer onset and progression,” says John Keefe, senior product manager in the expression business unit at Affymetrix. But studying miRNA regulation of mRNA is complex, as one or more miRNAs can bind to one or more mRNAs, according to Keefe. “TAC software provides the ability to quickly and easily visualize miRNA and mRNA fold changes, overlaid onto all of the potential mRNA-miRNA interactions networks, for rapid identification of relationships,” says Keefe.

Illumina’s (free) NextBio Research software enables researchers to compare gene-expression-level datasets between different experiments. “NextBio Research contains thousands of curated studies that have already been imported from various publicly accessible databases, including the Gene Expression Omnibus and the Stanford Microarray Database,” says Andrew Boudreau, senior manager, product marketing, in informatics at Illumina. “The software also performs normalization across experimental platforms, so different expression datasets can be compared to one another, both within and across experimental platforms.”

Agilent's GeneSpring platform is geared toward analysis of multi-omics data, including genomics, proteomics, metabolomics and transcriptomics. It can integrate these different types of datasets, using correlation and pathways for a multilayered view of results. The metadata framework feature allows for visualization and sorting based on numerical or categorical parameters, such as phenotype, treatment, clinical information and many sample-associated values. Carolina Livi, segment marketing manager in bioinformatics at Agilent, says that an important tool in GeneSpring is its ability to incorporate nonexperimental parameters as metadata in the analysis. “The GeneSpring 13 platform now offers this metadata visualization framework to make this even easier,” she says. “Often there are batches or subtypes unexpectedly affecting the differences coming up in the comparisons.” Vanessa Lordi, bioinformatics product manager, points out that "GeneSpring addresses the challenges in multi-omic data analysis by providing comprehensive analytical and visualization tools for multiple data types." Agilent will continue optimizing the multi-omic workflow to overlay genes, proteins and metabolites in pathways for future research and drug discovery.

Histopathology plus data mining

Another open-access, web-tool-based option for researchers is cBioPortal for Cancer Genomics, developed at the Memorial Sloan Kettering Cancer Center. This software enables researchers to explore large-scale genomic datasets from diverse cancer studies [1]. Dejun Shen, assistant professor of pathology at the University of Alabama at Birmingham, recently combined the use of cBioPortal with histopathology information from cancer patients [2]. “I expect that any bioinformatics tools using a pathology-centered approach will greatly improve the efficacy of the genomic data mining,” he says.

Shen’s approach, which he calls pathology-centered data mining, uses pathology features to classify the cancer-related genetic abnormalities. “I first classify the patients with pathology features and then find and compare the genetic abnormalities among various groups,” Shen says. In contrast to studies that base their bioinformatic criteria on in vitro experiments, his method of including diagnostic pathology incorporates the in vivo status of the cancer-associated genes, he says. “I believe that my results more reliably reflect the functional status of the gene in vivo.”

New developments in analysis tools

Other academic labs also are producing innovations in data-mining tools. For example, the lab of Zhongming Zhao at Vanderbilt University recently published a data-mining method called mutation set enrichment analysis (MSEA) [3]. This approach involves using MSEA to comb through already-existing online collections of data from cancer studies. Such a task is tricky, because cancer samples contain many mutations, but many of those have nothing to do with cancer. MSEA enables Zhao’s team to focus on mutation hot spots in regions of cancer-related genes. They applied their methods to data from the Cancer Gene Census, the Catalog of Somatic Mutations in Cancer database and The Cancer Genome Atlas and found that 51% of cancer genes contained mutation hot spots—which led them to predict cancer genes based on patterns of mutation clusters. In addition, the MSEA-domain feature searches for locations in protein domains that contain an unusual number of DNA insertions, deletions and point mutations. Because protein domains often are involved in cellular functions, connecting this information to mutation analysis may help researchers better understand the regulatory events of cancer progression.

Another recent innovation from academia comes from the lab of Lude Franke at University Medical Center Gröningen [4]. The group took advantage of already-existing gene-expression studies, which provided immediate access to large numbers of cancer-patient samples with almost 80,000 expression profiles. The team developed a statistical method that allowed them to find DNA abnormalities from the RNA profiles. They found that DNA copy number correlated strongly with expression levels, and they linked genomically unstable cancers to disruptions in genes, using more than 16,000 tumor samples from patients. Such an approach would have been impossible only a few years ago, but thanks to advances in computing it is now feasible.

When choosing a data-mining tool, think forward in terms of your future needs: If your lab uses another type of experiment to verify or strengthen your current research, will your data-mining method be able to analyze both types of data together? “There are increasing interests in looking at the cancer genome from multiple angles and [in integrating] data analysis to link genetic changes to functional changes and vice versa,” says Keefe, who suggests considering “tools which are compatible with each other, for the ability to analyze different types of datasets from the same sample.” Whatever data-mining method you choose, it is important to stay current on the latest developments in this evolving field—chances are that new (and possibly more useful) methods will emerge at any time.

References

[1] Gao, J, et al., “Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal,” Sci Signal, 6(269):pl1, 2013. [PubMed ID: 23550210]

[2] Ping, Z, et al., “Mining genome sequencing data to identify the genomic features linked to breast cancer histopathology,” J Pathol Inform, 5:3, 2014. [PubMed ID: 24672738]

[3] Jia, P, et al., “MSEA: detection and quantification of mutation hotspots through mutation set enrichment analysis,” Genome Biology, 15:489, 2014. [PubMed ID: 25348067]

[4] Fehrmann, RSN, et al., “Gene expression analysis identifies global gene dosage sensitivity in cancer,” Nature Genetics, 47:115-125, 2015. [PubMed ID: 25581432]

Image: Shutterstock

  • <<
  • >>

Join the discussion