Bioinformatics Tools for Gene Expression: Crunching the Numbers

 Software Tools for Gene Expression Analysis
Amber Dance is an award-winning freelance science writer based in Southern California. She is the ALS (Lou Gehrig’s disease) reporter for the Alzheimer Research Forum. She contributes to The Scientist and Nature journals, and has written about topics ranging from record-breaking rocks to bizarre new ant species.

Gene-expression studies yield multitudes of data, such as millions of spots on microarray chips or hundreds of RNA transcripts sequenced millions of times over. Even qPCR reactions to validate a handful of genes generate data that require further analysis. That means scientists need computer software to extract meaning from the raw information. A variety of options are available, including free, open-source software and commercial platforms that may come with equipment or cost extra.

Users often find they need multiple software applications to answer their research questions. For example, a program that comes with the microarray chip or PCR machine could be used to process raw data, eliminating noise and technical variations, and produce a list of genes truly up- or down-regulated in a given experiment. Then, researchers might move to different software applications to give that list biological significance, for example by identifying categories of genes or pathways.

Open-source options

If you’re comfortable with computer programming and biostatistics, you have a variety of open-source gene-expression applications to choose from. Check with friends, colleagues and the scientific literature (e.g., Genome Biology, Nature Biotechnology and Bioinformatics), recommends Naim Rashid, a postdoc at the Dana Farber Cancer Institute in Boston, who writes and uses open-source software in his studies of cancer RNA sequences. One popular option is Bioconductor, a collection of more than 700 modules contributed by scientists worldwide in the computer programming language R, including more than 70 related to gene expression.

You can use your favorite programming language to stitch together software modules from various sources as you go from raw data through different kinds of analysis; this is called “creating a pipeline,” Rashid says.

Open-source software is particularly prevalent in RNA-sequencing labs. The research community has had years to develop and settle on trustworthy algorithms for microarray analysis, and companies have built those into commercial programs. With RNA sequencing, however, scientists are still working out the right processes, and the field has yet to settle on the best analysis pathways.

These programs won’t cost a cent, but they don’t have much in the way of technical support and there’s no guarantee they’ll work for your data. Rashid recommends testing new software on a standard dataset to make sure it yields the expected results.

Commercial software

If you’re uncomfortable with instructions like “Install the latest release of R, then enter the following commands,” you might prefer user-friendly commercial software with a visual interface.

Molecular biologist Christopher Phiel at the University of Colorado in Denver, for one, found open-source options a bit intimidating when he started working with microarrays. Commercial software was more appropriate for his needs. These products tend to be more comprehensive, so you don’t need to assemble a pipeline of too many individual programs.

If you have the equipment to run qPCR or microarrays, you may already have some software, as it often is included with equipment. For example, Bio-Rad Laboratories provides its CFX Manager™ software with its CFX PCR machines. This all-in-one software program not only runs the PCR reaction, it also helps users analyze and visualize the results or export data for use in other software applications, such as Excel.

Similarly, Affymetrix microarray users can start with the company’s software, which is available to anyone using Affymetrix arrays. Expression Console™ (EC) performs basic quality control and normalization and also summarizes data. Earlier this year, Affymetrix launched the Transcriptome Analysis Console (TAC), which performs further analyses. TAC’s abilities include calculating the fold-change in gene expression, visualizing gene-expression changes and alternative splicing events and accessing public gene annotations. TAC can analyze data from any Affymetrix chip, including older legacy arrays and the newer Human Transcriptome Array.

Third-party solutions

However, many researchers work with more than one microarray brand, or multiple types of expression data. Suppose, for example, that you used to work with Affymetrix chips but recently switched to RNA sequencing. You might also want to compare your results with those of a collaborator who uses Illumina arrays. It’s impossible to fully integrate data from different platforms into a single dataset, cautions Christian Reece, a senior product manager at Affymetrix. But third-party software may be able to help.

For example, ArrayStar® from DNASTAR can combine multiple array platforms with RNA sequences in addition to identifying clusters of genes with related functions. BioDiscovery’s Nexus Expression™ also incorporates different brands of arrays with basic RNA-sequencing analysis.

Phiel chose GeneSpring, from Agilent Technologies, because of the wide variety of analyses he could perform within the single program.

GeneSpring is a “Grand Central Station” for integrated analysis, including organizing gene annotations and pathways, says Antoni Wandycz, Agilent’s director of bioinformatics solutions. It’s ready to analyze data from Agilent’s arrays as well as chips from other manufacturers, plus DNA and RNA sequences, qPCR results and metabolomics and proteomics data.

Similarly, the Ingenuity Pathway Analysis (IPA) program from Ingenuity Systems, a QIAGEN Company, incorporates gene-expression data with proteomic and metabolomic data, helping users understand what all the data points could mean. For example, IPA’s new Causal Network Analysis module compares your gene-expression pattern to a scientist-curated, literature-based database of genetic regulators. It then predicts what regulatory networks might have caused such a pattern, such as upstream transcription factors or microRNAs that could have mediated the effects you’re seeing.

No matter what, analyzing gene-expression data is likely to require an investment. The choice is whether you invest the time to create an open-source pipeline, the personal capital to recruit a biostatistician to do it for you or potentially thousands of dollars for a user-friendly commercial system.

Image: DNASTAR's ArrayStar software

  • <<
  • >>

Join the discussion