A Pathway-centric Approach to Multi-omics Research Powered by GeneSpring Analytics

As Steve mentioned, I am the Director of Integrated Biology Marketing. I have been with Agilent for about a year and a half

And tonight, I'll be talking to you about integrated biology. And integrated biology for those of you who are in the field of research, it's probably something that you're getting used to currently. It's a terminology that we introduced

It is different from systems biology in the sense that we are trying to limit the scope of what we are doing to multiomics data acquisition and data integration. So, that's what we mean by integrated biology. And that is the subject of my presentation tonight

The first slide right here, to give you--provide motivation for the presentation as to why do we even want to do multiomics research, and if you saw those two pictures merge together, I don't need to explain to you why

What you saw on the left-hand side was a Google map rendition from a cartographer's view of a specific place in the world. And on the right-hand side, what you saw was a satellite image of the same place

And when you bring those two pieces of domain knowledge together, you develop what we call interpretive value. And you understand the relationships between the pieces that make up for the physical landscape that you're looking at

There is exactly the same value proposition that we're trying to bring to you with integrated biology, except that, instead of satellite images and cartographers' maps, what we're trying to bring to you is the integration of the information that the different technologies and technology platforms that Agilent offers have to bring to bear to your study of biology

So, in this particular audience, the first piece of technology at the top, LCMS/GCMS is highly relevant. But, for other areas of research, people who are doing genomics, transcriptomics, are also interested in the second and third pieces of this diagram

And the challenge then is to bring all this information together in a way that makes sense biologically. And we decided to guide our efforts and to offer back to the community a solution that is based on the core principles of biology

Biology is not random. Biology is organized. And there are core principles for the organizational biological information. And one of them is the organization around pathways

So, we have constructed a workflow that actually consists of technologies that provide you input for the different measurements that you make and bringing that information together

And there's many ways to bring this information together. But, one specific way is to bring it around together around pathways because there is a lot of prior knowledge that is already captured in pathways. And we want to be able to exploit that prior knowledge and to bring that to bear to the analytical problem

We are doing so using a software platform that we call GeneSpring. For those of you who have been dealing or are related to genomics groups will probably have heard about GeneSpring

GeneSpring has been around and is very popular among people doing genomics. It's probably not so popular among people who are doing proteomics or mass spectrometry or metabolomics in general

But, what we're trying to do is to change that landscape and to provide you with a tool that will allow you to interact better with your colleagues in your departments and your institutes using the other omics technologies to bring it all together and to derive biological knowledge

I'm going to illustrate that to you using data from a collaboration that we have ongoing with Q. Re [sp] and his group at Cornell Medical College

And Q. and his team are very keen in understanding what the biological response is to drugs to treat tuberculosis

It is amazing when I look at this slide as I was preparing this presentation to realize that all of the front-line drugs and treatments to TB are so old. They've been around for a long time

We have a general idea of what they do mechanistically or at least what biological processes they interfere. But, we really don't have a clear picture of what the mechanism or the molecular targets are for any of those specific drugs

So, Q.'s endeavor in his team is to try to understand, for these traditional drugs that are in use, what exactly is the mechanism of action? What exactly are the molecular targets? And can we start elucidating those things by using a pathway-driven approach to multiomics data measurements? So, the general workflow that I showed you before for how you bring the different technologies together, in the case of Q.'s research, actually transforms to something like this

And I apologize in advance because a lot of this data is actually--is fairly new. And Q. would like to go ahead and publish. So, he asked me to deidentify a lot of the drugs and a lot of the treatments and the doses and the times and so on and so forth. So, I'm calling it generically Drug X, Drug Y, Drug Z; mid, low, and high treatment. This is all in vitro experimentation with TB

And Q. and his team acquired two sets of data, metabolomics data, label free, using LCMS on a 6520 QTOF with the appropriate number of replicates and so on and so forth, and gene expression data using custom microarrays because TB arrays are not commercially off-the-shelf products that we made. But, we can actually make custom arrays if you wish to order those

And then those two data streams come together under the platform with GeneSpring so you can actually corroborate known mechanism of actions, discover new unknown cellular responses, or suggest new drugable targets, which is the ultimate goal of his research

We have analyzed some of the data, and I'll be giving you a glimpse of what the data is looking like. And in fact, there is also ongoing work at the moment at Agilent Labs to try to understand because, once you analyze the transcriptomics and the metabolomics data, what comes out of that analysis are specific pathways that seem to be enriched by the perturbations of the biological system

And what you want to do in the next step, obviously, is to try to validate or verify those findings at the protein level, except that, at that stage, you're not going in an unbiased way, but you actually have a list of targeted proteins that you want to go and try to verify. And this is exactly the work that is ongoing currently in Chris Miller's lab at Agilent in Santa Clara

I will not be showing you that data because it's something that is currently being acquired. But, I just wanted to give you an idea of how we're trying to not only do discovery research but also have a feedback loop that closes the cycle and moves you from purely discovery experimentation into verification and validation

So, from one of the treatments and one of the drugs that Q.'s team has looked at, this is what the gene expression trend lines look like. We're looking at all genes here and the three different doses, so low, medium, and high, and looking at the full change with the change in treatments for genes

If you go ahead and look at the 5 percent top and up, top and bottom regulated genes, you see two different distinct groups over there

And you can take those two then and try to do pathway and gene ontology enrichment to understand how those two groups of genes that are up and down regulated, do they cluster together into specific pathways? To do that, GeneSpring is a great tool that allows you to group those genes into groups with the proper statistical analysis, in this case a P value coming from hypergeometric testing

So, we have two populations of pathways, pathways that are up regulated and pathways that are down regulated. And with that, we begin then the exploration to try to understand mechanisms of action or potential drug targets

So, I'm going to open up our emphasis here because I think it's very important to understand. And I think it's quite unique of GeneSpring IB the capabilities to do this kind of on-the-fly, very rapid statistical testing and then rendering the information back to you at the level of pathways

Again, if you have been involved doing genomics, this is nothing new to you. This is exactly how genomics has been doing enrichment analysis for a long time

They do hypergeometric testing, where they're trying to calculate the probability that an X number of genes are cluster or differentially expressed in our cluster within a pathway A

And it's a simple calculation that is given by the formula on the right-hand side of the equation right there. And all the parameters are defined in the diagram, where capital N is the number of genes on the array, which is known because you build the array

M is the number of genes that are differentially expressed. K is the number of genes that you expect to find in a pathway. And X is the number of genes that you observe being differentially expressed in your actual measurements

If you plug them into this equation, you can then calculate a P value for the probability that these genes actually be--will--are--that the hypothesis is correct that the genes are together in a specific pathway

In the case of metabolomics or proteomics data, this is not so simple because N, the total number of entities, is not specifically known. We don't know how many metabolites there are. We don't know how many proteins and proteins isoforms there are in the complete space

So, we take a different, a slightly different approach for those two datasets. And what we do is we simply count the number of occurrences of specific molecules or chemical entities in the pathways. And we rank accordingly

At the end of that analysis, what we have is essentially two Venn diagrams, where the circle on the left gives you the collection of pathways that are enriched in your transcriptomics or genomics measurement. And on the right, you have your proteomics or metabolomics

And by doing the union or the intersection of those two, you find what's common between the two

But, what you really would like to do is to be able to change that intersection by varying the P value, right, which is what gives you the size of that circle, or the number of occurrences in the pathway

Well, you can sit and do it manually. Joking

What we do--what we have provided is actually controls in GeneSpring, where you simply slide the position of a radio dial, if you will, and say, "I want to operate with a P value of this for my transcriptomic data and a number of occurrences of this for my metabolomics or proteomics data." And that gives you that intersect

And that intersect is then listed as a list of protein pathways that appear on the table at the top. So, fairly straightforward, but I wanted to kind of give you an idea of how it is that we're approaching this problem

The other thing that is not trivial but is kind of running under the hood and is an underappreciated function of GeneSpring is the fact that a lot of the identifiers for these chemical entities in genes even is a big problem in bioinformatics, right? Sometimes the same chemical species has four or five different names. And depending on where you're going to grab that name, you're going to end up with a big mess for the same molecule

What we needed to build was a tool, Bridge DB, that actually allows you to unify this and to kind of look up the answer in the back of the book and to say, "How many names is this molecule known by, and do I see it in this particular pathway," and map it

So, Bridge DB is also running kind of under the hood, like I said, a little underappreciated, but I think it's a tool that we should all learn to appreciate because it's solving a fairly significant problem to the community

So, in essence, then what you're doing when you're doing this kind of analysis, you're calculating an overrepresentation of a subset of molecules, whether they are genes or metabolites or proteins in a class rather than looking at the entire space

You're actually saying, in a particular class, in a particular pathway, how many--how well represented are these genes, rather, like I said, than going to the total space available. Okay? Then we go back, and we--so, we do the hypergeometric testing with multiple test correction, and we give you corrected P values to generate a table that contains actionable list of pathways that you can now go and explore

But, Excel is great for making tables and to--for providing P values, but it's really not a good environment to actually go look at this data in a way that is interesting and understandable to biologists

So, what we wanted to do is to be able to click on any of those pathways listed on the table and to give you a graphical view of the data. Okay? So, let's take a look at what that looks like with Q.'s data. Again, I apologize because the more interesting biological pathways in the data I'm not able to talk to you about

So, I'm going to choose a fairly mundane example that is overrepresented. I mean, I'm not making this up. This is the data. This is what the data looks like, but it really isn't one of the pathways that would perhaps produce some of the more intriguing targets to go after

And the pathway of--in question that I'm talking about here is peptidoglycan biosynthesis or cell wall in TB

Now, we have the canonical representation of this pathway. And he has an X number of entities composing it. The question is, there's many molecules in that pathway. And there's many paths in that pathway. Which of those paths are interesting to me because, certainly, not the whole thing is interesting? Maybe only some of the genes are being affected by the drug treatment. Some of the proteins are being affected. Maybe some of the metabolites are changing. What I need to know is that, is what is interesting about it? And I'm just going to quickly go through some of the representations here of what the data looks like. So, this is actually from GeneSpring using Q.'s data. We are looking at a subnetwork of the pathway from wikipathways.com in GeneSpring

You can see the items that are highlighted in yellow I think that are significantly down regulated under all drug treatments

The new bar graphs that you see next to each of the genes is actually the--showing you the expression changes as you change treatment. In some cases, the bars are pointing up. In some cases, the bars are pointing down simply because of a normalization effect

If you want to look at the metabolites, they can also be represented like this in a similar way. And if you want to look at all of the items together, you can just bring it all up into one view

And this allows you then in one view to actually have all your data consolidated for two omics measurements, metabolomics and transcriptomics, and to say to you these metabolites are interesting. These genes are interesting

And what it allows you to do in this particular case is to generate hypotheses to go measure the corresponding proteins, for example, the proteins that are expressed by those genes

This slide here is just to convey to you that GeneSpring--we'll refer to it internally sometimes as GeneSpring Integrated Biology

GeneSpring Integrated Biology is not a product per se. GeneSpring Integrated Biology is a capability. GeneSpring has multiple modules, GeneSpring GX, GeneSpring NPP, GeneSpring PA

All of those act on a specific type of omics data. And they represent back to you pathways that are derived from the analysis of that data

But, if you combine two or more of those modules, then you get a slightly different capability, which is what we call multiomics or integrated biology

NPP, of course, is one of those modules. Most of you are probably familiar with it. So, it lives within GeneSpring. I don't need to spend time telling you about this. You probably know more about it than I do

But, what we're trying to drive--as I said, the ultimate goal for enabling these technologies and these tools for the creation of these tools is to actually allow you to expedite cycle number one, is to accelerate discovery research by allowing you to project the data onto pathways, which is a useful metaphor if you're working with biologists, and then for moving quickly into the second phase of the experimental enterprise, which is the hypothesis-driven experimentation, in this case, proteomics

So, what is coming for GeneSpring and for those of you who are going to be attending the Sky Line workshop at ASMS, you will find out that we have a very strong--and in fact, we gave an award to Mike McCoss [sp] and his team as part of the integrated biology program so that we can integrate tightly the output of GeneSpring because, if you think about it, the output of GeneSpring IB is either a list of genes or a list of proteins

And if you want to go make measurements in mass spectrometry, what you need to do is to have a way to integrate that with an MRM builder

So, we are working closely with Mike McCoss and the team to simply take the output of GeneSpring, make it the input of Sky Line, generate methods that can be run quickly on a QTOF or a TripleQuad and then drive the whole validation experiment

So, I really encourage you to go to the Sky Line users' meeting, if you can, to learn more about this

Again, because of the time that we have to do the presentations, it's not sufficient to go in depth into the data. The data is not mine

But, I also wanted to point out to you to a couple of recent publications--oops, this is in--on autopilot--that give you an idea of how people in the field or other colleagues of yours are already doing integrated biology, combining multiomics measurements, Achilles [sp] span these references on the left. That is a particular case of proteomics combined with transcriptomics

And then there's really nice paper recently published by Mike Schneider and his team at Stanford, where he combined a multitude of omics technologies to construct what he calls the integrated personal profile or, you know, for trying to have predictive measurements with diagnostic value

With that, I wanted to thank you. Again, I apologize for not being able to go to any depth. I just wanted to give you an overview of what we're trying to do

And I wanted to also acknowledge some of the people outside, our collaborators, Q. Re and his team at Agilent Technologies, Ben Gordon, Michael Janice, Steve Fisher, Teo San [sp], and Chris Miller, and encourage you to actually learn more

We have a brand new integrated biology Website that is going to be launching on the 2nd of April. So, please be on the lookout for that. All these presentations, all the information about the products is going to be there. And some of that, actually, you can get a preview at the hospitality suite at this meeting

And with that, I wanted to thank you again

Request info