So, what I thought to do today was share a little bit with you about our experiences sequencing primarily tumor samples. So, at OICR we're fortunate enough to have, you know, all of the toys and--with all the bells and whistles, and we can really select, you know, the various technology to go along with the various projects that we want to do and, you know, kind of use our expertise in sequencing to try and optimize the type of data we're receiving.
And mainly what we sequence are tumor samples, which come in all sorts of shapes and sizes. Largely we're dealing with biopsy samples, which can have either very little material in a given biopsy or you can get lots of tumor samples. Also, we're dealing with cellularity issues that--you know, tumors are extremely heterogeneous, both in, you know, normal stromal contamination within the tumor itself, and also with sub-clones within the tumor that can have different genetic profiles.
And then, you know, the big thing that we're dealing with is--are pathologists' favorite tool, the formalin-fixed material that can play all sorts of problems with our DNA samples. And so, these are the types of things that we're trying to deal with on a daily basis sequencing tumors.
And now, there's several different target methods out there. And the way we look at it is you pick your enrichment method based on what you're trying to do with the needs of your project. So, for a large-scale project, you might be looking at, you know, lots of targets, a large number of samples, using things like exome sequencing. But, with more of our clinical aspects, we've been using smaller targets and things like having a 100 percent coverage and fast turnaround become more important and being able to accept variable input from, you know, small biopsies, for example.
So, what I thought I'd do is I'd touch on two projects that we're currently working on, one large scale and one clinical project.
So, we're part of the International Cancer Genome Consortium, which is this worldwide project; 13 countries around the world. Our goal is to sequence, you know, between 20,000 tumors and various types.
In Canada, we're part of the Pancreatic Cancer Project, and so currently we're trying to get about 375 pancreatic cancer tumors. These are surgically resected in--mostly in Toronto. Some come from the US as well. A chunk of the tumor gets stored in our bio bank for long-term future studies, but we also take the primary tumor--we grow it up in mice xenografts so that we can have lots of tumor material. And then, we do all sorts of, you know, genome analysis to this thing.
So, one of these methods is we're using the SureSelect--the 50-meg target to do exome sequencing. We generally do about one high ceiling per exome, which gives us about 200x coverage, which is extremely important to have high coverage when you're dealing with these heterogeneous tumor samples.
To give you an idea of the number of variants that we're seeing--so this is the--each individual line is a different sample on the bottom, and this represents the number of predicted to be deleterious somatic variants that we're seeing per sample, but we're seeing about 18 to 30 on average. Interestingly, if you look at the cellularity of these samples, the higher cellularity samples we see more variants. Where we get down to some of these tumor samples, there's not much tumor in the chunk itself, and we're not seeing as many variants. So, this is kind of a bit of headache, and we're trying to tease this out.
When you start to look at the number of genes that are mutated in these samples, you know, the typical candidates pile up, K-ras, p53 and so on, but we actually have this tail of over 1,000 genes that we're seeing somatic variants in. So, you can imagine that this is a lot of work to try and follow up on all of these pathways and doing, you know, various types of analysis.
And even just trying to validate variants in tumor samples becomes a bit of a headache. And so, here's a quick example where, you know, just--we took PCR and amplified, you know, between 20 and 25 variants per sample and sequenced it using Sanger. We did other methods like Lapack Biowin [sp]. We can see that our validation rate differs depending on the depth of coverage and the type of sequencing technology.
Lee Tims [sp] has a poster on Thursday that talks about this whole verification pipeline and the problems associated with doing this in tumor samples.
So, I mean, this is a large-scale project, and I think the exome sequencing's proving to be--work really well. I think the key is to actually get lots of depth per sample, but the analysis is ongoing and will continue for the next several months.
So, to change gears a little bit, we're also looking at more of a clinical model to research cancer, and I think cancer is the perfect disease for this type of--because, obviously, it's a genetic disease, but we actually know a lot about somatic mutations that, you know, are predictive to be--provide benefits to targeted therapy. And so, with the latest in desktop or rapid sequencing technologies, we thought to undertake a project where we could sequence things and turn it around in a quick manner.
So, to give you a bit of a--more of a background on why we thought of doing this, if we just look at the number of somatic mutations in various cancer genes across various cancer types, we can see that while we are seeing the traditional ones, like BRAF mutations in melanoma samples or K-ras in colorectal, we're also seeing the same variants across multiple cancer types. So, the idea is that you could potentially think of tumors not based purely on what type of tumor it is, but more on the mutational load that it carries, and so there's opportunity to try and link treatments with these given biomarkers. And so, the challenge is, well, will this work, and what kind of clinical trial design do we need to do to actually show that this will work.
So, if you look at the current literature out, there really aren't many known actionable mutations, mutations that oncologists would actually consider guiding new therapy based on. They really fall into only a handful of genes. And so, you know, the groups at the Broad and Farber and so on, and MD Anderson, have started just using genotyping to look at these mutations and have screened between, you know, 200 and 1,000 samples. And what they're seeing is that around 30 to 40 percent of patients carry one of these known actionable mutations, and they can actually predict a therapeutic response to one of these.
So, with this in mind, this is where we set up to create our own trial where we recruited people with advanced metastatic disease. We didn't care what type of tumor they were, as long as they were a potential candidate for a clinical trial. Obviously, we needed to be able to biopsy their disease and that we needed adequate material.
To give you a bit of an overview of how this trial works, a patient comes in, gets consented for the trial, they undergo a biopsy through their radiology, we get blood as well so we can screen their germline genome as well--this all remains in the hospital clinical setting through tissue processing and DNA extraction. Up here, the sample gets split. We keep some sample always in the Cleo Lab so that anything can be validated without leaving the clinical setting. We receive an aliquot of the sample, undergo targeted sequencing, everything, as I mentioned, gets verified either by Sanger or Sequenom genotyping. This creates a final report that goes to an expert panel of clinicians and scientists that can generate this report and get it back to the patient. The key thing here is that we can do this whole pipeline. Our goal is to keep it under 21 days.
So, you can imagine that, you know, with the capabilities in next-gen sequencing and all the various assays and tests that we could do that you can envision reporting back something crazy and convoluted like this with the patient's genome and their metabolome and their--you know, all sorts of network and pathway analysis, but obviously this is just going to be completely overwhelming. And so, here's an example of the type of report that we print back to the physician. It has the name of the somatic variant, any relevant background information about this variant, the frequencies it's been seen in other cancer types, and any characteristics and potential therapies that are associated with it.
So, to give you a bit of overview, we're up to about 50 patients right now. We've been recruiting them from about five sites across Ontario. You can see that the primary diagnosis varies. As I said, we're cancer agnostic. But, what we have noticed is that the amount of material we get per biopsy and per site varies quite drastically. So, all of the biopsies get immediately fixed in formalin so that pathology can confirm the diagnosis and estimate cellularity. And then, we also, when we can--if they have a biopsy from their original diagnosis, we go and try to get that sample as well.
So, you can see that we're getting--you know, for these fresh, small biopsies we're not quite getting a microgram of DNA. Obviously, blood we're getting loads of DNA. And if we're lucky, we're getting on average just over a microgram for these archival.
The initial way we did this is we looked at the--things like the OncoCarta gene list. This is the Sequenom genotyping assay that covers 238 somatic--known somatic variants. So, what we did is we PCR amplify any exon that contains one of these variants. It's about 70 exons in total. After much optimization, we were able to multiplex this into 19 reactions.
We are continuing to expand, but one of the problems is, you know, just to amplify all the coding sequence for these 19 genes it's 400 amplicons and that multiplexing has become a bit of a headache. We plan to expand to about to about 200 genes in the next year or so, but as--to give you an example, to PCR those 200 genes, the coding sequence is around 6,000 amplicons. So, you can see that PCR, while it's working really well, it's just not scalable to increase our target size.
I mean, that being said, we're still seeing--finding somatic actionable mutations in about 30 percent of our patients. In addition to that, we've also found several novel somatic mutations that we've been following up with functional assays, and they seem to be interesting.
ohn McPherson [sp] is going to be talking about this project as a whole and kind of the plans to go on Friday, if anyone's interested.
So, the initial benchmark of the project was that we would find a variant in at least 30 percent of patients, and so the clinicians on the project are very happy because we're meeting our goals. But, you know, as a geneticist, I see that we're missing the mark in 70 percent of our patients. So, you know, I think it's probably because our target's not big enough yet, so this is where we're trying to expand this. And one of these ways that we're doing this is using this HaloPlex system.
So, our initial assay design was the complete coding sequence of these 19 genes. It's about a 61 kV target. And we've been pairing this with My Seek [sp] two-by-one 50 base pair reads. So, the result of that is that we're getting about two gigs of raw data, you know, almost 100 percent of map reads on target. And the nice thing is that we can go from library to sequencing analysis in just around three days, which fits nicely with our timeline of keeping it within 21 days.
So, I mentioned it's 19 genes. That's about 321 exons total. And you can see that using the Halo design report, they expected we'd have about 98 percent coverage. That's assuming a luminal two-by-100 high-seek runs. But, with the My Seek and the 150 paredine [sp] reads, we're actually getting, you know, over 99 percent coverage of our targets.
So, we've tried to throw as many different samples at this as possible, you know, things with high and low DNA content, FFP samples. The blue line here, this is a mouse sample. We sequenced a mouse to see how well that would work. And then, you can see that, you know, our average coverage is about 600x and that--you know, still at 50x we're getting over 90 percent of the targets covered.
I mentioned that we've got it down to 100 nanograms of DNA, and we don't see a significant drop off in coverage going down to a lower input. If we divide it by the type of sample--so here I have, you know, archival. These are--some of these FFP blocks are five to seven years old--blood. These are the fresh FFPs from the biopsy. We've started looking at whole genome amplified material, and we've done some of these mouse xenograft samples with the corresponding mouse and so on. And really, we're seeing a slight shift in the coverage curve for the FFP samples, but still, overall, we're getting, you know, five to 600x coverage and well over 90 percent at 50x.
This just shows you some more of these examples of comparing formalin fixed and blood at various input levels. And as I said, the formalin-fixed samples are showing a little less coverage, but still enough that we're happy with it. And here's the reproducibility. This is the same sample done 10 times, and you can see that some of these are within the same reaction runs, some of these are across multiple reactions, and we're getting a nice reproducibility.
Here's just looking at the coverage of these 321 targets. So, I guess, focus on the left now. So, these are all standard normal samples ranging between 100 and 900 nanograms. But, you can see that there are patterns. The coverage is consistent across multiple experiments. You know, we have peaks in the same areas and some valleys.
Where we start to see differences in the coverage--actually, interestingly, the only samples that we see this is when we have samples that are whole genome amplified or in the mouse sample. So, obviously, we know that whole genome amplification is not consistent across the entire genome, and it's preferentially amplified in certain areas. But, we're still getting coverage of every target.
So, more importantly for us, we wanted to look at the variants that are being culled. So, we try to run everything no matter sequencing technology we use. We have a standard pipeline where we can just take BAN [sp] files, cull variants, and that way everything gets processed the same way no matter what we're doing. So, if we look at just the number of--here are single nucleotide variants. And in those, you can see that across the various input levels it's consistent. Where we start to see this increase in the number of variants is when we look at the xenograft in the mouse model, and I'll touch on that a little bit later.
So, we had a lot of samples that had known variants, so this particular sample had a known K-ras variant, Q61H I believe. And you can see that testing the FFP sample at 900 nanograms, at 100 nanograms, and even the whole genome amplified we still can clearly identify the variant at around 60 percent, which is the proper percent of frequency.
In addition, where I get happy is now we've actually started to identify somatic variants that previously were unknown in some of these samples. In this example, we see that we had a heterozygote reference, and there's been a loss of heterozygosity in the primary sample. And we've actually confirmed this by Sanger sequencing, and this is a real variant that we're now following up functionally.
So, you know, one of the advantages of doing--of growing these tumors up in xenografts is that we can get a lot more tumor tissue and we can increase the cellularity. This particular sample, based on deep sequencing we estimated that it's--it was only about 4 percent tumor. And we can see that at about 4 percent, we saw this variant, which is what we would expect. Whereas in the xenograft model, we can actually--we see we've got a lot more tumor material, and we can amplify up this variant to about 73 percent, which makes things a lot easier to cull. Interestingly, in the mouse sample, there was a depth of about five there, but we didn't see any variants.
So, this can also be a bit of detriment because we often see things like this, and this is the reason that we're seeing more variants in mouse and xenograft models. If we look at the primary sequence, we'd see that nothing was culled and that the mouse--we see these homozygous variants that in the xenograft get culled as heterozygous, and these are simply just interspecies differences between the genomes. In this particular region of chromosome four in human and chromosome five on mouse, there happens to be four differences.
And so, we've done a lot of work to try and, you know, create databases of these interspecies variants, we're calling them, that can be filtered out. And when we do that, we can see that the number of variants actually goes back down to about the same for all sample types.
We've done some mixing experiments where we take cell line DNA for a tumor and mix it in various ratios with the corresponding reference material to see that--you know, if we can still detect variants at various cellularities. So, you can see here that at this point we only have 25 percent tumor and 75 percent reference, and we're still able to detect it at a rate around 15 percent, which is what you'd expect for a heterozygous variant.
For a loss of heterozygotes, we're seeing, you know, the opposite pattern to this where we're seeing the variant at about 50 percent in--for a heterozygote in the--when it's 100 percent reference and, you know, it goes down to about zero in a hundred when it's all cell line.
So, finally, I guess, recently we had some early access to the six-hour protocol, so this is--the nice thing about this is--for us, is in our hands we're able to go, in one day--get it in the morning, start the library prep and have it on the sequencer that afternoon. And we can see that actually the coverage so far in the samples we've done is even more consistent and just as high, you know, well over almost 95 percent at 50x. It's very reproducible. You know, there's a third line under here that you can't even see because it's exactly the same. And then, across various sample types at various inputs, we're still seeing this very consistent coverage. So, we're very happy that now we can change our whole protocol into a single day.
So, I just--I guess I'll conclude. So, if we look at--you know, there's a lot of talk on, you know, as the sequencing--the cost of sequencing gets lower and lower, you know, why don't we just sequence everyone's whole genome, why don't we just sequence everyone's exome, and then we can extract the data corresponding to the targets of interest?
So, here's just to compare to you--this is the coverage across these 321 targets for one sample, but I also took whole genome and exome data for this same sample. And we can see that, well, yes, the whole genome seems to be a little more consistent. Exome is a little more variable. Some areas are higher. Some areas are not covered at all. I guess the big thing we're talking about is time and cost. We're talking, you know, thousands versus hundreds of dollars, but, you know, weeks and months of actual library and sequencing and analysis versus, you know, we have this down to about two days to cover it.
In addition to that, obviously, we'd be dealing with, you know, huge numbers of variants when you start looking at a whole exome. So, these are just numbers--I averaged some numbers from some of the recent exome and whole genome papers, and we're talking about, you know, 75 to 200 potentially damaging mutations when we look at a whole exome, and this goes up to, you know, probably close to a thousand in a whole genome. So, just the headache of trying to deal with all this information, I mean, it just becomes completely unrealistic in a clinical setting to actually turn this around in a time that would be reasonable to affect patient care.
So, to summarize, you know, tumor samples are difficult for a number of reasons. We can have variant quantities, we can lots of heterogeneity within the--both within the sample and among samples that, you know, the method for enrichment for your given project is going to really depend on the needs; do you have a large target, a small target, how fast, how cheap, and do you want consistent coverage or 100 percent coverage? And that so far we've been very happy with this HaloPlex PCR. It's been rapid. We can get things on in a short period of time. And that when we combine it with a--you know, a clinical lab to validate any findings we make, it's going to be a--could be well integrated into a clinical trial setting.
And so, with that I'd like to thank everyone at OICR and our collaborators at Halo and have to recognize the government of Ontario for base funding. And I'd be happy to take any questions.