Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ranking Tumor Phylogeny Trees by Likelihood

Similar presentations


Presentation on theme: "Ranking Tumor Phylogeny Trees by Likelihood"— Presentation transcript:

1 Ranking Tumor Phylogeny Trees by Likelihood
Dikshant Pradhan

2 Cancer Phylogenetics Cancer is characterized by rapid cell division and mutation Leads to many heterogenous subclonal populations within a tumor Driver mutations are causal in adaptation and spread of cancerous cells Passenger mutations have no functional consequence Research interest in identifying driver and passenger mutations leads to interest in tracing mutational patterns of cancer tumors Deconvolution of cell mixture and construction of phylogeny Unable to observe taxa directly in tumor samples due to heterogeneity Need to deconvolve taxa from samples Cancer cause Driver and passenger mutations deconvolution

3 Review of Tree Estimation Strategies
Infinite sites assumption Each SNV (mutation) only appears once Different mutations do not occur in the same location Topological constraints rules Ancestor Condition: ancestor mutations must have equal or higher frequencies than their descendants Sum Condition: if a branching phylogeny exists, then the ancestor mutation must have a higher frequency than the sum of its descendants Crossing Rule: if the frequency of a mutation is not consistently greater than or equal to that of another, then it cannot be an ancestor Mudaliar, M. (2015, June 12). Variant (SNP) calling - an introduction (with a worked example, using... Retrieved April 04, 2018, from Jiao, W., Vembu, S., Deshwar, A. G., Stein, L., & Morris, Q. (2014). Inferring clonal evolution of tumors from single nucleotide somatic mutations. BMC Bioinformatics,15(1), 35. doi: / Snv terminology Infinite sites & homoplasy Topological constraints & examples Infinites sites is related to no-homoplasy Image:

4 Project Focus Methods of cancer phylogeny estimation attempt to infer the different taxa in the tumor as well as the their proportional presence in each sample Hypothesis: by calculating and comparing the likelihood that each solution is able to produce the input set of variant allele frequencies, we can rank different solutions

5 SPRUCE Model allows for mutations altering copy number of genes
Returns all possible solution trees Given: Variant allele frequency for each SNV for each sample Output: Full solution space for given data Allows for comparison of all possible trees constructed from the same data set

6 Approach For each solution, calculate probability that inferred tree produced observed VAF Iterate through each sample Iterate through each SNV Iterate through each state Sum up the total variant alleles divided by total alleles times frequency of state Compute binomial probability of solution P = Binom(observed VAF | # of reads, inferred VAF) Expect more accurate solutions to have higher likelihoods

7 Results Sample analysis of a solution space
True solution has recall of 1 and has a relatively high likelihood compared to the rest of the solutions What’s interesting is that, incorrect solutions also have a high likelihood. So this method cannot isolate the true solution, but what i can do is clear out a great deal of the incorrect solutions and effectively reduce noise.

8 Cumulative Binomial versus Point
One thing that i want to note is that, i looked at two different ways of computing likelihood: the regular binomial probability and cumulative binomial probability. What I saw was that the true binomial probability was not very reliable in its predictions. So the overall likelihoods were much lower and the true solution was often grouped in the middle as you can see here. When I tried out the cumulative binomial, the results changed to what you can see here

9 Trends I ran this method on a few different simulated datasets which differed in the number of samples and reads that were taken. The true trees were also different in these cases. I think that, because the true tree used to generate samples was different in each case, there was no real connection that i can make about how the number of reads or samples affects the output. I also looked at the pospition of the predicted lilkelihood within the range of likelihood across all solutions in each case and I saw that the true solution was, like i saw, always in the 90th percentile of solutions. And it seemed to go higher in confidence with more reads, but again, can’t say for sure without comparing against simulations generated from the same tree.

10 Trends Number of Solutions:
50 reads 100 reads 500 reads 1000 reads 10000 reads 2 samples 205267 54044 9170 6227 791 5 samples 2832 1572 292 92 12 10 samples 40 33 2 Another thing that I looked into was the percent of solutions that had a higher likelihood than the true solution. I saw that, at best, only around 12% of possible solutions were greater than that of the actual true slution. And then, as you increased the number of reads, the number increased. However, an imprtant thing to note is that this ie because there are fewer possible solutions when you take more reads.

11 Limitations Incorrect solutions can have higher likelihood
Likelihood threshold to keep true solution is hard to predict Need to know full range of solutions to know what solutions are definitively incorrect

12 Summary Problem Approach Results Limitations Future Directions
Rank different solutions to the same deconvolution problem Approach Estimate likelihood based on binomial probability Results May be good for eliminating noise, but cannot isolate true solution Limitations With enough reads, the solution space is small enough that this approach provides no benefit Future Directions Incorporate likelihood estimation and correction into solution calculation

13 References El-Kebir, M., Satas, G., Oesper, L., & Raphael, B. J. (2016). Inferring the Mutational History of a Tumor Using Multi-state Perfect Phylogeny Mixtures. Cell Systems,3(1), doi: /j.cels Jiao, W., Vembu, S., Deshwar, A. G., Stein, L., & Morris, Q. (2014). Inferring clonal evolution of tumors from single nucleotide somatic mutations. BMC Bioinformatics,15(1), 35. doi: /

14 Input and Assumptions Deep sequencing: Assumptions:
Sequence particular regions of DNA for hundreds or thousands of times Allows detection of rare clonal types within a sample Observe frequency of mutations within each sample Assumptions: Clonal evolution model: all cells in the tumor are derived from ancestors and mutations that confer advantages will proliferate All tumor cells are derived from a single wild-type clone Infinite sites Copy-number of SNVs is given as input Assume that each SNV has the same copy-number Deep sequencing is the process of generating a high number of reads for each region in a genome from a sample Observe rare mutations Assumptions Copy number Deep sequencing is necessary for cancer tumors due to heterogeneity Authors suggest doing further sequencing for subclonal lineage (whole genome) Image: El-Kebir, M., Satas, G., Oesper, L., & Raphael, B. J. (2016). Inferring the Mutational History of a Tumor Using Multi-state Perfect Phylogeny Mixtures. Cell Systems,3(1), doi: /j.cels What is a copy number variant, and why are they important risk factors for ASD? (n.d.). Retrieved April 04, 2018, from

15 Approach Infinite sites assumption Topological constraints rules
Each SNV (mutation) only appears once Different mutations do not occur in the same location Topological constraints rules Ancestor Condition: ancestor mutations must have equal or higher frequencies than their descendants Sum Condition: if a branching phylogeny exists, then the ancestor mutation must have a higher frequency than the sum of its descendants Crossing Rule: if the frequency of a mutation is not consistently greater than or equal to that of another, then it cannot be an ancestor Mudaliar, M. (2015, June 12). Variant (SNP) calling - an introduction (with a worked example, using... Retrieved April 04, 2018, from Snv terminology Infinite sites & homoplasy Topological constraints & examples Infinites sites is related to no-homoplasy Image:

16 Algorithm Input: Process: Output:
Read counts for each SNV in each sample Copy-number status for each SNV Process: Group SNVs into sub-lineages Output: “Partial order plot” Represent posterior uncertainty in phylogeny Overview Explain each tree Edge weights, partial order plot Don’t compare every SNV, they sample the SNV space to generate relationships a set number of iterations Image:


Download ppt "Ranking Tumor Phylogeny Trees by Likelihood"

Similar presentations


Ads by Google