Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex.

Similar presentations


Presentation on theme: "1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex."— Presentation transcript:

1 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex Zelikovsky, GSU Viral Quasispecies Reconstruction from Amplicon 454 Pyrosequencing Reads CAME 2011, Atlanta Georgia

2 2 Viral Quasispecies and NGS RNA Viruses —HIV, HCV, SARS, Influenza —Higher (than DNA) mutation rates —  quasispecies —set of closely related variants rather than a single species Knowing quasispecies can help —Interferon HCV therapy effectiveness (Skums et al 2011) NGS allows to find individual quasispecies sequences —454 Life Sciences : 400-600 Mb with reads 300-800 bp long Sequencing is challenging —multiple quasispecies —qsps sequences are very similar —different qsps may be indistinguishable for > 1kb (longer than reads) CAME 2011, Atlanta Georgia

3 3 Outline Shotgun vs Amplicon Sequencing Viral Quasispecies Reconstruction Problem Challenges and Approaches Data Structure for Reads: Read Graph Novel Methods for Solving QSR Problem Observed vs True Read Frequencies True Frequency Reconstruction Simulations and Results CAME 2011, Atlanta Georgia

4 4 Shotgun versus Amplicon Sequencing Shotgun reads —starting positions distributed uniformly Amplicon —each read has predefined start/end covering fixed overlapping windows CAME 2011, Atlanta Georgia

5 5 Viral Quasispecies Spectrum Reconstruction Problem Given —collection of amplicon reads from a quasispecies population with unknown variants and distribution Find —viral quasispecies sequences and their frequencies CAME 2011, Atlanta Georgia

6 6 Amplicon Sequencing Challenges Collapse of quasispecies in amplicon —distinct quasispecies may be indistinguishable in window Collapse of quasispecies in overlap —match reads from consecutive windows coming from the same qsp First approach Prosperi et al (2011) —Guide Distribution —choose a column —go right/left matching the the closest in order neighbor CAME 2011, Atlanta Georgia 220200140160150 200140130150140 70130120140130 1020110130120 0101002060

7 7 Approaches to QSP Reconstruction Shotgun approaches —estimates probability of consecutive reads coming from the same qsp (ViSpA, Astrovskaya et al 2011) —parsimony (minimum number of distinct sequences covering all reads) (ShoRAH, Zagordi et al 2010) Why not use shotgun approaches for amplicons? —estimating probability in ViSpA relies on uniform distribution of reads —amplicon reads have fixed beginnings and ends Optimization approach —most parsimonious solution — minimize number of distinct sequences covering all reads — too coarse: many different optimal solutions —minimum information entropy (Shannon, 1948) — takes in account also frequency — fractional relaxation of pure parsimony CAME 2011, Atlanta Georgia

8 Min Entropy vs Parsimony Parsimony and Min Entropy selects AC and BD if a = c, and b = d 8 CAME 2011, Atlanta Georgia

9 9 Data Structure for Reads: Read Graph K amplicons → K-staged read graph —vertices → distinct reads —edges → reads with consistent overlap —vertices, edges have a count function CAME 2011, Atlanta Georgia

10 10 Read Graph May transform graph into a 'forked' graph —overlap is represented by fork vertex CAME 2011, Atlanta Georgia

11 11 Fork Resolving Problem Minimum Entropy is NP-hard —can solve it optimally for each small fork separately (future work) Greedy heuristic — ≤ a+b-1 are sufficient when resolving fork with a distinct reads on the left and b on the right — that can be done greedily matching largest (greedy heuristic) — this does not guarantee minimum number of distinct qsps Better way = globally match the most frequent reads (max bandwidth) — find s-t path maximizing minimum read count — subtract the minimum count from each read in the path — exhausts at least one read in the path CAME 2011, Atlanta Georgia

12 12 Greedy Method CAME 2011, Atlanta Georgia

13 13 Greedy Method CAME 2011, Atlanta Georgia

14 14 Greedy Method CAME 2011, Atlanta Georgia

15 15 Greedy Method CAME 2011, Atlanta Georgia

16 16 Greedy Method CAME 2011, Atlanta Georgia

17 17 Greedy Method CAME 2011, Atlanta Georgia

18 18 Greedy Method CAME 2011, Atlanta Georgia

19 19 Greedy Method CAME 2011, Atlanta Georgia

20 20 Greedy Method CAME 2011, Atlanta Georgia

21 21 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

22 22 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

23 23 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

24 24 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

25 25 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

26 26 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

27 27 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

28 28 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

29 29 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

30 30 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

31 31 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

32 32 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

33 33 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

34 34 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

35 35 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

36 36 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

37 37 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

38 38 Observed vs Ideal Read Frequencies Ideal frequency —consistent frequency across forks Observed frequency (count) —inconsistent frequency across forks All methods perform better under ideal frequencies CAME 2011, Atlanta Georgia

39 39 Fork Balancing Problem Given —set of reads and respective frequencies Find —minimal frequency offsets balancing all forks Simplest approach is to scale frequencies from left to right CAME 2011, Atlanta Georgia

40 40 Least Squares Approach Quadratic Program for read offsets q – fork, o i – observed frequency, x i – frequency offset CAME 2011, Atlanta Georgia

41 41 Flowchart CAME 2011, Atlanta Georgia

42 42 Data Sets and Metrics Simulated error-free HCV (1734 long fragment) – quasispecies from uniform, geometric, and skewed distribution – shift → delta of starting position Sensitivity – percentage of correctly assembled true quasispecies PPV – percentage of true quasispecies among all assembled Jensen-Shannon Divergence

43 43 Sensitivity Results CAME 2011, Atlanta Georgia

44 44 PPV Results CAME 2011, Atlanta Georgia

45 45 Divergence Results CAME 2011, Atlanta Georgia

46 46 ViSpA Comparison CAME 2011, Atlanta Georgia

47 47 Conclusion Two novel methods for solving QSR problem —Outperform Prosperi et al. on average —Outperform ViSpA approach on average Maximum Bandwidth approach worked best Future work: exact local solution for minimum entropy CAME 2011, Atlanta Georgia

48 48 Thanks CAME 2011, Atlanta Georgia


Download ppt "1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex."

Similar presentations


Ads by Google