# Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.

## Presentation on theme: "Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral."— Presentation transcript:

Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral Quasispecies Reconstruction Based on Unassembled Frequency Estimation

Outline Introduction ML Model EM Algorithm VSEM Algorithm Experimental Results Conclusions and future work ISBRA 2011, Central South University, Changsha, China

454 Pyrosequencing Emulsion PCR Single nucleotide addition —Natural nucleotides —DNA ploymerase pauses until complementary nucleotide is dispensed —Nucleotide incorporation triggers enzymatic reaction that results in emission of light ISBRA 2011, Central South University, Changsha, China

ML Model Panel : bipartite graph —RIGHT: strings >unknown frequencies —LEFT: reads >observed frequencies —EDGES: probability of the read to be emitted by the string >weights are calculated based on the mapping of the reads to the strings ISBRA 2011, Central South University, Changsha, China strings S1 S2 S3 R1 R2 R4 reads R3

ML estimates of string frequencies Probability that a read is sampled from string is proportional with its frequency f(j) ML estimates for f(j) is given by n(j)/(n(1) +... + n(N)) —n(j) - number of reads sampled from string j ISBRA 2011, Central South University, Changsha, China

EM algorithm E-step: Compute the expected number n(j) of reads that come from string j under the assumption that string frequencies f(j) are correct M-step: For each string j, set the new value of f(j) equal to the portion of reads being originated by string j among all observed reads in the sample ISBRA 2011, Central South University, Changsha, China

ML Model Quality How well the maximum likelihood model explain the reads Measured by deviation between expected and observed read frequencies —expected read frequency: ISBRA 2011, Central South University, Changsha, China

VSEM : Virtual String EM ISBRA 2011, Central South University, Changsha, China deviation between expected /observed read frequencies deviation between expected /observed read frequencies ML estimates of string frequencies ML estimates of string frequencies Compute expected read frequencies Compute expected read frequencies update weights of reads in virtual string update weights of reads in virtual string EM (incomplete) panel + virtual string with 0-weights in virtual string (incomplete) panel + virtual string with 0-weights in virtual string Stop condition Output : string frequencies, reads Output : string frequencies, reads EM yes no

Example : 1 st iteration 9 ISBRA 2011, Central South University, Changsha, China strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel O 0.25 O VS

Example : 1 st iteration 10 ISBRA 2011, Central South University, Changsha, China strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel O 0.25 O ML 0.25 0.5 0.25 ML 0.33 0.66 VS

Example : 1 st iteration 11 ISBRA 2011, Central South University, Changsha, China 11 ISBRA 2011, Central South University, Changsha, China strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel OE.25 OE.32.25.32.25.16.25.16 ML.25.5.25 ML.33.66 VS

Example : 1 st iteration 12 ISBRA 2011, Central South University, Changsha, China strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel OE.25 OE.32.25.32.25.16.25.16 ML.25.5.25 ML.34.66 VS D=0D=.08

Example : 1 st iteration 13 ISBRA 2011, Central South University, Changsha, China strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel OE.25 OE.3.25.3.25.15.25.15 ML.25.5.25 0 ML.32.65.02 VS D=0D=.075 Incomplete Panel

Example : last iteration 14 ISBRA 2011, Central South University, Changsha, China strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel OE.25 OE ML.25.5.25 0 ML.20.6.2 VS D=0

VSEM : Virtual String EM Decide if the panel is likely to be incomplete Estimate total frequency of missing strings Identify read spectrum emitted by missing strings ISBRA 2011, Central South University, Changsha, China

ViSpA ViSpA [Astrovskaya et al. 2011] – viral spectrum assembling tool for inferring viral quasispecies sequences and their frequencies from pyroseqencing shotgun reads —align reads —built a read graph : >V – reads >E – overlap between reads >each path – candidate sequence —filter based on ML frequencies 16 ISBRA 2011, Central South University, Changsha, China

ViSpA-VSEM 17 ISBRA 2011, Central South University, Changsha, China ViSPA Weighted assembler assembled Qsps Qsps Library VSEM Virtual String EM reads, weights Viral Spectrum +Statistics reads ViSpA ML estimator removing duplicated & rare qsps Stopping condition YES NO

Simulation Setup and Accuracy Measures Real quasispecies sequences data from [von Hahn et al. 2006] —44 sequences (1739 bp long) from the E1E2 region of Hepatitis C virus —Error-free data was simulated by in-house simulator >populations sizes: 10, 20, 30, and 40 sequences >population distributions: geometric, skewed normal, uniform Accuracy measures —Kullback-Leibler divergence —Correlation between real and predicted frequencies —Average prediction error 18 ISBRA 2011, Central South University, Changsha, China

Experimental Validation of VSEM Detection of panel incompleteness —VSEM can detect 1% of missing strings Improving quasispecies frequencies Detection of reads emitted by missing string —Correlation between predicted reads and reads emitted by missing strings >65% 19 ISBRA 2011, Central South University, Changsha, China

EM vs VSEM 20 ISBRA 2011, Central South University, Changsha, China % of missing strings r.l./n.r<10%10%-20%20%-30%30%-40%40%-50%>50% rerrr r r r r ViSpA100/20K90.24.591.06.875.45.168.61.640.82.339.810.4 ViSpA-VSEM100/20K91.62.392.84.476.54.170.51.454.22.050.87.4 ViSpA300/20K95.73.893.210.289.81.066.71.562.12.146.89.7 ViSpA-VSEM300/20K95.41.795.81.196.90.685.70.988.00.960.42.6 ViSpA100/100K95.24.593.99.184.81.474.21.874.52.373.49.9 ViSpA-VSEM100/100K97.82.695.63.086.31.379.81.779.02.174.28.8 ViSpA300/100K96.23.988.612.488.91.085.11.475.12.349.510.5 ViSpA-VSEM300/100K96.22.092.80.993.70.790.21.284.41.767.14.8

ViSpA vs ViSpA-VSEM 21 ISBRA 2011, Central South University, Changsha, China ViSpAViSpA-VSEM DistributionPPVSensetivityRErerrPPVSensetivityRErerrGain Geometric0.7670.5-0.00990.9547.360.5910.730.02760.9092.912.3 Skewed0.7330.4-0.01960.67313.010.7010.770.00850.9672.54 Uniform0.7330.4-0.01910.71612.760.6450.730.01080.9762.343.7 100K reads from 10 QSPS average length 300

ViSpA vs ViSpA-VSEM #mismatches ViSpAViSpA-VSEM PPVSensetivityRErerrPPVSensetivityRErerrGain k = 00.5 0.07200.98609.980.5460.60.04940.9747.541 k = 20.6 0.06680.98609.160.6360.70.04340.96806.671 k = 60.7 0.05770.98567.950.7270.80.03690.9466.201 k =70.8 0.05250.98667.260.8180.90.03350.9485.651 22 ISBRA 2011, Central South University, Changsha, China 100K reads from 10 QSPS average length 300

Conclusions & Future Work Apply VSEM to RNA-Seq data Assemble missing strings from the set of reads emitted by missing strings Handle chimerical strings presented in the panel 23 ISBRA 2011, Central South University, Changsha, China

Acknowledgments NFS … 24 ISBRA 2011, Central South University, Changsha, China

Download ppt "Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral."

Similar presentations