Presentation is loading. Please wait.

Presentation is loading. Please wait.

Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008.

Similar presentations


Presentation on theme: "Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008."— Presentation transcript:

1 Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

2 Read length and throughput read length bases per machine run 10 bp1,000 bp100 bp 100 Mb 10 Mb 1Mb 1Gb Illumina/Solexa, AB/SOLiD short-read sequencers ABI capillary sequencer 454 pyrosequencer (20-100 Mb in 100-250 bp reads) (1-4 Gb in 25-50 bp reads)

3 Current and future application areas Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery De novo genome sequencing Short-read sequencing will be (at least) an alternative to micro-arrays for: DNA-protein interaction analysis (CHiP-Seq) novel transcript discovery quantification of gene expression epigenetic analysis (methylation profiling) DEL SNP reference genome

4 Fundamental informatics challenges (I) 1. Interpreting machine readouts – base calling, base error estimation 2. Dealing with non- uniqueness in the genome: resequenceability 3. Alignment of billions of reads

5 Informatics challenges (II) 5. Data visualization 4. SNP and short INDEL, and structural variation discovery 6. Data storage & management

6 Resequencing-based SNP discovery genome reference sequence Read mapping Read alignment Paralog identification SNP detection + inspection

7 SNP calling workflow read alignment SNP detection visual checking

8 Bayesian detection algorithm AAAAAAAAAA CCCCCCCCCC TTTTTTTTTT GGGGGGGGGG polymorphic combination monomorphic combination Bayesian posterior probability i.e. the SNP score Base call + Base quality Polymorphism rate (prior) Base composition Depth of coverage

9 Base quality values for SNP calling base quality values help us decide if mismatches are true polymorphisms or sequencing errors accurate base qualities are crucial, especially in lower coverage

10 Priors for specific resequencing scenarios AACGTTAGCATA AACGTTCGCATA AACGTTAGCATA strain 1 strain 2 strain 3 AACGTTAGCATA AACGTTCGCATA AACGTTAGCATA individual 1 individual 3 individual 2

11 Consensus sequence generation (genotyping) AACGTTAGCATA strain 1 strain 2 strain 3 AACGTTAGCATA AACGTTCGCATA AACGTTAGCATA individual 1 individual 3 individual 2 AACGTTCGCATA A C A A/C C/CC/C A/A

12 SNP calling in Roche/454 pyrosequences

13 SNP calling in low 454 coverage with Andy Clark (Cornell) and Elaine Mardis (Wash. U.) 10 different African and American melanogaster isolates 10 runs of 454 reads (~300,000 reads per isolate) (~1.5X total) can we detect SNPs in survey-style 454 read coverage? DNA courtesy of Chuck Langley, UC Davis iso-1 reference 46-2 454 read 46-2 ABI reads (2 fwd + 2 rev) 92.9 % validation rate (1,342 / 1,443) 2.0% missed SNP rate (25 / 1247)

14 SNP calling in Illumina/Solexa short-reads

15 SNP calling in short-read coverage C. elegans reference genome (Bristol, N2 strain) Pasadena, CB4858 (1 ½ machine runs) SNP calling error rate very low: Validation rate = 97.8% (224/229) Conversion rate = 92.6% (224/242) Missed SNP rate = 3.75% (26/693) SNP INS INDEL candidates validate and convert at similar rates to SNPs: Validation rate = 89.3% (193/216) Conversion rate = 87.3% (193/221)

16 A C G G T C G T C G T G T G C G T A C G G T C G C C G T G T G C G T A C G G T C G T C G T G T G C G T No change SNP Measurement error SNP calling in AB/SOLiD color-space reads

17 Mutational profiling: deep 454/Illumina/SOLiD data collaboration with Doug Smith at Agencourt Pichia stipitis converts xylose to ethanol (bio-fuel production) one mutagenized strain had especially high conversion efficiency determine where the mutations were that caused this phenotype we resequenced the 15MB genome with 454 Illumina, and SOLiD reads 14 true point mutations in the entire genome In about 15X nominal coverage each technology can find every point mutation with essentially no false positives Pichia stipitis reference sequence Image from JGI web site

18 Our software is available for testing http://bioinformatics.bc.edu/marthlab/Beta_Release

19 Credits http://bioinformatics.bc.edu/marthlab Elaine Mardis (Washington University) Andy Clark (Cornell University) Doug Smith (Agencourt) Research supported by: NHGRI (G.T.M.) BC Presidential Scholarship (A.R.Q.) Derek Barnett Eric Tsung Aaron Quinlan Damien Croteau-Chonka Weichun Huang Michael Stromberg Chip Stewart Michele Busby


Download ppt "Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008."

Similar presentations


Ads by Google