Presentation is loading. Please wait.

Presentation is loading. Please wait.

Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.

Similar presentations


Presentation on theme: "Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory."— Presentation transcript:

1 Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory Personal Genomes meeting October 9-12, 2008

2 Large-scale individual human resequencing

3 Next-gen sequencers offer vast throughput… read length bases per machine run 10 bp1,000 bp100 bp 1 Gb 100 Mb 10 Mb 10 Gb Illumina, AB/SOLiD short-read sequencers ABI capillary sequencer 454 pyrosequencer (100-400 Mb in 200-450 bp reads) (5-15Gb in 25-70 bp reads) 1 Mb

4 The resequencing informatics pipeline (iii) read assembly REF (ii) read mapping IND (i) base calling IND (iv) SNP and short INDEL calling (vi) data validation, hypothesis generation (v) SV calling

5 The variation discovery “toolbox” base callers read mappers SNP callers SV callers assembly viewers

6 1. Base calling base sequence base quality (Q-value) sequence early manufacturer-supplied base callers were imperfect third party software made substantial improvements machine manufacturers are now focusing more on base calling

7 … and they give you the picture on the box 2. Read mapping Read mapping is like doing a jigsaw puzzle… …you get the pieces… Larger, more unique pieces are easier to place than others…

8 Next-gen reads are generally short read length [bp] 0 100200300 ~200-450 (variable) 25-70 (fixed) 25-50 (fixed) 20-60 (variable) 400

9 Base error rates are low Illumina 454

10 Strategies to deal with non-unique mapping

11 Mapping probabilities (qualities) 0.8 0.190.01 read

12 Error types are very different Illumina 454

13 Gapped alignments

14 MOSAIK fast accurate gapped versatile (short + long reads)

15 3. SNP and short-INDEL calling deep alignments of 100s / 1000s of individuals trio sequences

16 Allele discovery is a multi-step sampling process Population SamplesReads

17 Capturing the allele in the sample

18 Allele calling in the reads base quality allele call in read number of individuals

19 How many reads needed to call an allele? aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac Q30Q40Q50Q60 10.01 0.10.5 20.821.0 3

20 The need for accurate data…

21 … and realistic base quality values

22 Recalibrated base quality values (Illumina)

23 More samples or deeper coverage / sample? Shallower read coverage from more individuals … …or deeper coverage from fewer samples? simulation analysis by Aaron Quinlan

24 Analysis indicates a balance

25 SNP calling in trios the child inherits one chromosome from each parent there is a small probability for a mutation in the child

26 SNP calling in trios aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac mother father child P=0.79 P=0.86

27 4. Structural variation discovery Read pair mapping pattern (breakpoint detection)

28 Copy number estimation Depth of read coverage

29 Deletion: Aberrant positive mapping distance

30 Tandem duplication: negative mapping distance

31 Het deletion “revealed” by normalization Chip Stewart Saturday poster session

32 5. Data visualization software development data validation hypothesis generation

33 Summary Next-generation sequencing is a boon for large-scale individual human resequencing Basic data mining tools are getting applied and tested in the 1000 Genomes Project There is still a lot of fine-tuning to do A different set of tools are needed for comparative analysis and effective visualization of 100s/1000s of genomes

34 Credits Derek Barnett Eric Tsung Aaron Quinlan Damien Croteau-Chonka Weichun Huang Michael Stromberg Chip Stewart Michele Busby Several postdoc positions are available… … mail marth@bc.edumarth@bc.edu

35 Software tools for next-gen data http://bioinformatics.bc.edu/marthlab/Beta_Release

36 Positions Several postdoc positions are available… mail marth@bc.edumarth@bc.edu

37 Individual genotype directly from sequence AACGTTAGCATA AACGTTCGCATA AACGTTAGCATA individual 1 individual 3 individual 2 A/C C/CC/C A/A

38 Genotyping from primary sequence data 100 @ 16x: 0.975 +/- 0.121 200 @ 8x: 0.968 +/- 0.129 400 @ 4x: 0.924 +/- 0.151 800 @ 2x: 0.769 +/- 0.154

39 Most reads contain no or few errors

40 Paired-end reads help unique read placement fragment amplification: fragment length 100 - 600 bp fragment length limited by amplification efficiency Korbel et al. Science 2007 circularization: 500bp - 10kb (sweet spot ~3kb) fragment length limited by library complexity PE MP

41 How many reads needed to call an allele? aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac P=0.82 P=0.08


Download ppt "Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory."

Similar presentations


Ads by Google