Download presentation
Presentation is loading. Please wait.
1
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory Personal Genomes meeting October 9-12, 2008
2
Large-scale individual human resequencing
3
Next-gen sequencers offer vast throughput… read length bases per machine run 10 bp1,000 bp100 bp 1 Gb 100 Mb 10 Mb 10 Gb Illumina, AB/SOLiD short-read sequencers ABI capillary sequencer 454 pyrosequencer (100-400 Mb in 200-450 bp reads) (5-15Gb in 25-70 bp reads) 1 Mb
4
The resequencing informatics pipeline (iii) read assembly REF (ii) read mapping IND (i) base calling IND (iv) SNP and short INDEL calling (vi) data validation, hypothesis generation (v) SV calling
5
The variation discovery “toolbox” base callers read mappers SNP callers SV callers assembly viewers
6
1. Base calling base sequence base quality (Q-value) sequence early manufacturer-supplied base callers were imperfect third party software made substantial improvements machine manufacturers are now focusing more on base calling
7
… and they give you the picture on the box 2. Read mapping Read mapping is like doing a jigsaw puzzle… …you get the pieces… Larger, more unique pieces are easier to place than others…
8
Next-gen reads are generally short read length [bp] 0 100200300 ~200-450 (variable) 25-70 (fixed) 25-50 (fixed) 20-60 (variable) 400
9
Base error rates are low Illumina 454
10
Strategies to deal with non-unique mapping
11
Mapping probabilities (qualities) 0.8 0.190.01 read
12
Error types are very different Illumina 454
13
Gapped alignments
14
MOSAIK fast accurate gapped versatile (short + long reads)
15
3. SNP and short-INDEL calling deep alignments of 100s / 1000s of individuals trio sequences
16
Allele discovery is a multi-step sampling process Population SamplesReads
17
Capturing the allele in the sample
18
Allele calling in the reads base quality allele call in read number of individuals
19
How many reads needed to call an allele? aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac Q30Q40Q50Q60 10.01 0.10.5 20.821.0 3
20
The need for accurate data…
21
… and realistic base quality values
22
Recalibrated base quality values (Illumina)
23
More samples or deeper coverage / sample? Shallower read coverage from more individuals … …or deeper coverage from fewer samples? simulation analysis by Aaron Quinlan
24
Analysis indicates a balance
25
SNP calling in trios the child inherits one chromosome from each parent there is a small probability for a mutation in the child
26
SNP calling in trios aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac mother father child P=0.79 P=0.86
27
4. Structural variation discovery Read pair mapping pattern (breakpoint detection)
28
Copy number estimation Depth of read coverage
29
Deletion: Aberrant positive mapping distance
30
Tandem duplication: negative mapping distance
31
Het deletion “revealed” by normalization Chip Stewart Saturday poster session
32
5. Data visualization software development data validation hypothesis generation
33
Summary Next-generation sequencing is a boon for large-scale individual human resequencing Basic data mining tools are getting applied and tested in the 1000 Genomes Project There is still a lot of fine-tuning to do A different set of tools are needed for comparative analysis and effective visualization of 100s/1000s of genomes
34
Credits Derek Barnett Eric Tsung Aaron Quinlan Damien Croteau-Chonka Weichun Huang Michael Stromberg Chip Stewart Michele Busby Several postdoc positions are available… … mail marth@bc.edumarth@bc.edu
35
Software tools for next-gen data http://bioinformatics.bc.edu/marthlab/Beta_Release
36
Positions Several postdoc positions are available… mail marth@bc.edumarth@bc.edu
37
Individual genotype directly from sequence AACGTTAGCATA AACGTTCGCATA AACGTTAGCATA individual 1 individual 3 individual 2 A/C C/CC/C A/A
38
Genotyping from primary sequence data 100 @ 16x: 0.975 +/- 0.121 200 @ 8x: 0.968 +/- 0.129 400 @ 4x: 0.924 +/- 0.151 800 @ 2x: 0.769 +/- 0.154
39
Most reads contain no or few errors
40
Paired-end reads help unique read placement fragment amplification: fragment length 100 - 600 bp fragment length limited by amplification efficiency Korbel et al. Science 2007 circularization: 500bp - 10kb (sweet spot ~3kb) fragment length limited by library complexity PE MP
41
How many reads needed to call an allele? aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac P=0.82 P=0.08
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.