Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.

Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory Personal Genomes meeting October 9-12, 2008

Large-scale individual human resequencing

Next-gen sequencers offer vast throughput… read length bases per machine run 10 bp1,000 bp100 bp 1 Gb 100 Mb 10 Mb 10 Gb Illumina, AB/SOLiD short-read sequencers ABI capillary sequencer 454 pyrosequencer (100-400 Mb in 200-450 bp reads) (5-15Gb in 25-70 bp reads) 1 Mb

The resequencing informatics pipeline (iii) read assembly REF (ii) read mapping IND (i) base calling IND (iv) SNP and short INDEL calling (vi) data validation, hypothesis generation (v) SV calling

The variation discovery “toolbox” base callers read mappers SNP callers SV callers assembly viewers

1. Base calling base sequence base quality (Q-value) sequence early manufacturer-supplied base callers were imperfect third party software made substantial improvements machine manufacturers are now focusing more on base calling

… and they give you the picture on the box 2. Read mapping Read mapping is like doing a jigsaw puzzle… …you get the pieces… Larger, more unique pieces are easier to place than others…

Next-gen reads are generally short read length [bp] 0 100200300 ~200-450 (variable) 25-70 (fixed) 25-50 (fixed) 20-60 (variable) 400

Base error rates are low Illumina 454

Strategies to deal with non-unique mapping

Mapping probabilities (qualities) 0.8 0.190.01 read

Error types are very different Illumina 454

Gapped alignments

MOSAIK fast accurate gapped versatile (short + long reads)

3. SNP and short-INDEL calling deep alignments of 100s / 1000s of individuals trio sequences

Allele discovery is a multi-step sampling process Population SamplesReads

Capturing the allele in the sample

Allele calling in the reads base quality allele call in read number of individuals

How many reads needed to call an allele? aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac Q30Q40Q50Q60 10.01 0.10.5 20.821.0 3

The need for accurate data…

… and realistic base quality values

Recalibrated base quality values (Illumina)

More samples or deeper coverage / sample? Shallower read coverage from more individuals … …or deeper coverage from fewer samples? simulation analysis by Aaron Quinlan

Analysis indicates a balance

SNP calling in trios the child inherits one chromosome from each parent there is a small probability for a mutation in the child

SNP calling in trios aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac mother father child P=0.79 P=0.86

4. Structural variation discovery Read pair mapping pattern (breakpoint detection)

Copy number estimation Depth of read coverage

Deletion: Aberrant positive mapping distance

Tandem duplication: negative mapping distance

Het deletion “revealed” by normalization Chip Stewart Saturday poster session

5. Data visualization software development data validation hypothesis generation

Summary Next-generation sequencing is a boon for large-scale individual human resequencing Basic data mining tools are getting applied and tested in the 1000 Genomes Project There is still a lot of fine-tuning to do A different set of tools are needed for comparative analysis and effective visualization of 100s/1000s of genomes

Credits Derek Barnett Eric Tsung Aaron Quinlan Damien Croteau-Chonka Weichun Huang Michael Stromberg Chip Stewart Michele Busby Several postdoc positions are available… … mail marth@bc.edumarth@bc.edu

Software tools for next-gen data http://bioinformatics.bc.edu/marthlab/Beta_Release

Positions Several postdoc positions are available… mail marth@bc.edumarth@bc.edu

Individual genotype directly from sequence AACGTTAGCATA AACGTTCGCATA AACGTTAGCATA individual 1 individual 3 individual 2 A/C C/CC/C A/A

Genotyping from primary sequence data 100 @ 16x: 0.975 +/- 0.121 200 @ 8x: 0.968 +/- 0.129 400 @ 4x: 0.924 +/- 0.151 800 @ 2x: 0.769 +/- 0.154

Most reads contain no or few errors

Paired-end reads help unique read placement fragment amplification: fragment length 100 - 600 bp fragment length limited by amplification efficiency Korbel et al. Science 2007 circularization: 500bp - 10kb (sweet spot ~3kb) fragment length limited by library complexity PE MP

How many reads needed to call an allele? aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac P=0.82 P=0.08

Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.

Similar presentations

Presentation on theme: "Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.

Similar presentations

Presentation on theme: "Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback