Presentation is loading. Please wait.

Presentation is loading. Please wait.

Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Similar presentations


Presentation on theme: "Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008."— Presentation transcript:

1 Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008

2 Next-gen. sequencers offer vast throughput read length bases per machine run 10 bp1,000 bp100 bp 1 Gb 100 Mb 10 Mb 10 Gb Illumina, AB/SOLiD short-read sequencers ABI capillary sequencer 454 pyrosequencer (100-400 Mb in 200-450 bp reads) (5-15Gb in 25-70 bp reads) 1 Mb

3 Next-gen sequencing enables new applications Meissner et al. Nature 2008 Ruby et al. Cell, 2006 Jones-Rhoades et al. PLoS Genetics, 2007 organismal resequencing & de novo sequencing transcriptome sequencing for transcript discovery and expression profiling epigenetic analysis (e.g. DNA methylation)

4 Large-scale individual human resequencing

5 Technologies

6 Roche / 454 system pyrosequencing technology variable read-length the only new technology with >100bp reads

7 Illumina / Solexa Genome Analyzer fixed-length short-read sequencer very high throughput read properties are very close to traditional capillary sequences

8 AB / SOLiD system ACGT A C G T 2 nd Base 1 st Base 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 fixed-length short-reads very high throughput 2-base encoding system color-space informatics

9 Helicos / Heliscope system short-read sequencer single molecule sequencing no amplification variable read-length error rate reduced with 2- pass template sequencing

10 Data characteristics

11 Read length read length [bp] 0 100200300 ~200-450 (variable) 25-70 (fixed) 25-50 (fixed) 20-60 (variable) 400

12 Representational biases this affects genome resequencing (deeper starting read coverage is needed) will have major impact is on counting applications “dispersed” coverage distribution

13 Amplification errors many reads from clonal copies of a single fragment early PCR errors in “clonal” read copies lead to false positive allele calls early amplification error gets propagated into every clonal copy

14 Read quality

15 Error rate (Illumina)

16 Error rate (454)

17 Per-read errors (Solexa)

18 Per read errors (454)

19 Base quality values not well calibrated

20 Tools for genome resequencing

21 The resequencing informatics pipeline (iii) read assembly REF (ii) read mapping IND (i) base calling IND (iv) SNP and short INDEL calling (vi) data validation, hypothesis generation (v) SV calling

22 The variation discovery “toolbox” base callers read mappers SNP callers SV callers assembly viewers

23 1. Base calling base sequence base quality (Q-value) sequence diverse chemistry & sequencing error profiles

24 454 pyrosequencer error profile multiple bases in a homo-polymeric run are incorporated in a single incorporation test  the number of bases must be determined from a single scalar signal  the majority of errors are INDELs

25 454 base quality values the native 454 base caller assigns too low base quality values

26 PYROBAYES: determine base number

27 PYROBAYES: Performance assigned quality values predict measured error rate better higher fraction of bases are high quality

28 Base quality value calibration

29 Recalibrated base quality values (Illumina)

30 … and they give you the picture on the box 2. Read mapping Read mapping is like doing a jigsaw puzzle… …you get the pieces… Unique pieces are easier to place than others…

31 Non-uniqueness of reads confounds mapping Reads from repeats cannot be uniquely mapped back to their true region of origin RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length

32 Strategies to deal with non-unique mapping Non-unique read mapping: optionally either only report uniquely mapped reads or report all map locations for each read (mapping quality values for all mapped reads are being implemented) 0.8 0.190.01 read mapping to multiple loci requires the assignment of alignment probabilities

33 Paired-end reads help unique read placement fragment amplification: fragment length 100 - 600 bp fragment length limited by amplification efficiency Korbel et al. Science 2007 circularization: 500bp - 10kb (sweet spot ~3kb) fragment length limited by library complexity PE MP PE reads are now the standard for genome resequencing

34 MOSAIK

35 INDEL alleles/errors – gapped alignments 454

36 Aligning multiple read types together ABI/capillary 454 FLX 454 GS20 Illumina Alignment and co- assembly of multiple reads types permits simultaneous analysis of data from multiple sources and error characteristics

37 Aligner speed

38 3. Polymorphism / mutation detection sequencing error polymorphism

39 New challenges for SNP calling deep alignments of 100s / 1000s of individuals trio sequences

40 Rare alleles in 100s / 1,000s of samples

41 Allele discovery is a multi-step sampling process Population SamplesReads

42 Capturing the allele in the sample

43 Allele calling in the reads base call sample size individual read coverage base quality

44 Allele calling in deep sequence data aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac Q30Q40Q50Q60 10.01 0.10.5 20.821.0 3

45 More samples or deeper coverage / sample? Shallower read coverage from more individuals … …or deeper coverage from fewer samples? simulation analysis by Aaron Quinlan

46 Analysis indicates a balance

47 SNP calling in trios the child inherits one chromosome from each parent there is a small probability for a mutation in the child

48 SNP calling in trios aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac mother father child P=0.79 P=0.86

49 Determining genotype directly from sequence AACGTTAGCATA AACGTTCGCATA AACGTTAGCATA individual 1 individual 3 individual 2 A/C C/CC/C A/A

50 4. Structural variation discovery

51 SV events from PE read mapping patterns

52 Deletion: Aberrant positive mapping distance

53 Copy number estimation from depth of coverage

54 Alignability – read coverage normalization

55 Het deletion “revealed” by normalization

56 Tandem duplication: negative mapping distance

57 Spanner – a hybrid SV/CNV detection tool Navigation bar Fragment lengths in selected region Depth of coverage in selected region

58 5. Data visualization 1.aid software development: integration of trace data viewing, fast navigation, zooming/panning 2.facilitate data validation (e.g. SNP validation): simultanous viewing of multiple read types, quality value displays 3.promote hypothesis generation: integration of annotation tracks

59 Data visualization

60 Our software tools for next-gen data http://bioinformatics.bc.edu/marthlab/Beta_Release

61 Data mining projects

62 SNP calling in short-read coverage C. elegans reference genome (Bristol, N2 strain) Pasadena, CB4858 (1 ½ machine runs) Bristol, N2 strain (3 ½ machine runs) goal was to evaluate the Solexa/Illumina technology for the complete resequencing of large model-organism genomes 5 runs (~120 million) Illumina reads from the Wash. U. Genome Center, as part of a collaborative project lead by Elaine Mardis, at Washington University primary aim was to detect polymorphisms between the Pasadena and the Bristol strain

63 Polymorphism discovery in C. elegans SNP calling error rate very low: Validation rate = 97.8% (224/229) Conversion rate = 92.6% (224/242) Missed SNP rate = 3.75% (26/693) SNP INS INDEL candidates validate and convert at similar rates to SNPs: Validation rate = 89.3% (193/216) Conversion rate = 87.3% (193/221) MOSAIK aligned / assembled the reads (< 4 hours on 1 CPU) PBSHORT found 44,642 SNP candidates (2 hours on our 160-CPU cluster) SNP density: 1 in 1,630 bp (of non-repeat genome sequence)

64 Mutational profiling: deep 454/Illumina/SOLiD data Pichia stipitis converts xylose to ethanol (bio-fuel production) one mutagenized strain had especially high conversion efficiency determine where the mutations were that caused this phenotype we resequenced the 15MB genome with 454 Illumina, and SOLiD reads 14 true point mutations in the entire genome Pichia stipitis reference sequence Image from JGI web site

65 Technology comparisons

66 Thanks

67 Credits Elaine Mardis Andy Clark Aravinda Chakravarti Doug Smith Michael Egholm Scott Kahn Francisco de la Vega Kristen Stoops Ed Thayer

68 Lab

69 Recruitment


Download ppt "Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008."

Similar presentations


Ads by Google