Presentation is loading. Please wait.

Presentation is loading. Please wait.

Next-generation sequencing: the informatics angle

Similar presentations


Presentation on theme: "Next-generation sequencing: the informatics angle"— Presentation transcript:

1 Next-generation sequencing: the informatics angle
Gabor T. Marth Boston College Biology Department

2 Next-generation sequencing
Illumina, AB/SOLiD short-read sequencers 10 Gb (5-15Gb in bp reads) 1 Gb 454 pyrosequencer ( Mb in bp reads) bases per machine run 100 Mb 10 Mb ABI capillary sequencer 1 Mb 10 bp 100 bp 1,000 bp read length

3 Individual human resequencing

4 Whole-genome mutational profiling

5 Expression analysis

6 Technologies

7 Roche / 454 system pyrosequencing technology variable read-length
the only new technology with >100bp reads 7

8 Illumina / Solexa Genome Analyzer
fixed-length short-read sequencer very high throughput read properties are very close to traditional capillary sequences low INDEL error rate 8

9 AB / SOLiD system fixed-length short-reads very high throughput
2-base encoding system color-space informatics A C G T 2nd Base 1st Base 1 2 3 9

10 Helicos / Heliscope system
short-read sequencer single molecule sequencing no amplification variable read-length error rate reduced with 2-pass template sequencing 10

11 Data characteristics

12 Read length read length [bp] 20-60 (variable) 25-50 (fixed)
100 200 300 400 read length [bp] 12

13 Paired fragment-end reads
fragment amplification: fragment length bp fragment length limited by amplification efficiency circularization: 500bp - 10kb (sweet spot ~3kb) fragment length limited by library complexity Korbel et al. Science 2007 paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends) instrumental for structural variation discovery 13

14 Representational biases
“dispersed” coverage distribution this affects genome resequencing (deeper starting read coverage is needed) will have major impact is on counting applications 14

15 Amplification errors early amplification error gets propagated into every clonal copy many reads from clonal copies of a single fragment early PCR errors in “clonal” read copies lead to false positive allele calls 15

16 Read quality

17 Error rate (Solexa) Derek: please make the numbers BIGGER so people from the back rows can see them!

18 Error rate (454)

19 Per-read errors (Solexa)
Ask Derek to label axes, and change title: Distribution of reads according to number of errors

20 Per read errors (454)

21 Applications

22 Genome resequencing for variation discovery
SNPs short INDELs structural variations the most immediate application area 22

23 Genome resequencing for mutational profiling
Organismal reference sequence Ask Chip to provide images for this one slide likely to change “classical genetics” and mutational analysis 23

24 De novo genome sequencing
Lander et al. Nature 2001 difficult problem with short reads promising, especially as reads get longer 24

25 Identification of protein-bound DNA
Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. (Robertson et al. Nature Methods, 2007) DNA methylation. (Meissner et al. Nature 2008) natural applications for next-gen. sequencers 25

26 Transcriptome sequencing: transcript discovery
Mortazavi et al. Nature Methods 2008 Ruby et al. Cell, 2006 high-throughput, but short reads pose challenges 26

27 Transcriptome sequencing: expression profiling
Cloonan et al. Nature Methods, 2008 Jones-Rhoads et al. PLoS Genetics, 2007 high-throughput, short-read sequencing should make a major impact, and potentially replace expression microarrays 27

28 Analysis software

29 Individual resequencing
REF (ii) read mapping IND (iii) read assembly (v) SV calling (iv) SNP and short INDEL calling (i) base calling IND (vi) data validation, hypothesis generation

30 The variation discovery “toolbox”
base callers read mappers SNP callers SV callers assembly viewers

31 1. Base calling base sequence base quality value sequence

32 Base quality value calibration

33 Recalibrated base quality values (Illumina)

34 2. Read mapping Read mapping is like doing a jigsaw puzzle…
…you get the pieces… … and they give you the picture on the box Problem is, some pieces are easier to place than others…

35 Strategies to deal with non-unique mapping

36 Mapping probabilities (qualities)
0.8 0.19 0.01 read

37 Paired-end read alignments
Paired-end read alignments helps unique read placement PE sequences are now the “norm” for genome sequencing 37

38 Gapped alignments Gapped alignments: allow mapping reads with insertion or deletion errors, and reads with bona fide INDEL alleles The ability to map reads with INDEL errors also improves the certainty of unique mapping 38

39 3. SNP and short-INDEL discovery
capillary sequences: either clonal or diploid traces

40 SNP and short-INDEL discovery (II)
New technologies are perfectly suitable for accurate SNP calling, and some also for short-INDEL detection INS

41 New demands on SNP calling

42 Rare alleles in 100s / 1,000s of samples

43 More samples or deeper coverage / sample?

44 Determining genotype directly from sequence
AACGTTAGCATA AACGTTCGCATA individual 1 A/C C/C A/A AACGTTCGCATA individual 2 AACGTTAGCATA individual 3

45 4. Structural variation discovery software
Navigation bar Fragment lengths in selected region Depth of coverage in selected region

46 5. Data visualization (assembly viewers)
software development data validation hypothesis generation

47 New analysis tools are needed
Tailoring existing tools for specialized applications (e.g. read mappers for transcriptome sequencing) Analysis pipelines and viewers that focus on the essential results e.g. the few mutations in a mutant, or compare 1000 genome sequences (but hide most details) Work-bench style tools to support downstream analysis

48 Data storage and data standards

49 What level of data to store?
traces images base quality values base-called reads 49

50 Data standards different data storage needs (archival, transfer, processing) often poses contradictory requirements (e.g. normalized vs. non-normalized storage of assembly, alignment, read, image data) even different analysis goals often call for different optimal storage / data access strategies (e.g. paired-end read analysis for SV detection vs. SNP calling) requirements include binary formats, fast sequential and / or random access, and flexible indexing (e.g. an entire genome assembly can no longer reside in RAM)

51 Data standards (II) Sequence Read Format, SRF (Asim Siddiqui, UBC)
Assembly format working group Genotype Likelihood Format (Richard Durbin, Sanger)

52 Summary

53 Conclusions: next-gen sequencing software
Next-generation sequencing is a boon for mass-scale human resequencing, whole-genome mutational profiling, expression analysis and epigenetic studies Informatics tools already effective for basic applications There is a need both for “generic” analysis tools e.g. flexible read aligners and for specialized tools tailored to specific applications (e.g. expression profiling) Move toward tools that focus on biological analysis Most challenges are technical in nature (e.g. data storage, useful data formats, fast read mapping)… many of these will be addressed at this conference

54 Credits Michael Stromberg Chip Stewart Aaron Quinlan Michele Busby
Damien Croteau-Chonka Eric Tsung Derek Barnett Weichun Huang

55 Credits Elaine Mardis Andy Clark Aravinda Chakravarti Doug Smith
Michael Egholm Scott Kahn Francisco de la Vega Kristen Stoops Ed Thayer


Download ppt "Next-generation sequencing: the informatics angle"

Similar presentations


Ads by Google