Presentation is loading. Please wait.

Presentation is loading. Please wait.

Informatics challenges for next-generation sequence analysis

Similar presentations


Presentation on theme: "Informatics challenges for next-generation sequence analysis"— Presentation transcript:

1 Informatics challenges for next-generation sequence analysis
Gabor T. Marth Boston College Biology Department PSB 2008 January

2 Read length and throughput
Illumina/Solexa, AB/SOLiD short-read sequencers 1Gb (1-4 Gb in bp reads) bases per machine run 100 Mb 454 pyrosequencer ( Mb in bp reads) 10 Mb ABI capillary sequencer 1Mb read length 10 bp 100 bp 1,000 bp

3 Current and future application areas
Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery reference genome SNP DEL De novo genome sequencing Short-read sequencing will be (at least) an alternative to micro-arrays for: DNA-protein interaction analysis (CHiP-Seq) novel transcript discovery quantification of gene expression epigenetic analysis (methylation profiling)

4 Fundamental informatics challenges (I)
1. Interpreting machine readouts – base calling, base error estimation 2. Dealing with non-uniqueness in the genome: resequenceability 3. Alignment of billions of reads

5 Informatics challenges (II)
4. SNP and short INDEL, and structural variation discovery 5. Data visualization 6. Data storage & management

6 Challenge 1. Base accuracy and base calling
machine read-outs are quite diverse read length, read accuracy, and sequencing error profiles are variable (and change rapidly as machine hardware, chemistry, optics, and noise filtering improves)

7 Roche/454 pyrosequencer insertion and deletions dominate
error rates are nucleotide-dependent base quality values are underestimated

8 Illumina/Solexa system
Error rate grows with cycle number Actual base accuracy for a fixed base quality value is a function of base position within the read

9 AB/SOLiD system 2-base, 4-color: 16 probe combinations
dibase encoding: base transitions rather than individual bases are read conversion between base-space and “color-space” base quality value assignment is tricky A C G T 2nd Base 1st Base 1 2 3 2-base, 4-color: 16 probe combinations

10 PYROBAYES: 454 base calling program
data likelihoods priors posterior base number probability 10

11 Challenge 2. Resequenceability
Reads from repeats cannot be uniquely mapped back to their true region of origin RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length Near-perfect micro-repeats can be also a problem because we want to align reads even with a few sequencing errors and / or SNPs

12 Finding perfect and near-perfect micro-repeats
Hash based methods (fast but only work out to a couple of mismatches) Exact methods (very slow but find every repeat copy) Heuristic methods (fast but miss a fraction of the repeats)

13 Challenge 3. Read alignment and assembly
resequencing requires reference sequence-guided read alignment Step 1. initial short-hash based scan for possible read locations Step 2. evaluation of candidate read locations with SW method 13

14 Alignment to reference in “color-space”
There is no need to ever leave color space as you can convert your reference sequence to its color space and then align the two sequences. This is in fact desirable given the error properties of color space, our analysis software will use color space. note the misalignment in here are explained in next two slides) Working in color space: Reverse-complementation becomes simply reverse Apply color transition rules to remove measurement errors from partial assemblies If reference of Sanger reads are combined, translate to color space 14

15 Challenge 4. Polymorphism discovery
shallow and deep read coverage most candidates will never be “checked”  only very low error rates are acceptable we updated PolyBayes to deal with new read types made the new software (PBSHORT) much more efficient

16 SNP calling in short-read coverage
C. elegans reference genome (Bristol, N2 strain) Pasadena, CB4858 (1 ½ machine runs) SNP calling error rate very low: Validation rate = 97.8% (224/229) Conversion rate = 92.6% (224/242) Missed SNP rate = 3.75% (26/693) SNP INS INDEL candidates validate and convert at similar rates to SNPs: Validation rate = 89.3% (193/216) Conversion rate = 87.3% (193/221) 16

17 Mutational profiling: deep 454/Illumina/SOLiD data
Pichia stipitis reference sequence Image from JGI web site collaboration with Doug Smith at Agencourt Pichia stipitis converts xylose to ethanol (bio-fuel production) one mutagenized strain had especially high conversion efficiency determine where the mutations were that caused this phenotype we resequenced the 15MB genome with 454 Illumina, and SOLiD reads 14 true point mutations in the entire genome In about 15X nominal coverage each technology can find every point mutation with essentially no false positives

18 Structural variation discovery
copy number variations (deletions & amplifications) can be detected from variations in the depth of read coverage structural rearrangements (inversions and translocations) require paired-end read data Ask Chip to provide images for this one slide

19 Challenge 5. Data visualization
aid software development: integration of trace data viewing, fast navigation, zooming/panning facilitate data validation (e.g. SNP validation): co-viewing of multiple read types, quality value displays promote hypothesis generation: integration of annotation tracks

20 Challenge 6. Massive data volumes
two connected working groups to define standard binary data formats Short-read format working group (Asim Siddiqui, UBC) Assembly format working group Boston College

21 Our software is available for testing

22 Credits http://bioinformatics.bc.edu/marthlab
Elaine Mardis (Washington University) Andy Clark (Cornell University) Doug Smith (Agencourt) Research supported by: NHGRI (G.T.M.) BC Presidential Scholarship (A.R.Q.) Michael Stromberg Chip Stewart Michele Busby Aaron Quinlan Damien Croteau-Chonka Eric Tsung Derek Barnett Weichun Huang


Download ppt "Informatics challenges for next-generation sequence analysis"

Similar presentations


Ads by Google