Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing.

Similar presentations


Presentation on theme: "Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing."— Presentation transcript:

1 Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing Technologies for Antibody Clone Screening April 6, 2009

2 New sequencing technologies…

3 … offer vast throughput read length bases per machine run 10 bp1,000 bp100 bp 1 Gb 100 Mb 10 Mb 10 Gb Illumina/Solexa, AB/SOLiD sequencers ABI capillary sequencer Roche/454 pyrosequencer (100-400 Mb in 200-450 bp reads) (10-30Gb in 25-100 bp reads) 1 Mb 100 Gb

4 Roche / 454 pyrosequencing technology variable read-length the only new technology with >100bp reads

5 Illumina / Solexa fixed-length short-read sequencer very high throughput read properties are very close to traditional capillary sequences

6 AB / SOLiD ACGT A C G T 2 nd Base 1 st Base 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 fixed-length short-reads very high throughput 2-base encoding system color-space informatics

7 Helicos / Heliscope short-read sequencer single molecule sequencing no amplification variable read-length

8 Many applications organismal resequencing & de novo sequencing Ruby et al. Cell, 2006 Jones-Rhoades et al. PLoS Genetics, 2007 transcriptome sequencing for transcript discovery and expression profiling Meissner et al. Nature 2008 epigenetic analysis (e.g. DNA methylation)

9 Data characteristics

10 Read length read length [bp] 0 100200300 ~200-450 (variable) 25-100(fixed) 25-50 (fixed) 25-60 (variable) 400

11 Error characteristics (Illumina)

12 Error characteristics (454)

13 Coverage bias ~2X read genome read coverage ~20X read genome read coverage

14 Genome re- sequencing

15 Complete human genomes

16 The re-sequencing informatics pipeline REF (ii) read mapping IND (i) base calling IND (iii) SNP and short INDEL calling (v) data viewing, hypothesis generation (iv) SV calling

17 Read mapping

18 … is like a jigsaw puzzle … and they give you the picture on the box 2. Read mapping …you get the pieces… Big and Unique pieces are easier to place than others…

19 Challenge: non-uniqueness Reads from repeats cannot be uniquely mapped back to their true region of origin RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length

20 Non-unique mapping

21 SE short-read alignments are error-prone 0.35%

22 Paired-end (PE) reads fragment length: 100 – 600bp Korbel et al. Science 2007 fragment length: 1 – 10kb

23 PE alignment statistics (simulated data) 0.00% 7.6% 0.09% 0.35% 0.03%

24 The MOSAIK read mapper/aligner Michael Strömberg

25 Gapped alignments

26 Aligning multiple read types together ABI/capillary 454 FLX 454 GS20 Illumina

27 SNP / short-INDEL discovery

28 Polymorphism detection sequencing errorpolymorphism

29 Allele calling in multi-individual data P(G 1 =aa|B 1 =aacc; B i =aaaac; B n = cccc) P(G 1 =cc|B 1 =aacc; B i =aaaac; B n = cccc) P(G 1 =ac|B 1 =aacc; Bi=aaaac; B n = cccc) P(G i =aa|B 1 =aacc; B i =aaaac; B n = cccc) P(G i =cc|B 1 =aacc; B i =aaaac; B n = cccc) P(G i =ac|B 1 =aacc; Bi=aaaac; B n = cccc) P(G n =aa|B 1 =aacc; B i =aaaac; B n = cccc) P(G n =cc|B 1 =aacc; B i =aaaac; B n = cccc) P(G n =ac|B 1 =aacc; Bi=aaaac; B n = cccc) P(SNP) “genotype probabilities” P(B 1 =aacc|G 1 =aa) P(B 1 =aacc|G 1 =cc) P(B 1 =aacc|G 1 =ac) P(B i =aaaac|G i =aa) P(B i =aaaac|G i =cc) P(B i =aaaac|G i =ac) P(B n =cccc|G n =aa) P(B n =cccc|G n =cc) P(B n =cccc|G n =ac) “genotype likelihoods” Prior(G 1,..,G i,.., G n ) -----a----- -----c----- -----a----- -----c-----

30 SNP calling in deep sample sets Population SamplesReads Allele detection

31 Capturing the allele in the samples

32 The ability to call rare alleles reads Q30Q40Q50Q60 10.01 0.10.5 20.821.0 3 aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac

33 Allele calling in 400 samples

34 Detecting de novo mutations the child inherits one chromosome from each parent there is a small probability for a de novo (germ-line or somatic) mutation in the child

35 Capture sequencing

36 Targeted mammalian re-sequencing Deep sequencing of complete human genomes is still too expensive There is a need to sequence target regions, typically genes, to follow up on GWAS studies Targeted re-sequencing with DNA fragment capture offers a potentially cost-effective alternative Solid phase or liquid phase capture 454 or Illumina sequencing Informatics pipeline must account for the peculiarities of capture data

37 On/off target capture ref allele*:45% non-ref allele*:54% Target region SNP (outside target region)

38 Reference allele bias (*) measured at 450 het HapMap 3 sites overlapping capture target regions in sample NA07346 ref allele*:54% non-ref allele*:45% ref allele*:54% non-ref allele*:45%

39 SNP example Amit Indap

40 Structural Variation discovery

41 Structural variations

42 SV/CNV detection – SNP chips Tiling arrays and SNP-chips made whole-genome CNV scans possible Probe density and placement limits resolution Balanced events cannot be detected

43 SV/CNV detection – resolution Expected CNVs Karyotype Micro-array Sequencing Relative numbers of events CNV event length [bp]

44 44 Read depth

45 Chromosome 2 Position [Mb] CNV events found using RD

46 PE read mapping positions

47 47 The SV/CNV “event display” Chip Stewart

48 Spanner – specificity

49 Data standards

50 Data types with standard formats SRF/FASTQ SAM/BAM GLF

51 Transcriptome sequencing

52 Data highly reproducible Michele Busby

53 Comparative data Michele Busby

54 Biological questions Michele Busby

55 Our software tools for next-gen data http://bioinformatics.bc.edu/marthlab/Software_Release

56 Credits Elaine Mardis Andy Clark Aravinda Chakravarti Doug Smith Michael Egholm Scott Kahn Francisco de la Vega Patrice Milos John Thompson

57 Lab Several postdoc positions are available!

58 Mutational profiling

59 Chemical mutagenesis

60 Mutational profiling: deep 454/Illumina/SOLiD data Pichia stipitis converts xylose to ethanol (bio-fuel production) one mutagenized strain had high conversion efficiency determine which mutations caused this phenotype 15MB genome: 454, Illumina, and SOLiD reads 14 true point mutations in the entire genome Pichia stipitis reference sequence Image from JGI web site 10-15X genome coverage required


Download ppt "Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing."

Similar presentations


Ads by Google