Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing.

Slides:



Advertisements
Similar presentations
The Good, Bad, and Ugly of Next-Gen Sequencing
Advertisements

Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
Discovery of Structural Variation with Next-Generation Sequencing Alexandre Gillet-Markowska Gilles Fischer Team – Biology.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye
Transcriptome Sequencing with Reference
Next-generation sequencing
Canadian Bioinformatics Workshops
Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February
Next-generation sequencing and PBRC. Next Generation Sequencer Applications DeNovo Sequencing Resequencing, Comparative Genomics Global SNP Analysis Gene.
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009.
Greg Phillips Veterinary Microbiology
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Design Goals Crash Course: Reference-guided Assembly.
Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009.
1000 Genomes SV detection Boston College Chip Stewart 24 November 2008.
Canadian Bioinformatics Workshops
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Data analysis methods for next- generation sequencing technologies Gabor T. Marth Boston College Biology Department Epigenomics & Sequencing Meeting July.
Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
NHGRI/NCBI Short-Read Archive: Data Retrieval Gabor T. Marth Boston College Biology Department NCBI/NHGRI Short-Read.
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November.
CS273a Lecture 9, Aut08, Batzoglou CS273a Lecture 9, Fall 2008 Quality of assemblies—mouse N50 contig length Terminology: N50 contig length If we sort.
Bioinformatics for next-generation DNA sequencing Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September.
Read mapping and variant calling in human short-read DNA sequences
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
$399 Personal Genome Service $2,500 Health Compass service $985 deCODEme (November 2007) (April 2008) $350,000 Whole-genome sequencing (November 2007)
Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Diabetes and Endocrinology Research Center The BCM Microarray Core Facility: Closing the Next Generation Gap Alina Raza 1, Mylinh Hoang 1, Gayan De Silva.
Next generation sequencing platforms Applications
Next Now-Generation Genomics: methods and applications for modern disease research Aaron J. Mackey, Ph.D. Center for Public Health.
Next generation sequencing Xusheng Wang 4/29/2010.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
ARC Biotechnology Platform: Sequencing for Game Genomics Dr Jasper Rees
Todd J. Treangen, Steven L. Salzberg
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Massive Parallel Sequencing
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
The Changing Face of Sequencing
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Next Generation Sequencing
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
SEQUENCING – THE BENCHTOPS. Roche 454 Junior Same technology as 454 FLX Read length: 400 bases Paired-end 100,000 reads 12 hours (instrument time) Output.
UK NGS Sequencing Update July 2009 Dr Gerard Bishop - Division of Biology Dr Sarah Butcher – Centre for Bioinformatics.
Informatics challenges for next-generation sequence analysis
Next-generation sequencing: the informatics angle
Next-generation sequencing: the informatics angle Gabor T. Marth Boston College Biology Department CHI Next-Generation Data Analysis meeting Providence,
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,
Analysis of Next Generation Sequence Data BIOST /06/2015.
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Global Variation in Copy Number in the Human Genome Speaker: Yao-Ting Huang Nature, Genome Research, Genome Research, 2006.
Introduction to Next Generation Sequencing. Strategies For Interrogating the Transcriptome Known genes Predicted genes Surrogate strategy Exon verification.
Discovery tools for human genetic variations
Genome organization and Bioinformatics
Jianbin Wang, H. Christina Fan, Barry Behr, Stephen R. Quake  Cell 
BI820 – Seminar in Quantitative and Computational Problems in Genomics
Data formats Gabor T. Marth Boston College
Next-generation DNA sequencing
BF528 - Genomic Variation and SNP Analysis
SNPs and CNPs By: David Wendel.
Presentation transcript:

Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing Technologies for Antibody Clone Screening April 6, 2009

New sequencing technologies…

… offer vast throughput read length bases per machine run 10 bp1,000 bp100 bp 1 Gb 100 Mb 10 Mb 10 Gb Illumina/Solexa, AB/SOLiD sequencers ABI capillary sequencer Roche/454 pyrosequencer ( Mb in bp reads) (10-30Gb in bp reads) 1 Mb 100 Gb

Roche / 454 pyrosequencing technology variable read-length the only new technology with >100bp reads

Illumina / Solexa fixed-length short-read sequencer very high throughput read properties are very close to traditional capillary sequences

AB / SOLiD ACGT A C G T 2 nd Base 1 st Base fixed-length short-reads very high throughput 2-base encoding system color-space informatics

Helicos / Heliscope short-read sequencer single molecule sequencing no amplification variable read-length

Many applications organismal resequencing & de novo sequencing Ruby et al. Cell, 2006 Jones-Rhoades et al. PLoS Genetics, 2007 transcriptome sequencing for transcript discovery and expression profiling Meissner et al. Nature 2008 epigenetic analysis (e.g. DNA methylation)

Data characteristics

Read length read length [bp] ~ (variable) (fixed) (fixed) (variable) 400

Error characteristics (Illumina)

Error characteristics (454)

Coverage bias ~2X read genome read coverage ~20X read genome read coverage

Genome re- sequencing

Complete human genomes

The re-sequencing informatics pipeline REF (ii) read mapping IND (i) base calling IND (iii) SNP and short INDEL calling (v) data viewing, hypothesis generation (iv) SV calling

Read mapping

… is like a jigsaw puzzle … and they give you the picture on the box 2. Read mapping …you get the pieces… Big and Unique pieces are easier to place than others…

Challenge: non-uniqueness Reads from repeats cannot be uniquely mapped back to their true region of origin RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length

Non-unique mapping

SE short-read alignments are error-prone 0.35%

Paired-end (PE) reads fragment length: 100 – 600bp Korbel et al. Science 2007 fragment length: 1 – 10kb

PE alignment statistics (simulated data) 0.00% 7.6% 0.09% 0.35% 0.03%

The MOSAIK read mapper/aligner Michael Strömberg

Gapped alignments

Aligning multiple read types together ABI/capillary 454 FLX 454 GS20 Illumina

SNP / short-INDEL discovery

Polymorphism detection sequencing errorpolymorphism

Allele calling in multi-individual data P(G 1 =aa|B 1 =aacc; B i =aaaac; B n = cccc) P(G 1 =cc|B 1 =aacc; B i =aaaac; B n = cccc) P(G 1 =ac|B 1 =aacc; Bi=aaaac; B n = cccc) P(G i =aa|B 1 =aacc; B i =aaaac; B n = cccc) P(G i =cc|B 1 =aacc; B i =aaaac; B n = cccc) P(G i =ac|B 1 =aacc; Bi=aaaac; B n = cccc) P(G n =aa|B 1 =aacc; B i =aaaac; B n = cccc) P(G n =cc|B 1 =aacc; B i =aaaac; B n = cccc) P(G n =ac|B 1 =aacc; Bi=aaaac; B n = cccc) P(SNP) “genotype probabilities” P(B 1 =aacc|G 1 =aa) P(B 1 =aacc|G 1 =cc) P(B 1 =aacc|G 1 =ac) P(B i =aaaac|G i =aa) P(B i =aaaac|G i =cc) P(B i =aaaac|G i =ac) P(B n =cccc|G n =aa) P(B n =cccc|G n =cc) P(B n =cccc|G n =ac) “genotype likelihoods” Prior(G 1,..,G i,.., G n ) -----a c a c-----

SNP calling in deep sample sets Population SamplesReads Allele detection

Capturing the allele in the samples

The ability to call rare alleles reads Q30Q40Q50Q aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac

Allele calling in 400 samples

Detecting de novo mutations the child inherits one chromosome from each parent there is a small probability for a de novo (germ-line or somatic) mutation in the child

Capture sequencing

Targeted mammalian re-sequencing Deep sequencing of complete human genomes is still too expensive There is a need to sequence target regions, typically genes, to follow up on GWAS studies Targeted re-sequencing with DNA fragment capture offers a potentially cost-effective alternative Solid phase or liquid phase capture 454 or Illumina sequencing Informatics pipeline must account for the peculiarities of capture data

On/off target capture ref allele*:45% non-ref allele*:54% Target region SNP (outside target region)

Reference allele bias (*) measured at 450 het HapMap 3 sites overlapping capture target regions in sample NA07346 ref allele*:54% non-ref allele*:45% ref allele*:54% non-ref allele*:45%

SNP example Amit Indap

Structural Variation discovery

Structural variations

SV/CNV detection – SNP chips Tiling arrays and SNP-chips made whole-genome CNV scans possible Probe density and placement limits resolution Balanced events cannot be detected

SV/CNV detection – resolution Expected CNVs Karyotype Micro-array Sequencing Relative numbers of events CNV event length [bp]

44 Read depth

Chromosome 2 Position [Mb] CNV events found using RD

PE read mapping positions

47 The SV/CNV “event display” Chip Stewart

Spanner – specificity

Data standards

Data types with standard formats SRF/FASTQ SAM/BAM GLF

Transcriptome sequencing

Data highly reproducible Michele Busby

Comparative data Michele Busby

Biological questions Michele Busby

Our software tools for next-gen data

Credits Elaine Mardis Andy Clark Aravinda Chakravarti Doug Smith Michael Egholm Scott Kahn Francisco de la Vega Patrice Milos John Thompson

Lab Several postdoc positions are available!

Mutational profiling

Chemical mutagenesis

Mutational profiling: deep 454/Illumina/SOLiD data Pichia stipitis converts xylose to ethanol (bio-fuel production) one mutagenized strain had high conversion efficiency determine which mutations caused this phenotype 15MB genome: 454, Illumina, and SOLiD reads 14 true point mutations in the entire genome Pichia stipitis reference sequence Image from JGI web site 10-15X genome coverage required