Error model for massively parallel (454) DNA sequencing Sriram Raghuraman (working with Haixu Tang and Justin Choi)

Slides:



Advertisements
Similar presentations
Next-Generation Sequencing: Methodology and Application
Advertisements

BLAST Sequence alignment, E-value & Extreme value distribution.
Chapter 6 DNA  Consists of Deoxyribose sugar Phosphate group A, T, C, G  Double stranded molecule (Double Helix) Two strands of DNA run antiparallel.
The past, present, and future of DNA sequencing Dan Russell.
Profiles for Sequences
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Sequencing a genome and Basic Sequence Alignment Lecture 10 1Global Sequence.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
DNA Sequencing.
. Class 5: HMMs and Profile HMMs. Review of HMM u Hidden Markov Models l Probabilistic models of sequences u Consist of two parts: l Hidden states These.
Using a Genetic Algorithm for Approximate String Matching on Genetic Code Carrie Mantsch December 5, 2003.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Sequence alignment, E-value & Extreme value distribution
Sequencing a genome and Basic Sequence Alignment Lecture 8 1Global Sequence.
CS 6293 Advanced Topics: Current Bioinformatics
Sequencing a genome and Basic Sequence Alignment
Update on Next-Generation Sequencing
DNA Technology- Cloning, Libraries, and PCR 17 November, 2003 Text Chapter 20.
Finishing the Human Genome
High Throughput Sequencing Methods and Concepts
3.A.1 DNA and RNA Part II: Replication cases DNA, and in some cases RNA, is the primary source of heritable information. DNA, and in some cases RNA, is.
Transcription & Translation Do Now: 1.Get out yesterday’s homework (10-1 review) 2.If a DNA strand has the nucleotide sequence TCC-GAT-AAT, what will the.
High Throughput Sequencing Methods and Concepts Cedric Notredame adapted from S.M Brown.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Sequencing a genome and Basic Sequence Alignment
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Success criteria - PCR By the end of this lesson we will be able to: 1. The polymerase chain reaction (PCR) is a technique for the amplification ( making.
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Progress in genome sequencing  Human Genome Project  10.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Ultra-High Throughput DNA Sequencing on the 454/Roche GS-FLX
Locating and sequencing genes
By: Cody Alveraz Ted Dobbert Morgan Pettit
Doug Raiford Phage class: introduction to sequence databases.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Biotechnology and Genetic Engineering PBIO 450/550 Characterization of DNA clones including: Restriction Enzyme (RE) mapping Subcloning Southerns Northerns*
Introduction to PCR Polymerase Chain Reaction
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
CAMPBELL BIOLOGY Reece Urry Cain Wasserman Minorsky Jackson © 2014 Pearson Education, Inc. TENTH EDITION CAMPBELL BIOLOGY Reece Urry Cain Wasserman Minorsky.
AMPLIFYING DNA A.Recombinant DNA B.Polymerase Chain Reaction (PCR) (animation)
Lecturer: Bahiya Osrah Background PCR (Polymerase Chain Reaction) is a molecular biological technique that is used to amplify specific.
Presented by Samuel Chapman. Pyrosequencing-Intro The core idea behind pyrosequencing is that it utilizes the process of complementary DNA extension on.
I. PCR- Polymerase Chain Reaction A. A method to amplify a specific piece of DNA. DNA polymerase adds complementary strand DNA heated to separate strands.
Cse587A/Bio 5747: L2 1/19/06 1 DNA sequencing: Basic idea Background: test tube DNA synthesis DNA polymerase (a natural enzyme) extends 2-stranded DNA.
Are Roche 454 shotgun reads giving a accurate picture of the genome?
DNA Sequencing First generation techniques
Next-generation sequencing technology
Introduction to PCR Polymerase Chain Reaction
Success criteria - PCR By the end of this lesson we will be know:
DNA Sequencing.
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
Next-generation sequencing technology
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequencing Technologies
DNA Replication.
SOLEXA aka: Sequencing by Synthesis
Sequencing Data Analysis
ULTRASEQUENCING. Next Generation Sequencing: methods and applications.
Massively Parallel Sequencing: The Next Big Thing in Genetic Medicine
The Mechanism of DNA Replication
DNA Replication Helicase DNA Polymerase DNA Ligase
Biotechnology Mr. Greene Page: 78.
Standard (Sanger) sequencing
Sequence alignment, E-value & Extreme value distribution
Polymerase Chain Reaction (PCR) & DNA SEQUENCING
Sequencing Data Analysis
Presentation transcript:

Error model for massively parallel (454) DNA sequencing Sriram Raghuraman (working with Haixu Tang and Justin Choi)

Sequencing Preparation Randomly fragment entire genome Nebulize fragments. Add adapters. Attach to DNA capture beads in water oil emulsion PCR amplify fragments attached to beads Place beads bound to multiple copies of same fragment in a PicoTiterPlate. Add enzymes including polymerase and luciferase.

Sequencing Process Place plates in a sequencer. Wash nucleotides (A,C,G,T) in series over plate. When a complementary nucleotide enters a well, the template strand is extended by DNA polymerase. Addition of the nucleotide releases light which is recorded by a CCD camera. Hundreds of thousands of beads are then sequenced in parallel. Genome sequencing in microfabricated high-density picolitre reactors-Nature 437, (15 September 2005)

Speed of sequencing ~25 million bases at >=99% accuracy in a 4 hour run ~230,000 reads Average read length 110 bases

Data Sets(Newbler) reads aligned by Newbler  Bases  Matches (98.90%)  Mismatches10643(0.01%)  Inserts368332(0.37%)  Deletes (0.67%)  ‘N’ terms36820(0.03%)

Data Set (Sanger) Staphylococcus aureus subsp. aureus COL from NCBI Assembly Archive reads  Bases  Matches (99.70%)  Mismatches71203(0.26%)  Inserts1827(0.006%)  Deletes6223(0.02%)

Length Distributions Newbler reads are shorter than Sanger reads Newbler  Average read length ~100 bases Sanger  Average read length ~545 bases

Accuracy % Newbler reads show a prevalence of gaps as compared to mismatches  Newbler mismatches are indirect AA-CT AAG-T Sanger reads contain more mismatches than gaps

Biases in Substitutions and Gaps

Substitutions

The case for homogeneous gaps

Homogeneous gaps Newbler reads often exhibit homogeneous gaps Insertions R:-CGGGATCAGTGATGGCGTACGTTTACCGGGTTAAAAGAGGGCCGG G:-CGGGATCAGTGATG-CG-A--TT--CCGG-TTAAA-GAGG-C-GG Deletions R:-TTTACA-TCGTGGTCGTGACAC-ATCGACACTGTAT-AAAA-CCAT G:-TTT-CAATC-TGGTCGTGACACCATCGACACTGTATTAAAAACCAT

Insert Transitions

Delete Transitions

Insert Strings

Delete Strings

Some examples Blast 1 st hit  CTCCGCATC-AAAG....TTT-GATGCGGAG  CTCCGCATCCAAAG....TTTGGATGCGGAG Newbler Alignment  CCTCCGCATC-AAAG....TTTG-ATGCGGAG  C-TCCGCATCCAAAG....TTTGGATGCGGAG No difference between homogeneous and regular gaps as far as BLAST is concerned

Markov Model

General Ideas Incorporate provisions for homogeneous gaps Train model on Newbler data A Markov model that accounts for homogeneous gaps should perform better than one that doesn’t (i.e. BLAST)

MM AA MM-MisMatch CCGGTTA-C-G-T--A-C-G-T AC AG AT

Procedure Get initial, transition and emission probabilities from Newbler reads Use Markov model to perform pairwise alignment of unaligned reads by employing Viterbi’s algorithm Compare results to BLAST alignment of same reads

Procedure Get initial, transition and emission probabilities from Newbler reads Use Markov model to perform pairwise alignment of unaligned reads by employing Viterbi’s algorithm Compare results to BLAST alignment of same reads

Results

Limitations Global Alignment only Local Alignment hinges on good alignment extension metric/method