Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.

Slides:



Advertisements
Similar presentations
Graph Algorithms in Bioinformatics. Outline Introduction to Graph Theory Eulerian & Hamiltonian Cycle Problems Benzer Experiment and Interval Graphs DNA.
Advertisements

Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Lecture 14 Genome sequencing projects
Connection Between Alignment and HMMs. A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
DNA Sequencing Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the circular genome (host)
DNA Sequencing.
DNA Sequencing. CS273a Lecture 3, Spring 07, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.
Assembly.
DNA Sequencing. Next few topics DNA Sequencing  Sequencing strategies Hierarchical Online (Walking) Whole Genome Shotgun  Sequencing Assembly Gene Recognition.
DNA Sequencing and Assembly
DNA Sequencing.
CS273a Lecture 4, Autumn 08, Batzoglou Fragment Assembly (in whole-genome shotgun sequencing) CS273a Lecture 5.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
DNA Sequencing. CS273a Lecture 3, Spring 07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
DNA Sequencing. CS262 Lecture 9, Win07, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.
Computational Genomics and Proteomics Sequencing genomes.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
DNA Sequencing.
DNA Sequencing. CS262 Lecture 9, Win06, Batzoglou DNA Sequencing – gel electrophoresis 1.Start at primer(restriction site) 2.Grow DNA chain 3.Include.
DNA Sequencing. CS262 Lecture 9, Win06, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
DNA Sequencing. CS273a Lecture 3, Autumn 08, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
CS273a Lecture 4, Autumn 08, Batzoglou DNA Sequencing.
CS262 Lecture 9, Win07, Batzoglou Conditional Random Fields A brief description.
DNA Sequencing. Next few topics DNA Sequencing  Sequencing strategies Hierarchical Online (Walking) Whole Genome Shotgun  Sequencing Assembly Gene Recognition.
Genome sequencing and assembling
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Sequencing techniques and genome assembly Yuzhen Ye School of Informatics & Computing, IUB I519 Introduction to Bioinformatics, 2011.
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
De-novo Assembly Day 4.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
CS273a Lecture 4, Autumn 08, Batzoglou CS273a 2011 DNA Sequencing.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Genome sequencing Haixu Tang School of Informatics.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
CS273a Lecture 4, Autumn 08, Batzoglou CS273a 2013 DNA Structure.
Human Genome.
DNA Sequencing.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
VL Algorithmische BioInformatik (19710) WS2015/2016 Woche 7 - Mittwoch Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Genome Research 12:1 (2002), Assembly algorithm outline ● Input and trimming ● Overlap detection ● Error correction ● Evaluation of alignments.
CSCI2950-C Lecture 2 DNA Sequencing and Fragment Assembly
DNA Sequencing.
DNA Sequencing Project
Fragment Assembly (in whole-genome shotgun sequencing)
Genome sequence assembly
Computational Genomics: Genome assembly
Graph Algorithms in Bioinformatics
CS 598AGB Genome Assembly Tandy Warnow.
Graph Algorithms in Bioinformatics
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Most slides are taken/adapted from Serafim Batzoglou’s lectures

Outline Practical challenges in genome sequencing Whole genome sequencing strategies Sequencing coverage (Lander-Waterman model) Overlap-Layout-Consensus approach

Challenges with Fragment Assembly Sequencing errors ~1-2% of bases are wrong Repeats Computation: ~ O( N 2 ) where N = # reads false overlap due to repeat Bacterial genomes:5% Mammals:50%

Repeat Types Low-Complexity DNA (e.g. ATATATATACATA…) Microsatellite repeats (a 1 …a k ) N where k ~ 3-6 (e.g. CAGCAGTAGCAGCACCAG) Transposons/retrotransposons –SINE Short Interspersed Nuclear Elements (e.g., Alu: ~300 bp long, 10 6 copies) –LINE Long Interspersed Nuclear Elements ~ ,000 bp long, 200,000 copies –LTR retroposonsLong Terminal Repeats (~700 bp) at each end Gene Families genes duplicate & then diverge Segmental duplications ~very long, very similar copies

The sequencing errors, repeats, and the complexity of genomes make it necessary to use many heuristics in practice… The Shortest Superstring formulation is an over-simplification of the problem

Strategies for whole-genome sequencing 1. Hierarchical – Clone-by-clone yeast, worm, human i.Break genome into many long fragments ii.Map each long fragment onto the genome iii.Sequence each fragment with shotgun 2. Online version of (1) – Walkingrice genome i.Break genome into many long fragments ii.Start sequencing each fragment with shotgun iii.Construct map as you go 3. Whole Genome Shotgunfly, human, mouse, rat, fugu One large shotgun pass on the whole genome

Hierarchical Sequencing vs. Whole Genome Shotgun Hierarchical Sequencing –Advantages: Easy assembly –Disadvantages: Build library & physical map; Redundant sequencing Whole Genome Shotgun (WGS) –Advantages: No mapping, no redundant sequencing –Disadvantages: Difficult to assemble and resolve repeats Whole Genome Shotgun appears to get more popular…

Whole Genome Shotgun Sequencing cut many times at random genome forward-reverse paired reads known dist ~500 bp

Fragment Assembly Cover region with ~7-fold redundancy Overlap reads and extend to reconstruct the original genomic region reads

Read Coverage Length of genomic segment:G Number of reads: N Length of each read:L Definition: Coverage C = NL/ G C

Enough Coverage How much coverage is enough? According to the Lander-Waterman model: Assuming uniform distribution of reads, C=7 results in 1 gap per 1,000 nucleotides

Lander-Waterman Model Major Assumptions –Reads are randomly distributed in the genome –The number of times a base is sequenced follows a Poisson distribution Implications –G= genome length, L=read length, N = # reads –Mean of Poisson: =LN/G (coverage) –% bases not sequenced: p(X=0) = = 0.09% –Total gap length: p(X=0)*G –Total number of gaps: p(X=0)*N Average times This model was used to plan the Human Genome Project…

Repeats, Errors, and Read lengths Repeats shorter than read length are OK Repeats with more base pair diffs than sequencing error rate are OK To make a smaller portion of the genome appear repetitive, try to: –Increase read length –Decrease sequencing error rate Role of error correction: Discards ~90% of single-letter sequencing errors decreases error rate  decreases effective repeat content However, we have only limited read length. Many heuristics have been introduced to handle repeats…

Overlap-Layout-Consensus Assemblers:ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Consensus: derive the DNA sequence and correct read errors..ACGATTACAATAGGTT..

Overlap Find the best match between the suffix of one read and the prefix of another Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring

Overlapping Reads TAGATTACACAGATTAC ||||||||||||||||| Sort all k-mers in reads (k ~ 24) Find pairs of reads sharing a k-mer Extend to full alignment – throw away if not >95% similar T GA TAGA | || TACA TAGT ||

Overlapping Reads and Repeats A k-mer that appears N times, initiates N 2 comparisons For an Alu that appears 10 6 times  comparisons – too much Solution: Discard all k-mers that appear more than t  Coverage, (t ~ 10)

Finding Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA

Finding Overlapping Reads (cont’d) Correct errors using multiple alignment TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA C: 20 C: 35 T: 30 C: 35 C: 40 C: 20 C: 35 C: 0 C: 35 C: 40 Score alignments Accept alignments with good scores A: 15 A: 25 A: 40 A: 25 - A: 15 A: 25 A: 40 A: 25 A: 0 Multiple alignments will be covered later in the course…

Layout Repeats are a major challenge Do two aligned fragments really overlap, or are they from two copies of a repeat?

Merge Reads into Contigs Merge reads up to potential repeat boundaries repeat region

Merge Reads into Contigs (cont’d) Ignore non-maximal reads Merge only maximal reads into contigs repeat region

Merge Reads into Contigs (cont’d) Ignore “hanging” reads, when detecting repeat boundaries sequencing error repeat boundary??? b a

Merge Reads into Contigs (cont’d) ????? Unambiguous Insert non-maximal reads whenever unambiguous

Link Contigs into Supercontigs Too dense: Overcollapsed? (Myers et al. 2000) Inconsistent links: Overcollapsed? Normal density

Link Contigs into Supercontigs (cont’d) Find all links between unique contigs Connect contigs incrementally, if  2 links

Link Contigs into Supercontigs (cont’d) Fill gaps in supercontigs with paths of overcollapsed contigs

Link Contigs into Supercontigs (cont’d) Define G = ( V, E ) V := contigs E := ( A, B ) such that d( A, B ) < C Reason to do so: Efficiency; full shortest paths cannot be computed d ( A, B ) Contig A Contig B

Link Contigs into Supercontigs (cont’d) Contig A Contig B Define T: contigs linked to either A or B Fill gap between A and B if there is a path in G passing only from contigs in T

Consensus A consensus sequence is derived from a profile of the assembled fragments A sufficient number of reads is required to ensure a statistically significant consensus Reading errors are corrected

Derive Consensus Sequence Derive multiple alignment from pairwise read alignments TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting

What You Should Know The challenges in assembling fragments for whole genome shotgun sequencing Lander-Waterman model Main heuristics used in Overlap-Layout- Consensus