1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

Slides:



Advertisements
Similar presentations
DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Advertisements

The DNA Story Germs, Genes, and Genomics 4. Heredity Genes DNA Manipulating DNA.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Measuring the degree of similarity: PAM and blosum Matrix
1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Heuristic alignment algorithms and cost matrices
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Summer Bioinformatics Workshop 2008 Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University.
CSE182-L12 Gene Finding.
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sequence Alignment Oct 9, 2002 Joon Lee Genomics & Computational Biology.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Protein Sequence Comparison Patrice Koehl
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
LCS and Extensions to Global and Local Alignment Dr. Nancy Warter-Perez June 26, 2003.
Sequence comparison: Local alignment
Sequencing a genome and Basic Sequence Alignment
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Comparative Genomics of the Eukaryotes
Developing Pairwise Sequence Alignment Algorithms
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Sequencing a genome and Basic Sequence Alignment
1 Data structure:Lookup Table Application:BLAST. 2 The Look-up Table Data Structure A k-mer is a string of length k. A lookup table is a table of size.
Chapter 21 Eukaryotic Genome Sequences
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
MICROBIOLOGIA GENERALE Prokaryotic genomes. The prokaryotic genome.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
MICROBIOLOGIA GENERALE Prokaryotic genomes. The Escherichia coli nucleoid.
CS515: Bioinformatic Algorithms
Sequence comparison: Local alignment
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Genomic Data Manipulation
Genomes and Their Evolution
Pairwise sequence Alignment.
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Basic Local Alignment Search Tool (BLAST)
Bioinformatics Algorithms and Data Structures
Basic Local Alignment Search Tool (BLAST)
Presentation transcript:

1 Genome sizes (sample)

2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome shotgun for a bacterium Fleischmann et al became most-cited paper of the year 2869 citations to date : 2nd and 3rd bacteria published by TIGR: Mycoplasma genitalium, Methanococcus jannaschii 1996: first eukaryote, S. cerevisiae (yeast), 13 Mbp, sequenced by a consortium of (mostly European) labs 1997: E. coli finished (7th bacterial genome) : T. pallidum (syphilis), B. burgdorferi (Lyme disease), M. tuberculosis, Vibrio cholerae, Neisseria meningitidis, Streptococcus pneumoniae, Chlamydia pneumoniae [all at TIGR] 2000: fruit fly, Drosophila melanogaster 2000: first plant genome, Arabidopsis thaliana 2001: human genome, first draft 2002: malaria genome, Plasmodium falciparum 2002: anthrax genome, Bacillus anthracis TODAY (Sept 4, 2008): 744 complete microbial genomes! 1199 microbial genomes in progress! 476 eukaryotic genomes in progress!

3

New directions: sequencing ancient DNA New directions: sequencing ancient DNA (some assembly required)

5 J. P. Noonan et al., Science 309, (2005)

6 Published by AAAS J. P. Noonan et al., Science 309, (2005) Fig. 1. Schematic illustration of the ancient DNA extraction and library construction process

7 Published by AAAS J. P. Noonan et al., Science 309, (2005) Fig. 2. Characterization of two independent cave bear genomic libraries Fig. 2. Predicted origin of 9035 clones from library CB1 (A) and 4992 clones from library CB2 (B) are shown, as determined by BLAST comparison to GenBank and environmental sequence databases. Other refers to viral or plasmid-derived DNAs. Distribution of sequence annotation features in 6,775 nucleotides of carnivore sequence from library CB1 (C) and 20,086 nucleotides of carnivore sequence from library CB2 (D) are shown as determined by alignment to the July 2004 dog genome assembly.

8

9

10 Published by AAAS H. N. Poinar et al., Science 311, (2006) Fig. 1. Characterization of the mammoth metagenomic library, including percentage of read distributions to various taxa

11

12 Journals The very best: Science Nature PLoS Biology

13 Bioinformatics Journals Bioinformatics bioinformatics.oxfordjournals.org BMC Bioinformatics PLoS Computational Biology compbiol.plosjournals.org Journal of Computational Biology

14 Radically new journals PLoS ONE Biology Direct Reviewers’ comments are public Both journals can be annotated by readers Papers can be negative results, confirmations of other results, or brand new

15 Genomics Journals Genome Biology genomebiology.com Genome Research Nucleic Acids Research nar.oxfordjournals.org BMC Genomics

Before assembly… … we need to cover a basic sequence alignment algorithm 16

17 Sequence Alignment When we have very similar sequences: Closely related species Very little changed sequence Small differences can be very important Computationally “easy” to align Assembly ONLY deals with these When sequences are not so similar: Distantly related species Most positions changed Sequences that are most highly conserved are under the strongest selective (evolutionary) pressure. –E.g., some genes in humans and E. coli clearly have a common ancestor, the proteins can be aligned Computationally “difficult” to align

18 Sequence Alignment Algorithms for sequence alignment Choose best alignment, subject to some mutation model. A common (but overly simplistic) model for DNA mutations is called “edits”, which counts the number of substitutions, insertions and deletions. The resulting alignment suggests a possible “history” for the sequence. This slide and subsequent alignment slides courtesy of Nathan Edwards, available at

19 Example Alignments ACGTCTAG ||*****^ ACTCTAG- 2 matches, 5 mismatches, 1 not aligned

20 Example Alignments ACGTCTAG ^**||||| -ACTCTAG 5 matches, 2 mismatches, 1 not aligned

21 Example Alignments ACGTCTAG ||^||||| AC-TCTAG 7 matches, 0 mismatches, 1 not aligned Edit distance here = 1

22 Example Alignments...AACTGAGTTTACGCGCATAGA... |^^^||^|^^| T---CG-A--G Many equally good alignments! Even exact matching sequence can be found (at random) in long enough sequences

23 Global Alignment problem Given two related sequences, S (length n) and T (length m), find an alignment of S and T. Edit distance: minimum number of substitutions, insertions and deletions.

24 Dynamic Programming for pairwise alignment

25 Dynamic Programming Formulation Definition: Let D(i,j) be the edit distance of the alignment of S[1...i] and T[1...j]. Edit distance of S and T, then, is D(n,m). Dynamic programming solves the global alignment problem by computing D(i,j) for all i=0...n and j=0...m.

26 Recurrence Relation for D Computation of D is a recursive/iterative process. D(i,j) in terms of D(i’,j’) for i’ < i and j’ < j. Base conditions for D(i,j): D(i,0) = i, for all i = 0,...,n D(0,j) = j, for all j = 0,...,m

27 Recurrence relation for D For i > 0, j > 0: D(i,j) = min { D(i-1,j) + 1, D(i,j-1) + 1, D(i-1,j-1) + δ(S(i),T(j)) }

28 Dynamic programming D(i,j) is computed by optimally solving sub-problems The optimal solution to D(i,j) is a simple combination (addition) of two optimally solved subproblems

29 Using the recurrence We could code this as a recursive function call......but an exponential number of function evaluations –each position explores 3 alternatives There are only (n+1)x(m+1) pairs i and j We must be evaluating D(i,j) multiple times Why not cache the results?

30 Using the recurrence Compute D(i,j) bottom up. Store the intermediate results in a table (the table we already saw). Start with smallest (i,j) = (1,1). Compute D(i,j) after D(i-1,j), D(i,j-1), and D(i-1,j-1) have been determined. (n+1)(m+1) cells to fill, so O(nm) time.

31 Traceback Our dynamic programming table helps us compute the edit distance “score” We need the actual alignment corresponding to this edit distance The corresponding alignment can be read off, by doing a little extra accounting.

32 Traceback If D(i,j) == D(i-1,j) + 1, Pointer(i,j) = (i-1,j) If D(i,j) == D(i,j-1) + 1, Pointer(i,j) = (i,j-1) If D(i,j) == D(i-1,j-1) + δ(S(i),T(j)), Pointer(i,j) = (i-1,j-1) Break ties arbitrarily, or keep multiple pointers

33 Traceback Follow the pointers from cell (n,m). Any path to (0,0) corresponds to the (reverse of the) edits of the optimal alignment “horizontal” pointers: insertion in S “vertical” pointers: insertion in T “diagonal” pointers: match or substitution An optimal alignment can be found in O(n+m) time.

34 Original references T.F. Smith and M.S. Waterman, Identification of common molecular subsequences. J. Molecular Biology (1981), 147(1): Altschul SF, Gish W, Miller W, Myers EW, and Lipman DJ. Basic local alignment search tool. J. Molecular Biology (1990), 215(3): ,113 citations!