A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
Searching Sequence Databases
Heuristic alignment algorithms and cost matrices
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Bioinformatics and Phylogenetic Analysis
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Comparative ab initio prediction of gene structures using pair HMMs
Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Heuristic Approaches for Sequence Alignments
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequencing a genome and Basic Sequence Alignment Lecture 8 1Global Sequence.
Sequencing a genome and Basic Sequence Alignment
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
HOW TO SOLVE IT? Algorithms. An Algorithm An algorithm is any well-defined (computational) procedure that takes some value, or set of values, as input.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Genome alignment Usman Roshan. Applications Genome sequencing on the rise Whole genome comparison provides a deeper understanding of biology – Evolutionary.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Sequencing a genome and Basic Sequence Alignment
1 Data structure:Lookup Table Application:BLAST. 2 The Look-up Table Data Structure A k-mer is a string of length k. A lookup table is a table of size.
Construction of Substitution Matrices
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
A Hardware Accelerator for the Fast Retrieval of DIALIGN Biological Sequence Alignments in Linear Space Author: Azzedine Boukerche, Jan M. Correa, Alba.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
From Smith-Waterman to BLAST
Sequence Alignment.
Doug Raiford Phage class: introduction to sequence databases.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
DNA Sequences Analysis Hasan Alshahrani CS6800 Statistical Background : HMMs. What is DNA Sequence. How to get DNA Sequence. DNA Sequence formats. Analysis.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Aligning Genomes Genome Analysis, 12 Nov 2007 Several slides shamelessly stolen from Chr. Storm.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Introduction to Bioinformatics Resources for DNA Barcoding
Genome alignment Usman Roshan.
Fast Sequence Alignments
Basic Local Alignment Search Tool (BLAST)
Sequential Steps in Genome Mapping
Basic Local Alignment Search Tool
Searching Sequence Databases
Presentation transcript:

A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding Identifying similarity regions inside a DNA sequence (repeats), or between two sequences (local alignment), is a fundamental problem in bioinformatics. Detecting similarities is a necessary step in functional prediction, phylogenetic analysis, and many other biological studies. Searching for approximate repeats in whole genomes using exhaustive techniques takes a prohibitive time. Many algorithms find first small exact repeats, called seeds (by using suffix tree or hash function), and then try to extend those repeats into approximate ones. Our method chains together multiple seeds in order to form rapidly large similarity regions, instead of extending individual seeds. The choice of search parameters is based on a statistical analysis. SimilaritiesHeuristic Search A more sensitive approach Grouping seeds has been used in a very restricted form by late versions of BLAST [2] (‘two-hit’ method). Our chaining algorithm allows for a smaller seed size and therefore for a more sensitive search, without considerable drop in time efficiency. To illustrate the gain in sensitivity, in two sequences of length 100 with at least 66% of similarity, one finds more frequently 3 seeds of size 7 than one single seed of size 11 (see Figure below). References: [1] G.Benson, Tandem repeats finder: a program to analyze DNA sequences, Nucleic Acids Research, 1999, vol 27, 2, pp [2] S.Altschul, W.Gish, W.Miller, E.Myers, D.Lipman, Basic Local Alignment Search Tool, Journal of Molecular Biology, 1990, vol 215, pp Method: Parameters used in the chaining algorithm are estimated according to probability distributions, assuming a Bernoulli model of DNA sequence. Maximal distance ρ between seeds inside one repeat copy is computed according to waiting time distribution [1]. Indels are accounted for by computing a statistical bound δ of random walk distribution, which simulates the variation of distance between corresponding seeds inside a repeat copy [1]. Results: Output is given by start and end positions of each copy, that can be located on either strand. An equivalent BLAST score is also given. Copy alignments can be computed very efficiently due to seeds already found. Experiments: Experiments have been carried out on chromosomes V and IX of Saccharomyces cerevisiae. Comparative tests have been done with REPuter and BLAST. They have shown that our program is more sensitive than BLAST in finding repetitions with a good similarity rate (70%). REPuter tends to keep apart fragments of one repetition whereas our program assembles fragments provided that indels are smaller than δ. Typical Output Chaining algorithm groups together seeds involved in the same similarity region. A major problem is to efficiently retrieve previously found seeds which should be chained with the current seed. This is done by maintaining a diagonal table of seeds. It stores the last found seed indexed by its diagonal (difference between the copy positions of the seed) When a new seed is found, the chaining algorithm has to look around its diagonal to check if another seed can be chained to the new one according to statistical criteria. If no new seed occurred within distance ρ to the right of the last seed of the chain, then the chain is completed and removed from the memory. The chaining algorithm is linear in the number of seeds found, and allows to treat sequences of millions of bp on a regular PC. Chaining AlgorithmSeeds Multiple seeds vs single seed Running Time Algorithm Structure A seed consists of two identical words, one from each sequence. Seeds are obtained from a linked list computed using a hash function. The list contains all positions of each word in the sequence. Two seeds are chained together if the distance d between them is bounded by ρ (computed according to statistical criteria, see above) and variation of this distance Δ d between two sequences is bounded by δ. These two criteria allow us to chain seeds occurring within the same similarity region. Chaining Algorithm Alignment Algorithm Seed pairs chaining DNA sequences Similarities Parameters CGTCTCCTCCAAGCCCTATTGACTCTTACCCGGAGTTTCAGCTAAAAGCTATACTTACTACCTTTATC CGTCTCCTCCAAGGCCTGTTGGCATCTTACCCTGATGTTCAGTCAAAAGCTACTTACTACCTTTATC seed 1seed 2seed 3 Sequence A Sequence B d ΔdΔd seed 4