Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer.

Slides:

Advertisements

Similar presentations

Algorithm Analysis Input size Time I1 T1 I2 T2 …

Advertisements

Indexing DNA Sequences Using q-Grams

Lecture 24 MAS 714 Hartmut Klauck

Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.

Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.

Designing Algorithms Csci 107 Lecture 4. Outline Last time Computing 1+2+…+n Adding 2 n-digit numbers Today: More algorithms Sequential search Variations.

6 - 1 Chapter 6 The Secondary Structure Prediction of RNA.

15-853Page : Algorithms in the Real World Suffix Trees.

296.3: Algorithms in the Real World

1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.

Predicting RNA Structure and Function. Non coding DNA (98.5% human genome) Intergenic Repetitive elements Promoters Introns mRNA untranslated region (UTR)

. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.

. Class 1: Introduction. The Tree of Life Source: Alberts et al.

Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.

Finding Subtle Motifs by Branching from Sample Strings Xuan Qi Computer Science Dept. Utah State Univ.

Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.

Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.

Structural Alignment of Pseudoknotted RNAs Banu Dost, Buhm Han, Shaojie Zhang, Vineet Bafna.

6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.

Efficient algorithms for the scaled indexing problem Biing-Feng Wang, Jyh-Jye Lin, and Shan-Chyun Ku Journal of Algorithms 52 (2004) 82–100 Presenter:

1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.

Sequence alignment, E-value & Extreme value distribution

A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.

1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.

Dynamic Programming (cont’d) CS 466 Saurabh Sinha.

RNA Secondary Structure Prediction Introduction RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the.

Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)

Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.

Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

1 Bio + Informatics AAACTGCTGACCGGTAACTGAGGCCTGCCTGCAATTGCTTAACTTGGC An Overview پرتال پرتال بيوانفورماتيك ايرانيان.

RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?

Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

Genome Rearrangements [1] Ch Types of Rearrangements Reversal Translocation

DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.

1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.

© Wiley Publishing All Rights Reserved. RNA Analysis.

Searching for structured motifs in the upstream regions of hsp70 genes in Tetrahymena termophila. Roberto Marangoni^, Antonietta La Terza*, Nadia Pisanti^,

Questions?. Novel ncRNAs are abundant: Ex: miRNAs miRNAs were the second major story in 2001 (after the genome). Subsequently, many other non-coding genes.

Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.

Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Fixed parameter algorithms for protein similarity search under mRNA structure constrains A joint work by: G. Blin, G. Fertin, D. Hermelin, and S. Vialette.

. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.

Motif Search and RNA Structure Prediction Lesson 9.

Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S

Teaching Bioinformatics Nevena Ackovska Ana Madevska - Bogdanova.

Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.

Dipankar Ranjan Baisya, Mir Md. Faysal & M. Sohel Rahman CSE, BUET Dhaka 1000 Degenerate String Reconstruction from Cover Arrays (Extended Abstract) 1.

Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.

9/27/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Adam Smith Algorithm Design and Analysis L ECTURE 16 Dynamic.

Tries 07/28/16 11:04 Text Compression

Lab 8.3: RNA Secondary Structure

13 Text Processing Hongfei Yan June 1, 2016.

Strings: Tries, Suffix Trees

RNA Secondary Structure Prediction

Enumerating Distances Using Spanners of Bounded Degree

Dynamic Programming (cont’d)

Identification and Characterization of pre-miRNA Candidates in the C

Comparative RNA Structural Analysis

Parsing Costas Busch - LSU.

KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.

3. Brute Force Selection sort Brute-Force string matching

Dynamic Programming II DP over Intervals

Strings: Tries, Suffix Trees

Presentation transcript:

Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer Science, Systems and Communication University of Milano – Bicocca Milan, Italy

CPM MoreliaGiulio Pavesi2 Why Is RNA So Interesting? After the completion of various genome projects, the attention of many researchers has shifted from coding to non – coding parts More than 95% of our genome is not coding: what about the rest? Non – coding RNA: RNA that is transcribed from DNA, but does not encode directly for a protein (tRNA, microRNA, etc.)

CPM MoreliaGiulio Pavesi3 A Motivating Example Post-transcriptional regulation of gene expression

CPM MoreliaGiulio Pavesi4 The Problem Functionally related RNA sequences present structural similarity, at least in some parts Given two or more RNA molecules, find similar (supposedly functional) structural elements in them Sequence similarity implies structure similarity, but this is not always that true for RNA..... Given two or more RNA sequences of unknown structure, find similar structural elements in them (motifs) Low sequence similarity can anyway correspond to high structure similarity

CPM MoreliaGiulio Pavesi5 “Know Thine Enemy” RNA secondary structure: list of the base pairs among nucleotides in the sequences, such that: –No nucleotide takes part in more than a single base pair (usually, Watson – Crick pairs and wobble pairs G – T, i.e. canonical base pairs) –Base pairs never cross: if nucleotide i is bound to nucleotide j and k with l, then either i < j < k < l or i < k < l < j

CPM MoreliaGiulio Pavesi6 RNA Secondary Structure.((..(((.((....)))))...(((.(((...)))...)))))

CPM MoreliaGiulio Pavesi7 Motifs in RNA Secondary Structure Many functional motifs can be described by secondary structure alone Two types of similarity: –sequence similarity (in unpaired nucleotides, mainly) –structure similarity

CPM MoreliaGiulio Pavesi8 Data Structures? When dealing with DNA or protein sequences, some significant advantages have been obtained by using suitable text— indexing structures (e.g. suffix trees) RNA secondary structure can be described by a string Is there a “good” structure that will do for RNA sequences, allowing us to consider sequence and structure at the same time?

CPM MoreliaGiulio Pavesi9 Affix Trees Affix tree for string S = ATATC Suffix and prefix edges Suffix edges spell the substrings of string S Prefix edges (dotted) spell substrings of S -1 (the reverse Built in linear time Takes linear space

CPM MoreliaGiulio Pavesi10 Affix Trees The affix tree of a string S indexes all the substrings of both S and S -1 Once a substring of S has been located in the tree, we can extend it to the right (by following suffix edges) and to the left (by following prefix edges) Good if we search for patterns in the sequences with some kind of symmetry

CPM MoreliaGiulio Pavesi11 The Hairpin The basic element of RNA secondary structure is the hairpin (or stem— loop) structure The hairpin is symmetric!!!! ((((( ))))) AGGTC CAGTCA GATCT

CPM MoreliaGiulio Pavesi12 First Try Predict the secondary structure of each input sequence Build the affix tree for the folded sequences (in bracket notation) Search exhaustively for patterns describing hairpin structures (possibly with differences) Report those occurring in at least q sequences

CPM MoreliaGiulio Pavesi13 Searching for Hairpins in Affix Trees For each loop size l: 1.Find l dots in the tree, on suffix edges (hairpin loop) 2.Add a base pair: a)Find a ) on suffix edges b)Find a ( on prefix edges 3.If the result appears in at least q sequences, jump to 2, else return from jump 4.Add internal loops: a)Find a dot on prefix edges: jump to 2; b)Find a dot on suffix edges: jump to 2;

CPM MoreliaGiulio Pavesi14 Recursive Algorithm 1....(ok) 2 (....)(ok) 2 ((....))(ok) 2 (((....)))(no) 3a.((....))(ok) 2 (.((....)))(ok) On each path, we keep a pointer for the prefix edge, and another for the suffix edge Speed—up: represent the unpaired elements with a single symbol describing type and size, so to compare two symbols instead of two regions

CPM MoreliaGiulio Pavesi15 Approximate Search We can allow some approximation: –Hairpin loops of different size (range value at step 1) –Internal loops of different size at the same position along the stem –Internal loops or bulges at different positions along the stem –Stems of different size (base pairs) –Any combination of the previous

CPM MoreliaGiulio Pavesi16 Complexity Given a set of k folded sequences of overall length N : –Construction of the tree: O(N) –Annotation of the tree: O(kN) –Search: O(V(m)kN), where m is the length of the longest pattern found –V(m) depends on the degree of approximation –In practice, the most time consuming part is predicting the structure of the sequences

CPM MoreliaGiulio Pavesi17 Does It Work? Test: Iron Responsive Element, located in the UTRs of mRNA coding for proteins involved in iron metabolism (e.g. ferritin, transferrin) Does it appear in all the predicted structures? Alas, it does not!!!!!!

CPM MoreliaGiulio Pavesi18 Why? The “real structure”often does not correspond to the optimal one!!!! The motif “disappears” from the (supposedly) optimal structure

CPM MoreliaGiulio Pavesi19 One Possible Solution Idea: for each sequence, consider also a number of alternative sub-optimal structures All the possible structures can be enumerated Check whether a motif appears in at least one alternative structure per sequence The affix tree can handle efficiently even hundreds of alternative structures per input sequence Downside: the number of potential secondary structures for a sequence of length n is O(2 n ) If similarity is not stringent, we have too many candidates

CPM MoreliaGiulio Pavesi20 But..... If the same structure has to appear in a set of sequences, then the same pattern of complementary base pairs has to appear in the sequences ((((( ))))) AGGTC CAGTCA GATCT GCGAG CAGTCT CTTGC CCCAG CAGTCA CTGGG

CPM MoreliaGiulio Pavesi21 Idea! Instead of working on folded sequences, build the affix tree for the sequences alone, and find complementary base pairs on the fly The search can be implemented with the same parameters of the folded case

CPM MoreliaGiulio Pavesi22 Building Hairpins on the Fly By working on unfolded sequences, the theoretical time complexity is higher, since different paths correspond to the same structure In practice it is much faster, since we do not have to run the prediction algorithm on the input sequences We need to “validate” the candidate structures, e.g. according to their energy

CPM MoreliaGiulio Pavesi23 Post - Processing So far we have considered structure alone More than a single motif occurrence per sequence is often reported, especially if structural constraints are loose Post processing: compare the candidate occurrences by evaluating sequence similarity in unpaired elements Find the group of instances that are more similar at the sequence level

CPM MoreliaGiulio Pavesi24 Results and Work in Progress The second approach gave better results, in terms of reliability and efficiency Candidate hairpins can be validated according to their energy value (more reliable, in this case!) Good results on “harder” tests Too many input parameters yet Extend to more complex structures