1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Longest Common Subsequence
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Frequent Closed Pattern Search By Row and Feature Enumeration
Heaps1 Part-D2 Heaps Heaps2 Recall Priority Queue ADT (§ 7.1.3) A priority queue stores a collection of entries Each entry is a pair (key, value)
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Greedy Algorithms Amihood Amir Bar-Ilan University.
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Binary Trees, Binary Search Trees CMPS 2133 Spring 2008.
© The McGraw-Hill Companies, Inc., Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
Binary Trees, Binary Search Trees COMP171 Fall 2006.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Procedures of Extending the Alphabet for the PPM Algorithm Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
Lectures on Network Flows
Branch and Bound Similar to backtracking in generating a search tree and looking for one or more solutions Different in that the “objective” is constrained.
2 -1 Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Data Flow Analysis Compiler Design Nov. 8, 2005.
The Complexity of Algorithms and the Lower Bounds of Problems
Data Flow Analysis Compiler Design Nov. 8, 2005.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Important Problem Types and Fundamental Data Structures
Binary Trees Chapter 6.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.
Chapter 19: Binary Trees. Objectives In this chapter, you will: – Learn about binary trees – Explore various binary tree traversal algorithms – Organize.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
An Improved Algorithm to Accelerate Regular Expression Evaluation Author: Michela Becchi, Patrick Crowley Publisher: 3rd ACM/IEEE Symposium on Architecture.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Querying Structured Text in an XML Database By Xuemei Luo.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
Foundation of Computing Systems
Chapter 7 - Sequence patterns1 Chapter 7 – Sequence patterns (first part) We want a signature for a protein sequence family. The signature should ideally.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Heaps © 2010 Goodrich, Tamassia. Heaps2 Priority Queue ADT  A priority queue (PQ) stores a collection of entries  Typically, an entry is a.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
DATA STRUCURES II CSC QUIZ 1. What is Data Structure ? 2. Mention the classifications of data structure giving example of each. 3. Briefly explain.
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Top 50 Data Structures Interview Questions
DATA STRUCTURES AND OBJECT ORIENTED PROGRAMMING IN C++
Priority Queues © 2010 Goodrich, Tamassia Priority Queues 1
12. Graphs and Trees 2 Summary
Priority Queues Chuan-Ming Liu
Lecture 18. Basics and types of Trees
Heaps © 2010 Goodrich, Tamassia Heaps Heaps
Lectures on Network Flows
Binary Trees, Binary Search Trees
Part-D1 Priority Queues
Balanced-Trees This presentation shows you the potential problem of unbalanced tree and show two way to fix it This lecture introduces heaps, which are.
Lectures on Graph Algorithms: searching, testing and sorting
Sequence Based Analysis Tutorial
Balanced-Trees This presentation shows you the potential problem of unbalanced tree and show two way to fix it This lecture introduces heaps, which are.
CSE 589 Applied Algorithms Spring 1999
Data Compression Section 4.8 of [KT].
Binary Trees, Binary Search Trees
Trees.
Algorithms: Design and Analysis
Important Problem Types and Fundamental Data Structures
Binary Trees, Binary Search Trees
Presentation transcript:

1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015

2 Goal The goal is the use of pattern graph for discovering conserved patterns in a set of related protein sequences

3 Pratt *Is a tool that allows the user to search for patterns conserved in a set of protein sequences. * It must be specify what kind of patterns should be searched for and how many sequences should match a pattern to be reported *This tool is the implementation of an algorithm proposed by Jonassen in 1995 for the discovery of patterns of the PROSIT types allowing for both ambiguous pattern position and variable length gaps. *Pratt searches for patterns matching at least a specified number of a given sequence and then ranked the patterns discovered according to the highest scoring function

4 PROSITE Is a database of protein families containing more than 1,100 entries. For each family it gives a pattern or a profile which can be used to identify new members of the family. The results indicate that Pratt able to discover useful patterns for some protein families

5 Efficient Discovery of Conserved Pattern Using a Pattern Graph In 1996 Jonassen has proposed an alternative approach for finding patterns common to at least k out of n given sequences. The pattern graph concept is introduced. It assumes that the pattern has a determined form defines the transformation operations that allow the generation of a pattern from another given a sequence S = s 1,s 2, ……s m,with length l, the pattern is represented as graph it constructs the graph it uses DFS to find all the possible patterns derived from a given path in the graph. For all these patterns, selects the most significance one based on the highest score function

6 Terminology and definitions (1) The algorithm finds the most interesting patterns matching some minimum number of a given set of sequences. These sequences are string of alphabet  which represent the alphabet of a sequences of nucleotide Definition a class of patterns : A pattern P in the class C is considered in the form: P = A 1 ---x (i 1, j 1 )----A 2 ---x ( i 2, j 2 )………x ( i P-1, j P-1 )---- A P (1) where: A 1, A 2,……..A P are called pattern component of P A pattern component can be identity or ambiguous x ( i 1, j 1 )….. are called the wildcard region where i 1  j 1 …. are integers number non negative. Wildcard regions can be fixed and flexible the flexibility is defined as j-i

7 Exemple: P = A---[ DE]---x (3 )----G----x ( 3,4 )---L A 1 = A, i 1 = j 1 = 0, A 2 = {D,E }, i 2 = j 2 = i 3 = 3, A 3 = G, j 3 =4, A 4 = L the length of pattern is 6, p = 4 A pattern P 1 to match P, each patterns component of P 1 must match each pattern component of P P = 4, L = 6, W = 4, F = 1, N = 1, FP = 2

8 Definitions (2) We define a class of patterns C that will be discovered during the work of the algorithm. We define a set of bounds  = ( A, P, L, W, F, N, FP) where A  2  is the set of pattern components P the maximum number of component L the maximum length of patterns W the maximum length of wildcard region F the maximum flexibility of the wildcard region N the maximum number of flexible wildcard regions FP the maximum product of flexibility which is :  p k =1 ( j k - i k +1 )

9 Generalisation of Patterns Definition : a Pattern P 1 is said to be generalisation of another pattern P 2 if for any sequence matching P 2 will matches also P 1. The concept of generalisation : Given a class of patterns C,we define a family of transformation operators i  c i = { 1,2,3 } that can applied on a pattern P in c and produces another pattern P 1 in C. These operators are defined as follows: 1- P 1  c P 1 : P 1 is generated from P by the first transformation operator, if P 1 can be obtained by deleting or adding a pattern component c from P, formally : - P = c----x ( i, j ) ---- P 1 - P = P x ( i, j ) -----c - P = P x ( i 1, j 1 ) ---- c ---- x ( i 2, j 2 ) P 2 and P 1 = P x ( i 1 + i 2 +1, j 1 + j 2 +1 ) P 2 for P, P 1, P 1, P 2  C

10 Generalisation of pattern (2 ) 2- P 2  c P 1 by substitution a component c in P with less restrictive one c 1 : P = P x ( i 1, j 1 ) ---- c ---- x ( i 2, j 2 ) P 2 P 1 = P x ( i 1, j 1 ) ---- c x ( i 2, j 2 ) P 2 P 1, P 2  C c  c 1  A 3- P 3  c P 1 by allowing more flexibility in the wildcard regions of P P = P x ( i, j ) ---- P 2 P 1 = P x ( i 1, j 1 ) --- P 2 where i 1  i, j 1  j and ( i 1, j 1 )  ( i, j ) for some P 1, P 2, P 1  C more generally P  c P 1 if and only if P i  c P 1 i  { 1, 2, 3 }

11 Exemple: Given the pattern A-----B -----C-----D can be generalised to [AB]----B----- x ( 1,3)----- D. A----B----C-----D 2  c [AB]---B-----C----D 1  c [ AB]----B-----x---D 3  c [AB]----B---- x (1,3)------D Patterrn Scoring function : The score of pattern P is given in the form (1) is : I ( P ) = p  i-1 I 1 ( A i ) - c. p-1  k= 1 ( j k - i k ) where c is a constant and I 1 ( A i ) is the information contents of the pattern component. The pattern that contains more information has more highest scoring patterns and that is ranked in the top of the patterns. This function is used in Pratt to rank all the patterns discovered.

12 Pattern Graph Pattern graph is a directed graph G = ( V, E ) where the nodes V represent the patterns component, and the edges E represent the wildcard regions.  ( u) is the label of a node v  V, the edge e  E is labelled with the minimum  and the maximum  1 number of residues to match the wildcard region.

13 Exemple for a pattern P = A----B----x ( 0,2)---C----x (3)----D we can construct he following graph:  (u) = A,  (v) = B,  (w) = C,  (x) = D a path  = u 1, u 2,…u n in G defines the pattern : that means a path u, w, x defines the pattern :  ( P) = A----x ( 0,1)----- C----x (3)----D

14 Definition (3) we define  ( G,C) to be the set of all the patterns that can be C-generalisation from the set of patterns in C defined by the paths in G.  ( G,C) = P Path U in G   (P)  C { P /  (P) *  c P } we define  1 ( G,C) is a set of patterns in C that can be derived from a patterns defined by path in G using restrictive transformation operations: P  c P 1 if and only if P i  c P 1 i  { 2, 3 }  1 ( G,C) = P Path U in G   (P)  C { P /  (P) *  c P } the goal is to find  1 ( G,C) and to prove that  1 ( G,C) =  ( G,C)

15 Constructing a pattern graph Input : set of sequences S= s 1, s 2,…….s n where S i = s 1 i s i 2 …..s i j  bounds  specifying a class C minimum number of sequences k < n that a pattern should match. Output : 1- a pattern graph G 2-  1 ( G,C) 3- Finding the highest scoring patterns matching at least k sequences. 4- pruning the highest scoring patterns

16 Constructing pattern graph from a sequence Given a set of bounds  defining a class of patterns C and a sequence s = s 1,s 2,…..s l. The algorithm works in phases. In the first phase, it defines the nodes starting by the root and in the second phase it defines the edges. Phase 1 - if G contains one node u i - for each character s i in s, that is a pattern component in A label u i with s i  A,  (u i ) = s i phase 2 - for each node u i make an edge to all node u j which i < j  min ( i+ w + 1, l) - label this edges (u i, u j ) as ( j-i-1, j-i -1)

17 Exemple S = ABCDEFG Algorithm properties : -  1 ( G,C) contains all patterns in C matching S - each pattern in  1 ( G,C) matches S -  1 ( G,C) =  ( G,C)

18 Constructing a pattern graph from a multiple alignment The goal is to construct a pattern graph G with  1 ( G,C) =  ( G,C). Input : let  be an alignment of the sequences S = M 1 ….M m l is the length of alignment. A sequence M i = M i 1,……..M i li where M i j is the j th character in the sequence M i we number the column alignment from left to right the column i represent a vector c i 1 ……c i m where c i j = k if the i th column in  contains the k th character from sequence M i or 0 if the i th column contains a gap the graph is constructed by all the ungapped column

19 Constructing a pattern graph from a multiple alignment (2) the algorithm works in steps, in the first one define the nodes of the graph and in the second step defines the edges. Step 1 : - for each ungapped column in  make a node u i for column c i - the set of symbols present in that column represent the allowable pattern components - label u i with the smallest set a   Step2 : - each pair of nodes u i, u j, i < j correspond to a column i, j in   are the minimum and the maximum number of sequence symbols in each sequence between column i, j - for each edge u i  u j label it with (   

20 Exemple :

21 Simple depth -first search using the graph Until now we are constructing the graph. The next step is to find the set of conserved patterns  1 ( G,C) in the graph using DFS. That means constructing a search tree rooted in an empty pattern and contains all the k- pattern in  1 ( G,C) at depth k Definition : k- pattern is defined by k-path in G and the C- generalisation operation ( of type 2,3 ) applied on it. Input conserved k-pattern P k-path P in G from which P has been derived output : generating all the simple possible extension of P that are in C and can be derived from an extension of the path P. checked if the patterns generated are conserved or not.

22 how are generated an extension of P Let P = v 1, v 2,….v k and there are edges from v k to w 1,……w l each path p : v k,…..w l define a pattern P ( p l ) is a simple extension of P where: P ( p l ) = P x (  v k  w l ),    v k  w l ) ----  ( w l ) or = P x ( i l, j l ) A l for each pattern P ( p l ) we can generate a simple extension by applying the operator type 2 on A l and the operator type 3 on x ( i l, j l ) Example : let G be a graph and assume F = 1 and A = { { A }, { B }, {C }, { D}, { E }, {F}, {G}} assume that the pattern P = A x------C------D was derived from the path p = A,C,D

23 Example ( continue ) the path p can be extended along any of the edges

24 Simple depth -first search using the graph (2) running the search procedure recursively can generate all the patterns in  1 ( G,C) and then we check if they are conserved or not. Pruning the search : - find the highest scoring patterns means for all node u in G we need to find the most expressive conserved pattern from a path started in u. - The search can be done in different cases: 1- no flexibility no ambiguity is allowable 2- no flexibility but allowing for ambiguity 3- general case

25 Pruning the search (1) no flexibility no ambiguity is allowed: A = { {a } / a   } F = 0 in this case pattern is directly defined by the path and the longest path will give the highest scoring pattern. Property : for a given graph G = ( V, E ) if a node u has edges to v and w where u < v <w then there will be an edge from v to w. Defining an ordering relation < 1 we can ordered the child nodes of a given node x in a manner such if : x i < 1 x j then in the patterns P x i, P x j : w i < w j result : there are no need to explore all the subtree x i+1 ….x l to find the highest scoring pattern.

26 Pruning the search (2) no flexibility but ambiguity is allowed: if x 1 is a child of node x in the search tree that correspond to path p i = v 1 ……v k, w i, the pattern derived from such path is P x ---- x (  v k  w i ))-----A, let Ind ( x 1 ) = index ( w i ) and Amb ( x 1 ) = |A| we define a partial order < 2 ordering of the children of x so : x 1 < 2 x 11 if Ind ( x 1 ) < Ind ( x 11 ) or if ind ( x 1 ) = ind ( x 11 ) and Amb ( x 1 ) < Amb ( x 11 ) two nodes x 1, x 11 which : ( Ind (x 1 ), Amb (x 1 ) ) = (Ind ( x 11 ), Amb (x 11 ) ) are ordered arbitrarily. If a pattern of child x 1 matches the same number of segments as P x then all the child of x after x 1 will not be analysed because they cannot give a higher scoring pattern

27 Pruning the search (3) the general case: Each child x 1 of x defines a pattern P 1 = P----x ( i, j)-----a each node w i is appended to the path is defined by : a  A such  w )  a the flexibilty of the wildcard region defined by the edge v k  w i given Inde(x 1 ),Amb(x 1 ), F ( x 1 ) = j-i we define a partial order < 3 of the children of s that : x 1 < x 11 if : Ind (x 1 ) < Ind (x 11 ) or Ind ( x 1 ) = Ind ( x 11 ) and Amb ( x 1 ) < Amb ( x 11 ) or (Ind ( x 1 ), Amb ( x 1 )) = ( Ind ( x 11 ), Amb ( x 11 )) and F (x 1 ) < F (x 11 ) two nodes x 1, x 11 for which (Ind ( x 1 ), Amb ( x 1 ), F (x 1 ) ) = ( Ind ( x 11 ), Amb ( x 11 ), F (x 11 )) are ordered arbitrarily

28 Pruning the search (3) the general case (continue): if P 1 a pattern corresponding to a child x 1 of x, if the extend of P 1 matches at least a certain proportion of the segments matched by P we do not analysis other children of x because if a P is a real conserved pattern and the extension P 1 matches at least k segments, then we would expect only a small proportion of segments in the set of segments that matches the pattern P to extend to segment matching P 1. P 1 is conserved pattern and no additional expansion of P need to be explored

29 Complexity analysis: Time complexity : the algorithm search for all patterns conserved in at least k sequences of n sequences with average length l, the class of patterns C is given by a set of bounds  ( A,P,L, W, F, N,FP) then the time complexity to analysis a pattern graph G ( V, E ) constructed from the n-k+t shortest sequences is O ( |E|.P.N) where P = O (L ) and L = O ( n.l ) is the total length of all sequences. The worst case time complexity is exponential in the maximum pattern length P which is the maximum depth of the search tree. Space complexity : the space needed to store the graph is : O ( |E|.  g 2 / 8  + |V| ).(W+1+N:P) bytes where g is the maximum number of generalisations of a patterns component.

30 References

31 References (2)