An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan.

Slides:



Advertisements
Similar presentations
Frequent Closed Pattern Search By Row and Feature Enumeration
Advertisements

Space-for-Time Tradeoffs
Pushdown Automata Chapter 12. Recognizing Context-Free Languages Two notions of recognition: (1) Say yes or no, just like with FSMs (2) Say yes or no,
LR-Grammars LR(0), LR(1), and LR(K).
String and Lists Dr. Benito Mendoza. 2 Outline What is a string String operations Traversing strings String slices What is a list Traversing a list List.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Procedures of Extending the Alphabet for the PPM Algorithm Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
1 Languages. 2 A language is a set of strings String: A sequence of letters Examples: “cat”, “dog”, “house”, … Defined over an alphabet: Languages.
1 CSCI-2400 Models of Computation. 2 Computation CPU memory.
1 Languages and Finite Automata or how to talk to machines...
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
1 Efficient String Matching : An Aid to Bibliographic Search Alfred V. Aho and Margaret J. Corasick Bell Laboratories.
1 Convolution and Its Applications to Sequence Analysis Student: Bo-Hung Wu Advisor: Professor Herng-Yow Chen & R. C. T. Lee Department of Computer Science.
Algorithms and Efficiency of Algorithms February 4th.
Data Structures – LECTURE 10 Huffman coding
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Fall 2006Costas Busch - RPI1 Languages. Fall 2006Costas Busch - RPI2 Language: a set of strings String: a sequence of symbols from some alphabet Example:
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Fall 2004COMP 3351 Languages. Fall 2004COMP 3352 A language is a set of strings String: A sequence of letters/symbols Examples: “cat”, “dog”, “house”,
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.
Finite-State Machines with No Output
Computer Science Department Data Structure & Algorithms Problem Solving with Stack.
CSC312 Automata Theory Lecture # 2 Languages.
CSC312 Automata Theory Lecture # 2 Languages.
Costas Busch - LSU1 Languages. Costas Busch - LSU2 Language: a set of strings String: a sequence of symbols from some alphabet Example: Strings: cat,
Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.
Finding repeat pattern in human genome by TEIRESIAS algorithm Xiaojun Hu.
Lecture 4 Sequences CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine.
Pushdown Automata (PDAs)
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
1 Languages. 2 A language is a set of strings String: A sequence of letters Examples: “cat”, “dog”, “house”, … Defined over an alphabet:
Pushdown Automata Part I: PDAs Chapter Recognizing Context-Free Languages Two notions of recognition: (1) Say yes or no, just like with FSMs (2)
Application: String Matching By Rong Ge COSC3100
Mining Frequent Patterns in unaligned Protein Sequences (An Implementation of The Teiresias Algorithm) Fei Shao 1.
Built-in Data Structures in Python An Introduction.
12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
CS 203: Introduction to Formal Languages and Automata
Chapter 7 - Sequence patterns1 Chapter 7 – Sequence patterns (first part) We want a signature for a protein sequence family. The signature should ideally.
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
1 Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm Speaker: Minghua ZHANG March. 12, 2003 Authors: Isidore Rigoutsos Aris.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
1 Chapter 4 Unordered List. 2 Learning Objectives ● Describe the properties of an unordered list. ● Study sequential search and analyze its worst- case.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
String and Lists Dr. José M. Reyes Álamo. 2 Outline What is a string String operations Traversing strings String slices What is a list Traversing a list.
Algebra. Factorials are noted by the expression n! and reads “n factorial”
String and Lists Dr. José M. Reyes Álamo.
Languages.
Languages Prof. Busch - LSU.
Languages Costas Busch - LSU.
Chapter 2 FINITE AUTOMATA.
Fast Fourier Transform
The compareTo interface
String and Lists Dr. José M. Reyes Álamo.
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
CSE 589 Applied Algorithms Spring 1999
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Space-for-time tradeoffs
Languages Fall 2018.
SPL – PS1 Introduction to C++.
Presentation transcript:

An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan

Outline Introduction of Pattern Discovery Basic Definitions Teiresias Algorithm & Implementation Scan Phase Convolution Phase Results & Conclusion Q & A

What is Pattern Discovery? Pattern: A region or portion of a protein sequence. It may has specific structure and functionally significant. Protein family should have similar patterns so they can be characterized. Pattern Discovery: Detect patterns from known protein sequences. The result can be used to detect unknown protein sequences.

Why Pattern Discovery Useful? For some proteins that have similar biological properties on structural or functional features: Group together these protein sequences Discover a set of common sub-sequences Study and observe these sub-sequences The detected patterns may help to classify a protein

Basic Definitions ∑ (Basic alphabet set): The amino-acid with names can be abbreviated as the listed symbols in alphabetical order ∑={A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}  (Character/Equivalency group alphabets): a character corresponding to a subset of Σ. Ex: A-[CJ]-G. (Wild-card or don’t care): a special kind of ambiguous character that matches any character in ∑. Ex: X in protein sequences and N in nucleotide

Basic Definitions (Cont.) Pattern P is a (L, W) pattern iff: P is a string of characters (Σ and wild cards ‘.’). P starts and ends with a character from Σ. Characters in Σ are called residues. Any sub pattern of P (i.e subsequence starting and ending with a character from Σ) containing exactly L non-wildcard characters (residues) has length of at most W. Ex. L=3 and W=5, “CD..E” is a (3, 5) pattern.

TEIRESIAS Algorithm Target: Search for patterns consisting of characters of the alphabet Σ and wild cards ‘.’ Basic Idea: If a pattern P is a (L, W) pattern occurring in at least K sequences, then its sub patterns are also (L, W) patterns occurring in at least K sequences (K>=2). Ex: pattern “A.BC” is more specific than “A..C” Designed for non-aligned sequences

Two Phases Scan Phase: Scan the input sequences Identify all the elementary patterns which satisfy the minimum support Convolution Phase: Sort the element patterns and extract useful prefixes and suffixes Convolve the elementary patterns until find the maximal length and composition of the patterns

Scan Phase Input: Non-aligned sequences Parameter: K, L, W Output: a set of “Elementary Patterns” Are (L, W) patterns Occur in at least K sequences Contain exactly L non-wildcards

Elementary Pattern Examples SDFBASTSLP LFCASTSCP FDASTSNP TGLPACTTCPTFCP AST K=2, L=3, W=5 T.LP

Scan Phase (Cont.) Empty the stack of elementary patterns. (EP) For each letter in the alphabet, count how many sequences contain this letter. If less than K sequences contain this letter, ignore it. Otherwise, extend it until it is ignored or it is accepted.

Convolution Phase (Overall) There is a sub-phase named pre- convolution before the convolution phase. The goal of the convolution phase is to extend a elementary pattern with other elementary patterns.

Pre-convolution Phase Prefix of a pattern p is the unique substring containing the first L-1 residues of p. Suffix of pattern p is the unique substring containing the last L-1 residues of p. Example: for pattern a..c.e, prefix is a..c, suffix is c.e

Pre-convolution Phase (cont.) Given two pattern p and q, p convolve q is empty if the suffix of p is not the prefix of q, Otherwise, it is the pattern p followed by the part of q which remains when its prefix is removed.

Pre-convolution Phase (cont.) Left vector of a pattern q contains all the patterns p that can be left-convolved with it. Right vector of a pattern p contains all the patterns q that can be right-convolved with it. Example: Left a.c.d: ea.c b.a.c right a.c.d: fc.d

Pre-convolution Phase (cont.) Sort elementary patterns by prefix-wise < Construct Left and Right vector for each elementary pattern

Prefix-wise < Examples Example 0 abc 5 bc..f 1 bcd 6 a.cd 2 ab.d 7 b.de 3 cd.f 8 a.c.e 4 ab..e 9 a..de

Convolution Phase For each pattern t of the sorted EPs, firstly it is extended to the left by convolving it with every member in Left[t]. If the result r doesn’t have sufficient support or is redundant, discard it. If r has the same number of occurrences as t, t is non-maximal and is discarded.

Convolution Phase (cont.) If result r has adequate support and is not redundant, r is placed on the stack and becomes the new top. When the pattern at the top of the stack can not be extended to the left any more, extend it to the right. When all the extensions have been done, the current top is popped from the stack. If it is maximal, it is put into the output set.

An Example Scenario Suppose there are three sequences: SDFEASTS LFCASTS FDASTSNP Our goal: to find a maximal pattern, given K = 2, L = 3, W = 5

An Example Scenario (cont.) First, the scan phase is implemented The elementary patterns (EP) found are: AST AS.S A.TS F.AS F.A.T F..ST STS Then these EP are sorted, the result is: AST STS AS.S A.TS F.AS F.A.T F..ST

An Example Scenario (cont.) Then, the Left and Right vectors are constructed: Left Right AST: F.AS AST: STS STS: AST, F..ST STS: AS.S: F.AS AS.S: A.TS: F.A.T A.TS: F.AS: F.AS: AST, AS.S F.A.T: F.A.T: A.TS F..ST: F..ST: STS

An Example Scenario (cont.) Now the convolution phase is executed, and the result is as following: F.ASTS

About Our Implementation The program is written using Visual C Command line arguments are provided to users for them to specify the input data file, the K, L, W value (which have default values (2, 3, 5) if users don’t specify) >Teiresias [ ] [ ] [ ]

Questions?