Algorithms to Search Position Specific Scoring Matrices in Biosequences Cinzia Pizzi Dipartimento di Ingegneria dell’Informazione Università degli Studi.

Slides:



Advertisements
Similar presentations
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advertisements

Two-dimensional pattern matching M.G.W.H. van de Rijdt 23 August 2005.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.
MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion.
Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.
On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda.
Two implementation issues Alphabet size Generalizing to multiple strings.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.
15-853Page : Algorithms in the Real World Suffix Trees.
OUTLINE Suffix trees Suffix arrays Suffix trees Indexing techniques are used to locate highest – scoring alignments. One method of indexing uses the.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Krzysztof Fabjański Common string pattern searching.
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Combinational Pattern Matching Shashank Kadaveru.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
Goodrich, Tamassia String Processing1 Pattern Matching.
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
SSAHA, or Sequence Search and Alignment by Hashing Algorithm, is used mainly for fast sequence assembly, SNP detection, and the ordering and orientation.
1. 2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan.
Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.
Outline More exhaustive search algorithms Today: Motif finding
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string.
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Markov Chain Models BMI/CS 576 Colin Dewey Fall 2015.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
A database index to large biological sequences
Tries 07/28/16 11:04 Text Compression
String Data Structures and Algorithms: Suffix Trees and Suffix Arrays
Intro to Alignment Algorithms: Global and Local
String Data Structures and Algorithms
String Data Structures and Algorithms
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Presentation transcript:

Algorithms to Search Position Specific Scoring Matrices in Biosequences Cinzia Pizzi Dipartimento di Ingegneria dell’Informazione Università degli Studi di Padova

C.Pizzi, DEI – Univ. Of Padova (Italy)2 Outline Weighted patterns in Biology The problem of profile matching The look-ahead method Suffix based Algorithms Aho-Corasick Extension (ACE) Look-ahead Filtration Algorithm (LFA) Superalphabet (NS) Some experimental results

What are Motifs? Motifs are biologically significant elements that are responsible for common structures or functions Motifs are statistically significant substrings in bio-sequences Assumption: if two entities share same function or structure, common over- represented elements might be responsible for observed similarity C.Pizzi, DEI – Univ. Of Padova (Italy)3

Motif Discovery Take set of co-expressed genes Compare their promoter regions Common over-represented substrings are good candidates for TFBS Need counted/expected frequency C.Pizzi, DEI – Univ. Of Padova (Italy)4 Promoters of co-expressed genes

C.Pizzi, DEI – Univ. Of Padova (Italy)5 Motif Discovery TFBS, DNA motifs Motifs = binding sites = substrings Intrinsic variability of biological sequences Mismatches, indels, wildcards, superalphabets... Promoters of co-expressed genes

Motif Representation Binding sites of the same factor are not exactly the same in all sequences ACATAC CCGAAT ATGCAT GCCTAC TCCAAA TTCGAA ACGGAC TCCTAT GCCCAC TCGGAA A G C T Profile -> matrix representation C.Pizzi, DEI – Univ. Of Padova (Italy)

Motif Representation Protein classification: each family is modeled by a matrix ACDEHNPVAC CCDEGAMMAT ATHCATVVST A D C... C.Pizzi, DEI – Univ. Of Padova (Italy) A D C A D C... WVDEHNPVAC

Profile Weighted pattern p oflength m defined over alphabet Σ |Σ| x m matrix defines scores A C G T C.Pizzi, DEI – Univ. Of Padova (Italy)

Segment Score S = s 1 s 2 … s m A C G T s 1 s 2 s 3 s 4 s 5 s 6 C.Pizzi, DEI – Univ. Of Padova (Italy)

Meaning of the score C.Pizzi, DEI – Univ. Of Padova (Italy)10

Segment Score Example Score = A C G T G T A C A C C.Pizzi, DEI – Univ. Of Padova (Italy)

Profile Matching Problem Text T of length n defined over Σ Profile p (|Σ| x m) Score threshold th Score S i of the segment of length m starting at position i Find all positions i in T where S i ≥ th C.Pizzi, DEI – Univ. Of Padova (Italy)

Example: th = 2 CGTACACTCGGTA Score = 0.6 Not a match! A C G T C.Pizzi, DEI – Univ. Of Padova (Italy)

Example: th = 2 CGTACACTCGGTA Score = 2.1 Match at pos 2! A C G T C.Pizzi, DEI – Univ. Of Padova (Italy)

Example: th = 2 CGTACACTCGGTA Score = 1.4 Not a match! A C G T C.Pizzi, DEI – Univ. Of Padova (Italy)

Example: th = 2 CGTACACTCGGTA Score = 1.8 Not a match! A C G T C.Pizzi, DEI – Univ. Of Padova (Italy)

Example: th = 2 CGTACACTCGGTA Score = 0.9 Not a match! A C G T C.Pizzi, DEI – Univ. Of Padova (Italy)

Example: th = 2 CGTACACTCGGTA Score = 1.3 Not a match! A C G T C.Pizzi, DEI – Univ. Of Padova (Italy)

Example: th = 2 CGTACACTCGGTA Score = 1.4 Not a match! A C G T C.Pizzi, DEI – Univ. Of Padova (Italy)

Example: th = 2 CGTACACTCGGTA Score = 2.2 Match at pos 8! A C G T C.Pizzi, DEI – Univ. Of Padova (Italy)

Scenarios of applications Online Algorithms (no indexing) Database of profile matrices (e.g. TRANSFAC, JASPAR for TFBS) Input sequence to be searched Offline algorithms (indexing) Sequence or set of sequences Input matrix to search for matches C.Pizzi, DEI – Univ. Of Padova (Italy)

Summary of current methods Look-ahead method LA (Wu et al,00) Offline methods based on LA: Suffix-tree (Dorohonceanu et al, 00) Suffix-array (Beckstette et al, 04,06) Truncated Suffix Tree (Pizzi and Favaretto, 10) Online methods based on LA: Aho-Corasick,Filtering(Pizzi et al. 07,09) C.Pizzi, DEI – Univ. Of Padova (Italy)

Summary of current methods Pattern Matching Shift-Add (Salmela e Tarhio, 08) KMP (Liefoghee et al, 09) Matrix partitioning (Liefhooghe et al.,06, Pizzi et al., 07, 09) FFT based (Rajasekaran et al., 02) Compression based(Freschi et al., 05) C.Pizzi, DEI – Univ. Of Padova (Italy)

The look-ahead approach A C G T max P th C.Pizzi, DEI – Univ. Of Padova (Italy)

The look-ahead approach A C G T max P th C G T A C A 0.1 C.Pizzi, DEI – Univ. Of Padova (Italy)

The look-ahead approach A C G T max P th C G T A C A 0.1 C.Pizzi, DEI – Univ. Of Padova (Italy)

The look-ahead approach A C G T max P th C G T A C A Don’t need to compare these ones! C.Pizzi, DEI – Univ. Of Padova (Italy)

The suffix tree of T data structure suffix tree, Tree(T), is compacted trie that represents all the suffixes of string T linear size: |Tree(T)| = O(|T|) can be constructed in linear time O(|T|) C.Pizzi, DEI – Univ. Of Padova (Italy)

Suffix trie and suffix tree a b b a a a a a b b b a baab ab abaab baab aab ab b Trie(abaab)Tree(abaab) C.Pizzi, DEI – Univ. Of Padova (Italy)

Tree(T) is of linear size only the internal branching nodes and the leaves represented explicitly edges labeled by substrings of T v = node(α) if the path from root to v spells α one-to-one correspondence of leaves and suffixes |T| leaves, hence < |T| internal nodes C.Pizzi, DEI – Univ. Of Padova (Italy)30

Tree(hattivatti) hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i hattivatti attivatti ttivatti tivatti ivatti vatti atti ti i i tti ti t i vatti hattivatti atti C.Pizzi, DEI – Univ. Of Padova (Italy)

Tree(hattivatti) hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i ,10 2,5 4, ,3 vatti hattivatti 7 C.Pizzi, DEI – Univ. Of Padova (Italy)

Tree(T) is full text index Tree(T) P 318 P occurs in T at locations 8, 31, … P occurs in T  P is a prefix of some suffix of T  Path for P exists in Tree(T) All occurrences of P in time O(|P| + #occ) C.Pizzi, DEI – Univ. Of Padova (Italy)

LA over a Suffix Tree CG T Score(CG)=0.2 > -0.2 = Th(2) Score(CGT)=0.2 < 0.3 = Th(3) : Skip the subtree C.Pizzi, DEI – Univ. Of Padova (Italy) TCC G

LA over a Suffix Tree CG T Score(TCC)=1.9 > 0.3 = Th(3) Score(TCCG)=2.2 > 2 = Th(6) : Match, all the subtree C.Pizzi, DEI – Univ. Of Padova (Italy) TCC G

Suffix array: example suffix array = lexicographic order of the suffixes hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i ε atti attivatti hattivatti i ivatti ti tivatti tti ttivatti vatti C.Pizzi, DEI – Univ. Of Padova (Italy)

37 Suffix array suffix array SA(T) = an array giving the lexicographic order of the suffixes of T practitioners like suffix arrays (simplicity, space efficiency) theoreticians like suffix trees (explicit structure) C.Pizzi, DEI – Univ. Of Padova (Italy)

LA over a Suffix Array C.Pizzi, DEI – Univ. Of Padova (Italy) In terms of suffix trees, skp[i] is the lexicographically next leaf that does not occur in the subtree below the branching node corresponding to the longest common prefix of Ssuf[i-1] and Ssuf[i]. skp[i] = min({n + 1} U [ j in [i + 1; n] | lcp[i] > lcp[j])

LA over Truncated ST Build TST with truncation factor h L = max length of a matrix in the DB if h=L, simply work as ST if h<L, filtering if a leaf is reached take corresponding positions (p 1, p 2, …, p t ) For each p i check positions p i +j, h<j<=m with lookahead C.Pizzi, DEI – Univ. Of Padova (Italy)39

LA over Truncated ST C.Pizzi, DEI – Univ. Of Padova (Italy)40 h L p1p1 p3p3 p2p2 p 1 + h p1p1 p 2 +h p 3 +h L-h p2p2 p3p3

Space OccupationTruST C.Pizzi, DEI – Univ. Of Padova (Italy)41

Running Time TruST C.Pizzi, DEI – Univ. Of Padova (Italy)42

Aho-Corasick Expansion (ACE) Pattern matching + LA Lookahead Filtration Algorithm(LFA) Score for fixed length prefix as a filter + LA Naive Superalphabet (NS) Encode k-mers in superalphabet symbol Online Profile Matching C.Pizzi, DEI – Univ. Of Padova (Italy)

The Aho-Corasick Algorithm A trie for D = {he, she, his, hers} C.Pizzi, DEI – Univ. Of Padova (Italy)

The Aho-Corasick algorithm Add failure links his -- she Time O(n+m) Space depends on D m = sum of word lengths C.Pizzi, DEI – Univ. Of Padova (Italy)

The Fast Aho-Corasick s he rs s i s he e,i,r h r s h,s h e,i s Time O(n) Space depends on D and Σ C.Pizzi, DEI – Univ. Of Padova (Italy)

AC and profile matching Build AC automaton for all the words that are a match for the matrix LA partial threshold limits the number of words to those that actually match O(|D||Σ|m + m|Σ|) pre-processing |D|≤|Σ| m depends on matrix and threshold Search the text with AC automaton O(n) search C.Pizzi, DEI – Univ. Of Padova (Italy)

AC-Extension by LA First position A C G T Pth [C,0.1] [G,0.2] [A,0.3] [T,0.4] C.Pizzi, DEI – Univ. Of Padova (Italy)

AC-Extension by LA Second position A C G T Pth [C,0.1] [G,0.2] [A,0.3] [T,0.4] [A,0.1] [G,0.1] [T,0.3] [C,0.9] C.Pizzi, DEI – Univ. Of Padova (Italy)

AC-Extension by LA Third position A C G T Pth [C,0.1] [G,0.2] [A,0.3] [T,0.4] [A,0.1] [G,0.1] [T,0.3] [C,0.9] [G,0.5] [C,0.6] C.Pizzi, DEI – Univ. Of Padova (Italy)

ACE Example CGTACACTCGGTA gt a c c gg t t ac a c 1 C.Pizzi, DEI – Univ. Of Padova (Italy)

ACE Example CGTACACTCGGTA gt a c c gg t t ac a c 2 C.Pizzi, DEI – Univ. Of Padova (Italy)

ACE Example CGTACACTCGGTA gt a c c gg t t ac a c 3 C.Pizzi, DEI – Univ. Of Padova (Italy)

ACE Example CGTACACTCGGTA gt a c c gg t t ac a c 4 C.Pizzi, DEI – Univ. Of Padova (Italy)

ACE Example CGTACACTCGGTA gt a c c gg t t ac a c 5 C.Pizzi, DEI – Univ. Of Padova (Italy)

ACE Example CGTACACTCGGTA gt a c c gg t t ac a c 6 C.Pizzi, DEI – Univ. Of Padova (Italy)

ACE Example CGTACACTCGGTA gt a c c gg t t a c a c 7 Match at p-m+1 = 7-6+1=2 C.Pizzi, DEI – Univ. Of Padova (Italy)

Minimum Gain for ACE Dual Concept of look-ahead Compute for every prefix the minimum contribution of the remaining positions in the pattern If current_score(i) + min_gain(i) > Th Report a match Adv: in the automaton save a full subtree of height m-i C.Pizzi, DEI – Univ. Of Padova (Italy)

Example: M0003, MSS=0.85 [G,18500] C.Pizzi, DEI – Univ. Of Padova (Italy)

Example: M0003, MSS=0.85 [G,18500] [C,37000] C.Pizzi, DEI – Univ. Of Padova (Italy)

Example: M0003, MSS=0.85 [G,18500] [C,37000] [C,55500] GCC is sufficient to detect a match h=3 C.Pizzi, DEI – Univ. Of Padova (Italy)

Example: M0003, MSS=0.85 [G,18500] [C,37000] [C,55500] Save 5464 nodes out of 5468 h=3 C.Pizzi, DEI – Univ. Of Padova (Italy)

Minimum Gain ACE C.Pizzi, DEI – Univ. Of Padova (Italy)

Look-ahead Filtration Compute the scores for all words of fixed length k and store them O(|Σ| k ) pre-processing Sliding window of size k When score ≥ P th [k], check remaining symbols with LA (up to m-k) O(n + (m -k)r) search; k is the prefix length, r is avg number of full scoring C.Pizzi, DEI – Univ. Of Padova (Italy)

Lookahaed Filtration Example K=3SCORE AAA ATT0.5 CAA CGT0.1 CTT0.3 GAA GTA GTT0.4 TAA TTT0.6 P th [3]=0.3 CGTACACTCGGTA Score(CGT) = 0.1 < P th [3] Shift and concatenate to obtain the next 3-mer |Σ| k entries C.Pizzi, DEI – Univ. Of Padova (Italy)

Filtered Lookahaed Example K=3SCORE AAA ATT0.5 CAA CGT0.1 CTT0.3 GAA GTA GTT0.4 TAA TTT0.6 P th [3]=0.3 CGTACACTCGGTA Score(GTA) = 0.5 > P th [3] Check at most m-k remaining symbols Score(GTAC) = 0.7 > P th [4] Score(GTACA) = 1.7 > P th [5] Score(GTACAC) = 2.1 > th Match! |Σ| k entries C.Pizzi, DEI – Univ. Of Padova (Italy)

More on ACE and LF It is possible to combine both methods Automaton build on qualifying prefixes only Multi-matrix version C.Pizzi, DEI – Univ. Of Padova (Italy)67

Super-Alphabet Code words of length k to super- alphabet symbols |Σ| k symbols are needed Code the matrix M into matrix M’ (|Σ| k x m/k) Run the naive algorithm on the sequence O(nm/k) C.Pizzi, DEI – Univ. Of Padova (Italy)

SuperAlphabet Example K=2SCORE 1-2SCORE 3-4SCORE 5-6 AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT CGTACACTCGGTA Score = 0.6 < Th |Σ| k entries C.Pizzi, DEI – Univ. Of Padova (Italy)

SuperAlphabet Example K=2SCORE 1-2SCORE 3-4SCORE 5-6 AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT CGTACACTCGGTA Score = 2.1 match! |Σ| k entries C.Pizzi, DEI – Univ. Of Padova (Italy)

Experiments Jaspar Database: 123 TFBS matrices (DNA), PRINTS database (proteins) Test sequence about 50M bases P-value defines threshold 3 GHz Intel Pentium IV processor with 2 gigabytes of main memory, running under Linux. C.Pizzi, DEI – Univ. Of Padova (Italy)

DNA – avg running times per matrix C.Pizzi, DEI – Univ. Of Padova (Italy)72

DNA- matrix length C.Pizzi, DEI – Univ. Of Padova (Italy)73

DNA – window width C.Pizzi, DEI – Univ. Of Padova (Italy)74

Proteins, avg time per matrix C.Pizzi, DEI – Univ. Of Padova (Italy)75

Proteins - matrix length C.Pizzi, DEI – Univ. Of Padova (Italy)76

MOODS – Motif Occurrence Detection Suite C.Pizzi, DEI – Univ. Of Padova (Italy)77

Conclusions Searching matrix is a core step for many bioinformatics applications (searching, discovery, classification…) Several approaches have been developed in recent years Online methods based on filtering are currently the most efficient C.Pizzi, DEI – Univ. Of Padova (Italy)78

References C.Pizzi, P.Rastas, E.Ukkonen Fast Search Algorithms for Position Specific Scoring Matrices In Proc. of the 1st Conference on Bioinformatics Research and Development (BIRD 07), Berlin, Germany, March 2007, LNCS/LCBI 4414 pp C.Pizzi, E.Ukkonen Fast Profile Matching Algorithms - a survey Theoretical Computer Science, 395(2-3), 2008, pp , Special Issue SAIL: String Algorithms, Information and Learning C.Pizzi, P.Rastas, E.Ukkonen Finding significant matches of position weight matrices in linear time Accepted for publication by IEEE Transaction on Computational Biology and Bioinformatics, 2009 J.Korhonen, P.Martinmaki, C.Pizzi, P.Rastas, E.Ukkonen MOODS: fast search for position weight matrix matches in DNA sequences Bioinformatics (23): C.Pizzi, DEI – Univ. Of Padova (Italy)79

Thanks C.Pizzi, DEI – Univ. Of Padova (Italy)80

Acknowledgements Esko Ukkonen, Pasi Rastas, Janne Korhonen, P.Martinmaki Academy of Finland grant “From Data to knowledge” EU Project “Regulatory Networks” Premio di Ricerca `Avere Trent’Anni’ Univ.Padova, Parco Scientifico Galileo, Il Mattino, Giovani Confindustria, Scuola Galileiana di Studi Superiori C.Pizzi, DEI – Univ. Of Padova (Italy)

Length 100 NA = Naïve Algorithm LSA = Look-ahead Search Algorithm LFA = Look-ahead Filter Algorithm (k=7) NS = Naïve Superalphabet (k=7) 13 patterns obtained by concateneting Jaspar matrices MSS: Matrix Similarity Score (% of maximal score) C.Pizzi, DEI – Univ. Of Padova (Italy)

Multiple Matrices Search C.Pizzi, DEI – Univ. Of Padova (Italy)

Running Time per matrix C.Pizzi, DEI – Univ. Of Padova (Italy)

Length 0 to 15 (108 matrices) NA = Naïve Algorithm LSA = Look-ahead Search Algorithm ACE = Aho-Corasick Expansion LFA = Look-ahead Filtration Algorithm (k=7) NS = Naïve Super-alphabet (k=7) C.Pizzi, DEI – Univ. Of Padova (Italy)

Running Time per matrix C.Pizzi, DEI – Univ. Of Padova (Italy)

Length 16 to 30 (15 matrices) NA = Naïve Algorithm LSA = Look-ahead Search Algorithm LFA = Look-ahead Filtration Algorithm NS = Naïve Super-alphabet C.Pizzi, DEI – Univ. Of Padova (Italy)

Length 100 NA = Naïve Algorithm LSA = Look-ahead Search Algorithm LFA = Look-ahead Filter Algorithm (k=7) NS = Naïve Superalphabet (k=7) 13 patterns obtained by concateneting Jaspar matrices P=10 -5 P=10 -4 P=10 -3 P=10 -2 NA LSA LFA NS C.Pizzi, DEI – Univ. Of Padova (Italy)

Motif Representation Istances of a biological signal are different ACATAC CCGAAT ATGCAT GCCTAC TCCAAA TTCGAA ACGGAC TCCTAT GCCCAC TCGGAA TCC(G|T)AC A G C T Consensus -> pattern representation Profile -> matrix representation C.Pizzi, DEI – Univ. Of Padova (Italy)