Presentation is loading. Please wait.

Presentation is loading. Please wait.

Algorithms to Search Position Specific Scoring Matrices in Biosequences Cinzia Pizzi Dipartimento di Ingegneria dell’Informazione Università degli Studi.

Similar presentations


Presentation on theme: "Algorithms to Search Position Specific Scoring Matrices in Biosequences Cinzia Pizzi Dipartimento di Ingegneria dell’Informazione Università degli Studi."— Presentation transcript:

1 Algorithms to Search Position Specific Scoring Matrices in Biosequences Cinzia Pizzi Dipartimento di Ingegneria dell’Informazione Università degli Studi di Padova

2 C.Pizzi, DEI – Univ. Of Padova (Italy)2 Outline Weighted patterns in Biology The problem of profile matching The look-ahead method Suffix based Algorithms Aho-Corasick Extension (ACE) Look-ahead Filtration Algorithm (LFA) Superalphabet (NS) Some experimental results

3 What are Motifs? Motifs are biologically significant elements that are responsible for common structures or functions Motifs are statistically significant substrings in bio-sequences Assumption: if two entities share same function or structure, common over- represented elements might be responsible for observed similarity C.Pizzi, DEI – Univ. Of Padova (Italy)3

4 Motif Discovery Take set of co-expressed genes Compare their promoter regions Common over-represented substrings are good candidates for TFBS Need counted/expected frequency C.Pizzi, DEI – Univ. Of Padova (Italy)4 Promoters of co-expressed genes

5 C.Pizzi, DEI – Univ. Of Padova (Italy)5 Motif Discovery TFBS, DNA motifs Motifs = binding sites = substrings Intrinsic variability of biological sequences Mismatches, indels, wildcards, superalphabets... Promoters of co-expressed genes

6 Motif Representation Binding sites of the same factor are not exactly the same in all sequences ACATAC CCGAAT ATGCAT GCCTAC TCCAAA TTCGAA ACGGAC TCCTAT GCCCAC TCGGAA 1 2 3 4 5 6 A G C T Profile -> matrix representation C.Pizzi, DEI – Univ. Of Padova (Italy)

7 Motif Representation Protein classification: each family is modeled by a matrix ACDEHNPVAC CCDEGAMMAT ATHCATVVST 1 2 3 4 5 6 A D C... C.Pizzi, DEI – Univ. Of Padova (Italy) 1 2 3 4 5 6 A D C... 1 2 3 4 5 6 A D C... WVDEHNPVAC

8 Profile Weighted pattern p oflength m defined over alphabet Σ |Σ| x m matrix defines scores 123456 A0.30.00.10.21.00.3 C0.10.80.50.20.00.4 G0.20.00.40.30.0 T0.40.20.00.30.00.3 C.Pizzi, DEI – Univ. Of Padova (Italy)

9 Segment Score S = s 1 s 2 … s m 123456 A0.30.00.10.21.00.3 C0.10.80.50.20.00.4 G0.20.00.40.30.0 T0.40.20.00.30.00.3 s 1 s 2 s 3 s 4 s 5 s 6 C.Pizzi, DEI – Univ. Of Padova (Italy)

10 Meaning of the score C.Pizzi, DEI – Univ. Of Padova (Italy)10

11 Segment Score Example Score = 2.1 123456 A0.30.00.10.21.00.3 C0.10.80.50.20.00.4 G0.20.00.40.30.0 T0.40.20.00.30.00.3 G T A C A C C.Pizzi, DEI – Univ. Of Padova (Italy)

12 Profile Matching Problem Text T of length n defined over Σ Profile p (|Σ| x m) Score threshold th Score S i of the segment of length m starting at position i Find all positions i in T where S i ≥ th C.Pizzi, DEI – Univ. Of Padova (Italy)

13 Example: th = 2 CGTACACTCGGTA Score = 0.6 Not a match! 123456 A0.30.00.10.21.00.3 C0.10.80.50.20.00.4 G0.20.00.40.30.0 T0.40.20.00.30.00.3 C.Pizzi, DEI – Univ. Of Padova (Italy)

14 Example: th = 2 CGTACACTCGGTA Score = 2.1 Match at pos 2! 123456 A0.30.00.10.21.00.3 C0.10.80.50.20.00.4 G0.20.00.40.30.0 T0.40.20.00.30.00.3 C.Pizzi, DEI – Univ. Of Padova (Italy)

15 Example: th = 2 CGTACACTCGGTA Score = 1.4 Not a match! 123456 A0.30.00.10.21.00.3 C0.10.80.50.20.00.4 G0.20.00.40.30.0 T0.40.20.00.30.00.3 C.Pizzi, DEI – Univ. Of Padova (Italy)

16 Example: th = 2 CGTACACTCGGTA Score = 1.8 Not a match! 123456 A0.30.00.10.21.00.3 C0.10.80.50.20.00.4 G0.20.00.40.30.0 T0.40.20.00.30.00.3 C.Pizzi, DEI – Univ. Of Padova (Italy)

17 Example: th = 2 CGTACACTCGGTA Score = 0.9 Not a match! 123456 A0.30.00.10.21.00.3 C0.10.80.50.20.00.4 G0.20.00.40.30.0 T0.40.20.00.30.00.3 C.Pizzi, DEI – Univ. Of Padova (Italy)

18 Example: th = 2 CGTACACTCGGTA Score = 1.3 Not a match! 123456 A0.30.00.10.21.00.3 C0.10.80.50.20.00.4 G0.20.00.40.30.0 T0.40.20.00.30.00.3 C.Pizzi, DEI – Univ. Of Padova (Italy)

19 Example: th = 2 CGTACACTCGGTA Score = 1.4 Not a match! 123456 A0.30.00.10.21.00.3 C0.10.80.50.20.00.4 G0.20.00.40.30.0 T0.40.20.00.30.00.3 C.Pizzi, DEI – Univ. Of Padova (Italy)

20 Example: th = 2 CGTACACTCGGTA Score = 2.2 Match at pos 8! 123456 A0.30.00.10.21.00.3 C0.10.80.50.20.00.4 G0.20.00.40.30.0 T0.40.20.00.30.00.3 C.Pizzi, DEI – Univ. Of Padova (Italy)

21 Scenarios of applications Online Algorithms (no indexing) Database of profile matrices (e.g. TRANSFAC, JASPAR for TFBS) Input sequence to be searched Offline algorithms (indexing) Sequence or set of sequences Input matrix to search for matches C.Pizzi, DEI – Univ. Of Padova (Italy)

22 Summary of current methods Look-ahead method LA (Wu et al,00) Offline methods based on LA: Suffix-tree (Dorohonceanu et al, 00) Suffix-array (Beckstette et al, 04,06) Truncated Suffix Tree (Pizzi and Favaretto, 10) Online methods based on LA: Aho-Corasick,Filtering(Pizzi et al. 07,09) C.Pizzi, DEI – Univ. Of Padova (Italy)

23 Summary of current methods Pattern Matching Shift-Add (Salmela e Tarhio, 08) KMP (Liefoghee et al, 09) Matrix partitioning (Liefhooghe et al.,06, Pizzi et al., 07, 09) FFT based (Rajasekaran et al., 02) Compression based(Freschi et al., 05) C.Pizzi, DEI – Univ. Of Padova (Italy)

24 The look-ahead approach 123456 A0.30.00.10.21.00.3 C0.10.80.50.20.00.4 G0.20.00.40.30.0 T0.40.20.00.30.00.3 max3.02.21.71.40.4 P th -0.20.30.61.62.0 C.Pizzi, DEI – Univ. Of Padova (Italy)

25 The look-ahead approach 123456 A0.30.00.10.21.00.3 C0.10.80.50.20.00.4 G0.20.00.40.30.0 T0.40.20.00.30.00.3 max3.02.21.71.40.4 P th -0.20.30.61.62.0 C G T A C A 0.1 C.Pizzi, DEI – Univ. Of Padova (Italy)

26 The look-ahead approach 123456 A0.30.00.10.21.00.3 C0.10.80.50.20.00.4 G0.20.00.40.30.0 T0.40.20.00.30.00.3 max3.02.21.71.40.4 P th -0.20.30.61.62.0 C G T A C A 0.1 C.Pizzi, DEI – Univ. Of Padova (Italy)

27 The look-ahead approach 123456 A0.30.00.10.21.00.3 C0.10.80.50.20.00.4 G0.20.00.40.30.0 T0.40.20.00.30.00.3 max3.02.21.71.40.4 P th -0.20.30.61.62.0 C G T A C A 0.1 0.1 0.1 Don’t need to compare these ones! C.Pizzi, DEI – Univ. Of Padova (Italy)

28 The suffix tree of T data structure suffix tree, Tree(T), is compacted trie that represents all the suffixes of string T linear size: |Tree(T)| = O(|T|) can be constructed in linear time O(|T|) C.Pizzi, DEI – Univ. Of Padova (Italy)

29 Suffix trie and suffix tree a b b a a a a a b b b a baab ab abaab baab aab ab b Trie(abaab)Tree(abaab) C.Pizzi, DEI – Univ. Of Padova (Italy)

30 Tree(T) is of linear size only the internal branching nodes and the leaves represented explicitly edges labeled by substrings of T v = node(α) if the path from root to v spells α one-to-one correspondence of leaves and suffixes |T| leaves, hence < |T| internal nodes C.Pizzi, DEI – Univ. Of Padova (Italy)30

31 Tree(hattivatti) hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i hattivatti attivatti ttivatti tivatti ivatti vatti atti ti i i tti ti t i vatti hattivatti atti C.Pizzi, DEI – Univ. Of Padova (Italy)

32 Tree(hattivatti) hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i 1 2 3 4 5 6 6,10 2,5 4,5 10 8 9 3,3 vatti hattivatti 7 C.Pizzi, DEI – Univ. Of Padova (Italy)

33 Tree(T) is full text index Tree(T) P 318 P occurs in T at locations 8, 31, … P occurs in T  P is a prefix of some suffix of T  Path for P exists in Tree(T) All occurrences of P in time O(|P| + #occ) C.Pizzi, DEI – Univ. Of Padova (Italy)

34 LA over a Suffix Tree CG T Score(CG)=0.2 > -0.2 = Th(2) Score(CGT)=0.2 < 0.3 = Th(3) : Skip the subtree C.Pizzi, DEI – Univ. Of Padova (Italy) TCC G

35 LA over a Suffix Tree CG T Score(TCC)=1.9 > 0.3 = Th(3) Score(TCCG)=2.2 > 2 = Th(6) : Match, all the subtree C.Pizzi, DEI – Univ. Of Padova (Italy) TCC G

36 Suffix array: example suffix array = lexicographic order of the suffixes hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i ε atti attivatti hattivatti i ivatti ti tivatti tti ttivatti vatti 11 7 2 1 10 5 9 4 8 3 6 C.Pizzi, DEI – Univ. Of Padova (Italy)

37 37 Suffix array suffix array SA(T) = an array giving the lexicographic order of the suffixes of T practitioners like suffix arrays (simplicity, space efficiency) theoreticians like suffix trees (explicit structure) C.Pizzi, DEI – Univ. Of Padova (Italy)

38 LA over a Suffix Array C.Pizzi, DEI – Univ. Of Padova (Italy) In terms of suffix trees, skp[i] is the lexicographically next leaf that does not occur in the subtree below the branching node corresponding to the longest common prefix of Ssuf[i-1] and Ssuf[i]. skp[i] = min({n + 1} U [ j in [i + 1; n] | lcp[i] > lcp[j])

39 LA over Truncated ST Build TST with truncation factor h L = max length of a matrix in the DB if h=L, simply work as ST if h<L, filtering if a leaf is reached take corresponding positions (p 1, p 2, …, p t ) For each p i check positions p i +j, h<j<=m with lookahead C.Pizzi, DEI – Univ. Of Padova (Italy)39

40 LA over Truncated ST C.Pizzi, DEI – Univ. Of Padova (Italy)40 h L p1p1 p3p3 p2p2 p 1 + h p1p1 p 2 +h p 3 +h L-h p2p2 p3p3

41 Space OccupationTruST C.Pizzi, DEI – Univ. Of Padova (Italy)41

42 Running Time TruST C.Pizzi, DEI – Univ. Of Padova (Italy)42

43 Aho-Corasick Expansion (ACE) Pattern matching + LA Lookahead Filtration Algorithm(LFA) Score for fixed length prefix as a filter + LA Naive Superalphabet (NS) Encode k-mers in superalphabet symbol Online Profile Matching C.Pizzi, DEI – Univ. Of Padova (Italy)

44 The Aho-Corasick Algorithm A trie for D = {he, she, his, hers} C.Pizzi, DEI – Univ. Of Padova (Italy)

45 The Aho-Corasick algorithm Add failure links his -- she Time O(n+m) Space depends on D m = sum of word lengths C.Pizzi, DEI – Univ. Of Padova (Italy)

46 The Fast Aho-Corasick s 01289 67 345 he rs s i s he e,i,r h r s h,s h e,i s Time O(n) Space depends on D and Σ C.Pizzi, DEI – Univ. Of Padova (Italy)

47 AC and profile matching Build AC automaton for all the words that are a match for the matrix LA partial threshold limits the number of words to those that actually match O(|D||Σ|m + m|Σ|) pre-processing |D|≤|Σ| m depends on matrix and threshold Search the text with AC automaton O(n) search C.Pizzi, DEI – Univ. Of Padova (Italy)

48 AC-Extension by LA First position 123456 A0.30.00.10.21.00.3 C0.10.80.50.20.00.4 G0.20.00.40.30.0 T0.40.20.00.30.00.3 Pth-0.20.30.61.62.0 [C,0.1] [G,0.2] [A,0.3] [T,0.4] C.Pizzi, DEI – Univ. Of Padova (Italy)

49 AC-Extension by LA Second position 123456 A0.30.00.10.21.00.3 C0.10.80.50.20.00.4 G0.20.00.40.30.0 T0.40.20.00.30.00.3 Pth-0.20.30.61.62.0 [C,0.1] [G,0.2] [A,0.3] [T,0.4] [A,0.1] [G,0.1] [T,0.3] [C,0.9] C.Pizzi, DEI – Univ. Of Padova (Italy)

50 AC-Extension by LA Third position 123456 A0.30.00.10.21.00.3 C0.10.80.50.20.00.4 G0.20.00.40.30.0 T0.40.20.00.30.00.3 Pth-0.20.30.61.62.0 [C,0.1] [G,0.2] [A,0.3] [T,0.4] [A,0.1] [G,0.1] [T,0.3] [C,0.9] [G,0.5] [C,0.6] C.Pizzi, DEI – Univ. Of Padova (Italy)

51 ACE Example CGTACACTCGGTA gt a c c gg t t ac a c 1 C.Pizzi, DEI – Univ. Of Padova (Italy)

52 ACE Example CGTACACTCGGTA gt a c c gg t t ac a c 2 C.Pizzi, DEI – Univ. Of Padova (Italy)

53 ACE Example CGTACACTCGGTA gt a c c gg t t ac a c 3 C.Pizzi, DEI – Univ. Of Padova (Italy)

54 ACE Example CGTACACTCGGTA gt a c c gg t t ac a c 4 C.Pizzi, DEI – Univ. Of Padova (Italy)

55 ACE Example CGTACACTCGGTA gt a c c gg t t ac a c 5 C.Pizzi, DEI – Univ. Of Padova (Italy)

56 ACE Example CGTACACTCGGTA gt a c c gg t t ac a c 6 C.Pizzi, DEI – Univ. Of Padova (Italy)

57 ACE Example CGTACACTCGGTA gt a c c gg t t a c a c 7 Match at p-m+1 = 7-6+1=2 C.Pizzi, DEI – Univ. Of Padova (Italy)

58 Minimum Gain for ACE Dual Concept of look-ahead Compute for every prefix the minimum contribution of the remaining positions in the pattern If current_score(i) + min_gain(i) > Th Report a match Adv: in the automaton save a full subtree of height m-i C.Pizzi, DEI – Univ. Of Padova (Italy)

59 Example: M0003, MSS=0.85 [G,18500] C.Pizzi, DEI – Univ. Of Padova (Italy)

60 Example: M0003, MSS=0.85 [G,18500] [C,37000] C.Pizzi, DEI – Univ. Of Padova (Italy)

61 Example: M0003, MSS=0.85 [G,18500] [C,37000] [C,55500] GCC is sufficient to detect a match h=3 C.Pizzi, DEI – Univ. Of Padova (Italy)

62 Example: M0003, MSS=0.85 [G,18500] [C,37000] [C,55500] Save 5464 nodes out of 5468 h=3 C.Pizzi, DEI – Univ. Of Padova (Italy)

63 Minimum Gain ACE C.Pizzi, DEI – Univ. Of Padova (Italy)

64 Look-ahead Filtration Compute the scores for all words of fixed length k and store them O(|Σ| k ) pre-processing Sliding window of size k When score ≥ P th [k], check remaining symbols with LA (up to m-k) O(n + (m -k)r) search; k is the prefix length, r is avg number of full scoring C.Pizzi, DEI – Univ. Of Padova (Italy)

65 Lookahaed Filtration Example K=3SCORE AAA0.4... ATT0.5 CAA0.2... CGT0.1 CTT0.3 GAA0.3... GTA0.5... GTT0.4 TAA0.5... TTT0.6 P th [3]=0.3 CGTACACTCGGTA Score(CGT) = 0.1 < P th [3] Shift and concatenate to obtain the next 3-mer |Σ| k entries C.Pizzi, DEI – Univ. Of Padova (Italy)

66 Filtered Lookahaed Example K=3SCORE AAA0.4... ATT0.5 CAA0.2... CGT0.1 CTT0.3 GAA0.3... GTA0.5... GTT0.4 TAA0.5... TTT0.6 P th [3]=0.3 CGTACACTCGGTA Score(GTA) = 0.5 > P th [3] Check at most m-k remaining symbols Score(GTAC) = 0.7 > P th [4] Score(GTACA) = 1.7 > P th [5] Score(GTACAC) = 2.1 > th Match! |Σ| k entries C.Pizzi, DEI – Univ. Of Padova (Italy)

67 More on ACE and LF It is possible to combine both methods Automaton build on qualifying prefixes only Multi-matrix version C.Pizzi, DEI – Univ. Of Padova (Italy)67

68 Super-Alphabet Code words of length k to super- alphabet symbols |Σ| k symbols are needed Code the matrix M into matrix M’ (|Σ| k x m/k) Run the naive algorithm on the sequence O(nm/k) C.Pizzi, DEI – Univ. Of Padova (Italy)

69 SuperAlphabet Example K=2SCORE 1-2SCORE 3-4SCORE 5-6 AA0.3 1.3 AC1.10.31.4 AG0.30.41.0 AT0.30.41.3 CA0.10.70.3 CC0.90.70.4 CG0.10.80.0 CT0.30.80.3 GA0.20.60.3 GC1.00.60.4 GG0.20.70.0 GT0.40.70.3 TA0.40.20.3 TC1.20.20.4 TG0.40.30.0 TT0.60.3 CGTACACTCGGTA Score = 0.6 < Th |Σ| k entries C.Pizzi, DEI – Univ. Of Padova (Italy)

70 SuperAlphabet Example K=2SCORE 1-2SCORE 3-4SCORE 5-6 AA0.3 1.3 AC1.10.31.4 AG0.30.41.0 AT0.30.41.3 CA0.10.70.3 CC0.90.70.4 CG0.10.80.0 CT0.30.80.3 GA0.20.60.3 GC1.00.60.4 GG0.20.70.0 GT0.40.70.3 TA0.40.20.3 TC1.20.20.4 TG0.40.30.0 TT0.60.3 CGTACACTCGGTA Score = 2.1 match! |Σ| k entries C.Pizzi, DEI – Univ. Of Padova (Italy)

71 Experiments Jaspar Database: 123 TFBS matrices (DNA), PRINTS database (proteins) Test sequence about 50M bases P-value defines threshold 3 GHz Intel Pentium IV processor with 2 gigabytes of main memory, running under Linux. C.Pizzi, DEI – Univ. Of Padova (Italy)

72 DNA – avg running times per matrix C.Pizzi, DEI – Univ. Of Padova (Italy)72

73 DNA- matrix length C.Pizzi, DEI – Univ. Of Padova (Italy)73

74 DNA – window width C.Pizzi, DEI – Univ. Of Padova (Italy)74

75 Proteins, avg time per matrix C.Pizzi, DEI – Univ. Of Padova (Italy)75

76 Proteins - matrix length C.Pizzi, DEI – Univ. Of Padova (Italy)76

77 MOODS – Motif Occurrence Detection Suite C.Pizzi, DEI – Univ. Of Padova (Italy)77

78 Conclusions Searching matrix is a core step for many bioinformatics applications (searching, discovery, classification…) Several approaches have been developed in recent years Online methods based on filtering are currently the most efficient C.Pizzi, DEI – Univ. Of Padova (Italy)78

79 References C.Pizzi, P.Rastas, E.Ukkonen Fast Search Algorithms for Position Specific Scoring Matrices In Proc. of the 1st Conference on Bioinformatics Research and Development (BIRD 07), Berlin, Germany, March 2007, LNCS/LCBI 4414 pp 239--250 C.Pizzi, E.Ukkonen Fast Profile Matching Algorithms - a survey Theoretical Computer Science, 395(2-3), 2008, pp 137--157, Special Issue SAIL: String Algorithms, Information and Learning C.Pizzi, P.Rastas, E.Ukkonen Finding significant matches of position weight matrices in linear time Accepted for publication by IEEE Transaction on Computational Biology and Bioinformatics, 2009 J.Korhonen, P.Martinmaki, C.Pizzi, P.Rastas, E.Ukkonen MOODS: fast search for position weight matrix matches in DNA sequences Bioinformatics 2009 25(23):3181-3182 C.Pizzi, DEI – Univ. Of Padova (Italy)79

80 Thanks C.Pizzi, DEI – Univ. Of Padova (Italy)80

81 Acknowledgements Esko Ukkonen, Pasi Rastas, Janne Korhonen, P.Martinmaki Academy of Finland grant “From Data to knowledge” EU Project “Regulatory Networks” Premio di Ricerca `Avere Trent’Anni’ Univ.Padova, Parco Scientifico Galileo, Il Mattino, Giovani Confindustria, Scuola Galileiana di Studi Superiori C.Pizzi, DEI – Univ. Of Padova (Italy)

82 Length 100 NA = Naïve Algorithm LSA = Look-ahead Search Algorithm LFA = Look-ahead Filter Algorithm (k=7) NS = Naïve Superalphabet (k=7) 13 patterns obtained by concateneting Jaspar matrices MSS: Matrix Similarity Score (% of maximal score) C.Pizzi, DEI – Univ. Of Padova (Italy)

83 Multiple Matrices Search C.Pizzi, DEI – Univ. Of Padova (Italy)

84 Running Time per matrix C.Pizzi, DEI – Univ. Of Padova (Italy)

85 Length 0 to 15 (108 matrices) NA = Naïve Algorithm LSA = Look-ahead Search Algorithm ACE = Aho-Corasick Expansion LFA = Look-ahead Filtration Algorithm (k=7) NS = Naïve Super-alphabet (k=7) C.Pizzi, DEI – Univ. Of Padova (Italy)

86 Running Time per matrix C.Pizzi, DEI – Univ. Of Padova (Italy)

87 Length 16 to 30 (15 matrices) NA = Naïve Algorithm LSA = Look-ahead Search Algorithm LFA = Look-ahead Filtration Algorithm NS = Naïve Super-alphabet C.Pizzi, DEI – Univ. Of Padova (Italy)

88 Length 100 NA = Naïve Algorithm LSA = Look-ahead Search Algorithm LFA = Look-ahead Filter Algorithm (k=7) NS = Naïve Superalphabet (k=7) 13 patterns obtained by concateneting Jaspar matrices P=10 -5 P=10 -4 P=10 -3 P=10 -2 NA10.23410.24410.43411.080 LSA11.83512.67513.33515.118 LFA9.95510.34711.09612.965 NS3.5763.6774.5939.918 C.Pizzi, DEI – Univ. Of Padova (Italy)

89 Motif Representation Istances of a biological signal are different ACATAC CCGAAT ATGCAT GCCTAC TCCAAA TTCGAA ACGGAC TCCTAT GCCCAC TCGGAA TCC(G|T)AC 1 2 3 4 5 6 A G C T Consensus -> pattern representation Profile -> matrix representation C.Pizzi, DEI – Univ. Of Padova (Italy)


Download ppt "Algorithms to Search Position Specific Scoring Matrices in Biosequences Cinzia Pizzi Dipartimento di Ingegneria dell’Informazione Università degli Studi."

Similar presentations


Ads by Google