Presentation is loading. Please wait.

Presentation is loading. Please wait.

Issues in the Discovery and Use of Motif Patterns Alberto Apostolico University of Padova and Purdue University.

Similar presentations


Presentation on theme: "Issues in the Discovery and Use of Motif Patterns Alberto Apostolico University of Padova and Purdue University."— Presentation transcript:

1

2 Issues in the Discovery and Use of Motif Patterns Alberto Apostolico University of Padova and Purdue University

3 A. Apostolico - AofA04 General Form of Pattern Discovery Find-exploit a priori unknown patterns or associations thereof in a Data Base With some prior domain-specific knowledge Without any domain-specific prior knowledge Tenet: a pattern or association (rule) that occurs more frequently than one would expect is potentially informative and thus interesting frequent = interesting

4 A. Apostolico - AofA04 Motifs a motif is a recurring pattern with some solid and some ``don’t care’’ characters or ``gaps’’ Typical PROBLEM Input: textstring Output: repeated motifs ``don’t care’’ characters solid character T AA G A G G T A G A T AG T Motif discovery is beset by the circumstance that typically there are exponentially many candidate motifs in a sequence

5 A. Apostolico - AofA04 Motifs a motif is a recurring pattern with some solid and some ``don’t care’’ characters or ``gaps’’, together with its list of occurrences Self-correlation Motifs ``don’t care’’ characters solid character B AA D A D D B A D A B AD B B AA G A D D B A D A B AD B B A A D D B A A B B B BA B Motif discovery is beset by the circumstance that typically there are exponentially many candidate motifs in a sequence B A D A D D B A D A B AC B

6 A. Apostolico - AofA04 Controlling Motif Growth: Redundant Motifs (Parida) A motif is maximal in composition if specifying more solid characters implies an alteration to its occurrence list maximal in length if making the motif longer implies an alteration to the cardinality or displacement of its occurrence list A maximal motif such that the motif and its list can be inferred from studying other motifs is redundant

7 A. Apostolico - AofA04 Maximal, Redundant, Irredundant Motifs (examples) Let s= abcdabcd m_1 = ab with L_1 = { 1, 5 } m_2 = bc with L_2 = { 2, 6 } m_3 = cd with L_3 = { 3, 7 } m_4 = abc with L_4 = { 1, 5 } m_5 = bcd with L_5 = { 2, 6 } m_6 = abcd with L_6 = { 1, 5 } Notice that L_1 = L_4 = L_6 and L_2 = L_5. Denoting by L + i the list of j+i such that j is in L, L_5 = L_6 + 1 and L_3 = L_6 + 2 Motif m_6 is maximal as |m_6| > |m_1|, |m_4| and |m_5| > |m_2|. Motifs m_1, m_2, m_3, m_4 and m_5 are non-maximal motifs.

8 A. Apostolico - AofA04 Maximal, Redundant Irredundant Motifs (examples, cont.) Let s= aaXbaYdZZZaaVbaWcXXXXaaYbdXc s= aaXbaYdZZZaaVbaWcXXXXaaYbdXc m_1 = aa. b with L_1 = { 1, 11, 22} m_2 = aa. ba with L_2 = {1, 11} m_3 = aa. b. c with L_3 = {11, 22} m_1 = aa. b is redundant, since 1) m_1 is a sub-motif of m_2 and of m_3 and 2) L_1 is the union of L_2 and  L_3.

9 A. Apostolico - AofA04 Controlling Motif Growth : HOW MANY Irredundant Motifs Recall that a motif is maximal in composition if specifying more solid characters implies an alteration to its occurrence list maximal in length if making it longer implies an alteration to the cardinality of its occurrence list A maximal motif such that the motif and its list can be inferred from studying other motifs is redundant A motif that occurs at least k times in the textstring is a k-motif Theorem In any textstring x the number of irredundant 2-motifs is O(|x|) (PROBLEM: How to find irredundant motifs as fast as possible)

10 A. Apostolico - AofA04 Suffix Consensus, Suffix Meet suf4 s = suf1 The consensus of suf1 and suff4 is not a motif The meet of suf1 and suf4 is a maximal motif a b c a a aa a a a a aa a bb bbb ccc cc b c c

11 A. Apostolico - AofA04 Suffix Consensus, Suffix Meet suf4 s = suf1 The consensus of suf1 and suff4 is not a motif The meet of suf1 and suf4 is a maximal motif Theorem Every irredundant 2-motif of x is the meet of two suffixes of x a b c a a aa a a a a aa a bb bbb ccc cc c c

12 A. Apostolico - AofA04 Suffix Consensus, Suffix Meet suf4 s = suf1 The consensus of suf1 and suff4 is not a motif The meet of suf1 and suf4 is a maximal motif Theorem Every irredundant 2-motif of x is the meet of two suffixes of x a b c a a aa a a a a aa a bb bbb ccc cc c c

13 A. Apostolico - AofA04 1 Detect Repeated Patterns 2 Set up Dictionary 3 Use Pointers to Dictionary to Encode Replicas Most schemes are NP complete (Storer, 78), few exceptions (LZ is linear) Data Compression by Textual Substitution

14 A. Apostolico - AofA04 LZW LZW PARADIGM: build a dictionary trie as you scan the input ROUTINE Find the next phrase as the longest matching entry in the trie Add to the trie the unit symbol extension of this phrase

15 A. Apostolico - AofA04 LZW

16 A. Apostolico - AofA04 LZW LZW PARADIGM: build a dictionary trie as you scan the input ROUTINE Find the next phrase as the longest matching entry in the trie Add to the trie the unit symbol extension of this phrase Magics: It works, no need to send trie Coding & decoding are symmetric

17 A. Apostolico - AofA04 Fast and Lossy is Hard ``All universal lossy coding schemes found to date lack the relative simplicity that imbues Lempel-Ziv codes and arithmetic codes with economic viability. Perhaps as a consequence of the fact that approximate matches abound whereas exact matches are unique,it is inherently much faster to look for an exact match that it is to search from a plethora of approximate matches looking for the best, or even nearly the best, among them. The right way to trade off search effort in a poorly understood environment against the degree to which the product of the search possesses desired criteria has long been a human enigma. This suggests it is unlikely that the ``holy grail'' of implementable universal lossy source coding will be discovered soon.'' T. Berger and J.D. Gibson, ``Lossy Source Coding,'‘ IEEE Trans. on Inform. Theory, vol. 44, No. 6, pp. 2693--2723, 1998.

18 A. Apostolico - AofA04 Why Fast and Lossy is Hard Routine: Find longest prefix of incoming string matching past occurrence within some distortion PROBLEMS Defining the Gaps Encoding where are the gaps Finding the longest match

19 A. Apostolico - AofA04 Fast and Lossy is Hard ``All universal lossy coding schemes found to date lack the relative simplicity that imbues Lempel-Ziv codes and arithmetic codes with economic viability. Perhaps as a consequence of the fact that approximate matches abound whereas exact matches are unique,it is inherently much faster to look for an exact match that it is to search from a plethora of approximate matches looking for the best, or even nearly the best, among them. The right way to trade off search effort in a poorly understood environment against the degree to which the product of the search possesses desired criteria has long been a human enigma. This suggests it is unlikely that the ``holy grail'' of implementable universal lossy source coding will be discovered soon.'' T. Berger and J.D. Gibson, ``Lossy Source Coding,'‘ IEEE Trans. on Inform. Theory, vol. 44, No. 6, pp. 2693--2723, 1998.

20 A. Apostolico - AofA04 Why Fast and Lossy is Hard: LZW Recap LZW PARADIGM: build a dictionary trie as you scan the input ROUTINE Find the next phrase as the longest matching entry in the trie Add to the trie the unit symbol extension of this phrase

21 A. Apostolico - AofA04 LZW

22 A. Apostolico - AofA04 Towards an online lzw using motifs

23 A. Apostolico - AofA04 Towards an online lzw using motifs

24 A. Apostolico - AofA04 lzw

25 A. Apostolico - AofA04 lzw-a

26 A. Apostolico - AofA04 Original LZW parse of

27 A. Apostolico - AofA04 A motif-driven LZW parse of Lossless with resolvers Lossy without

28 A. Apostolico - AofA04 motif lzw, results

29 A. Apostolico - AofA04 Motif Disambiguation By Guessing DESCRIPTION OF FARMER OAK -- AN INCIDENT When Farmer Oak smile., the corners.fhis mouth spread till the. were within an unimportant distance.f his ears, his eye. were reduced to chinks, and...erging wrinkle—red round them, extending upon... countenance li.e the rays in a rudimentary sketch of the rising sun. HisChristian name was Gabriel, and on working days he was a young man of soundjudgment,easy motions, proper dress, and...eral good character. On Sundays,he was a man of misty views rather given to postponing, and.ampered by his bestclotes and umbrella : upon... whole, one who felt himself to occupy morally that... middle space of Laodicean neutrality which... between the Communion people ofthe parish and the drunken section, -- that... he went to church, but yawnedprivately by the t.ime the cong.egation reached the Nicene creed,- and thoughtof what there would be for dinner when he meant to be listening to the sermon. DESCRIPTION OF FARMER OAK -- AN INCIDENT When Farmer Oak smiled, the corners ofhis mouth spread till they were within an unimportant distance of his ears, hiseyes were reduced to chinks, and diverging wrinkles appeared round them, extending uponhis countenance like the rays in a rudimentary sketch of the rising sun. HisChristian name was Gabriel, and on working days he was a young man of soundjudgment, easy motions, proper dress, and general good character. On Sundayshe was a man of misty views, rather given to postponing, and hampered by his bestclothes and umbrella : upon the whole, one who felt himself to occupy morally thatvast middle space of Laodicean neutrality which lay between the Communion people ofthe parish and the drunken section, -- that is, he went to church, but yawnedprivately by the time the congregation reached the Nicene creed,- and thoughtof what there would be for dinner when he meant to be listening to the sermon.

30 A. Apostolico - AofA04 Motif Resolution By Completion (bilateral contexts better predictors)

31 A. Apostolico - AofA04 Motif Resolution By interpolation at receiver (images and sounds)

32 A. Apostolico - AofA04

33

34

35 Expected match length within distortion

36 A. Apostolico - AofA04 Expected match length within distortion - continued

37 A. Apostolico - AofA04 Expected match length within distortion - continued

38 A. Apostolico - AofA04 Giving up on ``longest match’’ 1 – Expected length with exact matches 2 – Expected length with with distortion d =

39 A. Apostolico - AofA04 Giving up on ``longest match’’ continued

40 A. Apostolico - AofA04 …but LZ works because phrases are DISTINCT A most crowded parse Achieves maximum number of phrases in a parse #phrases < n / log n

41 A. Apostolico - AofA04 LZWA is not necessarily better than LZW x = aaaaaaaaaaaa………a But compare vocabularies under alphabet compression

42 A. Apostolico - AofA04 LZW versus Lossy LZWA (Comparing Vocabulary Build-ups)

43 A. Apostolico - AofA04 LZW versus Lossy LZWA (Comparing Vocabulary Build-ups) contiued

44 A. Apostolico - AofA04 Conclusions -Self-correlation Motifs give versatile compression schemata for a variety of inpus - “Plier la machine’’ approach, bridges lossless and lossy - Linear time lossy variant with reasonable performance - Deeper analysis, broad experimentation of fine tuned variants and several extensions needed, some under way

45 A. Apostolico - AofA04 Conclusions “Analyze That’’, please Thank you

46 A. Apostolico - AofA04 Main References A. Apostolico ``Pattern Discovery and the Algorithmics of Surprise'' Proceedings of the NATO ASI on Artificial Intelligence and Heuristic Methods for Bioinformatics, (P. Frasconi and R. Shamir, eds.) IOS Press, 111--127 (2003). A. Apostolico and L. Parida ``Incremental Paradigms of Motif Discovery'', Journal of Computational Biology 11:1, 15--25 (2004). A. Apostolico M. Comin and L. Parida. ``Motifs in Ziv-Lempel-Welch Clef'‘ Proceedings of IEEE DCC Data Compression Conference, pp. 72—81 (2004). A. Apostolico. ``Fast Gapped Variants for Lempel-Ziv-Welch Compression'',in preparation.


Download ppt "Issues in the Discovery and Use of Motif Patterns Alberto Apostolico University of Padova and Purdue University."

Similar presentations


Ads by Google