Presentation is loading. Please wait.

Presentation is loading. Please wait.

Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery Alberto Apostolico University of Padova and Georgia Inst. Of Tech.

Similar presentations


Presentation on theme: "Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery Alberto Apostolico University of Padova and Georgia Inst. Of Tech."— Presentation transcript:

1

2 Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery Alberto Apostolico University of Padova and Georgia Inst. Of Tech.

3 Alberto Apostolico - Erice052 http://www.cc.gatech.edu/~axa/papers A) Specialized Material A. Apostolico and G. Bejerano ``Optimal Amnesic Probabilistic Automata, or How to Learn and Classify Proteins in linear Time and Space '', RECOMB 2000 and Journal of Computational Biology, 7(3/4):381--393, 2000. A.Apostolico, M.E. Bock, S. Lonardi and X. Xu. ``Efficient Detection of Unusual Words'', Proceedings of RECOMB 2002 and Journal of Computational Biology, 7(1/2):71--94, 2000. A. Apostolico, F. Gong and S. Lonardi. ``Verbumculus and the Detection of Unusual Words'', Journal of Computer Science and Technology, 19:1 ( Special Issue on Bioinformatics), 22-41 (2004). A. Apostolico, L. Parida. ``Incremental Paradigms of Motif Discovery'', Journal of Computational Biology 11:1, 15--25 (2004). A. Apostolico, M.E. Bock and S. Lonardi. ``Monotony of Surprise and Large Scale Quest for Unusual Words.'‘ Journal of Computational Biology, 10, 3-4, 283-311 (2003). A. Apostolico, C. Pizzi.``Monotone Scoring of Patterns with Mismatches'‘ Proceedings of the 4 th Workshop on Algorithms in Bioinformatics, Bergen, Norway, Springer Verlag LNCS 3240, 87-98, (2004)

4 Alberto Apostolico - Erice053 http://www.cc.gatech.edu/~axa/papers B) Introductory Material A. Apostolico and M. Crochemore ``String Pattern Matching for a Deluge Survival Kit'' Handbook of Massive Data Sets, J. Abello et al, Eds. Kluver Acad. Publishers, to appear. A. Apostolico ``General Pattern Matching'', Handbook of Algorithms and Theory of Computation, M.J. Atallah, ed., CRC Press Ch. 13, pp. 1--22 (1999). A. Apostolico ``Of Maps Bigger than the Empire'', Keynote, SPIRE2001, IEEE Press (2001) A. Apostolico ``Pattern Discovery and the Algorithmics of Surprise'' Artificial Intelligence and Heuristic Methods for Bioinformatics, (P. Frasconi and R. Shamir, eds.) IOS Press, 111--127 (2003). A.Apostolico ``Pattern Discovery in the Crib of Procrustes'' Imagination and Rigor, Essays on Eduardo R. Caianiello's Scientific Heritage Ten Years after his Death, ( S. Termini, ed.), Springer-Verlag, to appear 2005.

5 Alberto Apostolico - Erice054 Acknowledgements Gill Bejerano Dept. of Computer Science - The Hebrew University Mary Ellen Bock Dept. of Statistics - Purdue University Matteo Comin Univ. of Padova Jianhua Dong Dept. of Industrial Technology, Purdue University S. Lonardi Dept. of Comp. Science and Eng. - UC Riverside Fu Lu Celera FangCheng Gong Celera Laxmi Parida IBM Cinzia Pizzi Univ. of Padova Xuyan Xu CapitalOne

6 Alberto Apostolico - Erice055 A hemoglobin molecule consists of four polypeptide chains: two  globin chains (shown in green and blue) and two  globin chains (shown in yellow and orange). Each globin chain contains a heme (shown in red). Hemoglobin is the protein that carries oxygen from the lungs to the tissues and carries carbon dioxide from the tissues back to the lungs. In order to function most efficiently, hemoglobin needs to bind oxygen tightly in the oxygen-rich atmosphere of the lungs and be able to release oxygen rapidly in the relatively oxygen-poor environment of the tissues. Form = Function

7 Alberto Apostolico - Erice056

8 7 Bioinformatics the Road Ahead ‘’... more than any other single factor, the sheer volume of data poses the most serious challenge -- many problems that are ordinarily quite manageable become seemingly insurmountable when scaled up to these extents. For these reasons, it is evident that imaginative new applications of technologies designed for dealing with problems of scale will be required. For example, it may be imagined that data mining techniques will have to supplant manual search, intelligent data base integration will be needed in place of hyperlink browsing, scientific visualization will replace conventional interface to the data, and knowledge-based systems will have to supervise high-throughput annotation of the [sequence] data’’ [ D.B. Searls, Grand Challenges in Computational Biology Salzberg Searls Kasif eds Elsevier 1998]

9 Alberto Apostolico - Erice058 At a joint EU - US panel meeting on large scientific data bases held in Annapolis in 1999, I was invited with the physicists and earth observators to represent the needs of computational biology In honest to my duty, I said time and again that the kind of data available to biology was a tiny fracvtion of what is produced in earth observation and high energy physics. Just as the others were disposing of me saying that swe did not need money, I said : don’t worry we will make up for it with the data we will generate biology is a natural science, it dissects and multiplies formal sciences synthesize and cluster there is no telling where these two will go together

10 Alberto Apostolico - Erice059 Which Information Anyway  - Greek ``  " is form, appearance, or, in Latin, species - information is modern, quantified version of what the Greek called ``  " - it is a measure of the amount of structure  the three dimensions of information:  syntactic (formal medium without meaning)  semantic (dualism of subject and object invented by modern philosophy)  pragmatic (attempt to describe the understanding of meaning as a natural process)

11 Alberto Apostolico - Erice0510 King Phillip Came Over For Green Soup (ingdom, hylum, lass, rder, amily, enus, pecies) biologists group organisms by to represent similarities and propose relationships King Phillip Came Over For Green Soup (Kingdom, Phylum, Class, Order, Family, Genus, Species) biologists group organisms by morphology to represent similarities and propose relationships Linnaeus’ Taxonomy (partial)

12 Alberto Apostolico - Erice0511 The “Chinese” Taxonomy attributed by a Dr. Franz Kuhn to the Chinese Encyclopedia entitled Celestial Emporium of Benevolent Knowledge. Animals are divided into (a) those that belong to the Emperor, (b) embalmed ones, (c) those that are trained, (d) suckling pigs, (e) mermaids, (f) fabulous ones, (g) stray dogs, (h) those that are included in this classification, (i) those that tremble as if they were mad, (j) innumerable ones, (k) those drawn with a very fine camel's hair brush, (l) others, (m) those that have just broken a flower vase, (n) those that resemble flies from afar'' J.L. Borges, "The Analytical Language of John Wilkins," from Otras Inquisiciones (Other Inquisitions 1937-1952, London: Souvenir Press, 1973)

13 Alberto Apostolico - Erice0512 Summary Form and Information Form and Information  To Classify and Generate  Of Free Lunches, Ugly Ducklings, and Little Green Men  Privileging Syntactic Information  Avoidable and Unavoidable Regularities  Periods, Palindromes, Squares, etc.  Theories Bigger than Life  Motifs, Profiles and Weigh Matrices  The Emperor’s New Map

14 Alberto Apostolico - Erice0513 Defining ‘’Class’’ From Watanabe’s pattern recognition as information compression in Frontiers in pattern ‘recognition Class can be defined by 1 intension ( = list of properties or predicates) or 2 extension ( = list of names of individual members) Class can be also defined by 3 paradigm ( = show a few members and, optionally, few non-members) This is what brain does well (and what pattern recognition does poorly) Finally, Class can be defined by 4 clustering ( = we are not even given paradigms but rather sets of objects and asked to isolate subsets with strong coherence)

15 Alberto Apostolico - Erice0514 Class by Intension From Watanabe’s pattern recognition as information compression in Frontiers in pattern ‘recognition Types of class intension: Vectorial approach (statistical pattern recognition) divides into two = In the conventional zone a class is characterized by a predicate of type: belongs to such and such volume of n-dimensional representation space I In the subspace method, a class is characterized by a predicate of type: belongs to such and such subspace in n-dimensional representation space Structural or grammatical approach = a class is characterized by a predicate of the type: consists of such and such elementary components which are arranged together in such and such ways Note: structural and vector description are not uncorrelated, on the contrary For example, multiple sequence alignment can be considered as a search for the discovery of dimensions along which paradigms of noisy vectors exhibit same value

16 Alberto Apostolico - Erice0515 Statistical Classification  A Class is formed by Objects with many Predicates in common  Theorem of the Ugly Duckling (S. Watanabe): as long as all of the predicates characterizing the objects to be classified are given the same importance or ``weight", then a swan will be found to be just as similar to a duck as to another swan.  Classification as experienced on an empirical basis is only possible to the extent that the various predicates characterizing objects are given non-uniform weights.

17 Alberto Apostolico - Erice0516 Statistical Classification Theorem of the Ugly Duckling (S. Watanabe): as long as all of the predicates characterizing the objects to be classified are given the same importance or ``weight", then a swan will be found to be just as similar to a duck as to another swan. Cannot measure similarity by # of shared features: a member with only the left eye is more similar to one with no eye than to one with only the right eye Must measure similarity by # of shared predicates But this number is irrespective of the number of objects and same for all pairs

18 Alberto Apostolico - Erice0517 Statistical Classification Cannot measure similarity by # of shared features: a member with only the left eye is more similar to one with no eye than to one with only the right eye Must measure similarity by # of shared predicates But this number is irrespective of the number of objects and same for all pairs n Total # of predicates =   r=0         n  n d-2 d-2 Total # of predicates =    r=2          d-2  d-2 shared by ANY two patterns Theorem of the Ugly Duckling (S. Watanabe): as long as all of the predicates characterizing the objects to be classified are given the same importance or ``weight", then a swan will be found to be just as similar to a duck as to another swan. nrnr d-2 r-2

19 Alberto Apostolico - Erice0518 Inferring Grammars grammatical inference problem: Input: a finite set of symbol strings from some language L and possibly a finite set of strings from the complement of L Output: a grammar for the language ``Precisely the same problem arises in trying to choose a model or theory to explain a collection of sample data. This is one of the most important information processing problems known and it is surprising that there has been so little work on its formalization.’’ ( Bierman- Feldman, 1972)

20 Alberto Apostolico - Erice0519 Regular, Anomalous, Entropy, Negentropy  Shannon: information is entropy  Brillouin: info is negentropy, entropy is chaos  Key to the paradox: actual versus potential information  How can we express gain in information? (difference between two distributions ?)  This measure is global and can be either positive or negative  A better measure (Alfred Renyi - always positive)

21 Alberto Apostolico - Erice0520 Random, Regular, Compressible  Measuring structure in finite objects presupposes the ability to measure randomness in such objects.  Defining randomness has been an elusive goal for statisticians since the turn of the last century.  Kolmogorov's definition of information (note resemblance to molecular evolution): information (alternatively, conditional information) is the length of the recorded sequence of zeroes and ones that constitute a shortest program by which a universal machine produces one string from scratch (alt., from another string).

22 Alberto Apostolico - Erice0521 Random, Regular, Compressible  Kolmogorov's definition of information (note resemblance to molecular evolution): information (alternatively, conditional information) is the length of the recorded sequence of zeroes and ones that constitute a shortest program by which a universal machine produces one string from scratch (alt., from another string).  The programs of length less than k are at most  0, 1, 00, 01, 10, 11, …,..., 11…1 (or k `1’)  0, 1, 00, 01, 10, 11, …,..., 11…1 (or k `1’)  The number of strings with a program of length less than k is 1+2+…+4 + 2 k-1 = 2 k -1 < 2 k 1+2+…+4 + 2 k-1 = 2 k -1 < 2 k Bad News: there is hardly such a notion as that of a finite random sequence and yet most very long strings are complex – any given short sequence seems to exhibit some kind of regularity, however, in the limit, a great many sequences of sufficiently large length are seen to be incompressible and hence to appear as random It appears thus that we attribute and measure structure in finite objects only to the extent that we privilege (i.e., assign a high weight to) certain regularities and neglect others (is this the structural classification pendant to the theorem of the ugly duckling?)

23 Alberto Apostolico - Erice0522 Summary Form and Information Form and Information To Classify and Generate To Classify and Generate Of Free Lunches, Ugly Ducklings, and Little Green Men Of Free Lunches, Ugly Ducklings, and Little Green Men Privileging Syntactic Information Privileging Syntactic Information  Avoidable and Unavoidable Regularities  Periods, Palindromes, Squares, etc.  Theories Bigger than Life  Motifs, Profiles and Weigh Matrices

24 Alberto Apostolico - Erice0523 Privileging Syntactic Regularities in Strings Syntactic regularities in strings are pervasive notions in Computer Science and its applications. In Molecular Biology, regularities are variously implicated in diverse facets of biological function and structure Syntactic regularities in strings are pervasive notions in Computer Science and its applications. In Molecular Biology, regularities are variously implicated in diverse facets of biological function and structure Typical string regularities: Typical string regularities: -cadences -cadences -periods -periods -squares or tandem repeats -squares or tandem repeats -repetitions -repetitions -palindromes -palindromes -episodes -episodes -motifs -motifs -other exact variants and approximate versions thereof -other exact variants and approximate versions thereof  There are avoidable and unavoidable regularities !

25 Alberto Apostolico - Erice0524 Unavoidable Regularities If N is partitioned into k classes, one of the classes contains arbitrarily long arithmetic progressions ( Baudet-Artin-vanDer Waerden 1926-27 )

26 Alberto Apostolico - Erice0525 Avoidable Regularities Periods, Borders periodicities are pervasive notions of string algorithmics, e.g., KMP string searching abaabaababaabaababaabaabaabaab abaabaababaabaababaabaabaabaab A string can have many periods abacabacaba A string can have many periods abacabacaba abac abac abacabac abacabac abacabacab abacabacab The smallest one is THE period of the string

27 Alberto Apostolico - Erice0526 Periods cannot coexist too long  A string can have many periods abacabacaba abac abac abacabac abacabac abacabacab abacabacab  Periodicity Lemma (Lyndon-Schutzemberger, 62) If w has two periods of length p and q and |w| is at least p+q, then w has period gcd(p,q)  Proof assume wlog p>q, take x[i] either 1) i-q is not smaller than 1 either 1) i-q is not smaller than 1 or 2) i+p is not larger than n or 2) i+p is not larger than n case 1: x[i] = x[i-q] = x[i-q+p] case 1: x[i] = x[i-q] = x[i-q+p] case 2: x[i] = x[i+p] = x[i+p-q] case 2: x[i] = x[i+p] = x[i+p-q]  so p-q is a period ----> now repeat on q and p-q q p

28 Alberto Apostolico - Erice0527 Avoidable Regularities Periods and periodicities are pervasive notions of string algorithmics,, e.g., KMP string searching abaabaababaabaababaabaabaabaab abaabaababaabaababaabaabaabaab A string can have many periods abacabacaba A string can have many periods abacabacaba abac abac abacabac abacabac abacabacab abacabacab The smallest one is THE period of the string Palindromes w = w R Palindromes w = w R Once we know how to compute optimally ALL periods of a string we an also compute all initial palindromes  Proof: run the algorithm on w*w R abab... *... baba (In fact, all palindromes of a string can be computed in serial linear time: Manacher, 76)

29 Alberto Apostolico - Erice0528 Squares or Tandem Repeats Squares or Tandem Repeats or why does genetic code need more than 2 characters  Square: a string in the form ww with w a primitive string  Primitive string: a string that cannot be rewritten in the form v k with k > 1  Square free strings : a string that contains no square ij Longest squarefree string on two symbols 010 ? Thue (1906): On an alphabet of at least 3 symbols we can write indefinitely long square free strings square free morphism rew(a) -> abcab rew(b) -> acabcb rew(c) -> acbcacb there are about n 2 ways of choosing indices i and j, thus n 2 squares ? Istrail’s morphism (square free on ``a’’) rew(a) -> abc rew(b) -> ac rew(c) -> b

30 Alberto Apostolico - Erice0529 Detecting Squares  How many squares?  there can be cnlogn squares in a string (Crochemore, 81)  Example: Fibonacci words F o = a F o = a F 1 = b F 1 = b F i = F i-1 F i-2 F i = F i-1 F i-2 a b ba bab babba babbabab babbababbabba... a b ba bab babba babbabab babbababbabba... Recent (Kosaraju, Gusfield) Parallel (AA, Crochemore-Rytter, AA-Breslauer) Optimal nlogn algorithms since early 80's (Main-Lorentz, AA-Preparata, Rabin, Crochemore)

31 Alberto Apostolico - Erice0530 Tandem Repeats, Repeated Episodes (Myers ‘87, Kannan-Myers ‘92, Landau-Schmidt ‘93, Benson ’98, Ap.-Federico ’98, Myers-Sagot ’99, Ap-Atallah `99) Max 12 pos Input: textstring Output: repeated episode (within constaints) (worst-case quadratic or nk with max k errors) Max 30 pos

32 Alberto Apostolico - Erice0531 Pattern Discovery in WAKA alluded to: Kokin-shu #315 (Minamoto-no-Muneyuki) ya-ma-sa-to-ha fu-yu-so-sa-hi-shi-sa ma-sa-ri-ke-ru hi-to-me-mo-ku-sa-mo ka-re-nu-to-o-mo-he-ha allusive variation shugyoku-shu #3528 (Jien) ya-to-sa-hi-te hi-to-me-mo-ku-sa-mo ka-re-nu-re-ha so-te-ni-so-no-ko-ru a-ki-no-shi-ra-tsu-yu alluded to: Kokin-shu #315 A hamlet in mountain is the drearier in winter. I feel that there is no one to see and no green around allusive variation shugyoku-shu #3528 My home has been deserted Now in autumn, there is no one to see And no green around There is a pearl dew left in my sleeve

33 Alberto Apostolico - Erice0532 Discovering instances of poetic allusion from anthologies of classical Japanese poems Discovering instances of poetic allusion from anthologies of classical Japanese poems Theoretical Computer Science Volume 292, Issue 2 Masayuki Takeda Tomoko Fukuda Ichiro Nanri Mayumi Yamasaki Koichi Tamari Masayuki TakedaTomoko FukudaIchiro NanriMayumi YamasakiKoichi Tamari ABSTRACT ABSTRACT Waka is a form of traditional Japanese poetry with a 1300-year history. In this paper, we attempt to semi-automatically discover instances of poetic allusion, or more generally, to find similar poems in anthologies of Waka poems. One reasonable approach would be to arrange all possible pairs of poems in two anthologies in decreasing order of similarity values, and to scrutinize high-ranked pairs by human effort. The means of defining similarity between Waka poems plays a key role in this approach. In this paper, we generalize existing (dis)similarity measures into a uniform framework, called string resemblance systems, and using this framework, we develop new similarity measures suitable for finding similar poems. Using the measures, we report successful results in finding instances of poetic allusion between two anthologies Kokin- Shu and Shin-Kokin-Shu. Most interestingly, we have found an instance of poetic allusion that has never before been pointed out in the long history of Waka research.

34 Alberto Apostolico - Erice0533 Cheating by Schoolteachers (the longest substring common to k of n strings) 112a4a342cb214d0000000000 112a4a342cb214d0001acd24a3a12dadbcb4a0000000d4a2341cacbddad3142a2344a2ac23421c00adb4b3cb 1b2a34d4ac42d23b142134141 1b2a34d4ac42d23b141acd24a3a12dadbcb4a2134141 dba23dad1abbac1db121db200 dba23dad1abbac1db11acd24a3a12dadbcb4a21db200dbbbd21d3aac11da42dadcc000adcd21c4b4421dd000 121a4a2dcc2cadc11a11da011 121a4a2dcc2cadc11a1acd24a3a12dadbcb4a11da011 1421acbbdba23dad12a000214 1421acbbdba23dad121acd24a3a12dadbcb4aa000214 cacb1dadbc42dd1122cacb1da cacb1dadbc42dd11221acd24a3a12dadbcb4acacb1dadbbbd21d3aac11da421dadcc000adcd21c4b4421dd00 2baaab3dad2aadca2223421c0 2baaab3dad2aadca221acd24a3a12dadbcb4a23421c01baaab3dcacb1dadbc42ac2cc31012dadbcb4ad40000 From: S.D.Levit and S.J Dubner, Freakanomics Morrow, 2005

35 Alberto Apostolico - Erice0534 Summary Form and Information Form and Information To Classify and Generate To Classify and Generate Of Free Lunches, Ugly Ducklings, and Little Of Free Lunches, Ugly Ducklings, and Little Green Men Green Men Privileging Syntactic Information Privileging Syntactic Information Avoidable and Unavoidable Regularities Avoidable and Unavoidable Regularities Periods, Palindromes, Squares, etc. Periods, Palindromes, Squares, etc.  Theories Bigger than Life  Motifs, Profiles and Weigh Matrices

36 Alberto Apostolico - Erice0535 General Form of Pattern Discovery Find-exploit a priori unknown patterns or associations thereof in a Data Base With some prior domain-specific knowledge Without any domain-specific prior knowledge Tenet: a pattern or association (rule) that occurs more frequently than one would expect is potentially informative and thus interesting frequent = interesting

37 Alberto Apostolico - Erice0536 1 Detect Repeated Patterns 1 Detect Repeated Patterns 2 Set up Dictionary 3 Use Pointers to Dictionary to Encode Replicas  Redundancy (repetitiveness) is sought in order to remove it Data Compression by Textual Substitution

38 Alberto Apostolico - Erice0537 Consumer Prediction (Data Mining) Intrusion Detection (Security) Protein Classification (Bio-Informatics)  Infer consistent behavior from protocol of past record  Use to predict future behavior or detect malicious practices 1) Collect a set of behavioral sequences (normal profile) into a repository or dictionary 1) Collect a set of behavioral sequences (normal profile) into a repository or dictionary 2) Define measure(s) of sequence similarity 2) Define measure(s) of sequence similarity 3) Compare any new sequence to the dictionary, using similarity to past behavior as a a basis for classification as normal or anomalous 3) Compare any new sequence to the dictionary, using similarity to past behavior as a a basis for classification as normal or anomalous Anomaly is sought as a carrier of information Anomaly is sought as a carrier of information Similarity or predictability equals fitness to the model Similarity or predictability equals fitness to the model  Learning from positive & negative samples

39 Alberto Apostolico - Erice0538 Of Exactitude in Science...In that Empire, the craft of Cartography attained such Perfection that the Map of a Single province covered the space of an entire City, and the Map of the Empire itself an entire Province. In the course of Time, these Extensive maps were found somehow wanting, and so the College of Cartographers evolved a Map of the Empire that was of the same Scale as the Empire and that coincided with it point for point. Less attentive to the Study of Cartography, succeeding Generations came to judge a map of such Magnitude cumbersome, and, not without Irreverence, they abandoned it to the Rigours of Sun and Rain. In the western Deserts, tattered Fragments of the Map are still to be found, Sheltering an occasional Beast or beggar; in the whole Nation, no other relic is left of the Discipline of Geography. From Travels of Praiseworthy Men (1658) by J. A. Suarez Miranda The piece was written by Jorge Luis Borges and Adolfo Bioy Casares. English translation quoted from J. L. Borges, A Universal History of Infamy, Penguin Books, London, 1975.

40 Alberto Apostolico - Erice0539 Detection and Analysis of Gene Regulatory Regions ( Jacques van Helden, http://copan.cifn.unam.mx/Computational_Biology/yeast-tools ) `` Starting from the simple knowledge that a set of genes share some regulatory behavior, one can suppose that some elements are shared by their upstream region, and one would like to detect such elements. We implemented a simple and fast method to extract such elements, based on a detection of over-represented oligonucleotides. J. Mol. Biol. (1998) 281, 827-842. ‘’

41 Alberto Apostolico - Erice0540 A table of mono-mers only contains 4 lines http://www.ucmb.ulb.ac.be/bioinformatics/rsa-tools/ seq observed_freq occ a 0.2879006655447 301075 c 0.2120993344553 221805 g 0.2120993344553 221805 t 0.2879006655447 301075 Index of /bioinformatics/rsa-tools/data/ Escherichia_coli_K12/oligo-frequencies

42 Alberto Apostolico - Erice0541 A table of 2-mers contains 16 lines ;seqobserved_freqocc aa0.0996514874362 103508 ac0.0516799845961 53680 ag0.0522951766631 54319 at0.0840396649658 87292 ca0.0630865504958 65528 cc0.0474795417349 49317 cg0.0490959853663 50996 ct0.0522951766631 54319 ga0.0559112351978 58075 gc0.0573659381920 59586 gg0.0474795417349 49317 gt0.0516799845961 53680 ta0.0692904592279 71972 tc0.0559112351978 58075 tg0.0630865504958 65528 tt0.0996514874362 103508 http://www.ucmb.ulb.ac.be/bioinformatics/rsa-tools/ Index of /bioinformatics/rsa-tools/data/ Escherichia_coli_K12/oligo-frequencies

43 Alberto Apostolico - Erice0542 With increasing k, a table of k-mers grows rapidly out of proportions How many k-mers in total, for all k? http://www.ucmb.ulb.ac.be/bioinformatics/rsa-tools/ RSA-tools - menu.htm ;seqobserved_freq occ gct0.016117651391916629gct0.016117651391916629 ctt0.016398733772316919ctt0.016398733772316919 gaa0.018041611823318614gaa0.018041611823318614 gac0.00964500264619951gac0.00964500264619951 gag0.010881765119811227gag0.010881765119811227 gat0.017236165416017783gat0.017236165416017783 gca0.016634261422117162gca0.016634261422117162 gcc0.013343659072313767gcc0.013343659072313767 gcg0.014738409228815206gcg0.014738409228815206 gct0.012721400837013125gct0.012721400837013125 gga0.012376347983912769gga0.012376347983912769 ggc0.013343659072313767ggc0.013343659072313767 ggg0.010394232577310724ggg0.010394232577310724 ggt0.011420667890511783ggt0.011420667890511783 gta0.012328854754112720gta0.012328854754112720 gtc0.00964500264619951gtc0.00964500264619951 gtg0.011703688770112075gtg0.011703688770112075 gtt0.018063904563818637gtt0.018063904563818637 taa0.025967165701026791taa0.025967165701026791 tac0.012328854754112720tac0.012328854754112720 tag0.00884343323719124tag0.00884343323719124 tat0.022173522815222877tat0.022173522815222877 tca0.019030246402619634tca0.019030246402619634 tcc0.012376347983912769tcc0.012376347983912769 tcg0.010272107129210598tcg0.010272107129210598 tct0.014193690960614644tct0.014193690960614644 tga0.019030246402619634tga0.019030246402619634 tgc0.016634261422117162tgc0.016634261422117162 tgg0.011325681430911685tgg0.011325681430911685 tgt0.016274669825116791tgt0.016274669825116791 tta0.025967165701026791tta0.025967165701026791 ttc0.018041611823318614ttc0.018041611823318614 ttg0.018135629033318711ttg0.018135629033318711 ttt0.037414003330338601ttt0.037414003330338601 ;seqobserved_freqocc aaa0.037414003330338601 aac0.018063904563818637 aag0.016398733772316919 aat0.027655598482528533 aca0.016274669825116791 acc0.011420667890511783 acg0.011818060221412193 act0.012128220089412513 aga0.014193690960614644 agc0.012721400837013125 agg0.013335905075613759 agt0.012128220089412513 ata0.022173522815222877 atc0.017236165416017783 atg0.017094654976217637 att0.027655598482528533 caa0.018135629033318711 cac0.011703688770112075 cag0.016117651391916629 cat0.017094654976217637 cca0.011325681430911685 ccc0.010394232577310724 ccg0.012291054020212681 cct0.013335905075613759 cga0.010272107129210598 cgc0.014738409228815206 cgg0.012291054020212681 cgt0.011818060221412193 cta0.00884343323719124 ctc0.010881765119811227 ctg0.016117651391916629 ctt0.016398733772316919 gaa

44 Alberto Apostolico - Erice0543 ;seqobserved_freqocc aaaa0.014924921702015297 aaac0.00618481262136339 aaag0.00624432888106400 aaat0.009986047827710235 aaca0.00591064755646058 aacc0.00391831637284016 aacg0.00442371674164534 aact0.00377294059113867 aaga0.00449591679434608 aagc0.00398758939634087 aagg0.00428517069464392 aagt0.00365293239543744 aata0.00825031953408456 aatc0.00525694437675388 aatg0.00584039885655986 aatt0.00834788717288556 acaa0.00554769594025686 acac0.00267335330222740 acag0.00388319202293980 acat0.00413394085454237 acca0.00314753202663226 accc0.00246065584972522 accg0.00271823441602786 acct0.00303337788923109 acga0.00261286136612678 acgc0.00394465963534043 acgg0.00256700457592631 acgt0.00266945059662736 acta0.00238455309142444 actc0.00272799117992796 actg0.00333486189303418 actt0.00365293239543744 agaa0.00473105480374849 agac0.00239333417892453 agag0.00310948064753187 agat0.00392026772564018 agca0.00400027318944100 agcc0.00287043993252942 agcg0.00368415403983776 agct0.00218356375562238 agga0.00367927565783771 aggc0.00375830544523852 aggg0.00288117237272953 aggt0.00303337788923109 agta0.00303240221283108 agtc0.00222161513472277 agtg0.00311143200023189 agtt0.00377294059113867 ataa0.00925819324259489 atac0.00333974027493423 atag0.00307045359203147 atat0.00650971285846672 atca0.00614871259506302 atcc0.00400027318944100 atcg0.00313484823353213 atct0.00392026772564018 atga0.00525889572955390 atgc0.00459738713864712 atgg0.00318363205293263 atgt0.00413394085454237 atta0.00727366747007455 attc0.00499936581035124 attg0.00536524445575499 attt0.009986047827710235 caaa0.00648824797796650 caac0.00380513791193900 caag0.00247919370102541 caat0.00536524445575499 caca0.00347048091093557 cacc0.00295434810183028 cacg0.00217380699172228 cact0.00311143200023189 caga0.00434858965984457 cagc0.00364415130793735 cagg0.00482764676614948 cagt0.00333486189303418 cata0.00387538661183972 catc0.00464812231084764 catg0.00276701823542836 catt0.00584039885655986 ccaa0.00226356921942320 ccac0.00234259900682401 ccag0.00354072961083629 ccat0.00318363205293263 ccca0.00199818524192048 cccc0.00261969110092685 cccg0.00288019669642952 ccct0.00288117237272953 ccga0.00224600704442302 ccgc0.00350267823173590 ccgg0.00400027318944100 ccgt0.00256700457592631 ccta0.00159620657021636 cctc0.00262066677722686 cctg0.00482764676614948 cctt0.00428517069464392 ccta0.00159620657021636 cctc0.00262066677722686 cctg0.00482764676614948 cctt0.00428517069464392 cgaa0.00318558340573265 cgac0.00226454489572321 cgag0.00165572282991697 cgat0.00313484823353213 cgca0.00420028684894305 cgcc0.00393978125344038 cgcg0.00293093186853004 cgct0.00368415403983776 cgga0.00309777253083175 cggc0.00360609992883696 cggg0.00288019669642952 cggt0.00271823441602786 cgta0.00272018576882788 cgtc0.00251529372742578 cgtg0.00217380699172228 cgtt0.00442371674164534 ctaa0.00304996438783126 ctac0.00225771516102314 ctag0.0004624706077474 ctat0.00307045359203147 ctca0.00319046178763270 ctcc0.00291336969352986 ctcg0.00165572282991697 ctct0.00310948064753187 ctga0.00500717122145132 ctgc0.00369683783283789 ctgg0.00354072961083629 ctgt0.00388319202293980 ctta0.00446567082634577 cttc0.00322070775573301 cttg0.00247919370102541 cttt0.00624432888106400 gaaa0.00693315641077106 gaac0.00285873181582930 gaag0.00322070775573301 gaat0.00499936581035124 gaca0.00316119149603240 gacc0.00173670397001780 gacg0.00251529372742578 gact0.00222161513472277 gaga0.00329681051393379 gagc0.00223137189862287 gagg0.00262066677722686 gagt0.00272799117992796 gata0.00512327671165251 gatc0.00222063945832276 gatg0.00464812231084764 gatt0.00525694437675388 gcaa0.00558965002495729 gcac0.00272603982712794 gcag0.00369683783283789 gcat0.00459738713864712 gcca0.00356512152053654 gccc0.00241479905942475 gccg0.00360609992883696 gcct0.00375830544523852 gcga0.00336803489023452 gcgc0.00393782990064036 gcgg0.00350267823173590 gcgt0.00394465963534043 gcta0.00283336422982904 gctc0.00223137189862287 gctg0.00364415130793735 gctt0.00398758939634087 gcgc0.00393782990064036 gcgg0.00350267823173590 gcgt0.00394465963534043 gcta0.00283336422982904 gctc0.00223137189862287 gctg0.00364415130793735 gctt0.00398758939634087 ggaa0.00394465963534043 ggac0.00148693081481524 ggag0.00291336969352986 ggat0.00400027318944100 ggca0.00421980037664325 ggcc0.00231820709712376 ggcg0.00393978125344038 ggct0.00287043993252942 ggga0.00290166157692974 gggc0.00241479905942475 gggg0.00261969110092685 gggt0.00246065584972522 ggta0.00280214258532872 ggtc0.00173670397001780 ggtg0.00295434810183028 ggtt0.00391831637284016 gtaa0.00500034148675125 gtac0.00172109314781764 gtag0.00225771516102314 gtat0.00333974027493423 gtca0.00348316470393570 gtcc0.00148693081481524 gtcg0.00226454489572321 gtct0.00239333417892453 gtga0.00400320021864103 gtgc0.00272603982712794 gtgg0.00234259900682401 gtgt0.00267335330222740 gtta0.00515157132685280 gttc0.00285873181582930 gttg0.00380513791193900 gttt0.00618481262136339 taaa0.00906598499419292 taac0.00515157132685280 taag0.00446567082634577 taat0.00727366747007455 taca0.00377489194383869 tacc0.00280214258532872 tacg0.00272018576882788 tact0.00303240221283108 taga0.00205379879602105 tagc0.00283336422982904 tagg0.00159620657021636 tagt0.00238455309142444 tata0.00496619281325090 tatc0.00512327671165251 tatg0.00387538661183972 tatt0.00825031953408456 tcaa0.00477788727044897 tcac0.00400320021864103 tcag0.00500717122145132 tcat0.00525889572955390 tcca0.00263725327582703 tccc0.00290166157692974 tccg0.00309777253083175 tcct0.00367927565783771 tcga0.00205672582522108 tcgc0.00336803489023452 tcgg0.00224600704442302 tcgt0.00261286136612678 tcta0.00205379879602105 tctc0.00329681051393379 tctg0.00434858965984457 tctt0.00449591679434608 tatg0.00387538661183972 tatt0.00825031953408456 tcaa0.00477788727044897 tcac0.00400320021864103 tcag0.00500717122145132 tcat0.00525889572955390 tcca0.00263725327582703 tccc0.00290166157692974 tccg0.00309777253083175 tcct0.00367927565783771 tcga0.00205672582522108 tcgc0.00336803489023452 tcgg0.00224600704442302 tcgt0.00261286136612678 tcta0.00205379879602105 tctc0.00329681051393379 tctg0.00434858965984457 tctt0.00449591679434608 tgaa0.00620822885476363 tgac0.00348316470393570 tgag0.00319046178763270 tgat0.00614871259506302 tgca0.00424419228634350 tgcc0.00421980037664325 tgcg0.00420028684894305 tgct0.00400027318944100 tgga0.00263725327582703 tggc0.00356512152053654 A table of k-mers grows rapidly out of proportions or out of sight How many k-mers in total, for all k? http://www.ucmb.ulb.ac.be/bioinformatics/rsa-tools/ RSA-tools - menu.htm tggg0.00199818524192048 tggt0.00314753202663226 tgta0.00377489194383869 tgtc0.00316119149603240 tgtg0.00347048091093557 tgtt0.00591064755646058 ttaa0.00868156849748898 ttac0.00500034148675125 ttag0.00304996438783126 ttat0.00925819324259489 ttca0.00620822885476363 ttcc0.00394465963534043 ttcg0.00318558340573265 ttct0.00473105480374849 ttga0.00477788727044897 ttgc0.00558965002495729 ttgg0.00226356921942320 ttgt0.00554769594025686 ttta0.00906598499419292 tttc0.00693315641077106 tttg0.00648824797796650 tttt0.014924921702015297

45 Alberto Apostolico - Erice0544 How many distinct substrings in a string of n symbols A: no more than (n x n)/2 A: no more than (n x n)/2 ( n ways to choose beginning or i, then n-i ways to choose end or j ) i 1 j n

46 Alberto Apostolico - Erice0545 How many surprising substrings in a string of n symbols  Agree on a model for the source: e.g., the source emits symbols independently with identical distribution A: possibly, all (n x n)/2 of them ! Agree on some measure of surprise, e.g., departure from expected number of occurrences exceeds a certain threshold For a given observed string of n symbols, how many substrings may turn out to be surprising?

47 Alberto Apostolico - Erice0546 Source Modeling by Probabilistic Finite State Automata 0.25 01 00 10 11 0.25 0.75 0.25 0.75 0.5 0.25 0.5 Order-2 Markov Chain 1 00 10 1 0.25 0.75 0.25 0.5 0.25 Probabilistic Suffix Automaton (0.75, 0.25) (0.25, 0.75) (0.5, 0.5) 00 0 10 1 (0.5, 0.5) Prob Suffix Tree

48 Alberto Apostolico - Erice0547 Finding surprising substrings with mismatches  Input: a sequence or set of sequences, integers m and k  Out: all substrings of length m that occur unusually often, up to k mismatches, as a replica of the same pattern How many patterns should one try ? NOTE: the pattern might never occur exactly in the input Approximate Patterns

49 Alberto Apostolico - Erice0548 From the Special Issue for the 50 th Shannon Anniversary of IEEE Trans. IT ``Perhaps as a consequence of the fact that approximate matches abound whereas exact matches are unique, it is inherently much faster to look for an exact match that it is to search from a plethora of approximate matches looking for the best, or even nearly the best, among them. The right way to trade off search effort in a poorly understood environment against the degree to which the product of the search possesses desired criteria has long been a human enigma.'' ``Perhaps as a consequence of the fact that approximate matches abound whereas exact matches are unique, it is inherently much faster to look for an exact match that it is to search from a plethora of approximate matches looking for the best, or even nearly the best, among them. The right way to trade off search effort in a poorly understood environment against the degree to which the product of the search possesses desired criteria has long been a human enigma.'' T. Berger and J.D. Gibson, T. Berger and J.D. Gibson, ``Lossy Source Coding,'‘ IEEE Trans. on Inform. Theory, vol. 44, No. 6, pp. 2693--2723, 1998. ``Lossy Source Coding,'‘ IEEE Trans. on Inform. Theory, vol. 44, No. 6, pp. 2693--2723, 1998.

50 Alberto Apostolico - Erice0549 Syntactic Motif: a recurring pattern with some solid characters and some characters that are a subset of the alphabet, or a ‘’don’t care’’ or ‘’gap’’ PROBLEM Input: textstring Output: repeated motifs T A G A G G T A G A T AG T T A G A G G T A G A T A T Motifs may be rigid or extensible (sometimes also called flexible) ``don’t care’’ characters solid character T A G A G G T A G A T AG T

51 Alberto Apostolico - Erice0550 From Syntax to Stat: Extracting a Profile Matrix & Consensus From Syntax to Stat: Extracting a Profile Matrix & Consensus (From Hertz-Stormo 99) A A T T G A A G G T C C A G G A T G A G G C G T 4 1 0 1 0 1 Alignment Matrix 0 0 0 1 1 1 0 3 3 0 2 1 0 0 1 2 1 1 A G G T G ? (Consensus - by majority rule ) n i,j = times letter i is observed at jth position in alignment N = number of sequences = 4 NOTE: While each sequence is a ``realization’’ of the consensus the consensus itself might not be any of the sequences

52 Alberto Apostolico - Erice0551 From Syntax to Stat, continued: Computing Weight Matrix A A T T G A A G G T C C A G G A T G A G G C G T 4 1 0 1 0 1 Alignment Matrix 0 0 0 1 1 1 0 3 3 0 2 1 0 0 1 2 1 1 A G G T G ? (Consensus - by majority rule ) Compute ln [[(n i,j + p i ) / (N + 1)] / p i ] ~ ln (f i,j / p i ) n i,j = times letter i is observed at jth position in alignment N = number of sequences = 4 p i = a priori probability (.25 in example ) f i,j = frequency of letter i at position j this is like taking the ratio of the empirical frequencies, compensated by p i to avoid infinity or zero, to the hypothetical probabilities or flat distribution (popular measure among statisticians: how much the observed distribution deviates from chance)

53 Alberto Apostolico - Erice0552 From Syntax to Stat, continued: Weighing a Test Sequence A A T T G A A G G T C C Weight Matrix A G G A T G A G G C G T 4 1 0 1 0 1 1.2 0 -1.6 0 -1.6 0 0 0 0 1 1 1 ln (f i,j / p i ) -1.6 -1.6 -1.6 0 0 0 0 3 3 0 2 1 -1.6.96.96 -1.6.59 0 0 0 1 2 1 1 -1.6 -1.6 0.59 0 0 A G G T G ? A G G T G C (test sequence) ln [ [ (n i,j + p i ) / ( N + 1) ] / p i ] ~ ln (f i,j / p i )

54 Alberto Apostolico - Erice0553 From Syntax to Stat, continued: Weighing a Test Sequence A A T T G A A G G T C C Weight Matrix A G G A T G A G G C G T A 4 1 0 1 0 1 1.2 0 -1.6 0 -1.6 0 C 0 0 0 1 1 1 ln (f i,j / p i ) -1.6 -1.6 -1.6 0 0 0 G 0 3 3 0 2 1 -1.6.96.96 -1.6.59 0 T 0 0 1 2 1 1 -1.6 -1.6 0.59 0 0 A G G T G C (test sequence, score = 4.3) ln [[ (n i,j + p i ) / ( N + 1) ] / p i ] ~ ln (f i,j / p i )

55 Alberto Apostolico - Erice0554 From Stat to Syntax: extracting a “full consensus” from sample (daf-19 binding sites in C. elegans - Peter Swoboda) From Stat to Syntax: extracting a “full consensus” from sample (daf-19 binding sites in C. elegans - Peter Swoboda) GTTGTCATG GTGAC GTTTCCATG GAAAC GCTACCATG GCAAC GTTACCATA GTAAC GTTTCCATG GTAAC -150 osm-1 osm-6 daf-19 che-2 F02D8.3 GTT__CATGGT_AC GTT_CCATGG_AAC G_T_CCATGG_AAC GTT_CCAT_ GTAAC GTT_CCATG GTAAC Now the model describes also GATCCCATCGGAAC which did not belong to the data Consensus at all costs generates monsters Model: G_T__CAT_G__AC

56 Alberto Apostolico - Erice0555 Episodes and extensible motifs Mannila et al., 95; Das et al., 97 Max 10 pos Input: textstring and pattern string Output: episode realization (quadratic worst-case)

57 Alberto Apostolico - Erice0556 Extensible Motifs Definition: Extensible Motifs are patterns which allow variable-length don’t cares e.g., Prosite F…..G-(2,4)G.H e.g., Prosite F…..G-(2,4)G.H  Note that the length of these patterns is variable  High expressive power  Huge pattern space

58 Alberto Apostolico - Erice0557 An Example from Prosite Entry name: HIPIP Accession number: PS00596 Description: High potential iron-sulfur proteins signature. Pattern: C-(6,9)[LIVM]…G[YW]C..[FYW] PDB 1PIJ PDB 1HLQ PDB 1PIJ PDB 1HLQ

59 Alberto Apostolico - Erice0558 Extensible Motifs Extensible Motifs (Implications of Variable-Gaps) s = axbcaxxbcaxxxbc m = a-[1-3]bc at pos 1, 5 and 10 Main Issues 1) a location list corresponds to multiple patterns Eg. axbcpdaycbqd (at positions 1 and 7) m1 = a-[1-2]b-[1-2]d m2 = a-[1-2]c-[1-2]d 2) multiple occurrences at a location Eg. axbbxc (at position 1) m = a-[1-2]b-[1-2]c

60 Alberto Apostolico - Erice0559 Summary Form and Information Form and Information To Classify and Generate To Classify and Generate Of Free Lunches, Ugly Ducklings, and Little Of Free Lunches, Ugly Ducklings, and Little Green Men Green Men Privileging Syntactic Information Privileging Syntactic Information Avoidable and Unavoidable Regularities Avoidable and Unavoidable Regularities Periods, Palindromes, Squares, etc. Periods, Palindromes, Squares, etc.  Theories Bigger than Life  Motifs, Profiles and Weigh Matrices  The Emperor’s New Map

61 Alberto Apostolico - Erice0560 Detection and Analysis of Gene Regulatory Regions ( Jacques van Helden, http://copan.cifn.unam.mx/Computational_Biology/yeast-tools ) `` Starting from the simple knowledge that a set of genes share some regulatory behavior, one can suppose that some elements are shared by their upstream region, and one would like to detect such elements. We implemented a simple and fast method to extract such elements, based on a detection of over-represented oligonucleotides. J. Mol. Biol. (1998) 281, 827-842. ‘’

62 Alberto Apostolico - Erice0561 Over-represented sequences in the 800 bps upstream segments of two families of co-regulated genes in the yeast: superposition of circled words yields known motifs TCACGTG AAAACTGTGG TCCGCGGA

63 Alberto Apostolico - Erice0562 Question: how many of the 8-mers in a sequence 10 6 bases long could be surprisingly over-represented? How many k-mers in total, for all k?

64 Alberto Apostolico - Erice0563 Index of /bioinformatics/rsa-tools/data/ Escherichia_coli_K12/oligo-frequencies Name Last modified Size NameLast modifiedSize 1nt_non-coding_Esche..>1nt_non-coding_Esche..> 24-Dec-2001 06:56 1k 2nt_non-coding_Esche..>2nt_non-coding_Esche..> 24-Dec-2001 06:56 1k 3nt_non-coding_Esche..>3nt_non-coding_Esche..> 24-Dec-2001 06:56 2k 4nt_non-coding_Esche..>4nt_non-coding_Esche..> 24-Dec-2001 06:56 7k 5nt_non-coding_Esche..>5nt_non-coding_Esche..> 24-Dec-2001 06:56 26k 6nt_non-coding_Esche..>6nt_non-coding_Esche..> 24-Dec-2001 06:56 108k 7nt_non-coding_Esche..>7nt_non-coding_Esche..> 24-Dec-2001 06:56 434k 8nt_non-coding_Esche..>8nt_non-coding_Esche..> 24-Dec-2001 06:57 1.7M dyads_3nt_sp0-20_non..>dyads_3nt_sp0-20_non..> 24-Dec-2001 07:11 2.9M http://www.ucmb.ulb.ac.be/bioinformatics/rsa-tools/

65 Alberto Apostolico - Erice0564 Index of /bioinformatics/rsa-tools/data/ Escherichia_coli_K12/oligo-frequencies Name Last modified Size NameLast modifiedSize 1nt_non-coding_Esche..>1nt_non-coding_Esche..> 24-Dec-2001 06:56 1k 2nt_non-coding_Esche..>2nt_non-coding_Esche..> 24-Dec-2001 06:56 1k 3nt_non-coding_Esche..>3nt_non-coding_Esche..> 24-Dec-2001 06:56 2k 4nt_non-coding_Esche..>4nt_non-coding_Esche..> 24-Dec-2001 06:56 7k 5nt_non-coding_Esche..>5nt_non-coding_Esche..> 24-Dec-2001 06:56 26k 6nt_non-coding_Esche..>6nt_non-coding_Esche..> 24-Dec-2001 06:56 108k 7nt_non-coding_Esche..>7nt_non-coding_Esche..> 24-Dec-2001 06:56 434k 8nt_non-coding_Esche..>8nt_non-coding_Esche..> 24-Dec-2001 06:57 1.7M dyads_3nt_sp0-20_non..>dyads_3nt_sp0-20_non..> 24-Dec-2001 07:11 2.9M http://www.ucmb.ulb.ac.be/bioinformatics/rsa-tools/

66 Alberto Apostolico - Erice0565 Theories bigger than Life: Assume we wanted to build a statistical table counting occurrences of all surprising substrings in a genome Q: How many distinct substrings in a string of n symbols A: no more than (n x n)/2 A: no more than (n x n)/2 ( n ways to choose beginning or i, then n-i ways to choose end or j ) i 1 j n

67 Alberto Apostolico - Erice0566 Theories bigger than Life: How many surprising substrings in a string of n symbols  Agree on a model for the source: e.g., the source emits symbols independently with identical distribution  Agree on some measure of surprise, e.g., departure from expected number of occurrences exceeds a certain threshold expected number of occurrences exceeds a certain threshold  For a given observed string of n symbols, how many substrings may turn out to be surprising? A: possibly, all (n x n)/2 of them !

68 Alberto Apostolico - Erice0567 Z-scores as measures of surprise

69 Alberto Apostolico - Erice0568 Three easy conditions on surprise 1) always: 2) for absent words: (note asymmetry of surprise) 3) for over-represented words: (longer word = bigger surprise) From 1-3 together

70 Alberto Apostolico - Erice0569 Monotony of Surprise A score such that : will be called monotone

71 Alberto Apostolico - Erice0570 Main point For many monotone scores where ``surprising’’

72 Alberto Apostolico - Erice0571 DAWGs the set of words reaching a node is a burst of consecutive suffixes of a same word Each state corresponds to a set of strings, the set of all strings that have occurrences ending precisely at the same positions in x The sequence of labels on each distinct path from source to sink spells a suffix of x |x| < Q < 2|x| - 1 |x| -1 < E < 3|x| -3 ATAAAATTT

73 Alberto Apostolico - Erice0572 DAWGs With monotone scores, it suffices to publish scores only at the longest word in each one of the O(n) equivalence class (Often, however, we still need to compute all O(n ) scores ) 2

74 Alberto Apostolico - Erice0573 The Size of Tables for Substring Statistics

75 Alberto Apostolico - Erice0574 Substring Statistics with Suffix Trees A partial view (all suffixes starting with ``a'') of the weighted suffix tree for the string x = abaababaabaababaababa : the weight of each internal node reports the number of (possibly overlapping) occurrences in x of the substring having locus at that node. 1 Counts do not change along an arc 2 If aw ends at a node so does w (suffix links) `The Myriad Virtues of Suffix Trees’’ A.Apostolico Combinatoral Algorithms On Words A.A and Z.Galil eds, Springer 1985

76 Alberto Apostolico - Erice0575 Detecting Squares with Suffix Trees There is a square iff there is a node with two consecutive leaves in its subtree too close for comfort. 14 - 12 = 2 > 3 = |aba| (A. Apostolico & FP Preparata, 83)

77 Alberto Apostolico - Erice0576 Combining Saturation and Monotony of Scores over ST Arcs yields Surprising Solid Words in Linear Time and Space  Verbumculus (AA, Bock, Gong, Lonardi, Xu, JCB2000, JCB2003, Recomb 2003,..) Based on Suffix tree and iid Based on Suffix tree and iid Partitions the O(n 2 ) substrigs into O(n) “equivalence classes of monotone score”,then computes expected frequencies, variances and scores for the most surprising word in each class in time O(n) overall. Partitions the O(n 2 ) substrigs into O(n) “equivalence classes of monotone score”,then computes expected frequencies, variances and scores for the most surprising word in each class in time O(n) overall. For any word v without a score, there is a scored extension v y which is at least equally surprising. For any word v without a score, there is a scored extension v y which is at least equally surprising.

78 Alberto Apostolico - Erice0577 Z-scores and measures of surprise

79 Alberto Apostolico - Erice0578 Main point For any measure of surprise where and conditions 1-3 are satisfied: ``surprising’’

80 Alberto Apostolico - Erice0579 Exercise: i.i.d. variables Exercise: i.i.d. variables We are interested in the expected number of occurrences of y in X, and the corresponding variance.

81 Alberto Apostolico - Erice0580 Over- and Under-represented words: Z-Scores !@#&!!$! !

82 Alberto Apostolico - Erice0581 Under the Hood: Periods and Variance

83 Alberto Apostolico - Erice0582 ;seqobserved_freqocc aaaa0.014924921702015297 aaac0.00618481262136339 aaag0.00624432888106400 aaat0.009986047827710235 aaca0.00591064755646058 aacc0.00391831637284016 aacg0.00442371674164534 aact0.00377294059113867 aaga0.00449591679434608 aagc0.00398758939634087 aagg0.00428517069464392 aagt0.00365293239543744 aata0.00825031953408456 aatc0.00525694437675388 aatg0.00584039885655986 aatt0.00834788717288556 acaa0.00554769594025686 acac0.00267335330222740 acag0.00388319202293980 acat0.00413394085454237 acca0.00314753202663226 accc0.00246065584972522 accg0.00271823441602786 acct0.00303337788923109 acga0.00261286136612678 acgc0.00394465963534043 acgg0.00256700457592631 acgt0.00266945059662736 acta0.00238455309142444 actc0.00272799117992796 actg0.00333486189303418 actt0.00365293239543744 agaa0.00473105480374849 agac0.00239333417892453 agag0.00310948064753187 agat0.00392026772564018 agca0.00400027318944100 agcc0.00287043993252942 agcg0.00368415403983776 agct0.00218356375562238 agga0.00367927565783771 aggc0.00375830544523852 aggg0.00288117237272953 aggt0.00303337788923109 agta0.00303240221283108 agtc0.00222161513472277 agtg0.00311143200023189 agtt0.00377294059113867 ataa0.00925819324259489 atac0.00333974027493423 atag0.00307045359203147 atat0.00650971285846672 atca0.00614871259506302 atcc0.00400027318944100 atcg0.00313484823353213 atct0.00392026772564018 atga0.00525889572955390 atgc0.00459738713864712 atgg0.00318363205293263 atgt0.00413394085454237 atta0.00727366747007455 attc0.00499936581035124 attg0.00536524445575499 attt0.009986047827710235 caaa0.00648824797796650 caac0.00380513791193900 caag0.00247919370102541 caat0.00536524445575499 caca0.00347048091093557 cacc0.00295434810183028 cacg0.00217380699172228 cact0.00311143200023189 caga0.00434858965984457 cagc0.00364415130793735 cagg0.00482764676614948 cagt0.00333486189303418 cata0.00387538661183972 catc0.00464812231084764 catg0.00276701823542836 catt0.00584039885655986 ccaa0.00226356921942320 ccac0.00234259900682401 ccag0.00354072961083629 ccat0.00318363205293263 ccca0.00199818524192048 cccc0.00261969110092685 cccg0.00288019669642952 ccct0.00288117237272953 ccga0.00224600704442302 ccgc0.00350267823173590 ccgg0.00400027318944100 ccgt0.00256700457592631 ccta0.00159620657021636 cctc0.00262066677722686 cctg0.00482764676614948 cctt0.00428517069464392 ccta0.00159620657021636 cctc0.00262066677722686 cctg0.00482764676614948 cctt0.00428517069464392 cgaa0.00318558340573265 cgac0.00226454489572321 cgag0.00165572282991697 cgat0.00313484823353213 cgca0.00420028684894305 cgcc0.00393978125344038 cgcg0.00293093186853004 cgct0.00368415403983776 cgga0.00309777253083175 cggc0.00360609992883696 cggg0.00288019669642952 cggt0.00271823441602786 cgta0.00272018576882788 cgtc0.00251529372742578 cgtg0.00217380699172228 cgtt0.00442371674164534 ctaa0.00304996438783126 ctac0.00225771516102314 ctag0.0004624706077474 ctat0.00307045359203147 ctca0.00319046178763270 ctcc0.00291336969352986 ctcg0.00165572282991697 ctct0.00310948064753187 ctga0.00500717122145132 ctgc0.00369683783283789 ctgg0.00354072961083629 ctgt0.00388319202293980 ctta0.00446567082634577 cttc0.00322070775573301 cttg0.00247919370102541 cttt0.00624432888106400 gaaa0.00693315641077106 gaac0.00285873181582930 gaag0.00322070775573301 gaat0.00499936581035124 gaca0.00316119149603240 gacc0.00173670397001780 gacg0.00251529372742578 gact0.00222161513472277 gaga0.00329681051393379 gagc0.00223137189862287 gagg0.00262066677722686 gagt0.00272799117992796 gata0.00512327671165251 gatc0.00222063945832276 gatg0.00464812231084764 gatt0.00525694437675388 gcaa0.00558965002495729 gcac0.00272603982712794 gcag0.00369683783283789 gcat0.00459738713864712 gcca0.00356512152053654 gccc0.00241479905942475 gccg0.00360609992883696 gcct0.00375830544523852 gcga0.00336803489023452 gcgc0.00393782990064036 gcgg0.00350267823173590 gcgt0.00394465963534043 gcta0.00283336422982904 gctc0.00223137189862287 gctg0.00364415130793735 gctt0.00398758939634087 gcgc0.00393782990064036 gcgg0.00350267823173590 gcgt0.00394465963534043 gcta0.00283336422982904 gctc0.00223137189862287 gctg0.00364415130793735 gctt0.00398758939634087 ggaa0.00394465963534043 ggac0.00148693081481524 ggag0.00291336969352986 ggat0.00400027318944100 ggca0.00421980037664325 ggcc0.00231820709712376 ggcg0.00393978125344038 ggct0.00287043993252942 ggga0.00290166157692974 gggc0.00241479905942475 gggg0.00261969110092685 gggt0.00246065584972522 ggta0.00280214258532872 ggtc0.00173670397001780 ggtg0.00295434810183028 ggtt0.00391831637284016 gtaa0.00500034148675125 gtac0.00172109314781764 gtag0.00225771516102314 gtat0.00333974027493423 gtca0.00348316470393570 gtcc0.00148693081481524 gtcg0.00226454489572321 gtct0.00239333417892453 gtga0.00400320021864103 gtgc0.00272603982712794 gtgg0.00234259900682401 gtgt0.00267335330222740 gtta0.00515157132685280 gttc0.00285873181582930 gttg0.00380513791193900 gttt0.00618481262136339 taaa0.00906598499419292 taac0.00515157132685280 taag0.00446567082634577 taat0.00727366747007455 taca0.00377489194383869 tacc0.00280214258532872 tacg0.00272018576882788 tact0.00303240221283108 taga0.00205379879602105 tagc0.00283336422982904 tagg0.00159620657021636 tagt0.00238455309142444 tata0.00496619281325090 tatc0.00512327671165251 tatg0.00387538661183972 tatt0.00825031953408456 tcaa0.00477788727044897 tcac0.00400320021864103 tcag0.00500717122145132 tcat0.00525889572955390 tcca0.00263725327582703 tccc0.00290166157692974 tccg0.00309777253083175 tcct0.00367927565783771 tcga0.00205672582522108 tcgc0.00336803489023452 tcgg0.00224600704442302 tcgt0.00261286136612678 tcta0.00205379879602105 tctc0.00329681051393379 tctg0.00434858965984457 tctt0.00449591679434608 tatg0.00387538661183972 tatt0.00825031953408456 tcaa0.00477788727044897 tcac0.00400320021864103 tcag0.00500717122145132 tcat0.00525889572955390 tcca0.00263725327582703 tccc0.00290166157692974 tccg0.00309777253083175 tcct0.00367927565783771 tcga0.00205672582522108 tcgc0.00336803489023452 tcgg0.00224600704442302 tcgt0.00261286136612678 tcta0.00205379879602105 tctc0.00329681051393379 tctg0.00434858965984457 tctt0.00449591679434608 tgaa0.00620822885476363 tgac0.00348316470393570 tgag0.00319046178763270 tgat0.00614871259506302 tgca0.00424419228634350 tgcc0.00421980037664325 tgcg0.00420028684894305 tgct0.00400027318944100 tgga0.00263725327582703 tggc0.00356512152053654 A table of k-mers grows rapidly out of proportions or out of sight How many k-mers in total, for all k? http://www.ucmb.ulb.ac.be/bioinformatics/rsa-tools/ RSA-tools - menu.htm tggg0.00199818524192048 tggt0.00314753202663226 tgta0.00377489194383869 tgtc0.00316119149603240 tgtg0.00347048091093557 tgtt0.00591064755646058 ttaa0.00868156849748898 ttac0.00500034148675125 ttag0.00304996438783126 ttat0.00925819324259489 ttca0.00620822885476363 ttcc0.00394465963534043 ttcg0.00318558340573265 ttct0.00473105480374849 ttga0.00477788727044897 ttgc0.00558965002495729 ttgg0.00226356921942320 ttgt0.00554769594025686 ttta0.00906598499419292 tttc0.00693315641077106 tttg0.00648824797796650 tttt0.014924921702015297

84 Alberto Apostolico - Erice0583 Verbumculus + Dot on first 512 bps of Yeast Mitochondrial DNA

85 Alberto Apostolico - Erice0584 Counting occurrences of gagga in HSV1

86 Alberto Apostolico - Erice0585 Alternate Counting

87 Alberto Apostolico - Erice0586 Counting occurrences of ccgct in HSV1

88 Alberto Apostolico – Cinzia Pizzi – Giorgio Satta Dyads Detection in Biology Part of Speech Tagging in NLP Although preliminary findings were reported more than a year ago, the latest results appear… Dyads are the composition of two solid components separated by a variable gap IN JJ NNS VBD VBN RBR IN DT NN IN, DT JJS NNS VBP... Automatic Tagging Set of correctly classified examples Infer rules Classify new texts Drawback: ambiguity Limited size contest centered on a word can fail to give a unique tag assignment Goal: efficient counting of subword co-occurrences within distance d,with no interleaving occurrences of one or the other ACCG TAAG += Rules Possible Solution: Barriers NN/JJ CORRECT CLASSIFICATION TEXT NN or JJ ? NN JJ B1 B2 TAGGING DISAMBIGUATION

89 Alberto Apostolico – Cinzia Pizzi – Giorgio Satta X is a string of n symbols over the alphabet  d is a fixed non-negative integer y and z are subwords of X Tandem Index I (y,z) is the number of times that z has a closest occurrence within a distance d from a corresponding closest occurrence of y to its left Relaxed Tandem Index Î (y,z): all the occurrences of z within distance d are counted Notation Goal: efficient counting of subword co-occurrences within distance d,with no interleaving occurrences of one or the other In principle there are O(n 2 ) substrings in x, and thus O(n 4 ) distinct pair of substrings; however, it suffices to consider a family containing only O(n 2 ) pairs. Then, for any neglected pair (y’,z’) there is a pair (y,z) in the family such that: (i) y’ and z’ are prefixes of y and z respectively, and (ii) the tandem index of (y’,z’) equals the tandem index of (y,z). Result : O(n 2 ) algorithm for building a tandem index table ( previous results O(n 3 ) [Arimura et al., Wang et al.], in case the of two words from a generalized version of the problem) Key Observation

90 Alberto Apostolico - Erice0589 Towards a theory of saturated motifs: here a motif is a recurring pattern with some solid and some ``don’t care’’ characters together with its set of occurrences PROBLEM Input: textstring Output: repeated motifs ``don’t care’’ characters solid character T A G A G G T A G A T AG T T A G A G G T A G A T A T T A G A G G T A G T AG T Is motif discovery still beset by the circumstance that typically there are exponentially many candidate motifs in a sequence ?

91 Alberto Apostolico - Erice0590 Controlling Motif Growth: Irredundant Motifs (L.Parida) A motif is maximal in composition if specifying more solid characters implies an alteration to its occurrence list maximal in length if making the motif longer implies an alteration to the cardinality or displacement of its occurrence list A maximal motif such that the motif and its list can be inferred from studying other motifs is redundant

92 Alberto Apostolico - Erice0591 Maximal, Redundant, Irredundant Motifs (examples, cont.) Let s= aaXtaYgZZZaaVtaWcXXXXaaYtgXc s= aaXtaYgZZZaaVtaWcXXXXaaYtgXc s= aaXbaYgZZZaaVbaWcXXXXaaYbgXc m_1 = aa. t with L_1 = { 1, 11, 22} m_2 = aa. ta with L_2 = {1, 11} m_3 = aa. t. c with L_3 = {11, 22} m_1 = aa. t is redundant, since 1) m_1 is a sub-motif of m_2 and of m_3 and 2) L_1 is the union of L_2 and  L_3.

93 Alberto Apostolico - Erice0592 Controlling Motif Growth : HOW MANY Irredundant Motifs Recall that a motif is maximal in composition if specifying more solid characters implies an alteration to its occurrence list maximal in length if making it longer implies an alteration to the cardinality of its occurrence list A maximal motif such that the motif and its list can be inferred from studying other motifs is redundant A motif that occurs at least k times in the textstring is a k-motif Theorem In any textstring x the number of irredundant 2-motifs is O(|x|) (PROBLEM: How to find irredundant motifs as fast as possible)

94 Alberto Apostolico - Erice0593 Suffix Consensus, Suffix Meet suf4 s = suf1 The consensus of suf1 and suff4 is not a motif The meet of suf1 and suf4 is a maximal motif Theorem Every irredundant 2-motif of x is the meet of two suffixes of x a b c a a aa a a a a aa a bb bbb ccc cc c c

95 Alberto Apostolico - Erice0594 Finding surprising substrings with mismatches  Input: a sequence or set of sequences, integers m and k  Out: all substrings of length m that occur unusually often as a replica of the same pattern with up to k mismatches How many patterns should one try NOTE: the pattern might never occur exactly in the input Approximate Patterns Lazy havefrequent s

96 Alberto Apostolico - Erice0595 Problem Statement  Given a source text X and an error threshold k, extract substrings of X that occur unusually often in X within k substitutions or mismatches.  Measure of Surprise: compare counts with expectations

97 Alberto Apostolico - Erice0596 SubProblem: Compute Expected Frequencies under I.I.D. Distribution  Two results for expected frequencies O(nk) preprocessing of text, then report expected frequency for any substring in O(k 2 ) O(nk) preprocessing of text, then report expected frequency for any substring in O(k 2 ) Report expected frequency of all substrings of a given length in O(nk) Report expected frequency of all substrings of a given length in O(nk)

98 Alberto Apostolico - Erice0597 JACM 50, 1, January 2003 pp 25-26 Special Issue: Problems for the Next 50 Years page 1 paper 1 problem 1 ’’ Shannon and Weaver performed an inestimable service by giving us a definition of information and a metric for information as communicated from place to place. We have no theory however that gives us a metric for the information embodied in structure......this is the most fundamental gap in the theoretical underpinning of information and computer science. A young information theory scholar willing to spend years on a deeply fundamental problem need look no further. ’’ Frederick P. Brooks, jr The Great Challenges for Half Century Old Computer Science


Download ppt "Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery Alberto Apostolico University of Padova and Georgia Inst. Of Tech."

Similar presentations


Ads by Google