SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004

Motif: A functional regain of a DNA or protein sequence How to discover the functional regains automatically? Amino Acids sequences

Automatic Motif discovery Problem - Use A, B, C, … stands for different amino acids - A protein sequence: ABABAABCDBAA… - Motifs are certain patterns in sequences for example: ABCA Previous Methods: small scale discovery - Several sequences  similar functions  alignment Can we use data mining to generate motifs candidates first?

Automatically discover motifs: What properties should a motif have? It has a specific function  conservative  frequent appearing in sequences Evolution  likely not continually identical For example: ABCBABABA AB--ABAB-  string matching, suffix tree … AB-BAB-B-  how?

Formal problem definition Input: A string of characters: S=s 1 s 2,…,s L Output: A frequent pattern: (∑ U ●)* ●: a wild card to match a single character, ∑: a full character * : repeat arbitrary times Note: NO arbitrary-length gap. ABCD, AED are different

Regular Expression: to describe a certain type of patterns | or : A|B means A or B ● wild card to match any characters A●B means: AAB, ABB, ACB, … * to repeat any times (including 0 times) (AB)* means null, AB, ABAB, ABABAB, … + to repeat any times (not including 0 times) …

Any requirements for output patterns? Can wild card be anywhere? Do we need some constraints on wild cards? What means “frequent”? How long should a qualified pattern be?

Can wild card be anywhere? A pattern can have ●: for example A●BA●●B But, A●●●●●●●●●●●●●●●●BBA ?? Probably, it cannot be too “sparse”… Naïve solution: no more than n ● But, for example n=5 A●●●●●B : 5 ● A●BB●●A●●B●●A : 7●

Given a pattern P, any length l 0 region in P must have k 0 full characters Example: l 0 = 5, k 0 = 3 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 …… Density: how “sparse” do we allow? Two ● at most

Given a pattern P, any length l 0 region in P must have k 0 full characters Example: l 0 = 5, k 0 = 3 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 …… A●●ABB●A √ BA●●A●BB X Density: how “sparse” do we allow?

Frequency and length At least, the patterns have K 0 full characters repeating J 0 times Example J 0 =3 ABCBABABA √ ABCBABABA X Example K 0 =3 ABCBABABA X ABCBABABA √

Summary of parameters for a pattern Sequence S, and its length L Pattern P, K full character, appears J times Length constraints: K ≥ K 0 Frequency constraint: J ≥ J 0 Density constraint: l 0, k 0

Apriori property A constraint has a-priori property means: If a set violates this constraint, any its superset will violate this constraint as well. For example max(S) < 5 Frequency constraint has a-priori property! For example, BA●A●BB appears less than J 0 times, any its super patterns CANNOT appears more than J 0 times!

A whole picture of the algorithm To form longer pattern only from short qualified patterns. - First, to generate candidates/seed (length l 0 ): every seed should repeat at least J 0 times - To generate longer patterns from short patterns, iteratively 1. Two patterns are together 2. Longer patterns repeat at least J 0 ……

Generate the seeds: enumerating … To generate seeds (shortest patterns) first ABAABBCBACBDB… J 0 =4 A: 4, B: 6, C:2, D: 1 Are length 1 seeds too short? How long could those seeds be? - Too long: enumerating costs too much time - Too short: maybe not efficient, also not consider the density constraints Maybe, we should start from the patterns with length l 0.

How to generate seeds with length l 0 ? Give l 0 and k 0, and character sets ABC… Enumerating all possible patterns with length l 0 Scan the sequence the count the frequency For example, l 0 =3, k 0 =2, ABC AAA, AAB, AAC, ABA, … AA●, AB●, AC●, … A●A, A●B, A●C, … …

Can we do it more efficiently? Give l 0 and k 0, 1: full character 0:wild card Enumerating all possible patterns by 1 and 0? Example l 0 =5, k 0 =3, to find comb 11111, 11110, 11101, 11100, 11011, 11010, 11001, 10111, 10110, 10101 10011, 01111, 01110, 01101, 01011, 00111

How to use comb? For example 10101 A B A A B A B B A B A B A B B A A B A●A●B

How to use comb? For example 10101 A B A A B A B B A B A B A B B A A B B●A●A

How to use comb? For example 10101 A B A A B A B B A B A B A B B A A B A●B●B

How to use comb? For example 10101 A B A A B A B B A B A B A B B A A B A●A●B 3, B●A●A 2, A●B●B 2 B●B●A 2, B●B●B 2, A●A●A 1 A●B●A 1, B●A●B 1 J 0 = 3? only A●A●B left By the same way, use others combs to generate other seeds, different combs won’t generate the same patterns

How to get long patterns? Long pattern  two patterns could be merged  need short patterns and their locations - Pattern: A●B●●C {A:0,B:2,C:5} - Locus: the locations where a pattern occurs: Patten AB in string ABBCABAB Its locus {0, 4, 6}

Append operation: to connect two small patterns to a longer pattern Patten S 1 : A●●B●C and S 2 : B●D●  S 1 S 2 : A●●B●CB●D● conditional on: Their locus have intersection S 1 locus: {1, 20, 32, 57 …} S 2 locus: {7, 13, 38, 63 … }  {1,7,32,57,…} S 1 S 2 locus: {1,32,57,…} -6

Add: to make the patterns more “dense” Patten A●●B●C●●●D and ●●●B●CE  A●●B●CE●●D on the conditions: Their locus (with shifting) have intersection

Significance: whether it can be generated by randomly sampling ? hypothesis: A pattern is not randomly generated Given: character set: {A,B,C,D,E} sequence length: L A pattern: A●BA●AA●B Its frequency j Probability to generate this pattern j times pure randomly?

Statistical significance Pure random sampling, the frequency should satisfy normal distribution Z score, (A-E[A])/σ A --- normalized into N(0,1)

Experiments Two questions to answer. - How efficient is this algorithm? - How effective is this algorithm? Baseline algorithm - PRATT(EBI), MEME(UCSD)

Efficiency SPLASH PRATT

Effectiveness Search against SWISS-PROT Rel. 36, 578 GPCR proteins returned, only 4 false positive MEME cannot find it, PRATT program crashed

Conclusions Deterministic algorithm: It can discover all patterns satisfying the requirements Efficient and scalable: It beats PRATT and MEME. More scalable … Effective: It can discover useful patterns.

Problems All problems that A-priori algorithm could have: too many results, cannot really avoid worse-case exponential … Doesn’t really consider the 3D structure of proteins The software crashes sometimes

SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Similar presentations

Presentation on theme: "SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Similar presentations

Presentation on theme: "SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004."— Presentation transcript:

Similar presentations

About project

Feedback