SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Frequent Closed Pattern Search By Row and Feature Enumeration
Avrilia Floratou, Sandeep Tata, and Jignesh M. Patel ICDE 2010 Efficient and Accurate Discovery of Patterns in Sequence Datasets.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
CS 3240 – Chapter 6.  6.1: Simplifying Grammars  Substitution  Removing useless variables  Removing λ  Removing unit productions  6.2: Normal Forms.
Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Lecture 6, Thursday April 17, 2003
Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.
Finding approximate palindromes in genomic sequences.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Lexical Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Formal Methods in SE Theory of Automata Qasiar Javaid Assistant Professor Lecture # 06.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Module 2 How to design Computer Language Huma Ayub Software Construction Lecture 8.
An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan.
L ECTURE 3 Chapter 4 Regular Expressions. I MPORTANT T ERMS Regular Expressions Regular Languages Finite Representations.
Comp. Genomics Recitation 3 The statistics of database searching.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Generalized Sequential Pattern Mining with Item Intervals Yu Hirate Hayato Yamana PAKDD2006.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
©Brooks/Cole, 2001 Chapter 9 Regular Expressions ( 정규수식 )
Positional Association Rules Dr. Bernard Chen Ph.D. University of Central Arkansas.
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
Data Compression Meeting October 25, 2002 Arithmetic Coding.
Motif discovery and Protein Databases Tutorial 5.
Significance in protein analysis
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
Permuted Scaled Matching Ayelet Butman Noa Lewenstein Ian Munro.
Chapter 7 - Sequence patterns1 Chapter 7 – Sequence patterns (first part) We want a signature for a protein sequence family. The signature should ideally.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Sequence Alignment.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Lecture # 4.
Lecture 2 Theory of AUTOMATA
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Lecture 03: Theory of Automata:2014 Asif Nawaz Theory of Automata.
Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.
Recap Lecture 3 RE, Recursive definition of RE, defining languages by RE, { x}*, { x}+, {a+b}*, Language of strings having exactly one aa, Language of.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
CSE 589 Applied Algorithms Spring 1999
Sequential Pattern Mining
Theory of Computation Lecture #
Lexical Analysis CSE 340 – Principles of Programming Languages
13 Text Processing Hongfei Yan June 1, 2016.
Association Rule Mining
Data Warehousing Mining & BI
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Basic Local Alignment Search Tool
Recap Lecture 3 RE, Recursive definition of RE, defining languages by RE, { x}*, { x}+, {a+b}*, Language of strings having exactly one aa, Language of.
Presentation transcript:

SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004

Motif: A functional regain of a DNA or protein sequence How to discover the functional regains automatically? Amino Acids sequences

Automatic Motif discovery Problem - Use A, B, C, … stands for different amino acids - A protein sequence: ABABAABCDBAA… - Motifs are certain patterns in sequences for example: ABCA Previous Methods: small scale discovery - Several sequences  similar functions  alignment Can we use data mining to generate motifs candidates first?

Automatically discover motifs: What properties should a motif have? It has a specific function  conservative  frequent appearing in sequences Evolution  likely not continually identical For example: ABCBABABA AB--ABAB-  string matching, suffix tree … AB-BAB-B-  how?

Formal problem definition Input: A string of characters: S=s 1 s 2,…,s L Output: A frequent pattern: (∑ U ●)* ●: a wild card to match a single character, ∑: a full character * : repeat arbitrary times Note: NO arbitrary-length gap. ABCD, AED are different

Regular Expression: to describe a certain type of patterns | or : A|B means A or B ● wild card to match any characters A●B means: AAB, ABB, ACB, … * to repeat any times (including 0 times) (AB)* means null, AB, ABAB, ABABAB, … + to repeat any times (not including 0 times) …

Any requirements for output patterns? Can wild card be anywhere? Do we need some constraints on wild cards? What means “frequent”? How long should a qualified pattern be?

Can wild card be anywhere? A pattern can have ●: for example A●BA●●B But, A●●●●●●●●●●●●●●●●BBA ?? Probably, it cannot be too “sparse”… Naïve solution: no more than n ● But, for example n=5 A●●●●●B : 5 ● A●BB●●A●●B●●A : 7●

Given a pattern P, any length l 0 region in P must have k 0 full characters Example: l 0 = 5, k 0 = 3 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 …… Density: how “sparse” do we allow? Two ● at most

Given a pattern P, any length l 0 region in P must have k 0 full characters Example: l 0 = 5, k 0 = 3 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 …… Density: how “sparse” do we allow? Two ● at most

Given a pattern P, any length l 0 region in P must have k 0 full characters Example: l 0 = 5, k 0 = 3 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 …… Density: how “sparse” do we allow? Two ● at most

Given a pattern P, any length l 0 region in P must have k 0 full characters Example: l 0 = 5, k 0 = 3 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 …… A●●ABB●A √ BA●●A●BB X Density: how “sparse” do we allow?

Frequency and length At least, the patterns have K 0 full characters repeating J 0 times Example J 0 =3 ABCBABABA √ ABCBABABA X Example K 0 =3 ABCBABABA X ABCBABABA √

Summary of parameters for a pattern Sequence S, and its length L Pattern P, K full character, appears J times Length constraints: K ≥ K 0 Frequency constraint: J ≥ J 0 Density constraint: l 0, k 0

Apriori property A constraint has a-priori property means: If a set violates this constraint, any its superset will violate this constraint as well. For example max(S) < 5 Frequency constraint has a-priori property! For example, BA●A●BB appears less than J 0 times, any its super patterns CANNOT appears more than J 0 times!

A whole picture of the algorithm To form longer pattern only from short qualified patterns. - First, to generate candidates/seed (length l 0 ): every seed should repeat at least J 0 times - To generate longer patterns from short patterns, iteratively 1. Two patterns are together 2. Longer patterns repeat at least J 0 ……

Generate the seeds: enumerating … To generate seeds (shortest patterns) first ABAABBCBACBDB… J 0 =4 A: 4, B: 6, C:2, D: 1 Are length 1 seeds too short? How long could those seeds be? - Too long: enumerating costs too much time - Too short: maybe not efficient, also not consider the density constraints Maybe, we should start from the patterns with length l 0.

How to generate seeds with length l 0 ? Give l 0 and k 0, and character sets ABC… Enumerating all possible patterns with length l 0 Scan the sequence the count the frequency For example, l 0 =3, k 0 =2, ABC AAA, AAB, AAC, ABA, … AA●, AB●, AC●, … A●A, A●B, A●C, … …

Can we do it more efficiently? Give l 0 and k 0, 1: full character 0:wild card Enumerating all possible patterns by 1 and 0? Example l 0 =5, k 0 =3, to find comb 11111, 11110, 11101, 11100, 11011, 11010, 11001, 10111, 10110, , 01111, 01110, 01101, 01011, 00111

How to use comb? For example A B A A B A B B A B A B A B B A A B A●A●B

How to use comb? For example A B A A B A B B A B A B A B B A A B B●A●A

How to use comb? For example A B A A B A B B A B A B A B B A A B A●B●B

How to use comb? For example A B A A B A B B A B A B A B B A A B A●A●B 3, B●A●A 2, A●B●B 2 B●B●A 2, B●B●B 2, A●A●A 1 A●B●A 1, B●A●B 1 J 0 = 3? only A●A●B left By the same way, use others combs to generate other seeds, different combs won’t generate the same patterns

How to get long patterns? Long pattern  two patterns could be merged  need short patterns and their locations - Pattern: A●B●●C {A:0,B:2,C:5} - Locus: the locations where a pattern occurs: Patten AB in string ABBCABAB Its locus {0, 4, 6}

Append operation: to connect two small patterns to a longer pattern Patten S 1 : A●●B●C and S 2 : B●D●  S 1 S 2 : A●●B●CB●D● conditional on: Their locus have intersection S 1 locus: {1, 20, 32, 57 …} S 2 locus: {7, 13, 38, 63 … }  {1,7,32,57,…} S 1 S 2 locus: {1,32,57,…} -6

Add: to make the patterns more “dense” Patten A●●B●C●●●D and ●●●B●CE  A●●B●CE●●D on the conditions: Their locus (with shifting) have intersection

Significance: whether it can be generated by randomly sampling ? hypothesis: A pattern is not randomly generated Given: character set: {A,B,C,D,E} sequence length: L A pattern: A●BA●AA●B Its frequency j Probability to generate this pattern j times pure randomly?

Statistical significance Pure random sampling, the frequency should satisfy normal distribution Z score, (A-E[A])/σ A --- normalized into N(0,1)

Experiments Two questions to answer. - How efficient is this algorithm? - How effective is this algorithm? Baseline algorithm - PRATT(EBI), MEME(UCSD)

Efficiency SPLASH PRATT

Effectiveness Search against SWISS-PROT Rel. 36, 578 GPCR proteins returned, only 4 false positive MEME cannot find it, PRATT program crashed

Conclusions Deterministic algorithm: It can discover all patterns satisfying the requirements Efficient and scalable: It beats PRATT and MEME. More scalable … Effective: It can discover useful patterns.

Problems All problems that A-priori algorithm could have: too many results, cannot really avoid worse-case exponential … Doesn’t really consider the 3D structure of proteins The software crashes sometimes