Avrilia Floratou, Sandeep Tata, and Jignesh M. Patel ICDE 2010 Efficient and Accurate Discovery of Patterns in Sequence Datasets.

Slides:



Advertisements
Similar presentations
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Advertisements

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Applied Algorithmics - week7
Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.
Frequent Closed Pattern Search By Row and Feature Enumeration
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.
Jessica Lin, Eamonn Keogh, Stefano Loardi
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
Regular Expression Constrained Sequence Alignment Abdullah N. Arslan Assistant Professor Computer Science Department.
Sequence similarity.
Faster Algorithm for String Matching with k Mismatches Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp Date.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Dynamic Programming and Biological Sequence Comparison Part I.
Detecting Time Series Motifs Under
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –
Class 2: Basic Sequence Alignment
Brandon Andrews.  Longest Common Subsequences  Global Sequence Alignment  Scoring Alignments  Local Sequence Alignment  Alignment with Gap Penalties.
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.
What Is Sequential Pattern Mining?
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
How to make a presentation (Oral and Poster) Dr. Bernard Chen Ph.D. University of Central Arkansas July 5 th Applied Research in Healthy Information.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Mining Multidimensional Sequential Patterns over Data Streams Chedy Raїssi and Marc Plantevit DaWak_2008.
An Improved Algorithm to Accelerate Regular Expression Evaluation Author: Michela Becchi, Patrick Crowley Publisher: 3rd ACM/IEEE Symposium on Architecture.
MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.
PMLAB Finding Similar Image Quickly Using Object Shapes Heng Tao Shen Dept. of Computer Science National University of Singapore Presented by Chin-Yi Tsai.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
My Research Work and Clustering Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
CEMiner – An Efficient Algorithm for Mining Closed Patterns from Time Interval-based Data Yi-Cheng Chen, Wen-Chih Peng and Suh-Yin Lee ICDM 2011.
Mining Frequent Patterns in unaligned Protein Sequences (An Implementation of The Teiresias Algorithm) Fei Shao 1.
Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.
Positional Association Rules Dr. Bernard Chen Ph.D. University of Central Arkansas.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
TEMPLATE DESIGN © Molecular Re-Classification of Renal Disease Using Approximate Graph Matching, Clustering and Pattern.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Sequence Alignment.
Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004.
VizTree Huyen Dao and Chris Ackermann. Introducing example
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
An Improved Search Algorithm for Optimal Multiple-Sequence Alignment Paper by: Stefan Schroedl Presentation by: Bryan Franklin.
Sparse nonnegative matrix factorization for protein sequence motifs information discovery Presented by Wooyoung Kim Computer Science, Georgia State University.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Visually Mining and Monitoring Massive Time Series
Protein Alignments: Clues to Protein Function
TT-Join: Efficient Set Containment Join
CARPENTER Find Closed Patterns in Long Biological Datasets
Basic Local Alignment Search Tool
Presentation transcript:

Avrilia Floratou, Sandeep Tata, and Jignesh M. Patel ICDE 2010 Efficient and Accurate Discovery of Patterns in Sequence Datasets

Outline Motivation A powerful new model FLAME(FLexible and Accurate Motif Detector) algorithm Experimental Results Conclusion 2010/5/27 2

Motivation Existing sequence mining algorithms mostly focus on mining for subsequences(non-contiguous). For instance, assume that a sequence is “a, b, a, c, b, a, c”, the sequence “a, b, b, c” is a subsequence constructed by choosing the 1 st, 2 nd, 5 th, and 7 th. “Are there any frequently recurring patterns in this time series dataset?” The recurring subsequences are similar, but not identical. It allow for some noise. Approximate subsequence mining problem.(contiguous) 2010/5/27 3

(Cont.) 2010/5/27 4 Computational biology To detect short sequences, usually of length 6-15, that occur frequently in a given set of DNA or protein sequences. We call these frequently occurring patterns as motifs. Goal To present a new model for approximate motif mining in many different domains. To present FLAME algorithm to efficiently find motifs that satisfy these model.

The model 2010/5/27 5 (L, M, s, k) model L is the length of the motif. M is a distance matrix that is used to compute the similarity between two strings. s is the maximum distance threshold within which two strings are considered similar. k is the min_sup. If at least k strings T 1,…, T k in the DB each of them is length L, and d(S,T i ) s, where is a distance function. Then, a string S is an (L, M, s, k) motif.

(Cont.) 2010/5/27 6 Protein motif mining Some amino acids are very similar to each other, while some are very different. Because Alanine and Valine are both hydrophobic, the matrix can be used to award a small penalty for M(X,Y) when X and Y are similar. When Alanine is matched to Glycine, the large penalty is awarded. Because Glycine is a hydrophilic amino acid. (L, d, k) model (L, f, d, k) model

(L, d, k) model 2010/5/27 7 (L, d, k) model is a mismatch based model for finding DNA motifs. The distance measure between two strings is the Hamming distance, M(X,Y)=1 if X Y and M(X,Y)=0 if X=Y The signature is usually a short string of DNA 6-15 bases long. These signatures are seldom identical and differ in a few positions. For instance {ABCD, ACCD, ABCA} if ABCD is the model sequence, the other two sequences are within one mismatch of the motif, so these sequences would constitute a (4,1,3) motif.

(L, f, d, k) model 2010/5/27 8 This model builds on the (L, d, k) model to include positional constraints on the mismatches. To specify the number of fixed-position mismatches. {ABCD, ACCD, ADCD}, this set forms a (4,1,3) motif, but the mismatches, whenever they occur, are always in position two (AcCD, AdCD). The advantage of this model Consider a DNA DB consisting of 5 sequences, each of length 500. Assume that each sequence has in it the motif GTGAACAC, and each instance of the motif has a mismatch at the fifth position. ( 8,1,0,5) motif or (8,1,5) motif Use the (L, d, k) model to retrieve the pattern, we will end up with many extraneous hits that might not be meningful.

FLAME algorithm 2010/5/27 9 It first construct a data suffix tree and model suffix tree. To traversing the nodes of the model space in depth first order. Using two subroutines Evaluate_support: to compute the list of matches and the support. Expand_Matches: to ensure that the number of mismatches to the model string does not exceed d.

(Cont.) 2010/5/27 10 The data suffix tree: The model suffix tree: On the set of all possible model srtings. To help guide the exploration of the model space in a way that avoids redundant work.

(Cont.)the strategy of pruning the model suffix tree 2010/5/27 11 Assume that the dataset consists of sequences over the alphabet {A, B, C, D, E} All the strings of length L starting with the symbol A form a subset of the model space. A AEADACABAA … …

(Cont.) 2010/5/27 12 No mismatches are allowed. A AEADACABAA … …

(Cont.) 2010/5/27 13 As mismatches are allowed. A AEADACABAA … … B

Experimental results 2010/5/27 14

(Cont.) 2010/5/27 15

Conclusion 2010/5/27 16 This paper presented a powerful new model: (L, M, s, k) It also presented FLAME to find (L, M, s, k) motifs