Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, and Jiawei Han SIGMOD 2002 Presented by: Eddie Date: 2002/12/23.

Slides:

Advertisements

Similar presentations

Yinyin Yuan and Chang-Tsun Li Computer Science Department

Advertisements

“Students” t-test.

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

Authors: David N.C. Tse, Ofer Zeitouni. Presented By Sai C. Chadalapaka.

Statistical Issues in Research Planning and Evaluation

Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.

Measuring the degree of similarity: PAM and blosum Matrix

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

Topic 6: Introduction to Hypothesis Testing

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

Rakesh Agrawal Ramakrishnan Srikant

COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.

Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.

Noam Segev, Israel Chernyak, Evgeny Reznikov Supervisor: Gabi Nakibly, Ph. D.

Temporal Pattern Matching of Moving Objects for Location-Based Service GDM Ronald Treur14 October 2003.

Evaluating Hypotheses

Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

Similar Sequence Similar Function Charles Yan Spring 2006.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.

Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

Single Motif Charles Yan Spring Single Motif.

Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

QNT 531 Advanced Problems in Statistics and Research Methods

1 An Efficient Classification Approach Based on Grid Code Transformation and Mask-Matching Method Presenter: Yo-Ping Huang Tatung University.

Presented by Tienwei Tsai July, 2005

Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.

BINF6201/8201 Hidden Markov Models for Sequence Analysis

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Sequential Pattern Mining COMP Seminar BCB 713 Module Spring 2011.

First topic: clustering and pattern recognition Marc Sobel.

Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.

Mining Approximate Frequent Itemsets in the Presence of Noise By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel and J. Prins Presentation by- Apurv Awasthi.

Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.

1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng

PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.

Uncertainty Management in Rule-based Expert Systems

MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:

CHAPTER 5 SIGNAL SPACE ANALYSIS

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.

Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.

Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.

1 Efficient Discovery of Frequent Approximate Sequential Patterns Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu ICDM 2007.

For starters - pick up the file pebmass.PDW from the H:Drive. Put it on your G:/Drive and open this sheet in PsiPlot.

1 An Efficient Classification Approach Based on Grid Code Transformation and Mask-Matching Method Presenter: Yo-Ping Huang.

1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.

Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.

Confidence Intervals & Effect Size. Outline of Today’s Discussion 1.Confidence Intervals 2.Effect Size 3.Thoughts on Independent Group Designs.

Step 3: Tools Database Searching

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

+ Unit 5: Estimating with Confidence Section 8.3 Estimating a Population Mean.

1 A Statistical Matching Method in Wavelet Domain for Handwritten Character Recognition Presented by Te-Wei Chiang July, 2005.

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Virtual University of Pakistan

Sampling and Sampling Distribution

Presentation transcript:

Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, and Jiawei Han SIGMOD 2002 Presented by: Eddie Date: 2002/12/23

Outline Introduction A Model of Obscure Patterns A Border Collapsing Approach Experimental Results Some thoughts

Introduction In a noisy environment, an observed sequence may not accurately reflect the underlying behavior. For example, in a protein sequence, the amino acid N is likely to mutate to D with little impact to the biological function of the protein. The substitution may prevent an occurrence of a pattern from being recognized and in turn slashes the support of that pattern. It would be desirable if the occurrence of D in the observation can be related to a possible mutation from N in an appropriate manner.

Introduction In order to accommodate the above circumstance, it is necessary to allow some flexibility in pattern matching. A so-called compatibility matrix is introduced to enable a clear representation of the likelihood of symbol substitutions. –The following matrix shows an example of the compatibility matrix between 5 symbols d 1, d 2, d 3, d 4, and d 5. An entry C(d i,d j ) > 0 indicates that d i might be (mis)represented as d j in the observation; while C(d i,d j ) = 0 implies that the symbol d i cannot be represented as d j despite the presence of noise.

Introduction It provides a meaningful measure to reveal the substance and the assessment of each entry has great impact to the final result. In practice, this matrix can either given by a domain expert or learned from a training data set. d1d2d3d4d5 d d d d d observed value true value

A Model of Obscure Patterns The authors proposed a new metric, namely match, to characterize the significance of the pattern in a symbol sequences. In particular, they utilize the conditional probability to assess the match of the patter. Def. Given a pattern P and a subseuence of l symbols s, the match of P in s is defined as the conditional probability Prob(P|s). (or M(P, s)) M(P, s) = П 1 ≦ i ≦ l C(d i,d i ’) For example, P 1 =d 1 d 2 in a subsequence s=d 1 d 3 M(P 1, s) = C(d 1,d 1 )  C(d 2,d 3 ) = 0.9  0.05=0.045 Where the pattern P 2 =d 1 d 5 does not match s because M(P 2, s) = C(d 1,d 1 )  C(d 2,d 5 ) = 0.9  0=0

A Model of Obscure Patterns Def. For a symbol sequence S and a pattern P, the match of P in S is defined as the maximal match of P in every distinct subsequence in S. That is, M(P,S) = max s  S M(P, s) Therefore, the dynamic programming method is used to compute the match. –E.g.: The match of P = d 1 d 2 in the sequence d 1 d 3 d 2 d 3 d 4 d 1 is d1d3d2d3d4d1 0.9 d

A Model of Obscure Patterns Def. Given a pattern P and a database D of N sequences, the match of P in D is the average match of P in every sequence in D, i.e. M(P,D) = Similar to the traditional support model, all patterns that meet the min_match threshold are then referred to as frequent patterns.

A Model of Obscure Patterns A direct implication of the Apriori property is that, given a min-match threshold, the set of frequent patterns occupy a “contiguous portion” in the pattern lattice, and can be described using the notion of border.

A Border Collapsing Approach The Chernoff bound is used to estimate the range of the match of a pattern from a sample of the data with a high statistical confidence (e.g., 99.99%). Chernoff bound: Let X be a random variable whose spread is R(e.g. 1), suppose that we have n independent observations of X, and the mean is μ. The Chernoff Bound states that with probability 1-δ, the true mean of X is at least μ-ε, where ε = E.g. μ is the mean of samples of random variables, the true value of the random variable is at least μ with 99.99% confidence.

A Border Collapsing Approach Claim (Chernoff bound estimation) Given a set of sample data and a threshold min-match, a pattern is frequent with probability 1-δ if μ match >min_match+ε and is infrequent with probability 1-δ if μ match < min_match-ε, other patterns remain undecided and are referred as ambiguous patterns. –For example, assume min_match=0.02, ε=0.003, μ match (d 1 d 2 )=0.025, μ match (d 3 d 4 )=0.015, μ match (d 5 d 6 )=0.021, then we can know that d 1 d 2 is frequent, d 3 d 4 is infrequent, and d 5 d 6 is ambiguous.

A Border Collapsing Approach Claim (Restricted spread) The restricted spread R for the match of a pattern d 1 d 2 …d l is R = min 1 ≦ i ≦ l match[d i ] where match[d i ] is the match of the symbol d i in the entire sequence database. –For example, if the match of d1d2 are 0.1 and 0.05 in the database respectively, then the match of d1d2 has to be between 0 and 0.05 in the database. Thus, we can use R = 0.05 and reduce the value of ε by 95%.

A Border Collapsing Approach Hence, the following three-fold algorithm is developed for mining the obscure patterns: –Finding the match of individual symbols and taking a random sample of sequences –Identify the borders that embrace the set of ambiguous patterns(i.e., whose match is between min_match–ε and min_match+ε) –Locate the border of frequent patterns via border collapsing

A Border Collapsing Approach

Experimental Results Accuracy and completeness of the two models While the accuracy and the completeness are very high, the quality of the resutls by the support model degrades significantly. For example, when  = 0.6, the accuracy and completeness of the support model are 61% and 33%, respectively.

Experimental Results Accuracy and completeness of the two models With a given degree of noise (e.g.,  = 0.1), the accuracy and completeness of the support and match models with different pattern lengths are shown as above.

Some thoughts They devised a border-collapsing algorithm. It is very helpful to speed up the reduction of candidates. I learned a novel technique about database sampling. This paper mentioned me about the noise problem. I think that I should take more consideration about it on my thesis.