1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.

Slides:

Advertisements

Similar presentations

Boosting Textual Compression in Optimal Linear Time.

Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

A distributed method for mining association rules

DATA MINING Association Rule Discovery. AR Definition aka Affinity Grouping Common example: Discovery of which items are frequently sold together at a.

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Introduction to Computer Science 2 Lecture 7: Extended binary trees

gSpan: Graph-based substructure pattern mining

AVL Trees1 Part-F2 AVL Trees v z. AVL Trees2 AVL Tree Definition (§ 9.2) AVL trees are balanced. An AVL Tree is a binary search tree such that.

Chapter 4: Trees Part II - AVL Tree

Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.

Advanced Database Discussion B Trees. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if.

LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo.

Nadia Andreani Dwiyono DESIGN AND MAKE OF DATA MINING MARKET BASKET ANALYSIS APLICATION AT DE JOGLO RESTAURANT.

FP-Growth algorithm Vasiljevic Vladica,

FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.

Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Data Mining Association Analysis: Basic Concepts and Algorithms

Adaptive Load Shedding for Mining Frequent Patterns from Data Streams Xuan Hong Dang, Wee-Keong Ng, and Kok-Leong Ong (DaWaK 2006) 2008/3/191Yi-Chun Chen.

Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.

FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.

Introduction to Analysis of Algorithms

Mining Time-Series Databases Mohamed G. Elfeky. Introduction A Time-Series Database is a database that contains data for each point in time. Examples:

1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.

Fast Algorithms for Mining Association Rules * CS401 Final Presentation Presented by Lin Yang University of Missouri-Rolla * Rakesh Agrawal, Ramakrishnam.

Mining Tree-Query Associations in a Graph Bart Goethals University of Antwerp, Belgium Eveline Hoekx Jan Van den Bussche Hasselt University, Belgium.

Fast Algorithms for Association Rule Mining

Verifying and Mining Frequent Patterns from Large Windows over Data Streams Barzan Mozafari, Hetal Thakkar, and Carlo Zaniolo ICDE 2008 Cancun, Mexico.

Chapter 11: Limitations of Algorithmic Power

Association Analysis (3). FP-Tree/FP-Growth Algorithm Use a compressed representation of the database using an FP-tree Once an FP-tree has been constructed,

Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:

USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.

NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.

Mining Optimal Decision Trees from Itemset Lattices Dr, Siegfried Nijssen Dr. Elisa Fromont KDD 2007.

1 Chapter 1 Analysis Basics. 2 Chapter Outline What is analysis? What to count and consider Mathematical background Rates of growth Tournament method.

Secure Incremental Maintenance of Distributed Association Rules.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.

Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura

Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.

LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.

Mining Frequent Patterns without Candidate Generation.

Association Rule Mining in Peer-to-Peer Systems Ran Wolff Assaf Shcuster Department of Computer Science Technion I.I.T. Haifa 32000,Isreal.

An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining 2/Oct/2007 Discovery Science 2007 Takeaki Uno (National Institute of Informatics)

August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.

Union-find Algorithm Presented by Michael Cassarino.

LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:

CanTree: a tree structure for efficient incremental mining of frequent patterns Carson Kai-Sang Leung, Quamrul I. Khan, Tariqul Hoque ICDM ’ 05 報告者：林靜怡.

1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hong.

Association Analysis (3)

1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003.

1 Online Mining (Recently) Maximal Frequent Itemsets over Data Streams Hua-Fu Li, Suh-Yin Lee, Man Kwan Shan RIDE-SDMA ’ 05 speaker ：董原賓 Advisor ：柯佳伶.

Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),

Searching for Pattern Rules Guichong Li and Howard J. Hamilton Int'l Conf on Data Mining (ICDM),2006 IEEE Advisor ： Jia-Ling Koh Speaker ： Tsui-Feng Yen.

MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value.

CFI-Stream: Mining Closed Frequent Itemsets in Data Streams

Frequency Counts over Data Streams

New ideas on FP-Growth and batch incremental mining with FP-Tree

A Research Oriented Study Report By :- Akash Saxena

Byung Joon Park, Sung Hee Kim

Analysis and design of algorithm

Association Rule Mining

A Parameterised Algorithm for Mining Association Rules

Farzaneh Mirzazadeh Fall 2007

Fraction-Score: A New Support Measure for Co-location Pattern Mining

Presentation transcript:

1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun Advisor: Dr. Koh, JiaLing

2 Outline Introduction Mining Large Sliding Windows Verification Double-Tree Verifier (DTV) Depth-First Verifier (DFV) Hybrid Version of Our Verifiers Experiments Conclusion

3 Introduction Normally, finding new rules requires both machines and domain experts Delays by the mining algorithms in detecting new frequent itemsets are acceptable

4 Introduction Propose an algorithm for incremental mining of frequent itemsets that compares favorably with existing algorithms when real-time response is required The performance of the proposed algorithm improves when small delays are acceptable

5 Introduction The on-line verification of old rules is highly desirable in most application scenarios Propose verifiers for verifying the frequency of previously frequent itemsets over new arriving windows

6 Mining Large Sliding Windows Problem Statement and Notations D ： the dataset to be mined (a window), contains several transactions Count(p, D) ： frequency of an itemset p sup(p, D) ： the support of p Minimum support threshold α σ α (D) ： the set of frequent itemsets in D n = |W| / |S| ： the number of slides in each window

7 Mining Large Sliding Windows * The SWIM Algorithm Sliding Window Incremental Miner (SWIM) always maintains a union of the frequent patterns of all slides in the current window W, called Pattern Tree (PT), which is guaranteed to be a superset of the frequent patterns over W

Mining Large Sliding Windows * The SWIM Algorithm Mine the new slide and add its frequent patterns to PT Uses an auxiliary array, aux_array, stores frequency of a pattern for each window, for which the frequency is not known This counting can either be done eagerly (i.e., immediately) or lazily 8

Mining Large Sliding Windows * The SWIM Algorithm At the end of each slides, SWIM outputs all patterns in PT whose frequency at that time is ≥αn|S| We may miss a few patterns due to lack of knowledge at the time of output, but we will report them as delayed when other slides expires

Mining Large Sliding Windows * The SWIM Algorithm Example 1: assume that our input stream is partitioned into slides S 1, S 2, … and we have 3 slides in each window Consider a pattern p which shows up as frequent in S 4 for the first time p.f i : the frequency of p in the i th slide p.freq: p’s cumulative frequency in the current window

Mining Large Sliding Windows * The SWIM Algorithm W 4 ={S 2, S 3, S 4 } p.freq=p.f 4 p.aux_array= W 5 ={S 3, S 4, S 5 } p.freq=p.f 4 +p.f 5 p.aux_array= W 6 ={S 4, S 5, S 6 } p.freq=p.f 4 +p.f 5 +p.f 6 p.aux_array=

Mining Large Sliding Windows * The SWIM Algorithm 12

Mining Large Sliding Windows * SWIM with Adjusted Delay Bound SWIM can be easily modified to only allow a given maximum delay of L slides (0 ≤ L ≤ n-1) Choosing L = 0 guarantees that all frequent patterns are reported immediately once they become frequent in W Choosing L = n-1 leads to the laziest approach, we wait until a slide expires and then compute the frequency of new patterns and update aux_arrays accordingly

Verification Definition 1: D: be a transactional database P: be a given set of arbitrary patterns min_freq: a given minimum frequency A function f is called verifier if it takes D, P and min_freq as input and for each p P returns one of the following: 1. p’s true frequency in D if it has occurred at least min_freq times or otherwise 2. Reports that it has occurred less than min_freq times

Verification In the special case of min_freq = 0, a verifier simply counts the frequency of all p P In general if min_freq > 0, the verifier can skip any pattern whose frequency will be less than min_freq Verification simply verifies counts for a given set of patterns, i.e., verification does not discover additional patterns 15

Verification * Background 16

Verification * Background Use the following notation to describe the verifiers: u.item: the item represented by node u u.freq: the frequency of u, or NIL when unknown u.children: the set of u’s children head(c): the set of all nodes holding item c 17

Verification * Background Use another data structure called pattern tree A fp-tree Each node represents a unique pattern A verifier algorithm computes the frequency of all patterns in a given pattern tree Initially the frequency of each node in the pattern tree is 0, but after verification it would contain the true frequency of the pattern it represents 18

Verification * Double Tree Verifier (DTV) 19

Verification * Double Tree Verifier (DTV) Theoretically, the running time of DTV could be exponential in the worst case due to exponential nature of frequent pattern mining In practice, number of frequent patterns is small when min_freq is not very small 20

Verification * Double Tree Verifier (DTV) The number of different paths in the execution tree of DTV is bounded by the total number of different patterns in the pattern tree The advantage of DTV increases when the minimum support decreases 21

Verification * Depth-First Verifier (DFV) DFV exploits the following optimizations: Ancestor Failure: If a path in the fp-tree already proved to not contain a prefix of the pattern p, then we know that it does not contain p itself either 22

Verification * Depth-First Verifier (DFV) Smaller Sibiling Equivalence: If a path in the fp-tree has already been marked to (or not to) contain a smaller sibiling of the pattern p, then we know that it does (or does not) contain p itself too Parent Success: If a path in the fp-tree has already been marked to contain the parent pattern of p, then we know that it also contains p 23

24

Verification * Depth-First Verifier (DFV) Definition 2 (smallest decisive ancestor): for a given pattern node u and an fp-tree’s node s, smallest decisive ancestor of s is its lowest ancestor t for which either t.item < u.item or t.item ≠ NIL 25

Verification * Hybrid Version of Our Verifier DTV is faster than DFV when there are many transactions in the fp-tree and many patterns in the pattern tree When our tree is small, DFV is more efficient because conditionalization overhead is high Start with DTV until the conditionalizad tree are “small enough” and after that point switch to DFV 26

Experiments P4 machine running Linux, with 1GB of RAM Algorithm is implemented in C dataset: IBM QUEST data generator Kosarak real-world dataset 27

Experiments * Efficiency of Verification Algorithms 28

Experiments * Efficiency of Verification Algorithms * T20I5D50K 29

Experiments * Efficiency of SWIM Algorithms T20I5D1000Ksupport = 1% 30

Experiments * Efficiency of SWIM Algorithms 31

Conclusion The introduction of a very fast algorithm to verify the frequency of a given set of patterns Further improves the performance by simply allowing a small reporting delay 32