VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.

Slides:



Advertisements
Similar presentations
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Advertisements

Frequent Closed Pattern Search By Row and Feature Enumeration
Fast Algorithms For Hierarchical Range Histogram Constructions
LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Frequent Subgraph Pattern Mining on Uncertain Graph Data
Efficiency concerns in Privacy Preserving methods Optimization of MASK Shipra Agrawal.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Rakesh Agrawal Ramakrishnan Srikant
Mining Frequent Itemsets from Uncertain Data Presented by Chun-Kit Chui, Ben Kao, Edward Hung Department of Computer Science, The University of Hong Kong.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Data Mining Association Analysis: Basic Concepts and Algorithms
Probabilistic Similarity Search for Uncertain Time Series Presented by CAO Chen 21 st Feb, 2011.
Fast Algorithms for Association Rule Mining
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Chapter 5 Mining Association Rules with FP Tree Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Ch5 Mining Frequent Patterns, Associations, and Correlations
Mining High Utility Itemsets without Candidate Generation Date: 2013/05/13 Author: Mengchi Liu, Junfeng Qu Source: CIKM "12 Advisor: Jia-ling Koh Speaker:
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
Sequential PAttern Mining using A Bitmap Representation
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
EFFICIENT ITEMSET EXTRACTION USING IMINE INDEX By By U.P.Pushpavalli U.P.Pushpavalli II Year ME(CSE) II Year ME(CSE)
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
ICDE 2012 Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data Yongxin Tong 1, Lei Chen 1, Bolin Ding 2 1 Department of Computer.
LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,
Mining High Utility Itemset in Big Data
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying.
SECURED OUTSOURCING OF FREQUENT ITEMSET MINING Hana Chih-Hua Tai Dept. of CSIE, National Taipei University.
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
Mining Approximate Frequent Itemsets in the Presence of Noise By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel and J. Prins Presentation by- Apurv Awasthi.
KDD’09,June 28-July 1,2009,Paris,France Copyright 2009 ACM Frequent Pattern Mining with Uncertain Data.
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
Mining Frequent Itemsets from Uncertain Data Presenter : Chun-Kit Chui Chun-Kit Chui [1], Ben Kao [1] and Edward Hung [2] [1] Department of Computer Science.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
A Scalable Association Rules Mining Algorithm Based on Sorting, Indexing and Trimming Chuang-Kai Chiou, Judy C. R Tseng Proceedings of the Sixth International.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
1 Top Down FP-Growth for Association Rule Mining By Ke Wang.
Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.
Fast Mining Frequent Patterns with Secondary Memory Kawuu W. Lin, Sheng-Hao Chung, Sheng-Shiung Huang and Chun-Cheng Lin Department of Computer Science.
MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value.
Mining Dependent Patterns
Frequent Pattern Mining
Market Basket Many-to-many relationship between different objects
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong(崇志宏) , Hongjun Lu.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Frequent Itemsets over Uncertain Databases
A Parameterised Algorithm for Mining Association Rules
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
DENSE ITEMSETS JOUNI K. SEPPANEN, HEIKKI MANNILA SIGKDD2004
Presentation transcript:

VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science and Technology, Hong Kong, China 2 Northeastern University, China 3 University of Illinois at Chicago, USA

Outline  Motivations – An Example of Mining Uncertain Frequent Itemsets (FIs) – Deterministic FI Vs. Uncertain FI – Evaluation Goals  Problem Definitions  Evaluations of Algorithms – Expected Support-based Frequent Algorithms – Exact Probabilistic Frequent Algorithms – Approximate Probabilistic Frequent Algorithms  Conclusions 2

Motivation Example  In an intelligent traffic system, many sensors are deployed to collect real-time monitoring data in order to analyze the traffic jams. 3 TIDLocationWeatherTimeSpeedProbability T1 HKUSTFoggy8:30-9:00 AM T2 HKUSTRainy5:30-6:00 PM T3 HKUSTSunny3:30-4:00 PM T4 HKUSTRainy5:30-6:00 PM

 According to above data, we analyze the reasons that cause the traffic jams through the viewpoint of uncertain frequent pattern mining.  For example, we find that {Time = 5:30-6:00 PM; Weather = Rainy} is a frequent itemset with a high probability.  Therefore, under the condition of {Time = 5:30-6:00 PM; Weather = Rainy}, it is very likely to cause the traffic jams. 4 TIDLocationWeatherTimeSpeedProbability T1 HKUSTFoggy8:30-9:00 AM T2 HKUSTRainy5:30-6:00 PM T3 HKUSTSunny3:30-4:00 PM T4 HKUSTRainy5:30-6:00 PM Motivation Example ( cont’d )

Outline  Motivations – An Example of Mining Uncertain Frequent Itemsets (FIs) – Deterministic FI Vs. Uncertain FI – Evaluation Goals  Problem Definitions  Evaluations of Algorithms – Expected Support-based Frequent Algorithms – Exact Probabilistic Frequent Algorithms – Approximate Probabilistic Frequent Algorithms  Conclusions 5

Deterministic Frequent Itemset Mining 6  Itemset: a set of items, such as {abc} in the right table.  Transaction: a tuple where tid is the identifier, and T is a itemset, such as the first line in the right table is a transaction. TIDTransaction T1a b c d e T2a b c d T3a b c f T4a b c e  Support: Given an itemset X, the support of X is the number of transactions containing X. i.e. support({abc})=4.  Frequent Itemset: Given a transaction database TDB, an itemset X, a minimum support σ, X is a frequent itemset iff. sup(X) > σ For example: Given σ=2, {abcd} is a frequent itemset.  The support of an itemset is only an simple count in the deterministic frequent itemset mining! A Transaction Database

Deterministic FIM Vs. Uncertain FIM 7  Transaction: a tuple where tid is the identifier, and UT={u 1 (p 1 ), ……, u m (p m )} which contains m units. Each unit has an item u i and an appearing probability p i. TIDTransaction T1a(0.8) b(0.2) c(0.9) d(0.5) e(0.9) T2a(0.8) b(0.7) c(0.9) d(0.5) f(0.7) T3a(0.5) c(0.9) f(0.1) g(0.4) T4b(0.5) f(0.1)  Support: Given an uncertain database UDB, an itemset X, the support of X, denoted sup(X), is a random variable.  How to define the concept of frequent itemset in uncertain databases?  There are currently two kinds of definitions:  Expected Support-based frequent itemset.  Probabilistic frequent itemset. An Uncertain Transaction Database

Outline  Motivations – An Example of Mining Uncertain Frequent Itemsets (FIs) – Deterministic FI Vs. Uncertain FI – Evaluation Goals  Problem Definitions  Evaluations of Algorithms – Expected Support-based Frequent Algorithms – Exact Probabilistic Frequent Algorithms – Approximate Probabilistic Frequent Algorithms  Conclusions 8

Evaluation Goals  Explain the relationship of exiting two definitions of frequent itemsets over uncertain databases. – The support of an itemset follows Possion Binomial distribution. – When the size of data is large, the expected support can approximate the frequent probability with the high confidence.  Clarify the contradictory conclusions in existing researches. – Can the framework of FP-growth still work in uncertain environments?  Provide an uniform baseline implementation and an objective experimental evaluation of algorithm performance. – Analyze the effect of the Chernoff Bound in the uncertain frequent itemset mining issue. 9

Outline  Motivations – An Example of Mining Uncertain Frequent Itemsets (FIs) – Deterministic FI Vs. Uncertain FI – Evaluation Goals  Problem Definitions  Evaluations of Algorithms – Expected Support-based Frequent Algorithms – Exact Probabilistic Frequent Algorithms – Approximate Probabilistic Frequent Algorithms  Conclusion 10

Expected Support-based Frequent Itemset  Expected Support – Given an uncertain transaction database UDB including N transactions, and an itemset X, the expected support of X is:  Expected-Support-based Frequent Itemset – Given an uncertain transaction database UDB including N transactions, a minimum expected support ratio min_esup, an itemset X is an expected support-based frequent itemset if and only if 11

Probabilistic Frequent Itemset  Frequent Probability – Given an uncertain transaction database UDB including N transactions, a minimum support ratio min_sup, and an itemset X, X’s frequent probability, denoted as Pr(X), is:  Probabilistic Frequent Itemset – Given an uncertain transaction database UDB including N transactions, a minimum support ratio min_sup, and a probabilistic frequent threshold pft, an itemset X is a probabilistic frequent itemset if and only if 12

Examples of Problem Definitions  Expected-Support-based Frequent Itemset – Given the uncertain transaction database above, min_esup=0.5, there are two expected-support-based frequent itemsets: {a} and {c} since esup(a)=2.1 and esup(c)=2.6 > 2 = 4×0.5.  Probabilistic Frequent Itemset – Given the uncertain transaction database above, min_sup=0.5, and pft=0.7, the frequent probability of {a} is: Pr(a)=Pr{sup(a) ≥4×0.5}= Pr{sup(a) =2}+Pr{sup(a) =3}= =0.8> TIDTransaction T1a(0.8) b(0.2) c(0.9) d(0.5) e(0.9) T2a(0.8) b(0.7) c(0.9) d(0.5) f(0.7) T3a(0.5) c(0.8) f(0.1) g(0.4) T4b(0.5) f(0.1) An Uncertain Transaction Database sup(a)0123 Probability The Probability Distribution of sup(a)

Outline  Motivations – An Example of Mining Uncertain Frequent Itemsets (FIs) – Deterministic FI Vs. Uncertain FI – Evaluation Goals  Problem Definitions  Evaluations of Algorithms – Expected Support-based Frequent Algorithms – Exact Probabilistic Frequent Algorithms – Approximate Probabilistic Frequent Algorithms  Conclusions 14

Type AlgorithmsHighlights Expected Support–based Frequent Algorithms UApioriApriori-based search strategy UFP-growth UFP-tree index structure ; Pattern growth search strategy UH-Mine UH-struct index structure ; Pattern growth search strategy Exact Probabilistic Frequent Algorithms DP Dynamic programming-based exact algorithm DC Divide-and-conquer-based exact algorithm Approximation Probabilistic Frequent Algorithms PDUApiori Poisson-distribution-based approximation algorithm NDUApiori Normal-distribution-based approximation algorithm NDUH-Mine Normal-distribution-based approximation algorithm UH-struct index structure 8 Representative Algorithms 15

Experimental Evaluation  Characteristics of Datasets 16  Default Parameters of Datasets Dataset Number of Transactions Number of Items Average Length Density Connect Accident Kosarak Gazelle T20I10D30KP DatasetMeanVar.min_suppft Connect Accident Kosarak Gazelle T20I10D30KP

Outline  Motivations – An Example of Mining Uncertain Frequent Itemsets (FIs) – Deterministic FI Vs. Uncertain FI – Existing Problems and Evaluation Goals  Problem Definitions  Evaluations of Algorithms – Expected Support-based Frequent Algorithms – Exact Probabilistic Frequent Algorithms – Approximate Probabilistic Frequent Algorithms  Conclusion 17

Expected Support-based Frequent Algorithms  UApriori (C. K. Chui et al., in PAKDD’07 & 08) – Extend the classical Apriori algorithm in deterministic frequent itemset mining.  UFP-growth (C. Leung et al., in PAKDD’08 ) – Extend the classical FP-tree data structure and FP-growth algorithm in deterministic frequent itemset mining.  UH-Mine (C. C. Aggarwal et al., in KDD’09 ) – Extend the classical H-Struct data structure and H-Mine algorithm in deterministic frequent itemset mining. 18

UFP-growth Algorithm 19 TIDTransaction T1a(0.8) b(0.2) c(0.9) d(0.7) f(0.8) T2a(0.8) b(0.7) c(0.9) e(0.5) T3a(0.5) c(0.8) e(0.8) f(0.3) T4b(0.5) d(0.5) f(0.7) An Uncertain Transaction Database UFP-Tree

UH-Mine Algorithm 20 TIDTransaction T1a(0.8) b(0.2) c(0.9) d(0.7) f(0.8) T2a(0.8) b(0.7) c(0.9) e(0.5) T3a(0.5) c(0.8) e(0.8) f(0.3) T4b(0.5) d(0.5) f(0.7) UDB: An Uncertain Transaction Database UH-Struct Generated from UDB UH-Struct of Head Table of A

Running Time 21 (a) Connet (Dense) (b) Kosarak (Sparse) Running Time w.r.t min_esup

Memory Cost 22 (a) Connet (Dense) (b) Kosarak (Sparse) Running Time w.r.t min_esup

Scalability 23 (a) Scalability w.r.t Running Time (b) Scalability w.r.t Memory Cost

Review: UApiori Vs. UFP-growth Vs. UH-Mine  Dense Dataset: UApriori algorithm usually performs very good  Sparse Dataset: UH-Mine algorithm usually performs very good.  In most cases, UF-growth algorithm cannot outperform other algorithms 24

Outline  Motivations – An Example of Mining Uncertain Frequent Itemsets (FIs) – Deterministic FI Vs. Uncertain FI – Evaluation Goals  Problem Definitions  Evaluations of Algorithms – Expected Support-based Frequent Algorithms – Exact Probabilistic Frequent Algorithms – Approximate Probabilistic Frequent Algorithms  Conclusions 25

Exact Probabilistic Frequent Algorithms  DP Algorithm (T. Bernecker et al., in KDD’09) – Use the following recursive relationship: – Computational Complexity: O(N 2 )  DC Algorithm (L. Sun et al., in KDD’10) – Employ the divide-and-conquer framework to compute the frequent probability – Computational Complexity: O(Nlog 2 N)  Chernoff Bound-based Pruning – Computational Complexity: O(N) 26

Running Time 27 (a) Accident (Time w.r.t min_sup) (b) Kosarak (Time w.r.t pft)

Memory Cost 28 (a) Accident (Memory w.r.t min_sup) (b) Kosarak (Memory w.r.t pft)

Scalability 29 (a) Scalability w.r.t Running Time (b) Scalability w.r.t Memory Cost

Review: DC Vs. DP  DC algorithm is usually faster than DP, especially for large data. – Time Complexity of DC: O(Nlog 2 N) – Time Complexity of DP: O(N 2 )  DC algorithm spends more memory in trade of efficiency  Chernoff-bound-based pruning usually enhances the efficiency significantly. – Filter out most infrequent itemsets – Time Complexity of Chernoff Bound: O(N) 30

Outline  Motivations – An Example of Mining Uncertain Frequent Itemsets (FIs) – Deterministic FI Vs. Uncertain FI – Evaluation Goals  Problem Definitions  Evaluations of Algorithms – Expected Support-based Frequent Algorithms – Exact Probabilistic Frequent Algorithms – Approximate Probabilistic Frequent Algorithms  Conclusions 31

Approximate Probabilistic Frequent Algorithms  PDUApriori (L. Wang et al., in CIKM’10) – Poisson Distribution approximate Poisson Binomial Distribution – Use the algorithm framework of UApriori  NDUApriori (T. Calders et al., in ICDM’10) – Normal Distribution approximate Poisson Binomial Distribution – Use the algorithm framework of UApriori  NDUH-Mine (Our Proposed Algorithm) – Normal Distribution approximate Poisson Binomial Distribution – Use the algorithm framework of UH-Mine 32

Running Time 33 (a) Accident (Dense) (b) Kosarak (Sparse) Running Time w.r.t min_sup

Memory Cost 34 (a) Accident (Dense) (b) Kosarak (Sparse) Momory Cost w.r.t min_sup

Scalability 35 (a) Scalability w.r.t Running Time (b) Scalability w.r.t Memory Cost

Approximation Quality  Accuracy in Accident Data Set 36  Accuracy in Kosarak Data Set min_sup PDUAprioriNDUAprioriUDUH-Mine PrecisionRecallPrecisionRecallPrecisionRecall min_sup PDUAprioriNDUAprioriUDUH-Mine PrecisionRecallPrecisionRecallPrecisionRecall

Review: PDUApriori Vs. NDUApriori Vs. NDUH-Mine  When datasets are large, three algorithms can provide very accurate approximations.  Dense Dataset: PDUApriori and NDUApriori algorithms perform very good  Sparse Dataset: NDUH-Mine algorithm usually performs very good  Normal distribution-based algorithms outperform the Possion distribution-based algorithms – Normal Distribution: Mean & Variance – Possion Distribution: Mean 37

Outline  Motivations – An Example of Mining Uncertain Frequent Itemsets (FIs) – Deterministic FI Vs. Uncertain FI – Evaluation Goals  Problem Definitions  Evaluations of Algorithms – Expected Support-based Frequent Algorithms – Exact Probabilistic Frequent Algorithms – Approximate Probabilistic Frequent Algorithms  Conclusions 38

Conclusions  Expected Support-based Frequent Itemset Mining Algorithms – Dense Dataset: UApriori algorithm usually performs very good – Sparse Dataset: UH-Mine algorithm usually performs very good – In most cases, UF-growth algorithm cannot outperform other algorithms  Exact Probabilistic Frequent Itemset Mining Algorithms – Efficiency: DC algorithm is usually faster than DP – Memory Cost: DC algorithm spends more memory in trade of efficiency – Chernoff-bound-based pruning usually enhances the efficiency significantly  Approximate Probabilistic Frequent Itemset Mining Algorithms – Approximation Quality: In datasets with large size, the algorithms generate very accurate approximations. – Dense Dataset: PDUApriori and NDUApriori algorithms perform very good – Sparse Dataset: NDUH-Mine algorithm usually performs very good – Normal distribution-based algorithms outperform the Possion-based algorithms 39

40 Thank you Our executable program, data generator, and all data sets can be found: