Frequent itemset mining and temporal extensions Sunita Sarawagi

Slides:

Advertisements

Similar presentations

Mining Association Rules

Advertisements

CSE 634 Data Mining Techniques

Data Mining Techniques Association Rule

Graph Mining Laks V.S. Lakshmanan

LOGO Association Rule Lecturer: Dr. Bo Yuan

732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña FP grow algorithm Correlation analysis.

1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.

Data Mining Association Analysis: Basic Concepts and Algorithms

CPS : Information Management and Mining

Chapter 5: Mining Frequent Patterns, Association and Correlations

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Data Mining: Association Rule Mining CSE880: Database Systems.

732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Mining Association Rules in Large Databases

Data Mining Association Analysis: Basic Concepts and Algorithms

FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.

Data Mining Association Analysis: Basic Concepts and Algorithms

1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.

Mining Frequent patterns without candidate generation Jiawei Han, Jian Pei and Yiwen Yin.

4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.

Association Analysis: Basic Concepts and Algorithms.

Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.

Data Mining Association Analysis: Basic Concepts and Algorithms

Frequent-Pattern Tree. 2 Bottleneck of Frequent-pattern Mining  Multiple database scans are costly  Mining long patterns needs many passes of scanning.

Mining Association Rules

1 1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 6 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.

Mining Association Rules

SEG Tutorial 2 – Frequent Pattern Mining.

Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,

Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

Chapter 5 Mining Association Rules with FP Tree Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.

Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.

1 Mining surprising patterns using temporal description length Soumen Chakrabarti (IIT Bombay) Sunita Sarawagi (IIT Bombay) Byron Dom (IBM Almaden)

Ch5 Mining Frequent Patterns, Associations, and Correlations

Mining Frequent Patterns without Candidate Generation Presented by Song Wang. March 18 th, 2009 Data Mining Class Slides Modified From Mohammed and Zhenyu’s.

Jiawei Han, Jian Pei, and Yiwen Yin School of Computing Science Simon Fraser University Mining Frequent Patterns without Candidate Generation SIGMOD 2000.

AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.

Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —

Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

EFFICIENT ITEMSET EXTRACTION USING IMINE INDEX By By U.P.Pushpavalli U.P.Pushpavalli II Year ME(CSE) II Year ME(CSE)

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Mining Frequent Patterns without Candidate Generation.

Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授：廖述賢博士報告人：朱佩慧班級：管科所博一.

Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.

Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

Data Mining Find information from data data ? information.

Association Rule Mining

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.

1 Data Mining: Mining Frequent Patterns, Association and Correlations.

Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.

Association Rule Mining CENG 514 Data Mining July 2,

Data Mining Find information from data data ? information.

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining: Concepts and Techniques

Association rule mining

Mining Association Rules in Large Databases

732A02 Data Mining - Clustering and Association Analysis

Mining Frequent Patterns without Candidate Generation

Frequent-Pattern Tree

Market Basket Analysis and Association Rules

©Jiawei Han and Micheline Kamber

FP-Growth Wenlong Zhang.

Mining Association Rules in Large Databases

Presentation transcript:

Frequent itemset mining and temporal extensions Sunita Sarawagi

Association rules l Given several sets of items, example: n Set of items purchased n Set of pages visited on a website n Set of doctors visited l Find all rules that correlate the presence of one set of items with another n Rules are of the form X  Y where X and Y are sets of items n Eg: Purchase of books A&B  purchase of C

Parameters: Support and Confidence l All rules X  Z have two parameters n Support probability that a transaction has X and Z n confidence conditional probability that a transaction having X also contains Z l Two parameters to association rule mining: n Minimum support s n Minimum confidence c S: 50%, and c: 50% n A  C (50%, 66.6%) n C  A (50%, 100%)

Applications of fast itemset counting l Cross selling in retail, banking l Catalog design and store layout l Applications in medicine: find redundant tests l Improve predictive capability of classifiers that assume attribute independence l Improved clustering of categorical attributes

Finding association rules in large databases l Number of transactions: in millions l Number of distinct items: tens of thousands l Lots of work on scalable algorithms l Typically two parts to the algorithm: 1. Finding all frequent itemsets with support > S 2. Finding rules with confidence greater than C n Frequent itemset search more expensive n Apriori algorithm, FP-tree algorithm

The Apriori Algorithm l L 1 = {frequent items of size one}; l for (k = 1; L k !=  ; k++) n C k+1 = candidates generated from L k by Join L k with itself Prune any k+1 itemset whose subset not in L k n for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t n L k+1 = candidates in C k+1 with min_support l return  k L k ;

How to Generate Candidates? l Suppose the items in L k-1 are listed in an order l Step 1: self-joining L k-1 insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 =q.item 1, …, p.item k-2 =q.item k-2, p.item k-1 < q.item k-1 l Step 2: pruning forall itemsets c in C k do forall (k-1)-subsets s of c do if (s is not in L k-1 ) then delete c from C k

The Apriori Algorithm — Example Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3

Improvements to Apriori l Apriori with well-designed data structures works well in practice when frequent itemsets not too long (common case) l Lots of enhancements proposed n Sampling: count in two passes n Invert database to be column major instead of row major and count by intersection n Count multiple length itemsets in one-pass l Reducing passes not useful since I/O not bottleneck: l Main bottleneck: candidate generation and counting  not optimized for long itemsets

Mining Frequent Patterns Without Candidate Generation l Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure n highly condensed, but complete for frequent pattern mining l Develop an efficient, FP-tree-based frequent pattern mining method n A divide-and-conquer methodology: decompose mining tasks into smaller ones n Avoid candidate generation

Construct FP-tree from Database {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Item frequency f4 c4 a3 b3 m3 p3 min_support = 0.5 TIDItems bought (ordered) frequent items 100{f, a, c, d, g, i, m, p}{f, c, a, m, p} 200{a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500 {a, f, c, e, l, p, m, n}{f, c, a, m, p} Scan DB once, find frequent 1-itemset Order frequent items by decreasing frequency Scan DB again, construct FP-tree

Step 1: FP-tree to Conditional Pattern Base l Starting at the frequent header table in the FP-tree l Traverse the FP-tree by following the link of each frequent item l Accumulate all of transformed prefix paths of that item to form a conditional pattern base Conditional pattern bases itemcond. pattern base cf:3 afc:3 bfca:1, f:1, c:1 mfca:2, fcab:1 pfcam:2, cb:1 {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency f4 c4 a3 b3 m3 p3

Step 2: Construct Conditional FP-tree l For each pattern-base n Accumulate the count for each item in the base n Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1 {} f:3 c:3 a:3 m-conditional FP-tree All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam 

Mining Frequent Patterns by Creating Conditional Pattern-Bases Empty f {(f:3)}|c{(f:3)}c {(f:3, c:3)}|a{(fc:3)}a Empty{(fca:1), (f:1), (c:1)}b {(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m {(c:3)}|p{(fcam:2), (cb:1)}p Conditional FP-treeConditional pattern-base Item Repeat this recursively for higher items…

FP-growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K

Criticism to Support and Confidence X and Y: positively correlated, X and Z, negatively related support and confidence of X=>Z dominates l Need to measure departure from expected. l For two items: l For k items, expected support derived from support of k-1 itemsets using iterative scaling methods

Prevalent correlations are not interesting l Analysts already know about prevalent rules l Interesting rules are those that deviate from prior expectation l Mining’s payoff is in finding surprising phenomena bedsheets and pillow covers sell together! Zzzz... bedsheets and pillow covers sell together!

What makes a rule surprising? l Does not match prior expectation n Correlation between milk and cereal remains roughly constant over time l Cannot be trivially derived from simpler rules n Milk 10%, cereal 10% n Milk and cereal 10% … surprising n Eggs 10% n Milk, cereal and eggs 0.1% … surprising! n Expected 1%

Finding suprising temporal patterns l Algorithms to mine for surprising patterns n Encode itemsets into bit streams using two models Mopt: The optimal model that allows change along time Mcons: The constrained model that does not allow change along time n Surprise = difference in number of bits in Mopt and Mcons

One item: optimal model l Milk-buying habits modeled by biased coin l Customer tosses this coin to decide whether to buy milk n Head or “1” denotes “basket contains milk” n Coin bias is Pr[milk] l Analyst wants to study Pr[milk] along time n Single coin with fixed bias is not interesting n Changes in bias are interesting

The coin segmentation problem l Players A and B l A has a set of coins with different biases l A repeatedly n Picks arbitrary coin n Tosses it arbitrary number of times l B observes H/T l Guesses transition points and biases Pick Toss Return A B

How to explain the data l Given n head/tail observations n Can assume n different coins with bias 0 or 1 Data fits perfectly (with probability one) Many coins needed n Or assume one coin May fit data poorly l “Best explanation” is a compromise 1/4 5/71/3

Coding examples l Sequence of k zeroes n Naïve encoding takes k bits n Run length takes about log k bits l 1000 bits, 10 randomly placed 1’s, rest 0’s n Posit a coin with bias 0.01 n Data encoding cost is (Shannon’s theorem):

How to find optimal segments Sequence of 17 tosses: Derived graph with 18 nodes: Edge cost = model cost + data cost Model cost = one node ID + one Pr[head] Data cost for Pr[head] = 5/7, 5 heads, 2 tails Shortest path

Two or more items l “Unconstrained” segmentation n k items induce a 2 k sided coin n “milk and cereal” = 11, “milk, not cereal” = 10, “neither” = 00, etc. l Shortest path finds significant shift in any of the coin face probabilities l Problem: some of these shifts may be completely explained by marginals

Example l Drop in joint sale of milk and cereal is completely explained by drop in sale of milk l Pr[milk & cereal] / (Pr[milk]  Pr[cereal]) remains constant over time l Call this ratio 

Constant-  segmentation l Compute global  over all time l All coins must have this common value of  l Segment as before l Compare with un-constrained coding cost Observed support Independence

Is all this really needed? l Simpler alternative n Aggregate data into suitable time windows n Compute support, correlation, , etc. in each window n Use variance threshold to choose itemsets l Pitfalls n Choices: windows, thresholds n May miss fine detail n Over-sensitive to outliers

Experiments l Millions of baskets over several years l Two algorithms n Complete MDL approach n MDL segmentation + statistical tests (MStat) l Data set n 2.8 million transactions n 7 years, 1987 to 1993 n items n Average 2.62 items per basket

Little agreement in itemset ranks l Simpler methods do not approximate MDL

MDL has high selectivity l Score of best itemsets stand out from the rest using MDL

Three anecdotes l  against time l High MStat score n Small marginals n Polo shirt & shorts l High correlation n Small % variation n Bedsheets & pillow cases l High MDL score n Significant gradual drift n Men’s & women’s shorts

Conclusion l New notion of surprising patterns based on n Joint support expected from marginals n Variation of joint support along time l Robust MDL formulation l Efficient algorithms n Near-optimal segmentation using shortest path n Pruning criteria l Successful application to real data

References l R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB' , Santiago, Chile. l S. Chakrabarti, S. Sarawagi and B.Dom, Mining surprising patterns using temporal description length Proc. of the 24th Int'l Conference on Very Large Databases (VLDB), 1998 l J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD'00, 1-12, Dallas, TX, May l Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques by, Morgan Kaufmann Publishers (Some of the slides in the talk are taken from this book) l H. Toivonen. Sampling large databases for association rules. VLDB'96, , Bombay, India, Sept. 1996