Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
CSE 634 Data Mining Techniques
FP-Growth algorithm Vasiljevic Vladica,
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Item Mining.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining: Concepts and Techniques (2nd ed.) — Chapter 5 —
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar.
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Business Systems Intelligence: 4. Mining Association Rules Dr. Brian Mac Namee (
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Association Rule Mining Instructor Qiang Yang Thanks: Jiawei Han and Jian Pei.
Chapter 4: Mining Frequent Patterns, Associations and Correlations
Mining Association Rules in Large Databases
Frequent Pattern and Association Analysis (baseado nos slides do livro: Data Mining: C & T)
Mining Association Rules
1 1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 6 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Mining Association Rules
Performance and Scalability: Apriori Implementation.
SEG Tutorial 2 – Frequent Pattern Mining.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Chapter 2: Mining Frequent Patterns, Associations and Correlations
Ch5 Mining Frequent Patterns, Associations, and Correlations
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
What Is Association Mining? l Association rule mining: – Finding frequent patterns, associations, correlations, or causal structures among sets of items.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Warehousing 資料倉儲 Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management, Tamkang University Dept. of Information ManagementTamkang.
Chapter 4 – Frequent Pattern Mining Shuaiqiang Wang ( 王帅强 ) School of Computer Science and Technology Shandong University of Finance and Economics Homepage:
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining Find information from data data ? information.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
1 Data Mining: Mining Frequent Patterns, Association and Correlations.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Mining Frequent Patterns. What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs.
Chapter 6: Mining Frequent Patterns, Association and Correlations
Dept. of Information Management, Tamkang University
What is Frequent Pattern Analysis?
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Association Rule Mining CENG 514 Data Mining July 2,
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Information Management course
Association rule mining
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Department of Computer Science National Tsing Hua University
Association Rule Mining
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
What Is Association Mining?
Presentation transcript:

Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2013 Han, Kamber & Pei. All rights reserved. 1 1

Chapter 6: Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods Frequent Itemset Mining Methods Which Patterns Are Interesting?—Pattern Evaluation Methods Summary

What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining Motivation: Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.

Market Basket Analysis

Why Is Freq. Pattern Mining Important? Freq. pattern: An intrinsic and important property of datasets Foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: frequent pattern analysis Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient Semantic data compression: fascicles Broad applications

Basic Concepts: Frequent Patterns Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk itemset: A set of one or more items k-itemset X = {x1, …, xk} (absolute) support, or, support count of X: Frequency or occurrence of an itemset X (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X’s support is no less than a minsup threshold Customer buys diaper buys both buys beer

Basic Concepts: Association Rules Tid Items bought Find all the rules X  Y with minimum support and confidence support, s, probability that a transaction contains X  Y confidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys diaper Customer buys beer Association rules: (many more!) Beer  Diaper (60%, 100%) Diaper  Beer (60%, 75%)

Closed Patterns and Max-Patterns A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns! Solution: Mine closed frequent itemsets and maximal frequent itemsets instead An itemset X is closed if X is frequent and there exists no super-itemset Y כ X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99) An itemset X is a maximal if X is frequent and there exists no super-itemset Y כ X and Y is frequent (proposed by Bayardo @ SIGMOD’98) Closed pattern is a lossless compression of freq. patterns

Closed Patterns and Max-Patterns Exercise: Suppose a DB contains only two transactions <a1, …, a100>, <a1, …, a50> Let min_sup = 1 What is the set of closed frequent itemsets? {a1, …, a100}: 1 {a1, …, a50}: 2 What is the set of maximal frequent itemsets? What is the set of all itemsets? {a1}: 2, …, {a1, a2}: 2, …, {a1, a51}: 1, …, {a1, a2, …, a100}: 1 A big number: 2100 - 1

Chapter 5: Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods Frequent Itemset Mining Methods Which Patterns Are Interesting?—Pattern Evaluation Methods Summary

Scalable Frequent Itemset Mining Methods Apriori: A Candidate Generation-and-Test Approach Improving the Efficiency of Apriori FPGrowth: A Frequent Pattern-Growth Approach Frequent Pattern Mining with Vertical Data Format

Apriori Algorithm Proposed by R. Agrawal and R. Srikanth in 1994 Name of the algorithm is based on the fact that the algorithm uses prior knowledge of the frequent itemset properties Apriori employs an iterative approach known as a level-wise search where k-itemsets are used to explore (k+1)-itemsets

Apriori Method Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Terminate when no frequent or candidate set can be generated To improve the efficiency of the level-wise generation of frequent itemsets, the apriori property is used to reduce the search space All non-empty subsets of a frequent itemset must also be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper}

The Apriori Algorithm—An Example Supmin = 2 Itemset sup {A} 2 {B} 3 {C} {D} 1 {E} Database TDB Itemset sup {A} 2 {B} 3 {C} {E} L1 C1 Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E 1st scan C2 Itemset sup {A, B} 1 {A, C} 2 {A, E} {B, C} {B, E} 3 {C, E} C2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} L2 2nd scan Itemset sup {A, C} 2 {B, C} {B, E} 3 {C, E} C3 L3 Itemset {B, C, E} 3rd scan Itemset sup {B, C, E} 2

The Apriori Algorithm (Pseudo-Code) Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;

Implementation of Apriori How to generate candidates? Step 1: self-joining Lk Step 2: pruning Example of Candidate-generation L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4 = {abcd}

Generating Association Rules from Frequent Itemsets Method For each frequent itemset l, generate all nonempty subsets of l For every nonempty subset of l, output the rule s -> (l - s) if (support_coutn(l) / support_count(s)) >=min_conf Because the rules are generated from frequent itemsets, each one automatically satisfies the minimum support

Generating Association Rules Example Example: Given the following table and the frequent itemset X = {I1, I2, I5}, generate the association rules. The nonempty subsets of X are: {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, {I5} The resulting association rules are: If the minimum confidence threshold is, say, 70%, then only the second, third, and last rules are output, because these are the only ones generated that are strong

Scalable Frequent Itemset Mining Methods Apriori: A Candidate Generation-and-Test Approach Improving the Efficiency of Apriori FPGrowth: A Frequent Pattern-Growth Approach Frequent Pattern Mining with Vertical Data Format 19 19

Further Improvement of the Apriori Method Major computational challenges Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce passes of transaction database scans Shrink number of candidates Facilitate support counting of candidates

Partition: Scan Database Only Twice Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns A. Savasere, E. Omiecinski and S. Navathe, VLDB’95

Hash-based Technique: Reduce the Number of Candidates A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. SIGMOD’95

Sampling for Frequent Patterns Select a sample of original database, mine frequent patterns within sample using Apriori H. Toivonen. Sampling large databases for association rules. In VLDB’96

Dynamic Itemset Counting: Reduce Number of Scans Database is partitioned into blocks marked by start point New candidate itemsets can be added at any start point, unlike in Apriori, which determines new candidate itemsets only before each complete database scan The technique uses count-so-far as the lower bound of the actual count If the count-so-far passes the minimum support, the itemset is added into the frequent itemset S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97

Apriori Example Consider the following database with five transactions. Let min_sup = 60% and min_conf = 80%. Find all the frequent itemsets using Apriori method List all the strong association rules matching the following metarule

Scalable Frequent Itemset Mining Methods Apriori: A Candidate Generation-and-Test Approach Improving the Efficiency of Apriori FPGrowth: A Frequent Pattern-Growth Approach Frequent Pattern Mining with Vertical Data Format 26 26

Pattern-Growth Approach: Mining Frequent Patterns Without Candidate Generation Apriori: uses a generate-and-test approach – generates candidate itemsets and tests if they are frequent Generation of candidate itemsets is expensive(in both space and time) Support counting is expensive Subset checking (computationally expensive) Multiple Database scans (I/O) FP-Growth: allows frequent itemset discovery without candidate itemset generation. Two step approach: Step 1: Build a compact data structure called the FP-tree Built using 2 passes over the data-set. Step 2: Extracts frequent itemsets directly from the FP-tree

Step 1: FP-Tree Construction FP-Tree is constructed using 2 passes over the data-set: Pass 1: Scan data and find support for each item. Discard infrequent items. Sort frequent items in decreasing order based on their support. Use this order when building the FP-Tree, so common prefixes can be shared.

Step 1: FP-Tree Construction Pass 2: Nodes correspond to items and have a counter FP-Growth reads 1 transaction at a time and maps it to a path Fixed order is used, so paths can overlap when transactions share items (when they have the same prefix ). In this case, counters are incremented Pointers are maintained between nodes containing the same item, creating singly linked lists (dotted lines) The more paths that overlap, the higher the compression. FP-tree may fit in memory. Frequent itemsets extracted from the FP-Tree.

Step 2: The Frequent Pattern Growth Mining Method For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern

Scalable Frequent Itemset Mining Methods Apriori: A Candidate Generation-and-Test Approach Improving the Efficiency of Apriori FPGrowth: A Frequent Pattern-Growth Approach Frequent Pattern Mining with Vertical Data Format

Mining by Exploring Vertical Data Format Minimum support = 2

Chapter 5: Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods Frequent Itemset Mining Methods Which Patterns Are Interesting?—Pattern Evaluation Methods Summary

Interestingness Measure: Correlations (Lift) play basketball  eat cereal [40%, 66.7%] play basketball  not eat cereal [20%, 33.3%] Measure of dependent/correlated events: lift If lift =1, occurrence of itemset A is independent of occurrence of itemset B, otherwise, they are dependent and correlated Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 5000

Chapter 5: Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods Frequent Itemset Mining Methods Which Patterns Are Interesting?—Pattern Evaluation Methods Summary

Summary Basic concepts: association rules, support-confident framework, closed and max-patterns Scalable frequent pattern mining methods Apriori (Candidate generation & test) Fpgrowth Vertical format approach (ECLAT, CHARM, ...) Which patterns are interesting? lift