Association Mining Dr. Yan Liu Department of Biomedical, Industrial and Human Factors Engineering Wright State University.

Slides:



Advertisements
Similar presentations
Association rule mining
Advertisements

Association Rules Mining
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
CSE 634 Data Mining Techniques
Data Mining Techniques Association Rule
Effect of Support Distribution l Many real data sets have skewed support distribution Support distribution of a retail data set.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rules Mining Part III. Multiple-Level Association Rules Items often form hierarchy. Items at the lower level are expected to have lower support.
Data Mining Association Analysis: Basic Concepts and Algorithms
FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Association Analysis: Basic Concepts and Algorithms.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Association Rules in Large Databases
Mining Association Rules
Mining Association Rules
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
SEG Tutorial 2 – Frequent Pattern Mining.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Chapter 5 Mining Association Rules with FP Tree Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Association Rules presented by Zbigniew W. Ras *,#) *) University of North Carolina – Charlotte #) Warsaw University of Technology.
Ch5 Mining Frequent Patterns, Associations, and Correlations
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Mining Frequent Patterns without Candidate Generation.
Mining various kinds of Association Rules
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining III COMP Seminar GNET 713 BCB Module Spring 2007.
November 3, 2015Data Mining: Concepts and Techniques1 Chapter 5: Mining Frequent Patterns, Association and Correlations Basic concepts and a road map Efficient.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
UNIT-5 Mining Association Rules in Large Databases LectureTopic ********************************************** Lecture-27Association rule mining Lecture-28Mining.
1 Data Mining: Mining Frequent Patterns, Association and Correlations.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Mining  Association Rule  Classification  Clustering.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining.
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Association rule mining
Mining Association Rules
©Jiawei Han and Micheline Kamber
Association Rule Mining
Find Patterns Having P From P-conditional Database
Unit 3 MINING FREQUENT PATTERNS ASSOCIATION AND CORRELATIONS
©Jiawei Han and Micheline Kamber
FP-Growth Wenlong Zhang.
Department of Computer Science National Tsing Hua University
Presentation transcript:

Association Mining Dr. Yan Liu Department of Biomedical, Industrial and Human Factors Engineering Wright State University

Introduction What is Association Mining  Discovering frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, or other information repositories  Frequent patterns Patterns (such as itemsets, subsequences, or substructures) that occur frequently Motivation of Association Mining  Discovering regularities in data What products are often purchased together? — Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? 2

3 Association Rules: Basic Concepts I={I 1, …, I n } is a set of items D is the task-relevant dataset consisting of a set of transactions where each transaction T is a set of items such that Association Rule  X  Y, where X and Y are antecedent and consequent items, respectively Support  Probability that a transaction contains both X and Y, i.e. P(X  Y) P(X  Y) = (# of transactions that contain both X and Y) / (total # of transactions) Confidence  Probability that a transaction that contains Y also contains X, i.e. P(Y|X) P(Y|X) = P(X  Y) / P(X) = support (X  Y) / support (X) Mining Association Rules  Finding association rules that satisfy the minimum support and confidence thresholds

Min. support 50% Min. confidence 60% Transaction-idItems bought 1A, B, C 2A, C 3A, D 4B, E, F Frequent ItemsetSupport {A}3/4 =75% {B}2/4 = 50% {C}2/4 = 50% {A, C}2/4 = 50% I={A, B, C, D, E, F} A  C: support = 50% confidence = support(A  C)/support(A) = 50% / 75% = 66.6% 4

Mining Association Rules Goal  Discover rules with high support and confidence values Two-Step Process  Find all frequent itemsets Itemsets that occur at least as frequently as the predetermined minimum support  Generate strong association rules from the frequent itemsets Generate rules that satisfy minimum support and minimum confidence If we have all frequent itemsets, we can compute support and confidence! 5

Apriori Algorithm Overview  First proposed by Agrawal and Srikant (1994) for mining Boolean association rules  Use prior knowledge of frequent itemset properties Any subset of a frequent itemset must be frequent (why?)  e.g. if itemset{beer, diaper, nuts} is frequent, so is itemset {beer, diaper} Apriori pruning principle: If there is an itemset is infrequent, its superset is also infrequent and thus should not be generated Process of Generating Frequent Itemsets  Join step Generate all candidate k-itemsets, C k, by self-joining frequent (k-1)-itemsets, L k-1 e.g. L 2 ={ac, bc, be} self-joining L 2 x L 2 : C 3 ={abc, ace, abe, bce}  Prune step A scan of the database to determine the count of each candidate in C k to determine L k e.g. Pruning C 3 gets L 3 ={bce}. But{abc}, {ace}, {abe} are not frequent itemsets because {ab}, {ae}, and{ce} are not in L 2 6

Transaction Database 1 st Scan C1C1 L1L1 L2L2 C2C2 2 nd scan C3C3 L3L3 3 rd scan TidItems 10A, C, D 20B, C, E 30A, B, C, E 40B, E ItemsetSup. count {A}2 {B}3 {C}3 {D}1 {E}3 ItemsetSup. count {A}2 {B}3 {C}3 {E}3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemset Sup. count {A, B}1 {A, C}2 {A, E}1 {B, C}2 {B, E}3 {C, E}2 Itemset Sup. count {A, C}2 {B, C}2 {B, E}3 {C, E}2 Itemset {B, C, E} Itemset Sup. count {B, C, E}2 Self-Join Apriori Algorithm Example 7 Self-Join

Apriori Algorithm (Cont.) Generating Association Rules from Frequent Itemsets  For each frequent k-itemset (k≥2), l, generate all nonempty proper subsets of l  For each nonempty subset of l, s, output the rule “s  (l- s)” if the confidence of this rule satisfies the minimum confidence threshold, i.e. support_count (l) / support_count(s) ≥ minimum confidence 8

Apriori Algorithm Example (Cont.) 9 RulesSupportConfidence A  C 2/4 = 50%2/2 = 100% C  A 2/4 = 50%2/3 = 66.7% B  C 2/4 = 50%2/3 = 66.7% C  B 2/4 = 50%2/3 = 66.7% B  E 3/4 =75%3/3 = 100% E  B 3/4 =75%3/3 = 100% C  E 2/4 = 50%2/3 = 66.7% E  C 2/4 = 50%2/3 = 66.7% B  C+E 2/4 = 50%2/3 = 66.7% C  B+E 2/4 = 50%2/3 = 66.7% E  B+C 2/4 = 50%2/3 = 66.7% B+C  E 2/4 = 50%2/2 = 100% B+E  C 2/4 = 50%2/3 = 66.7% C+E  B 2/4 = 50%2/2 = 100%

Improve Efficiency of Apriori Challenge in Mining Frequent Itemsets  Multiple scans of transaction database are costly  Huge number of candidates e.g. To find frequent itemset {i 1,i 2 …,i 100 }: # of scans: 100, # of Candidates: Transaction Reduction  Reduce the number of transactions scanned in future iterations  A transaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets and thus do not need to be considered in future scans 10

Improve Efficiency of Apriori (Cont.) Partitioning  Need only two database scans to mine frequent itemsets  Scan 1: Divide database into non-overlapping partitions and find local frequent itemsets for each partition  Scan 2: Assess actual support of local frequent itemsets to determine global frequent patterns Sampling  Randomly select a sample of the database and search for frequent itemset in the sample  Trade off accuracy against efficiency 11

Improve Efficiency of Apriori (Cont.) Dynamic Itemset Counting (DIC)  Database is divided into blocks marked by start points  New candidates can be added at any start point once all of their subsets are estimated to be frequent In Apriori, new candidates are added only after a complete database scan 12 Transactions 1-itemsets 2-itemsets … 1-itemsets 2-itemsets 3-itemsets DIC Apriori

Frequency-Pattern (FP) Growth Purpose  Find frequent itemsets without candidate generation General Idea  Compress the database representing frequent items into a FP-tree which retains the itemset association information  Mine the FP-tree to find frequent itemsets Construct FP-Tree  1 st scan of the database: derive the set of frequent items and their support counts; sort the frequent items in the order of descending support count (the resulting list is denoted L)  Create the root of the tree, labeled “null”  2 nd scan of the database: the items in each transaction are processed in L order, and a branch is created for each transaction  Braches that with share a common prefix are combined  To facilitate tree traversal, an item header table is built so that each item points to its occurrences in the tree via a chain of node-links 13

{f, c, a, m, p} {f, c, a, b, m} {f, b} {c, b, p} {f, c, a, m, p} (ordered) frequent items 1{f, a, c, d, g, i, m, p} 2{a, b, c, f, l, m, o} 3{b, f, h, j, o, w} 4{b, c, k, s, p} 5{a, f, c, e, l, p, m, n} TIDitems bought Minimum support count is 2. L: {{f: 4}, {c: 4}, {a: 3}, {b: 3}, {m:3}, {p: 3}} {} f:1 c:1 a:1 m:1 p:1 (root) T1T1 f:1 c:1 a:1 b:1 m:1 T2T2 {} f:2 c:2 a:2 m:1 p:1 b:1 m:1 T 1 +T 2 f:1 b:1 T3T3 {} f:3 c:2 a:2 m:1 p:1 b:1 m:1 T 1 +T 2 +T 3 b:1 c:1 T4T4 b:1 p:1 f:1 c:1 a:1 m:1 p:1 T5T5 {} f:4 c:3 a:3 m:2 p:2 b:1 m:1 b:1 c:1 b:1 p:1 FP-Tree Growth Example 14

15 {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item Frequency f4 c4 a3 b3 m3 p3 FP-Tree Registers Compressed Frequent Pattern Information

Frequency-Pattern (FP) Growth (Cont.) Mine Frequent Itemsets from FP-Tree  Starting from the last item in the header table, for each frequent item, construct its conditional pattern-base, and then its conditional FP-tree Conditional pattern-base of an item consists of the set of its prefix paths in the FP- tree co-occurring with the suffix pattern  Repeat the process on each newly created conditional FP-tree  Until the resulting conditional FP-tree is empty, or it contains only one path— single path will generate all the combinations of its sub-paths, each of which is a frequent pattern 16

17 {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base Construct conditional FP-tree by eliminating non- frequent items Concatenate items in conditional FP-tree with p to generate frequent itemsets with p, Considering p as suffix, Conditional Pattern Base, Conditional FP-Tree {f,p:2}, {c,p:3}, {a,p:2}, {m,p:2} {f,c,p:2}, {f,a,p:2}, {f,m,p:2}, {c,a,p:2}, {c,m,p:2}, {a,m,p:2} {f,c,a,p:2}, {c,a,m,p:2} {f,c,a,m,p:2} Frequent itemsets FP-Tree Growth Example (Cont.)

18 ItemConditional Pattern BaseConditional FP-TreeFrequent Patterns Generated p, {f,p:2}, {c,p:3}, {a,p:2}, {m,p:2}, {f,c,p:2}, {f,a,p:2}, {f,m,p:2}, {c,a,p:2}, {c,m,p:2}, {a,m,p:2}, {f,c,a,p:2}, {c,a,m,p:2}, {f,c,a,m,p:2} m, {f, m: 3}, {c, m: 3}, {a, m: 3}, {f, c, m: 3}, {f, a, m: 3}, {c, a, m: 3}, {f, c, a, m: 3} b,,, {f, b: 2}, {c, b: 2} a, {f, a: 3}, {c, a: 3} c f N/A

Advantages of FP-Growth over Apriori Divide-and-Conquer  Decompose both the mining task and database according to the frequent patterns obtained so far  Leads to focused search of smaller dataset Other Factors  No candidate generation, no candidate test  Compressed database: FP-tree structure  No repeated scan of entire database  Basic operation — counting local frequent items and building sub FP-tree, no pattern search and matching 19

Mining Various Kinds of Rules or Regularities Multi-Level Association Rules  Involve concepts at different levels of abstraction Multi-Dimensional Association Rules  Involve more than one antecedent Quantitative Association Rules  Involve numeric attributes that have an implicit ordering among values 20

Mining Multi-Level Association Rules Mining Multi-Level Hierarchy  Top-down strategy Starting from the top level in the hierarchy and working downward in the hierarchy toward the more specific concept levels For each level frequent itemsets and association rules are mined Variations of Support Threshold  Uniform minimum support threshold for all levels The same minimum support threshold is used for all levels  Reduced minimum support threshold at lower levels Lower-level items usually have lower support  Group-based minimum support threshold Users or experts set up user-specific item- or group-based minimum support threshold 21

22 Uniform support Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 5% Level 1 min_sup = 5% Level 2 min_sup = 3% Reduced support

Mining Multi-Level Association Rules (Cont.) Rule Redundancy  Some rules may be redundant due to “ancestor” relationships between items  A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor e.g. milk is the “ancestor” of “2% milk” Suppose Rule 1: milk  wheat bread [support = 8%, confidence = 70%] and we know that about ¼ of the milk is 2% milk If Rule 2: 2% milk  wheat bread [support = 2%, confidence = 72%], then rule 2 is redundant 23

Mining Multi-Dimensional Association Rules Single-Dimensional Rules  e.g. buys(X, “milk”)  buys(X, “bread”) Multi-Dimensional Rules:  2 antecedents  Inter-dimension assoc. rules (no variable appear in both antecedent and consequent) e.g. age(X, “19-25”)  occupation(X,“student”)  buys(X,“coke”)  Hybrid-dimension assoc. rules (variables can appear in both antecedent and consequent) e.g. age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”) Categorical Attributes  Finite number of possible values, no ordering among values Quantitative Attributes  Numeric, implicit ordering among values 24

Mining Multi-Dimensional Association Rules (Cont.) Static Discretization  Quantitative attributes are discretized using predefined concept hierarchies (static discretization)  e.g. Values of attribute age can be discretized into intervals “0… 20K”, “21K… 30K”, “31K… 40K”, … Dynamic Discretization  Quantitative attributes are discretized or clustered into “bins” based on data distribution  Treats numeric attribute values as quantities rather than predefined ranges or categories 25

Static Discretization of Quantitative Attributes Quantitative attributes are discretized prior to mining using predefined concept hierarchies Numeric values are replaced by intervals Data cube is well suited for mining multi-dimensional association rules  The cells of an k-dimensional cuboid correspond to the itemsets  Store aggregates (such as support counts) in multi-dimensional space 26

27 (income) (age)(buys) (age,income) (age,buys) (income,buys) (age,income,buys) 0 0-D cuboid 1-D cuboids 2-D cuboids 3-D cuboid 3D Data Cube (each cuboid representing an item or itemset)

Quantitative Association Rules Numeric attributes are dynamically discretized to satisfy some mining criteria  Such that maximizing the confidence or compactness of the rules mined 2-D Quantitative Association Rules  A quan1  A quan2  A cat A quan1 and A quan2 are two quantitative predicate attribute intervals (determined dynamically) A cat is a categorical attribute  e.g. age(X, “30…39”)  income(X, “42K…48K”)  buys(X, “HDTV”) Association Rule Clustering System  Map “adjacent” association rules to form general rules using a 2-D grid  Search the grid for clusters of points from which the association rules are generated 28

Quantitative Association Rules Numeric attributes are dynamically discretized to satisfy some mining criteria  Such that maximizing the confidence or compactness of the rules mined 2-D Quantitative Association Rules  A quan1  A quan2  A cat A quan1 and A quan2 are two quantitative predicate attribute intervals (determined dynamically) A cat is a categorical attribute  e.g. age(X, “30…39”)  income(X, “42K…48K”)  buys(X, “HDTV”) Association Rule Clustering System  Map “adjacent” association rules to form general rules using a 2-D grid  Search the grid for clusters of points from which the association rules are generated 29

Association Rule Clustering System Step 1: Binning  Partition the ranges of quantitative attributes into intervals  Equal-width binning The interval size of each bin is the same  Equal-frequency binning Each bin has approximately the same number of records  Clustering-based binning Clustering is performed on the quantitative attribute to group neighboring points in to the same bin 30

31 Illustration of Thee Methods of Binning

Association Rule Clustering System (Cont.) Step 2: Finding Frequent Predicate Sets  Once the 2-D array containing the count distribution for each category is set up, it can be scanned to find the frequent predicate sets (i.e. those satisfying minimum support) that also satisfy minimum confidence  Use the rule algorithm generation algorithm (such as Apriori) discussed before Step 3: Clustering Association Rules  Strong association rules obtained in the previous step are mapped to a 2-D grid 32 age(X, “34”)  income(X, “30K-40K”)  buys(X, “HDTV”) age(X, “34”)  income(X, “40K-50K”)  buys(X, “HDTV”) age(X, “35”)  income(X, “30K-40K”)  buys(X, “HDTV”) age(X, “35”)  income(X, “40K-50K”)  buys(X, “HDTV”) Combined into age(X, “34-35”)  income(X, “30K-50K”)  buys(X, “HDTV”)

BasketballNot BasketballSum (row) Cereal Not Cereal Sum (col) play basketball  eat cereal, support = ? Confidence = ? Support = 2000/5000 = 40% Confidence = 2000/3000 = 66.7% The overall percentage of students eating cereal (regardless basketball play) is 3750/5000 = 75% > 66.7%, so rule play basketball  eat cereal is misleading play basketball  not eat cereal, support = ? Confidence = ? Support = 1000/5000 = 20% Confidence = 1000/3000 = 33.3% The overall percentage of students not eating cereal (regardless basketball play) is 1250/5000 = 25% < 33.3%, so rule play basketball  not eat cereal is more accurate than play basketball  eat cereal 33

Correlation Analysis BasketballNot BasketballSum (row) Cereal Not Cereal Sum (col) play basketball  eat cereal, support = ? Confidence = ? Support = 2000/5000 = 40% Confidence = 2000/3750 = 66.7% The overall percentage of students eating cereal (regardless basketball play) is 3750/5000 = 75% > 66.7%, so rule play basketball  eat cereal is misleading play basketball  not eat cereal, support = ? Confidence = ? Support = 1000/5000 = 20% Confidence = 1000/3000 = 33.3% The overall percentage of students not eating cereal (regardless basketball play) is 1250/5000 = 25% < 33.3%, so rule play basketball  not eat cereal is more accurate than play basketball  eat cereal 34

Correlation Analysis (Cont.) Why Correlation Analysis  Support and confidence measures can be insufficient in filtering out uninteresting association rules  Correlation measures can augment the support-confidence framework for association rules Lift χ 2 analysis All_confidence Cosine 35

Lift If occurrence of A is independent of occurrence of B if P(A and B) = P(A)P(B) 36  If lift(A, B) < 1, then occurrence of A is negatively correlated with the occurrence of B  If lift(A, B) > 1, then occurrence of A is positively correlated with the occurrence of B  If lift(A, B) = 1, then occurrences of A and B are independent

BasketballNot BasketballSum (row) Cereal Not Cereal Sum (col) play basketball  eat cereal, lift= ? play basketball  not eat cereal, lift= ? P(play basketball and eat cereal) = 2000/5000 = 40% P(play basketball) = 3000/5000 = 60% P(eat cereal) = 3750/5000 = 75% lift(play basketball, eat cereal) = 40%/(60%*75%) = P(play basketball and not eat cereal) = 1000/5000 = 20% P(not eat cereal) = 1250/5000 = 25% lift(play basketball, eat cereal) = 20%/(60%*25%) = 1.33 In conclusion, playing basketball and eating cereal are negatively correlated! 37

χ 2 Analysis BasketballNot BasketballSum (row) Cereal2000 ( 2250)1750 (1500)3750 Not Cereal1000 (750)250 (500)1250 Sum (col) = >> χ (1) = 3.84 playing basketball and eating cereal are NOT independent Observed the value of (basketball, Cereal) is less than the expected value of (basketball, Cereal), so playing basketball and eating cereal are negatively correlated 38

All_Confidence 39 Given an itemset X={i 1, i 2, …, i k }, the all_confidence of X is defined as where is the maximum single item support for all the items in X all_confidence of X is the minimal confidence among the set of rules i j → X- i j, where X = {basketball, cereal} sup(X) = 2000/5000 = 40% max{sup(i j )} = max{3000/5000, 3750/5000} = 3750/5000 = 75% all_conf (X) = 40%/75% = 53.3% if X={A, B}, when all_conf (X) > 0.5, A and B are positively correlated; when all_conf (X) = 0.5, A and B are independent; when all_conf (X) < 0.5, A and B are negatively correlated

cosine Measure 40 Given two itemsets A and B, the cosine measure of A and B is defined as cosine (A, B) cosine (A, B) > 0.5, A and B are positively correlated; cosine (A, B) = 0.5, A and B are independent; cosine (A, B) < 0.5, A and B are negatively correlated cosine measure can be viewed as a harmonized lift measure: the square root is taken on P(A) x P(B), so that the cosine value is only influenced by sup(A) and sup(B), not by the number of transactions A = {basketball}, B = {cereal} sup(A) = 3000/5000, sup(B) = 3750/5000, sup(A and B) = 2000/5000 cosine(A, B) = 2000/(√3000*3750) = 59.6%

Comparison of Four Correlation Measures 41 milkno milk coffeemc~mc no cofeem~c~m~c Datasetmc~mcm~c~m~call_confcosineliftχ2χ2 A11,000(12) , ,452.6 A21,000(108)100 10, ,055.7 A31,000(550)100 1, ,472.7 A41,000(1008) B11,000(1000)1, C1100(12)1, , C21,000(109)10010,000100, ,712.8 C31(0)110010, lift and χ 2 are poor indicators because they are greatly affected by the null transaction all_conf and cosine are better indicators because they are not affected by the null transaction cosine is better when ~mc and m~c are unbalanced Null-invariance (free of the influence of null transactions) is an important property for measuring correlations in large transaction databases

Comparison of Four Correlation Measures (Cont.) 42 Datasetgv~gvg~v~g~vall_confcosineliftχ2χ2 D04,000(4737)3,5002, ,477.8 D14,000(4500)3,5002, D24,000(2307)3,5002,00010, ,913.0 lift and χ 2 show correlation between g and v changes from being rather positive to rather negative all_conf and cosine cannot precisely assert positive/negative correlations when they are around 0.50 Rule of Thumb: in large transaction databases, perform the all_conf or cosine analysis first, and when the result shows that they are weakly positively/negatively correlated, lift or χ 2 can be used to assist analysis

Constraint-Based Data Mining Problems of Automatic Data Mining  The derived patterns can be too many but not focused  Users lack understanding of the derived patterns  Users’ domain knowledge cannot be taken advantage of Interactive Data Mining  Users direct data mining process through queries or graphical user interfaces Constraint-Based Mining  Users specify constraints on what “kinds” of patterns to be mined  Knowledge type constraints Specify the type of knowledge to be mined (e.g. association, classification rules)  Data constraints Specify the set of task-relevant data  Dimension/level constraints Specify the desired dimensions (or attributes) of the data, or levels of the concept hierarchies, to be used in mining  Interestingness constraints Specify thresholds on statistical measures of interestingness of patterns (e.g. support, confidence, correlation of association rules)  Rule constraints Specify the forms of rules to be mined 43

Metarule-Guided Association Rules Mining Metarules  Specify the syntactic form of rules that users are interested in mining  Rule forms are used as constraints to help improve efficiency of the mining process 44 e.g. You are interested in finding associations between customer traits and the items they purchase. However, rather than finding all the association rules that reflect these relationships, you are particularly interested in determining which pairs of customer traits promote the sale of office software. Metarule: P 1 (X, Y)  P 2 (X, W)  buys(X, “office software”) P 1, P 2 : predicate variables that instantiated to some attributes from the database during mining X: a variable representing a customer Y, W: values of attributes assigned to P 1 and P 2, respectively age(X, “30..39”)  income(X, “41K..60K”)  buys(X, “office software”)