Association Rule Mining

Association Rule Mining
CS246 Association Rule Mining

Association Rule Mining
What is the problem? What is an association rule? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science)
Motivating Problem If a customer buys, “Diet Coke,” is she likely to buy a nutrition bar? To arrange store shelves, etc. Beer and diaper Life as a parent is tough… Junghoo "John" Cho (UCLA Computer Science)

Word of Caution Famous example: David Rhine at Duke Tested students for “extrasensory perception” Asked them to guess 10 cards – red or black 1/1000 of them guess all 10 correctly. If done many times, some unlikely events happen for purely statistical reasons No physical validity Junghoo "John" Cho (UCLA Computer Science)

Problem Definition Input: transaction records (set of items) T1: Bread, Milk, Apple T2: Beer, Chips T3: Pants, Brush, Toothpaste, Chopstick … Output: all “association rules” Bread, Milk  Apple If a customer buys bread and milk, he is likely to buy an apple. Junghoo "John" Cho (UCLA Computer Science)

Confidence Bread  Apple: If a customer buys bread, he is likely to buy an apple. What does likely mean? A large fraction of baskets with bread also have apple. Formally, P{ I1 | I2 , I3 } > c c : confidence, say 0.95 Probability to buy an item given other items If a customer buys I2 , I3 , she is likely to buy I1 with 95% probability “Strength” of the rule Identify all association rules satisfying confidence threshold c Junghoo "John" Cho (UCLA Computer Science)

Support Do we really want to find all association rules? If we sell only 5 items of a particular product, who cares what it is sold with? Find association rules only for the set of items that appear often enough. Formally, P{ I1 , I2 , I3 } > s s: support Fraction of records containing the itemset Statistical “significance” I1 , I2 , I3 : frequent itemset Find association rules for frequent itemsets Junghoo "John" Cho (UCLA Computer Science)

Problem Definition Input: transaction records (set of items) Output: All association rules I1 , I2  I3 with support: P{ I1 , I2 , I3 } > s and confidence: P{ I1 | I2 , I3 } > c Is the difference between confidence and support clear? Junghoo "John" Cho (UCLA Computer Science)

Basic Algorithm? Step 1: Find all frequent itemsets P{ I1 , I2 , I3 } > s Step 2: From the large itemsets, identify high confidence rules P{ I1 | I2 , I3 } > c Junghoo "John" Cho (UCLA Computer Science)

Step 1: Frequent Itemsets
Find all with : frequent itemset More informally, find all sets of items appearing in more than k transactions Is it really difficult? How can we solve it? Junghoo "John" Cho (UCLA Computer Science)

Naïve Approach Keep counters for all subsets of items {A, B, C} {A}, {B}, {C}, {AB}, {BC}, {AC} {ABC} Scan all transaction records and increase counters Transaction {A, C} {A}++, {C}++, {AC}++ What is difficult? Junghoo "John" Cho (UCLA Computer Science)

Main Challenge? Problem: 2n subsets for n items 1000 items: = 10301 Clearly not feasible Lesson: When data size is large, even a simple problem can be very difficult. What was their main idea? Junghoo "John" Cho (UCLA Computer Science)

Main Idea of Apriori Algorithm
If (A, B, C) is a frequent itemset, (A, B) is a frequent itemset If (A, B) is not a frequent itemset, (A, B, C) cannot be a frequent itemset Consider (A, B, C) only if all its subsets are frequent itemsets Junghoo "John" Cho (UCLA Computer Science)

Apriori Algorithm L1 = { frequent 1-itemsets }, k = 1 Candidate set generation Candidate set Ck : potentially frequent k itemset {A, B, C} is a candidate set iff all its subsets {A, B}, {B, C} and {A, C} are frequent itemsets Generate candidate set Ck+1 using Lk Scanning Check whether candidate sets are actually frequent Increase k by 1, and go to step 2 Ask questions for candidate set generation: If {a}, {b} large, is {a, b} candidate? If {a, b}, {b, c} large, is {a, b, c} candidate? Junghoo "John" Cho (UCLA Computer Science)

Example Items: {A, B, C, D} Transactions: {A, B}, {A, D} {A, B, C} {B} Support: 0.5 = 2 transactions Junghoo "John" Cho (UCLA Computer Science)

Example  A B C D        {A,B}    {A, B} {A, D} {A, B, C} {B} Junghoo "John" Cho (UCLA Computer Science)

Why Does Apriori Work? Typical grocery-store scenario: 100,000 different items 10M baskets with 10 items each (108 items) support = 0.01 Q: How many items can Apriori eliminate? A: At most 1000 items remain (less than 1%) An item should appear at least 0.01*107 = 105 108 items in total, so 108/105 = 1000 items Junghoo "John" Cho (UCLA Computer Science)

Basic Algorithm Step 1: Find all frequent itemsets P{ I1 , I2 , I3 } > s Apriori algorithm Step 2: From the large itemsets, identify high confidence rules P{ I1 | I2 , I3 } > c Junghoo "John" Cho (UCLA Computer Science)

Step 2: High Confidence Rules
In principle, second step is straightforward: We already estimated values in the first step Piece of cake. Simple division! Junghoo "John" Cho (UCLA Computer Science)

More On Step 2 Q: But given a frequent k-itemset, how many potential rules? A: 2k! Any efficient algorithm? Junghoo "John" Cho (UCLA Computer Science)

Questions (1) Is support pruning valid? What about Castillo de Ygay ($5000 wine)  Caviar? Even if we only sell 100 items, significant profit… Technically very challenging Finding all association rules without support pruning Topic of the next paper Junghoo "John" Cho (UCLA Computer Science)

Questions (2) Is P{Beer|Diaper} > 0.95 really meaningful? What if beer appears in 95% of baskets? Interest: P{Beer, Diaper} / P{Beer} P{Diaper} Implication strength: Beer  Diaper == ~(Beer, ~Diaper) P{~Diaper} P{Beer} / P{~Diaper, Beer} Junghoo "John" Cho (UCLA Computer Science)

Follow-up Works Candidate set generation still costly Iceberg queries No candidate set generation stage Minimizing number of passes Junghoo "John" Cho (UCLA Computer Science)

Mining without Support Pruning
What is the Problem? How can we identify “Castillo de Ygay  Caviar”? Apriori is efficient only for frequent items Problem definition Data mining: Low support, high correlation Finding rare, but very similar items Junghoo "John" Cho (UCLA Computer Science)

Matrix Representation
Typical scenario 100,000 items 10M baskets with 10 items each Matrix Columns = items Rows = baskets (i, j) = 1 if item cj is in basket ri Very sparse: almost all 0’s (less than 0.01% 1’s) Junghoo "John" Cho (UCLA Computer Science)

Matrix Example {a, b} {a, f} {b, d, g} {b, e} {c, d, e} {a} {e, f} a b c d e f g 1 Junghoo "John" Cho (UCLA Computer Science)

Association Rule and Similarity
Think of column Ci as the set of rows with 1 Association Rule (confidence) Similarity Junghoo "John" Cho (UCLA Computer Science)

Example C1 C2 0 1 1 1 1 0 0 0 P(C2|C1) = 2/4 Sim(C1, C2) = 2/5 Junghoo "John" Cho (UCLA Computer Science)

Problem Definition Find all highly similar pairs All Ci, Cj with Sim(Ci, Cj) > s* s*: Similarity threshold Junghoo "John" Cho (UCLA Computer Science)

Why Similarity (not Confidence)?
A1: Techniques work only for similarity A2: High similarity implies high confidence |C1C2| / |C1C2| < |C1C2| / |C1| All similar pairs are of high confidence numerator/denominator Junghoo "John" Cho (UCLA Computer Science)

Assumption Matrix does not fit into main memory Number of columns is relatively small Can store some information in main memory per each item Number of rows can be very big Sparse data: mostly 0 in the matrix Junghoo "John" Cho (UCLA Computer Science)

Key Idea? “Compress” the matrix into a smaller one Load the compressed matrix into main memory Find high similarity pairs from the compressed matrix Much easier than disk-based computation Junghoo "John" Cho (UCLA Computer Science)

Min-Hash? LSH? Hamming? What are the for? Min-Hash?: compression LSH?: similarity pair computation Hamming LSH?: compression+similarity Junghoo "John" Cho (UCLA Computer Science)

How To Compress? (1) “Hash” each column C to a small signature Sig(C) such that Sim(C1, C2) is the same as the “similarity” of Sig(C1) and Sig(C2) Sig(C) is small enough, so that we can store the “compressed” matrix in main memory Junghoo "John" Cho (UCLA Computer Science)

How To Compress? (2) Idea 1 Pick 100 random rows Sig(C1) = the 100 bits of the selected rows Would it work? Idea 1 does not work Matrix is sparse Most of the columns will be “0000…0” But the columns are different because 1’s are in different rows Junghoo "John" Cho (UCLA Computer Science)

Min-Hashing Imagine rows are permuted randomly “Hash” function h(C) The first row number with 1 in column C Junghoo "John" Cho (UCLA Computer Science)

Example C1 C2 C3 1 2 3 4 5 Permutation = (45123) S1 S2 S3 5 4 1 Junghoo "John" Cho (UCLA Computer Science)

Important Property The probability that h(C1) = h(C2) is the same as Sim(C1, C2) Why? Junghoo "John" Cho (UCLA Computer Science)

Row Types Given C1 and C2, rows can be classified as a = # of rows of type a Sim(C1, C2) = a / (a + b + c) Q: What’s P{ h(C1) = h(C2) }? A: a / (a + b + c) Look down C1 and C2 until we see 1 If it’s type a, then h(C1) = h(C2) If it’s type b or c, not. C1 C2 a 1 b c d Junghoo "John" Cho (UCLA Computer Science)

Min-Hash Signature Pick (say) 100 random permutations of the rows Get Min-Hash values from each permutation Sig(C) = the list of 100 Min-Hash values Sim( Sig(C1), Sig(C2) ) = fraction of signatures for which Min-Hash value agrees Junghoo "John" Cho (UCLA Computer Science)

Example 1 2 4 5 3 S3 S2 S1 Perm1 = (12345) Perm2 = (54321) Perm3 = (34512) C1 C2 C3 1 2 3 4 5 Similarities: 1-2 1-3 2-3 Matrix 0.5 0.25 Sig 0.67 Junghoo "John" Cho (UCLA Computer Science)

Basic Idea “Compress” the matrix into a smaller one Min-Hash signature Find high similarity pairs from the compressed matrix How? Junghoo "John" Cho (UCLA Computer Science)

Problem From the signature matrix (which fits into main memory), identify all similar pairs Assuming 100,000 items Potentially 1010 similar pairs? One counter per one pair? No way How? Junghoo "John" Cho (UCLA Computer Science)

Locality Sensitive Hashing
A technique to limit the number of similar pairs to consider Approach Using LSH, identify “candidate similar pairs” Scan the Min-Hash signature matrix for verification Junghoo "John" Cho (UCLA Computer Science)

Partition the signature matrix into l bands of r rows each C1 C2 C3 C4 C5 C6 C7 h1 h2 h3 h4 h5 h6 band 1 r rows l bands band 2 … Junghoo "John" Cho (UCLA Computer Science)

Hash each column in each band into buckets C1 C2 C3 C4 C5 C6 C7 h1 h2 h3 h4 h5 h6 Junghoo "John" Cho (UCLA Computer Science)

Two columns are candidate pair if they hash to the same bucket in any band C1 C2 C3 C4 C5 C6 C7 h1 h2 h3 h4 h5 h6 Candidate pair ! Junghoo "John" Cho (UCLA Computer Science)

Final verification After identifying candidates, verify each candidate-pair (Ci, Cj) by examining Sig(Ci) and Sig (Cj) for similarity Junghoo "John" Cho (UCLA Computer Science)

Example 100,000 columns 100 Min-Hash integer signature Total signature table size 4 x 100 x 100,000 = 40 MB (not bad) Potential similar pairs x / 2 = 5,000,000,000 (too many!) 20 bands of 5 integers per band Compute false positive and false negative rates Junghoo "John" Cho (UCLA Computer Science)

False Negative: 80% Similar
Probability C1, C2 identical in one band 0.8^5 = 0.328 Probability C1, C2 not identical in any of the 20 bands (1 – 0.328)^20 = We miss only 1/3000 of 80% similar column pairs! Very few false negative Junghoo "John" Cho (UCLA Computer Science)

False Positive: 40% Similar
Probability C1, C2 identical in one band 0.4^5 = 0.01 Probability C1, C2 identical in at least one of the 20 bands 1 – (1 – 0.01)^20 = 0.18 Only about 20% of unsimilar pairs are identified as candidate pairs False negatives much lower when similarities << 40% Junghoo "John" Cho (UCLA Computer Science)

LSH Summary Similar signature column pair identification algorithm Split the signature matrix into l bands of r rows each Identify almost all similar pairs and a small number of unsimilar pairs By adjusting r and l Junghoo "John" Cho (UCLA Computer Science)

Hamming LSH Life is simpler if the matrix has about 50% 1’s We can take a random collection of rows Let us make the matrix denser! How? Construct a series of matrices by OR-ing together pairs of rows 0 disappears over time… Junghoo "John" Cho (UCLA Computer Science)

Example OR 1 1 1 1 More 1’s Junghoo "John" Cho (UCLA Computer Science)

Hamming LSH Construct all matrices No more than log n matrices for n rows Total number of rows in all matrices is 2n Twice as much work as the original matrix Identify similar columns from each matrix From each matrix, apply LHS to the columns with density between 30% -- 70% 1’s Report similar columns Note that similar columns have similar densities, so they will be considered together in at least one matrix No point ever comparing columns whose number of 1’s are very different Junghoo "John" Cho (UCLA Computer Science)

Summary Apriori, Min-Hash, LSH, Hamming LSH Finding frequent pairs? Apriori Finding similar pairs? Min-Hash+LSH or Hamming LSH Min-Hash: Sparse matrix compression LSH: Similar signature identification Hamming LSH: Amplification of 1 Junghoo "John" Cho (UCLA Computer Science)

Questions Can we extend the techniques to multiple column rules C1, C2  C3? Junghoo "John" Cho (UCLA Computer Science)

Any Questions? Junghoo "John" Cho (UCLA Computer Science)

AprioriTid (1) Q: What was the main idea? A: Some transactions may not need to be checked Candidate itemsets: {A, B}, {A, C} Transaction: {A, D, E, F}? We may eliminate many transactions Q: How do we know {A, B, E, F} is not necessary? A: When we check {A, B} and {A, C} we can tell that {A, B, E, F} does not have any candidate sets Junghoo "John" Cho (UCLA Computer Science)

AprioriTid (2) In each pass, Substitute each transaction with a set of candidate itemsets Candidate set: {A, B, C}, {A, C, D}, {A, C, M} Transaction T1: {A, B, C, D, F, G}  T1: {{A, B, C}, {A, C, D}} Candidate itemset {A, C, D} appears in T1 if {A, C} and {A, D} appears in T1 Junghoo "John" Cho (UCLA Computer Science)

AprioriTid (3) Q: Advantage? A: Many transactions/items may be eliminated Especially in later passes Q: Disadvantage? A: A transaction may be blown up T1: {A, B, C, D}  T1: {{A, B, C}, {A, B, D}} Why not just eliminate “infrequent items”? Junghoo "John" Cho (UCLA Computer Science)

AprioriHybrid In earlier passes, use Apriori In later passes, use AprioriTid Switching criteria Does the generated set of transactions fit in main memory? Junghoo "John" Cho (UCLA Computer Science)

History of the paper Earlier SIGMOD93 paper (AIS Algorithm) Very difficult to read. Poor organization Did not use the “obvious” pruning criteria Very naïve and simple heuristics Techniques in the paper may not be very important Much more efficient algorithms proposed next year Even great research starts with small ideas As you can see from the history Learn how a “simple” idea can change things… Junghoo "John" Cho (UCLA Computer Science)

Association Rule Mining

Similar presentations

Presentation on theme: "Association Rule Mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Association Rule Mining

Similar presentations

Presentation on theme: "Association Rule Mining"— Presentation transcript:

Similar presentations

About project

Feedback