Lecture 11 (Market Basket Analysis)

Lecture 11 (Market Basket Analysis)
CSE 482: Big Data Analysis Lecture 11 (Market Basket Analysis)

Motivating Example

Problem Definition Market-Basket transactions Given: a set of transactions, where each transaction contains a set of items Goal: extract a set of rules that represent the interesting relationships among items

Applications Items are asymmetric binary attributes
Example: words in documents, items in transactions, medical conditions, etc Non-asymmetric binary: gender (male/female), answer (true/false), etc

Example: Market Basket Data
Market-Basket transactions Association rule: X  Y Interpretation: Customers who buy X will also likely buy Y X: rule antecedent Y: rule consequent Example of Association Rules {Milk, Diaper}  {Beer}, {Bread}  {Milk}, {Milk}  {Bread}

Important Considerations
Rule evaluation measure There are exponentially large number of possible rules to be extracted How do we decide whether one rule is more “interesting” than another? Answer: rule should be statistically significant

Rule Evaluation Measures
Given an association rule X  Y Support: measures the rule prevalence (i.e., how often it can be applied to the data) Confidence (c): measures the strength of implication I is an indicator function T is number of transactions Number of transactions that contain X and Y Number of transactions that contain X

Rule Evaluation Measures: Examples

Problem Definition (Formal)
Given A set of transactions where each transaction contains a set of items Task Extract a set of rules having support ≥ minsup threshold confidence ≥ minconf threshold where minsup and minconf are user-specified parameters

How many rules are there?
Suppose there are d unique items: Total number of possible association rules: If d=6, R = 602 rules

Mining Association Rules
Two-step approach: Frequent Itemset Generation Generate all itemsets whose support  minsup (these are known as frequent itemsets) Rule Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset

Frequent Itemset Generation
Given d items, there are 2d possible candidate itemsets

Anti-Monotone Property of Support
Itemset Support {Bread} 4/5 {Bread, Milk} 3/5 {Bread, Milk, Diaper} 2/5 Support is non-increasing when size of an itemset increases

Apriori Algorithm Anti-monotone property of support:
Support of an itemset Y will never exceed the support of its subsets Apriori principle: If an itemset is infrequent, then all of its supersets must also be infrequent

Illustrating Apriori Principle
Found to be Infrequent Pruned supersets

Illustrating Apriori Principle
Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Minimum Support = 0.6 Triplets (3-itemsets) If every subset is considered, 6C1 + 6C2 + 6C3 = 41 With support-based pruning, = 13

From Frequent Itemsets to Rules
Suppose {A, B, C} is a frequent itemset … {AB}  {C} {AC}  {B} {BC}  {A} There are only 6 possible candidate association rules generated from the frequent itemset {A,B,C} A binary partitioning of an itemset means you must assign each item in the itemset either to the left or right hand side of the rule (but not on both sides). So, the rule {A}  {C} is not generated from {A,B,C}. Ordering within a set is irrelevant; so {AB}  {C} is equivalent to {BA}  {C}.

From Frequent Itemsets to Rules
Given a frequent itemset Z, find all non-empty subsets f  Z such that f  Z – f satisfies the minimum confidence requirement If {A,B,C,D} is a frequent itemset, we need to calculate the confidence of the rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABC AB CD, AC  BD, AD  BC, BC AD, BD AC, CD AB, If |Z| = k, then there are 2k – 2 candidate association rules (ignoring L   and   L)

Association Rule Generation
In general, confidence does not have an anti- monotone property c(ABC D) can be larger or smaller than c(AB D) But confidence of rules generated from the same itemset has an anti-monotone property E.g., Suppose {A,B,C,D} is a frequent 4-itemset: c(ABC  D)  c(AB  CD)  c(A  BCD) Confidence is anti-monotone w.r.t. increasing number of items on the RHS of the rule

Association Rule Generation
Pruned Rules Low Confidence Rule

Example: S&P 500 Stock Market Data
Dataset represents daily closing prices of S&P- 500 stocks from Jan 1994 to October 1996 Cisco Systems Ticker: CSCO Applied Materials Ticker: AMAT

Preprocessing Compute the percentage change in closing price:
Discretize the attribute: Stock-UP: if d  2 Stock-DOWN: if d  -2 Why not use d  0 and d < 0? To ensure attribute is asymmetric binary (i.e., sparse data) xt : closing price on day t

Transaction Data Day Items Number of transactions = 716
AAPL-UP,AGREA-DOWN,AMH-UP,AT-DOWN,AXP-UP, …,WMTT-UP,WMX-UP,WY-UP 2 AA-UP,AAPL-UP,ACV-DOWN,AHC-UP,AL-UP,AMAT-UP, AMR-UP,AT-DOWN,…,WMB-UP,WMX-UP 3 AAPL-DOWN,ACK-UP,ADM-UP,ADSK-UP,AET-DOWN, AR-UP, ASN-UP,…,WH-UP,WMT-DOWN,XRX-UP,Z-DOWN … Number of transactions = 716 Minsup = 5% Minconf = 90% # frequent itemsets = 2397 (1: 708, 2: 907, 3: 670, 4: 109, 5: 3) # rules = 56

Results cpq-up csco-up ==> amat-up s=0.0503, c=0.90, count=36, left=40, right=177 coms-up intc-up ==> lsi-up s=0.0503, c=0.90, count=36, left=40, right=148 csco-up txn-up ==> mu-up s=0.0615, c=0.92, count=44, left=48, right=169 adsk-up amat-up mu-up ==> txn-up s=0.0503, c=0.90, count=36, left=40, right=132 amat-up andw-up lsi-up ==> mu-up s=0.0503, c=0.90, count=36, left=40, right=169 amat-up csco-up sgi-up ==> bay-up s=0.0503, c=0.92, count=36, left=39, right=173 bay-up csco-up txn-up ==> amat-up s=0.0503, c=0.92, count=36, left=39, right=177 count: # transactions that contain all items on both LHS and RHS left: # transactions that contain items on the LHS of the rule right: # transactions that contain items on the RHS of the rule

Implementation There are no standard library packages for association rule mining in pandas and numpy There are other packages developed by independent authors A good version of Apriori algorithm (written in C) is available at Download the apriori executable file to your machine (choose the GNU/Linux or Windows executable)

Example Download Titanic.csv from class Webpage

Example (Frequent Itemset Generation)
Generate frequent itemsets with support threshold of 5% Exclamation mark (!) means you’re running a command from the system shell. Make sure you put apriori executable file in the same directory from which you ran your ipython notebook program Command: apriori -options [inputfile] [outputfile] Options: -s5 : support threshold is 5% -ts : target type = itemsets (i.e., generate the frequent itemsets)

Example Output (freqiset.txt)
Itemset (support)

Example (Rule Generation)
Generate association rules with support threshold of 5% and confidence threshold of 80% Command: apriori -options [inputfile] [outputfile] Options: -m2: minimum number of items per item set/association rule -n4: maximum number of items per item set/association rule -s5 : support threshold is 5% -c80: confidence threshold is 80% -tr : target type = rules (i.e., generate the association rules)

Example Output (rules.txt)
consequent <= antecedent (support, confidence)

Summary Association rule mining is a powerful technique to quickly identify the statistically interesting relationships among items in a large database Apriori principle is the key to addressing the computational complexity problem

Lecture 11 (Market Basket Analysis)

Similar presentations

Presentation on theme: "Lecture 11 (Market Basket Analysis)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 11 (Market Basket Analysis)

Similar presentations

Presentation on theme: "Lecture 11 (Market Basket Analysis)"— Presentation transcript:

Similar presentations

About project

Feedback