Market Basket Analysis

Market Basket, Frequent Itemsets, Association Rules, Apriori and Other Algorithms

Market Basket Analysis
Using the market basket analysis you can easily discover what is missing in the basket of every single customer. Then you offer the right product. Think Amazon or McDonald’s. Just try buying only one thing from them. Results in increased sales.

Items & Baskets Items are the objects that we are identifying associations between. For an online retailer, each item is a product in the shop. Baskets are instances of groups of items co-occuring together. Items go into baskets. The support of an item or item set is the number of transactions in our data set that contain that item or item set.

Support threshold & Frequent Itemset
Trans. ID Purchased Items 1 A,D 2 A,C 3 A,B,C 4 B,E,F 5 A,C,F What is the support of these itemsets? Sup(A,B)=1 Sup(A,C)=3 If Support Threshold =2, {A,C} is a frequent Itemset

Association Rules Association Rules are if/then statements that help uncover relationships between seemingly unrelated data. A common example of association rules is the Market Basket Analysis Ex. If a customer buys a dozen eggs, he is 80% likely to also purchase milk. Ex. If a customer buys a mouse, he/she is 95% likely to buy a keyboard as well. Two Main Components of association rule are: Antecedent Item found in the data Can be viewed as the “If” Consequent Item found in combination with the Antecedent Can be viewed as the “then”

Association Rules Cont’d
Association rules are created by analyzing data for frequent if/then patterns and using the criteria support and confidence to identify the most important relationships Support and Confidence help identify relationship between items Support - The number of times an item appears in a dataset Confidence - Indicates the number of times the if/then statements have been found to be true. Ex. Rule A ⇒B Support = frq (A,B)/N, (N = total # of transactions) Confidence = frq(A,B)/A

Uses Data mining: association rules are useful for analyzing and predicting customer behavior Play an important part in shopping basket data analysis, product clustering, catalog design and store layout. Programmers use association rules to build programs capable of machine learning.

Association Rule Generation (Problem definition)
Two sub-problems Finding frequent itemsets (whose occurrences exceed a predefined minimum support threshold) Deriving association rules from those frequent itemsets (with the constraint of minimum confidence threshold) Apriority property All nonempty subsets of a frequent itemset must also be frequent If {beer, diaper, nuts} is a frequent itemset, then the itemsets {beer, diaper}, {diaper, nuts}, {beer, nuts} must also be frequent

The Apriori Algorithm Let Ck be a set of candidate itemsets of size k, and Lk be a set of frequent itemsets of size k Main steps of iteration Find frequent itemset Lk-1 Join step: Ck is generated by joining Lk-1 with itself (cartesian product Lk-1 x Lk-1) Prune step (use Apriori property): Any (k−1)-itemset that is not frequent cannot be a subset of a frequent k-itemset(Lk), hence should be removed from Ck Obtain frequent itemset Lk and repeat the steps unless Lk= 

Apriori Algorithm Example
Consider a database, D , consisting of 9 transactions. Suppose min. support count required is 2 (i.e. 2/9 = 22 % ) Let minimum confidence required is 70%. We have to first find out the frequent itemset using Apriori algorithm. Then, Association rules will be generated using min. support & min. confidence TID List of Items T100 I1, I2, I5 T200 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 T600 T700 T800 I1, I2, I3, I5 T900 I1, I2, I3

Apriori Algorithm Example

Generating Association Rules
Confidence 𝐼1⇒𝐼2 4/6=67% 𝐼1⇒𝐼3 𝐼1⇒𝐼5 2/6=33% 𝐼2⇒𝐼3 4/7=57% 𝐼2⇒𝐼4 2/7=29% 𝐼2⇒𝐼5 𝐿 2 = 𝐼1,𝐼2 , 𝐼1,𝐼3 , 𝐼1,𝑖5 , 𝐼2,𝐼3 , 𝐼2,𝐼4 , 𝐼2,𝐼5 TID List of Items T100 I1, I2, I5 T200 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 T600 T700 T800 I1, I2, I3, I5 T900 I1, I2, I3 Rule Confidence 𝐼2⇒𝐼1 4/7=57% 𝐼3⇒𝐼1 4/6=67% 𝐼5⇒𝐼1 2/2=100% 𝐼3⇒𝐼2 𝐼4⇒𝐼2 𝐼5⇒𝐼2

Generating Association Rules
Confidence {𝐼1, 𝐼2}⇒𝐼3 2/4=50% {𝐼1,𝐼3}⇒𝐼2 {𝐼2,𝐼3}⇒𝐼1 𝐼1⇒{𝐼2,𝐼3} 2/6=33% 𝐼2⇒{𝐼1,𝐼3} 2/7=29% 𝐼3⇒{𝐼1,𝐼2} 𝐿 3 = 𝐼1,𝐼2,𝐼3 , 𝐼1,𝐼2,𝐼5 TID List of Items T100 I1, I2, I5 T200 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 T600 T700 T800 I1, I2, I3, I5 T900 I1, I2, I3 Rule Confidence {𝐼1, 𝐼2}⇒𝐼5 2/4=50% {𝐼1,𝐼5}⇒𝐼2 2/2=100% {𝐼2,𝐼5}⇒𝐼1 𝐼1⇒{𝐼2,𝐼5} 2/6=33% 𝐼2⇒{𝐼1,𝐼5} 2/7=29% 𝐼5⇒{𝐼1,𝐼2} Solved iteratively until no more new rules emerge

PCY (Park-Chen-Yu) Algorithm
Hash-based improvement to A-Priori. During Pass 1 of A-priori, most memory is idle. Will use that memory to keep counts of buckets into which pairs of items are hashed. Hashes each pair of items and increments the count at that hash if item pairs occurs more than once. PCY Algorithm accomplishes more on the first pass. It will make use of an array disguised as a hash table (where indices represent keys)

PCY (Park-Chen-Yu) Algorithm—(Cont..)
Between Passes Replace the buckets by a bit-vector (“bitmap”):1 means the bucket is frequent;0 means it’s not. Also, decide which items are frequent and list them for the second pass. Gives extra conditions that candidate pairs must satisfy on Pass 2.

Simple Algorithm Instead of using the entire file of baskets, pick a random subset of the baskets and pretend it as the entire dataset Safest way to pick the sample -read the entire dataset, and for each basket, select that basket for the sample with some fixed probability p. • Suppose there are m baskets in the entire file. At the end, we shall have a sample whose size is very close to pm baskets. Ex. If support threshold for the full dataset is s ,and we choose a sample of 1% of the baskets, then we may examine the sample for itemsets that appear in at least s/100 of the baskets. Smaller support thresholds will recognize more frequent itemset but require more memory.

Son ALGORITHM Part 1 The first pass of SON Algorithm performs the simple algorithm on subsets that compose partitions of the dataset processing the subsets in parallel is more efficient. - Scan data - Break data into chunks that can be processed in main memory - Continuously fill memory with new batch of data - Run sampling algorithm on each batch - Generate candidate frequent sets Part 2 The second pass counts the output from the first pass and determines if an itemset is frequent across all subsets. - Validates the candidate itemsets - Counts all candidate itemsets and determines which are frequent in the entire set Monotonicity property - Itemset X is frequent overall  frequent in at least one batch

Toivonen’s ALGORITHM First start as the simple algorithm but in this case, lower the support threshold - Example : if 1%, then make it s/125 not s/100 Goal is to prevent false nagatives and ensure that itemsets are frequent If an item has a support that is close to the support threshold but is not equal to or greater than, then it would be considered frequent in this algorithm An itemset is in the negative border if it is not deemed frequent in the sample, but all it’s immediate subset are. Negative border : when a set is not frequent in the sample but all of its immediate subsets are {a,b,c,d} is not frequent but {a,b,c}, {a,b,d}, {a,c,d}, {b,c,d} are frequent, then {a,b,c,d} is frequent In the second pass, count all the frequent itemsets from the first pass, and the negative borders. If there is a negative border as the frequent itemset, then you have to start over with a different support threshold level

Demonstration

Market Basket Analysis

Similar presentations

Presentation on theme: "Market Basket Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Market Basket Analysis

Similar presentations

Presentation on theme: "Market Basket Analysis"— Presentation transcript:

Similar presentations

About project

Feedback