Association Rules and Sequential Patterns

Association Rules and Sequential Patterns
Bamshad Mobasher DePaul University

Market Basket Analysis
Goal of MBA is to find associations (affinities) among groups of items occurring in a transactional database has roots in analysis of point-of-sale data, as in supermarkets but, has found applications in many other areas Association Rule Discovery most common type of MBA technique Find all rules that associate the presence of one set of items with that of another set of items. Example: 98% of people who purchase tires and auto accessories also get automotive services done We are interested in rules that are non-trivial (and possibly unexpected) actionable easily explainable

What Is Association Mining?
Association rule mining searches for relationships between items in a data set: Finding association, correlation, or causal structures among sets of items or objects in transaction databases, relational databases, etc. Rule form: Body ==> Head [support, confidence] Body and Head can be represented as sets of items or as predicates Examples: {diaper, milk, Thursday} ==> {beer} [0.5%, 78%] buys(x, "bread") ==> buys(x, "milk") [0.6%, 65%] major(x, "CS") /\ takes(x, "DB") ==> grade(x, "A") [1%, 75%] age(X,30-45) /\ income(X, 50K-75K) ==> buys(X, SUVcar) age=“30-45”, income=“50K-75K” ==> car=“SUV”

Different Kinds of Association Rules
Boolean vs. Quantitative associations on discrete and categorical data vs. continuous data Single Vs. Multiple Dimensions one predicate = single dimension; multiple predicates = multiple dimensions buys(x, “milk”) ==> buys(x, “butter”) age(X,30-45) /\ income(X, 50K-75K) ==> buys(X, SUVcar) Single level vs. multiple-level analysis Based on the level of abstractions involved buys(x, “bread”) ==> buys(x, “milk”) buys(x, “wheat bread”) ==> buys(x, “2% milk”) Simple vs. constraint-based Constraints can be added on the rules to be discovered

Basic Concepts We start with a set I of items and a set D of transactions D is all of the transactions relevant to the mining task A transaction T is a set of items (a subset of I): An Association Rule is an implication on itemsets X and Y , denoted by X ==> Y, where The rule meets a minimum confidence of c, meaning that c% of transactions in D which contain X also contain Y In addition a minimum support of s is satisfied

Support and Confidence
Customer buys diaper buys both buys beer Find all the rules X  Y with minimum confidence and support Support = probability that a transaction contains {X,Y} i.e., ratio of transactions in which X, Y occur together to all transactions in database. Confidence = conditional probability that a transaction having X also contains Y i.e., ratio of transactions in which X, Y occur together to those in which X occurs. In general confidence of a rule LHS => RHS can be computed as the support of the whole itemset divided by the support of LHS: Confidence (LHS => RHS) = Support(LHS È RHS) / Support(LHS)

Support and Confidence - Example
Itemset {A, C} has a support of 2/5 = 40% Rule {A} ==> {C} has confidence of 50% Rule {C} ==> {A} has confidence of 100% Support for {A, C, E} ? Support for {A, D, F} ? Confidence for {A, D} ==> {F} ? Confidence for {A} ==> {D, F} ?

Improvement (Lift) High confidence rules are not necessarily useful
what if confidence of {A, B} ==> {C} is less than Pr(C)? improvement gives the predictive power of a rule compared to just random chance: Itemset {A} has a support of 4/5 Rule {C} ==> {A} has confidence of 2/2 Improvement = 5/4 = 1.25 Rule {B} ==> {A} has confidence of 1/2 Improvement = 5/8 = 0.625

Steps in Association Rule Discovery
Find the frequent itemsets Frequent item sets are the sets of items that have minimum support Support is “downward closed”, so, a subset of a frequent itemset must also be a frequent itemset if {AB} is a frequent itemset, both {A} and {B} are frequent itemsets this also means that if an itemset that doesn’t satisfy minimum support, none of its supersets will either (this is essential for pruning search space) Iteratively find frequent itemsets with cardinality from 1 to k (k-itemsets) Use the frequent itemsets to generate association rules

Mining Association Rules - An Example
Min. support 50% Min. confidence 50% Only need to keep these since {A} and {C} are subsets of {A,C} For rule A  C: support = support({A, C}) = 50% confidence = support({A, C})/support({A}) = 66.6%

Apriori Algorithm Ck : Candidate itemset of size k
Lk : Frequent itemset of size k Join Step: Ck is generated by joining Lk-1with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

Example of Generating Candidates
L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4 = {abcd}

Apriori Algorithm - An Example
Assume minimum support = 2

Apriori Algorithm - An Example
The final “frequent” item sets are those remaining in L2 and L3. However, {2,3}, {2,5}, and {3,5} are all contained in the larger item set {2, 3, 5}. Thus, the final group of item sets reported by Apriori are {1,3} and {2,3,5}. These are the only item sets from which we will generate association rules.

Generating Association Rules from Frequent Itemsets
Only strong association rules are generated Frequent itemsets satisfy minimum support threshold Strong rules are those that satisfy minimum confidence threshold confidence(A ==> B) = Pr(B | A) = For each frequent itemset, f, generate all non-empty subsets of f For every non-empty subset s of f do if support(f)/support(s) ³ min_confidence then output rule s ==> (f-s) end

Generating Association Rules (Example Continued)
Item sets: {1,3} and {2,3,5} Recall that confidence of a rule LHS  RHS is Support of itemset (i.e. LHS È RHS) divided by support of LHS. Candidate rules for {1,3} Candidate rules for {2,3,5} Rule Conf. {1}{3} 2/2 = 1.0 {2,3}{5} 2/2 = 1.00 {2}{5} 3/3 = 1.00 {3}{1} 2/3 = 0.67 {2,5}{3} {2}{3} {3,5}{2} {3}{2} {2}{3,5} {3}{5} {3}{2,5} {5}{2} {5}{2,3} {5}{3} Assuming a min. confidence of 75%, the final set of rules reported by Apriori are: {1}{3}, {3,5}{2}, {5}{2} and {2}{5}

Multiple-Level Association Rules
Items often form a hierarchy Items at the lower level are expected to have lower support Rules regarding itemsets at appropriate levels could be quite useful Transaction database can be encoded based on dimensions and levels Food Milk Bread Skim 2% Wheat White

Mining Multi-Level Associations
A top_down, progressive deepening approach First find high-level strong rules: milk ® bread [20%, 60%] Then find their lower-level “weaker” rules: 2% milk ® wheat bread [6%, 50%] When one threshold set for all levels; if support too high then it is possible to miss meaningful associations at low level; if support too low then possible generation of uninteresting rules different minimum support thresholds across multi-levels lead to different algorithms (e.g., decrease min-support at lower levels) Variations at mining multiple-level association rules Level-crossed association rules: milk ® wonder wheat bread Association rules with multiple, alternative hierarchies: 2% milk ® wonder bread

Quantitative Association Rules
Handling quantitative rules may require mapping of the continuous variables into Boolean

MBA in Text / Web Content Mining
Documents Associations Find (content-based) associations among documents in a collection Documents correspond to items and words correspond to transactions Frequent itemsets are groups of docs in which many words occur in common Term Associations Find associations among words based on their occurrences in documents similar to above, but invert the table (terms as items, and docs as transactions)

MBA in Web Usage Mining Association Rules in Web Transactions Examples
discover affinities among sets of Web page references across user sessions Examples 60% of clients who accessed /products/, also accessed /products/software/webminer.htm 30% of clients who accessed /special-offer.html, placed an online order in /products/software/ Actual Example from IBM official Olympics Site: {Badminton, Diving} ==> {Table Tennis} [conf69.7%,sup0.35%] Applications Use rules to serve dynamic, customized contents to users prefetch files that are most likely to be accessed determine the best way to structure the Web site (site optimization) targeted electronic advertising and increasing cross sales

Web Usage Mining: Example
Association Rules From Cray Research Web Site Design “suggestions” from rules 1 and 2: there is something in J90.html that should be moved to th page /PUBLIC/product-info/T3E (why?)

Sequential / Navigational Patterns
Sequential patterns add an extra dimension to frequent itemsets and association rules - time. Items can appear before, after, or at the same time as each other. General form: “x% of the time, when A appears in a transaction, B appears within z transactions.” note that other items may appear between A and B, so sequential patterns do not necessarily imply consecutive appearances of items (in terms of time) Examples Renting “Star Wars”, then “Empire Strikes Back”, then “Return of the Jedi” in that order Collection of ordered events within an interval Most sequential pattern discovery algorithms are based on extensions of the Apriori algorithm for discovering itemsets Navigational Patterns they can be viewed as a special form of sequential patterns which capture navigational patterns among users of a site in this case a session is a consecutive sequence of pageview references for a user over a specified period of time

Mining Sequences - Example
Customer-sequence Sequential patterns with support > 0.25 {(C), (H)} {(C), (DG)}

Sequential Pattern Mining: Cases and Parameters
Duration of a time sequence T Sequential pattern mining can then be confined to the data within a specified duration Ex. Subsequences corresponding to the year of 1999 Ex. Partitioned sequences, such as every year, or every week after stock crashes, or every two weeks before and after a volcano eruption Event folding window w If w = T, time-insensitive frequent patterns are found If w = 0 (no event sequence folding), sequential patterns are found where each event occurs at a distinct time instant If 0 < w < T, sequences occurring within the same period w are folded in the analysis

Sequential Pattern Mining: Cases and Parameters
Time interval, int, between events in the discovered pattern int = 0: no interval gap is allowed, i.e., only strictly consecutive sequences are found Ex. “Find frequent patterns occurring in consecutive weeks” min_int  int  max_int: find patterns that are separated by at least min_int but at most max_int Ex. “If a person rents movie A, it is likely she will rent movie B within 30 days” (int  30) int = c  0: find patterns carrying an exact interval Ex. “Every time when Dow Jones drops more than 5%, what will happen exactly two days later?” (int = 2)

Mining Navigational Patterns
Approach: build an aggregated sequence tree this is the approach taken by Web Utilization Miner (WUM) - Spiliopoulou, 1998 for each occurrence of a sequence start a new branch or increase the frequency counts of matching nodes in example below, note that s6 contains “b” twice, hence the sequence is <(b,1),(d,1),(b,2),(e,1)>

The aggregated sequence tree can be used directly to determine support and confidence for navigational patterns Note that each node represents a navigational path ending in that node Support = count at the node / count at root Confidence = count at the node / count at the parent Navigation pattern: a  b Support = 11/35 = 0.31 Confidence = 11/21 = 0.52 Nav. pattern: a  b  e Support = 11/35 = 0.31 Confidence = 11/11 = 1.00 Nav. patterns: a  b  e  f Support = 3/35 = 0.086 Confidence = 3/11 = 0.27

WUM: supports a powerful mining query language to extract patterns from the aggregated tree Example query: For example, patterns matching the query with X = “b” are: SELECT t NODES AS X Y Z, TEMPLATE AS t WHERE X.support >= 20 AND Y.support >= 6 AND Z.support >= 4 12

Another Approach: Markov Chains idea is to model the navigational sequences through the site as a state-transition diagram without cycles (a directed acyclic graph) a Markov Chain consists of a set of states (pages or pageviews in the site) S = {s1, s2, …, sn} and a set of transition probabilities P = {p1,1, … , p1,n, p2,1, … , p2,n, … , pn,1, … , pn,n} a path r from a state si to a state sj, is a sequence states where the transition probabilities for all consecutive states are greater than 0. the probability of reaching a state sj from a state si via a path r is the product of all the probabilities along the path: the probability of reaching sj from si is the sum over all paths:

An example Markov Chain What is the probability that a user who visits the welcome page purchases a product? Home -> Search -> PD -> $ = 1/3 * 1/2 *1/2 = 1/12 Home -> Cat -> PD -> $ = 1/3 * 1/3 * 1/2 = 1/18 Home -> Cat -> $ = 1/3 * 1/3 = 1/9 Home -> RS -> PD -> $ = 1/3 * 2/3 * 1/2 = 1/9 Sum = 13/36

Calculating conditional probabilities for transitions
Markov Chain Example: Calculating conditional probabilities for transitions Web site hyperlink graph Sessions: A, B A, B, C A, B, C, D A, B, C, E A, C, E A, B, D A, B, D, E B, C B, C, D B, C, E B, E,D B D A 0.57 C E Transition BC: Total occurrences of B: 14 Total occurrence of BC: 8 Pr(C|B) = 8/14 = 0.57

Tools: Weka Package Weka
set of Java packages developed at the University of Waikato in New Zealand includes packages for data filtering, association rules, classification, clustering, and instance-based learning Web site: can be used both from command line, or using the Java based GUI requires the data to be in a standard format called ARFF

Weka: ARFF Format ARFF files have two main sections Attributes
categorical (nominal) attributes along with their values integer attributes along with a range real attributes Data section each record has values corresponding to the order in which attributes were specified in the attribute section @RELATION zoo @ATTRIBUTE animal {aardvark,antelope,bass,bear,boar, } @ATTRIBUTE hair {false, true} @ATTRIBUTE feathers {false, true} @ATTRIBUTE eggs {false, true} @ATTRIBUTE milk {false, true} @ATTRIBUTE airborne {false, true} @ATTRIBUTE aquatic {false, true} @ATTRIBUTE predator {false, true} @ATTRIBUTE toothed {false, true} @ATTRIBUTE backbone {false, true} @ATTRIBUTE breathes {false, true} @ATTRIBUTE venomous {false, true} @ATTRIBUTE fins {false, true} @ATTRIBUTE legs INTEGER [0,9] @ATTRIBUTE tail {false, true} @ATTRIBUTE domestic {false, true} @ATTRIBUTE catsize {false, true} @ATTRIBUTE type { mammal, bird, reptile, fish, insect, } . . .

Weka: ARFF Format Data portion of the ARFF file for Zoo animals:
For association rule discovery, we first need to discretize using Weka “Filters” . . . @DATA % % Instances (101): aardvark,true,false,false,true,false,false,true,true,true,true,false,false,4,false,false,true,mammal antelope,true,false,false,true,false,false,false,true,true,true,false,false,4,true,false,true,mammal bass,false,false,true,false,false,true,true,true,true,false,false,true,0,true,false,false,fish bear,true,false,false,true,false,false,true,true,true,true,false,false,4,false,false,true,mammal boar,true,false,false,true,false,false,true,true,true,true,false,false,4,true,false,true,mammal buffalo,true,false,false,true,false,false,false,true,true,true,false,false,4,true,false,true,mammal calf,true,false,false,true,false,false,false,true,true,true,false,false,4,true,true,true,mammal carp,false,false,true,false,false,true,false,true,true,false,false,true,0,true,true,false,fish catfish,false,false,true,false,false,true,true,true,true,false,false,true,0,true,false,false,fish cavy,true,false,false,true,false,false,false,true,true,true,false,false,4,false,true,false,mammal cheetah,true,false,false,true,false,false,true,true,true,true,false,false,4,true,false,true,mammal

Weka Explorer Interface
Can open the native ARFF format or the standard “CSV” format

Weka Attribute Filters

After Saving the new relation in ARFF format
We can discretize “children” manually since it only has a small number of discrete values After Saving the new relation in ARFF format

Weka Discretization Filter

After Saving the new relation in ARFF format

After renaming attribute values for “age” and “income”

Weka Association Rules

Weka Association Rules
Another try with Lift >= 1.5

Association Rules and Sequential Patterns

Similar presentations

Presentation on theme: "Association Rules and Sequential Patterns"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Association Rules and Sequential Patterns

Similar presentations

Presentation on theme: "Association Rules and Sequential Patterns"— Presentation transcript:

Similar presentations

About project

Feedback