Presentation is loading. Please wait.

Presentation is loading. Please wait.

Association Rule and Sequential Pattern Mining for Episode Extraction Jonathan Yip.

Similar presentations


Presentation on theme: "Association Rule and Sequential Pattern Mining for Episode Extraction Jonathan Yip."— Presentation transcript:

1 Association Rule and Sequential Pattern Mining for Episode Extraction Jonathan Yip

2 Introduction to Association Rule Associating multiple objects/events together Associating multiple objects/events together Example: A customer buying a laptop also Example: A customer buying a laptop also buys a wireless LAN card (2- itemset) buys a wireless LAN card (2- itemset) Wireless LAN Card Laptop Laptop Wireless LAN Card

3 Association Rule (cont) Measures of Rule Interestingness Support == P(Laptop LAN card)Support == P(Laptop LAN card) Probability that all studied sets occur Confidence == P(LAN card Laptop)Confidence == P(LAN card Laptop) =P(Laptop U LAN card)/P(Laptop) Conditional Probability that a customer bought Laptop also bought Wireless LAN card Buy both Thresholds: Minimum Support: 25% Minimum Confidence: 30% [Support = 40%, Confidence = 60%] Laptop Wireless LAN Card

4 Association Rule (eg.) TIDItems 1 Bread, Coke, Milk 2 Chips, Bread 3 Coke, Eggs, Milk Coke, Eggs, Milk 4 Bread, Eggs, Milk, Coke 5 Coke, Eggs, Milk Min_Sup = 25% Min_Conf = 25% Milk Eggs Support :P(Milk Eggs) = 3/5 = 60% Support : P(Milk Eggs) = 3/5 = 60% Confidence :P (Eggs|Milk) Confidence : P (Eggs|Milk) = P(Milk U Eggs)/P(Milk) P(Milk) = 4/5 = 80% P(Eggs Milk)=60%/80% = 75% (75% Confidence that a customer buys milk also buys eggs)

5 Types of Association Boolean vs. QuantitativeBoolean vs. Quantitative Single dimension vs. Multiple dimensionSingle dimension vs. Multiple dimension Single level vs. Multiple level AnalysisSingle level vs. Multiple level Analysis Example: Example: 1.) Gender(X,Male) ^ Income(X,>50K) ^Age(X,35…50) Buys (X, BMW Sedan) Buys (X, BMW Sedan) 2.) Income(X,,>50K) Buys (X, BMW Sedan) 3.) Gender(X,Male) ^ Income(X,>50K) ^Age(X,35…50) Buys (X, BMW 540i)

6 Association Rule (DB Miner)

7 Apriori Algorithm Purpose Purpose To mine frequent itemsets for boolean To mine frequent itemsets for boolean association rules association rules Use prior knowledge to predict future values Use prior knowledge to predict future values Has to be frequent (Support>Min_Sup) Has to be frequent (Support>Min_Sup) Anti-monotone concept Anti-monotone concept If a set cannot pass a min_sup test, all If a set cannot pass a min_sup test, all supersets will fail as well supersets will fail as well

8 Apriori Algorithm Psuedo-Code Pseudo-code:Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k != ; k++) do begin C k+1 = candidates generated from L k ; C k+1 = candidates generated from L k ; for each transaction t in database do for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support L k+1 = candidates in C k+1 with min_support end end return k L k ;

9 Apriori Algorithm Procedures Step 1 Scan & find support of each item (C1): TIDItems 1 Bread, Coke, Milk 2 Chips, Bread 3 Coke, Eggs, Milk Coke, Eggs, Milk 4 Bread, Eggs, Milk, Coke 5 Coke, Eggs, Milk Example revisited: 5 – itemset with 5 transactions Min_Sup = 25% Min Support Count = 2 items Min Support Count = 2 items Min_Conf = 25% ItemssupportBread3 Coke4 Milk4 Chips 1 (fail) Eggs3 ItemssupportBread3 Coke4 Milk4 Eggs3 Step 2 Compare with Min_Sup and eliminate (prune) I<Min_Sup (L1):

10 Apriori Algorithm (cont) Supports Bread & Coke:2/5=40% Bread & Milk:2/5=40% Bread & Eggs:1/5=20% Coke & Milk:4/5=80% Coke & Eggs:2/5=40% Milk & Eggs:3/5=60% ItemsBread Coke Milk Eggs ItemsBread Coke Milk Eggs Step 3 Join (L1 L1) Repeated Step: Eliminate (prune) items<min_supPrune (C2): L1 set

11 Supports Bread & Coke Bread & Milk Coke & Milk Coke & Eggs Milk & Eggs L2 set Join L2 L2 Supports Bread & Coke Bread & Milk Coke & Milk Coke & Eggs Milk & Eggs ItemsSupport Bread & Coke & Milk 2 Bread & Coke & Eggs 1 (fail) Bread & Coke & Milk & Eggs 1 (fail) Coke & Milk & Eggs 3 L2 set Compare with Min_Sup then eliminate (prune) items <Min_sup: Conclusion: Bread & Coke & Milk have strong correlationBread & Coke & Milk have strong correlation Coke & Milk & Eggs have strong correlationCoke & Milk & Eggs have strong correlation Apriori Algorithm (cont)

12 Sequential Pattern Mining Introduction Mining of frequently occurring patterns related to time or other sequencesMining of frequently occurring patterns related to time or other sequencesExamples 70% of customers rent Star Wars, then Empire Strikes Back, and then Return of the Jedi70% of customers rent Star Wars, then Empire Strikes Back, and then Return of the JediApplication Intrusion detection on computersIntrusion detection on computers Web access patternWeb access pattern Predict disease with sequence of symptomsPredict disease with sequence of symptoms Many other areasMany other areas Star WarsEmpire Strikes Back Return of the Jedi

13 Sequential Pattern Mining (cont) Steps: Sort PhaseSort Phase Sort by Cust_ID, Transaction_ID Sort by Cust_ID, Transaction_ID Litemset PhaseLitemset Phase Find large itemsets Find large itemsets Transform PhaseTransform Phase Eliminates items < min_sup Eliminates items < min_sup Sequence PhaseSequence Phase Find desired sequences Find desired sequences Maximal PhaseMaximal Phase Find the maximal sequences among set of large sequences Find the maximal sequences among set of large sequences

14 Sequential Pattern Mining (cont) Cust ID Trans. Time Items Bought 1 June 25 02 3 1 June 30 02 9 2 June 10 02 1, 2 2 June 15 02 3 2 June 20 02 4, 6, 7 3 June 25 02 3, 5, 7 4 June 25 02 3 4 June 30 02 4, 7 4 July 25 02 9 5 June 12 02 9 Example: Database sorted by Cust_ID & Transaction Time (Min_sup=25%) Organized format with Cust_ID: Cust ID Original Sequence 1 {(3) (9)} 2 {(1,2) (3) (4,6,7)} 3{(3,5,7)} 4 {(3) (4,7) (9)} 5{(9)}

15 Sequential Pattern Mining (cont) Cust ID Original Sequence Items to study SupportCount 1{(3)(9)} {(3)} {(9)} {(3,9)} 3,3, 2 5{(9)}{(9)}1 Step 1: Sort (examples of several transaction): Conclusion: >25% Min_sup {(3) (9)} && {(3) (4,7)}

16 Sequential Pattern Mining (cont) Cust ID Original Sequence Transformed Cust. Sequence After mapping 1 {(3) (9)} ({3} {(9)} ({1} {5}) 2 {(1,2) (3) (4,6,7)} {(3}) {(4) (7) (4,7)} ({1} {2 3 4}) 3{(3,5,7)} {(3) (7)} ({1,3}) 4 {(3) (4,7) (9)} ({3} {(4) (7) (4 7)} {(9)} ({1} {2 3 4} {5}) 5{(9)}{(9)}({5}) Data sequence of each customer: Sequences < min_support: {(1,2) (3)}, {(3)},{(4)},{(7)},{(9)}, {(3) (4)}, {(3) (7), {(4) (7)} Support > 25% {(3) (9)} {(3) (4 7)} The most right column implies customers buying patterns L Item Ma pp ed To (30)1 (40)2 (70)3 (40 70) 4 (90)5 Step 2: Litemset phase

17 Sequential Pattern Mining Algorithm Algorithm AprioriAllAprioriAll Count all large sequence, including those not maximal Pseudo-code: Ck: Candidate sequence of size k Lk : frequent or large sequence of size k L1 = {large 1-sequence}; //result of litemset phase for (k = 2; Lk != ; k++) do begin Ck = candidates generated from Lk-1; for each customer sequence c in database do Increment the count of all candidates in Ck that are contained in c end Answer=Maximal sequences in k Lk; AprioriSome AprioriSome Generates every candidate sequence, but skips counting some large sequences (Forward Phase). Then, discards candidates not maximal and counts remaining large sequences (Backward Phase).

18 Episode Extraction A partially ordered collection of events occurring togetherA partially ordered collection of events occurring together Goal: To analyze sequence of events, and to discover recurrent episodesGoal: To analyze sequence of events, and to discover recurrent episodes First finding small frequent episodes then progressively looking larger episodesFirst finding small frequent episodes then progressively looking larger episodes Types of episodesTypes of episodes Serial () – E occurs before F Serial () – E occurs before F Parallel() – No constraints on Parallel() – No constraints on relativelyorder of A & B relativelyorder of A & B Non-Serial/Non-Parallel () Non-Serial/Non-Parallel () - Occurrence of A & B - Occurrence of A & B precedes C precedes C E F A B A B C

19 Episode Extraction (cont) E D F A B C E F C D B A D C E F C B E A E C F A 30 35 40 45 50 55 60 65 S = {(A 1,t 1 ),(A 2,t 2 ),….,(A n, t n ) s={(E,31),(D,32),(F,33)….(A,65)} Time window is set to bind the interestingnessTime window is set to bind the interestingness W(s,5) slides and snapshot the whole sequence W(s,5) slides and snapshot the whole sequence eg. (w,35,40) contains A,B,C,E episodes, occurs but not eg. (w,35,40) contains A,B,C,E episodes, occurs but not User specifies how many windows an episode has to occur to be User specifies how many windows an episode has to occur to be frequent frequent Formula : Formula : A Sequence of events:

20 Episode Extraction Minimal occurrences Look at exact occurrences of episodes & relationships between occurrences Look at exact occurrences of episodes & relationships between occurrences Can modify width of windowCan modify width of window Eliminates unnecessary repetition of the recognition effortEliminates unnecessary repetition of the recognition effort ExampleExample mo( ) = {[35,38), [46,48),[57,60)} mo( ) = {[35,38), [46,48),[57,60)} When episode is a subepisode of another; this relation is used forWhen episode is a subepisode of another; this relation is used for discovering all frequent episodes discovering all frequent episodes

21 Applications of Episodes Extraction Computer SecurityComputer Security BioinformaticsBioinformatics FinanceFinance Market AnalysisMarket Analysis And more……And more……

22 References Discovery of Frequent Episodes in Event Sequences (Manilla,Toivonen, Verkamo) Mining Sequential Patterns (Agrawal, Srikant) Principles of Data Mining (Hand, Manilla, Smyth) 2001 Data Mining Concepts and Techniques (Han, Kamber) 2001

23 END


Download ppt "Association Rule and Sequential Pattern Mining for Episode Extraction Jonathan Yip."

Similar presentations


Ads by Google