# Sequential Patterns & Process Mining Current State of Research Edgar de Graaf LIACS.

## Presentation on theme: "Sequential Patterns & Process Mining Current State of Research Edgar de Graaf LIACS."— Presentation transcript:

Sequential Patterns & Process Mining Current State of Research Edgar de Graaf LIACS

2/34 Mining Sequential Patterns Sequential Patterns Sequence Databases AprioriAll PrefixSpan Gap Constraints

3/34 Sequential Patterns contained in not contained in

4/34 Sequential databases The Database with sequences

5/34 Sequential databases Support count 0 A Generated Candidate Pattern

6/34 Sequential databases Support count 0 1

7/34 Sequential databases Support count 1 Not Contained → Not Counted

8/34 Sequential databases Support count 1 Contained 2345 IF Minimal Support ≤ 50% THEN frequent

9/34 Lifting order (1) Notation by examples , a ordered list of sets ≡ sequence  Every set A,B and C is unordered. E.g. A = (x,y,z) = (y,z,x) = (z,y,x) = …  [x,y,z] is an extension: we ignore the order when counting frequency

10/34 Lifting order (2) and frequent → is frequent Says: t3 and t2 occurs frequent in- between t1 and t4 in either order

11/34 Lifting Order (3) and infrequent suppose (t1)[t3,t2](t4) frequent Says: often t3 and t2 occur in-between t1 and t4

12/34 Existing Algorithms AprioriAll: the first algorithm based on the anti-monotone principles PrefixSpan: currently the fastest algorithm around, it uses projected databases

13/34 AprioriAll (1) AprioriAll(DB, min_sup){ L 1 = {frequent sequences size 1} k = 2 while(L k-1 is not empty){ C k = candidateGeneration(L k-1,k) C k = candidatePruning(C k, k) L k = supportBasedPruning(C k ) k++ }

14/34 AprioriAll (2) candidateGeneration(L k-1, k){ C k = ø for each a in L k-1 { for each b in L k-1 { if(all n, 1 ≤ n ≤ k-2: a n = b n ) toevoegen aan C k de sequences: {a 1 …a k-2, a k-1, b k-1 }en {a 1 …a k-2, b k-1, a k-1 } }

15/34 PrefixSpan (1) Assume that the prefix = 1. Scan de projected database to find every frequent item x such that 1. is frequent or 2. is frequent 2. Append the x to the prefix and output the pattern 3. Now call recursively e.g. PrefixSpan(, newProjDB)

16/34 PrefixSpan (2) A projected DB only stores the postfix  E.g. if prefix = then we store as New projected DB = Old projected DB – sequences without prefix

17/34 PrefixSpan (3) Faster than AprioriAll  No non-existing candidates  Testing on a shrinking projected DB

18/34 Gap Constraint Simple idea: between sequence-item- sets a maximal distance, e.g. pattern = and gap = 1 then this sequence is not counted

19/34 Process Mining What is process mining? Using D/F tables and graphs Genetic Algorithms Problem areas Using sequential patterns

20/34 What is process mining? (1) The ordering of events is known e.g. Process mining constructs a petri net: claimregisterto_be_evaluated pay send_letter ready Source: Workflow Management by W. van der Aalst and K. van Hee. (1997)

21/34 What is process mining? (2) Usability of process mining:  Given the audit trails, what is the workflow network?  Mined workflow network ≡ original design? (Delta Analysis)  Mined workflow network better than the original design? (Performance Analysis)

22/34 Using D/F tables and graphs (1) B#BBBB<<>>BA→B T11000006870-0.246 T21994001035505-0.487 For every task a D/F table: Intuition: if A is often followed by B then the probability of A causing B increases

23/34 Using D/F tables and graphs (2) A D/F graph is constructed: IF((A→B ≥ N) AND (A > B ≥ σ) AND (B < A ≤ σ) THEN connection A to B More complicated rules deal with recursion and short loops

24/34 Using D/F tables and graphs (3) D/F Graph example:

25/34 Using D/F tables and graphs (4) AND/OR-Splits: OR if neither C > B or B > C is higher than the threshold AND if both are higher than threshold A B C

26/34 Genetic Algorithms (1) 1. Create a initial population of workflows 2. Calculate their fitness using audit trails 3. Create a child 4. Mutate the child 5. Repeat 3 to 4 to create the new population 6. Go to 2

27/34 Genetic Algorithms (2) Advantages:  Can deal with duplicate tasks and non- free choice. Disadvantages:  The structure of the “chromosome”  How do we measure fitness?  How do we do cross-over and mutation?

29/34 Problem Areas (2) Mining non-free-choice A B C D E

30/34 Problem Areas (3) Mining Loops: ABCDBCD BC DA

31/34 Problem Areas (4) Delta analysis: how do we compare two models? Other problems: time, dealing with noise and incompleteness.

32/34 Using sequential patterns Mining loops? Fitness measure in a GA? Use in delta analysis? Generate the important frequent subsequences to help the designer

33/34 Further research in sequences How about gaps between items in different item sets? What type of frequent subsequences to use in fitness? Lifting order, is it useful in workflow generation? Further research of lifting order

34/34 The End Thank you for your attention Edgar de Graaf edegraaf@liacs.nl

Download ppt "Sequential Patterns & Process Mining Current State of Research Edgar de Graaf LIACS."

Similar presentations