Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 The Strategies for Mining Fault-Tolerant Patterns Jia-Ling Koh Department of Information and Computer Education National Taiwan Normal University.

Similar presentations


Presentation on theme: "1 The Strategies for Mining Fault-Tolerant Patterns Jia-Ling Koh Department of Information and Computer Education National Taiwan Normal University."— Presentation transcript:

1 1 The Strategies for Mining Fault-Tolerant Patterns Jia-Ling Koh Department of Information and Computer Education National Taiwan Normal University

2 2 Two Topics Introduced in This Talk The strategies for mining fault-tolerant frequent itemsets (patterns)  from a transaction database The strategies for mining fault-tolerant repeating patterns  from a data sequence

3 3 An Efficient Approach for Mining Fault-Tolerant Frequent Patterns based on Bit Vector Representations Jia-Ling Koh and Pei-Wy Yo DASFAA 2005

4 4 Motivation Related works Problem Definition Appearing Bit Vectors VB_FT_Mine algorithm (Vector-Based Fault Tolerant frequent patterns Mining) Experiments Conclusion and future works

5 5 Min-sup=4 frequent pattern : E Min-sup=3 frequent patterns : B 、 D 、 E 、 F 、 G 、 BE 、 DE an expected minimum support few frequent patterns are discovered Low min-support no general information and representative frequent patterns is returned A B D E GT5 sample database C F GT4 B E F GT3 A C D ET2 B D E FT1 ItemsTID E E E E

6 6 contain 4 out of the 5 items {B, D, E, F, G} whether a transaction containing a pattern with fault-tolerance contain 4 out of 5 items a longer “approximate” pattern (BDEFG) with support count 4 TIDItems T1 B D E F T2A C D E T3 B E F G T4C F G T5 B D E G A B D E G sample database

7 7 FT-Apriori algorithm (ACM-SIGMOD,2001) Apriori approach Apply the “downward closure” property suffered from generating a large number of candidates repeatedly scanning database

8 8 When fault tolerance is set to be 1 A transaction FT-contains BDE : If a transaction contains any (|BDE|-1) items in BDE BD, BE, DE it FT-contains BDE

9 9  (fault tolerance) =1 Itemset P={B, D, E} FT-body 1 (P)={T1,T2,T3,T5} FT-sup 1 (P) = 4 item B Item-Sup (B)=3 item D Item-Sup (D)=3 item E Item-Sup (E)=4 TIDItems T1B D E F T2A C D E T3B E F G T4C F G T5A B D E G sample database B D E D E B E B D E B B B D D D E E E E δ

10 10 A fault-tolerant frequent pattern P (1) FT-sup δ (P)  min-sup FT (2)  p  P, Item-Sup(p)  min-sup item

11 11 δ=1 min-sup FT =4, min-supitem=3 Itemset P={B, D, E} FT-sup 1 (P) = 4 item B : Item-Sup (B)=3 item D : Item-Sup (D)=3 item E : Item-Sup (E)=4 BDE is a FT-frequent pattern TIDItems T1 B D E B D E F T2 D E A C D E T3 B E B E F G T4C F G T5 B D E A B D E G sample database

12 12 TIDItems T1B D E F T2A C D E T3B E F G T4C F G T5A B D E G sample database Appearing vector table Item Appearing vector ( Appear P ) A 0 1 0 0 1 B 1 0 1 0 1 C 0 1 0 1 0 D 1 1 0 0 1 E 1 1 1 0 1 F 1 0 1 1 0 G 0 0 1 1 1 A A 11 111 B B B

13 13 A Appearing vector : A the support count of an item count the number of bits with 1s

14 14 Appear A = I 5 = Vector(Appear A ) ․ I 5 = 2 Appearing vector

15 15 Appear A = 01001 Appear D = 11001 Appear AD = Appear A  Appear D = 01001  11001 = 01001 TIDItems T1B D E F T2 AD A C D E T3B E F G T4C F G T5 AD A B D E G sample database

16 16 Pattern P=AD  = 1 T1, T2 and T5 FT-contain AD FT-Appear AD (1) = 11001 TIDItems T1 D B D E F T2 AD A C D E T3B E F G T4C F G T5 AD A B D E G sample database

17 17 FT-Appear AD (1) = 11001 FT-sup 1 (AD) = ․ = 3 Item-Sup (A) = ․ = 2 Item-Sup (D) = ․ = 3 TIDItems T1 D B D E F T2 AD A C D E T3B E F G T4C F G T5 AD A B D E G sample database

18 18 Itemset AB FT-Appear AB (1) = Appear A  Appear B Itemset ABC FT-Appear ABC (1) = Appear AB  Appear BC  Appear AC FT-Appear ABC (2) = Appear A  Appear B  Appear C Perform C -1 OR operations

19 19 【 Theorem 】 Let P´ = P ∪ {x} If transaction T FT-contains P’ with fault-tolerance δ δ-1, or T FT-contains P with fault-tolerance δ-1, or δ T FT-contains P with fault-tolerance δ and contains P

20 20 P = ABD P´ = P ∪ {G} = ABDG T FT-contains P ´ with fault tolerance 2 Case 1 : T1 = BDEF FT-contains ABD with fault tolerance 1 Case 2 T3 = BEFGFT-contains ABD with fault tolerance 2 and contains G TIDItems T1B D E F T2A C D E T3B E F G T4C F G T5A B D E G sample database B D G

21 21 Ifδ = 0,FT-Appear P´ (δ) = Appear P´ If |P´|  δ, FT-Appear P´ ( δ )= I |DB| Otherwise, FT-Appear P´ ( δ ) = FT-Appear P (δ-1)  (FT-Appear P (δ)  Appear x )

22 22  = 1 Itemset A FT-Appear A (1) FT-Appear A (1) = I |DB| FT-Appear A (0) FT-Appear A (0) = Appear A Itemset AB FT-Appear A (0)FT-Appear A (1) FT-Appear AB (1) = FT-Appear A (0)  (FT-Appear A (1)  Appear B ) = Appear A  (I |DB|  Appear B ) = Appear A  Appear B FT-Appear AB (0) = Appear AB FT-Appear A (0) = FT-Appear A (0)  Appear B = Appear A  Appear B

23 23  =2 Itemsets AB FT-Appear AB (2) = I |DB| FT-Appear AB (1) = FT-Appear A (0)  (FT-Appear A (1)  Appear B ) FT-Appear AB (0) = Appear AB Itemset ABC FT-Appear ABC (2) = FT-AppearAB(1)  (FT-AppearAB (2)  Appear C ) FT-Appear AB (0) FT-Appear ABC (1) = FT-Appear AB (0)  (FT-AppearAB (1)  Appear C ) FT-Appear ABC (0) = Appear ABC FT-Appear AB (0) = FT-Appear AB (0)  Appear C

24 24 construct appearing vector table TIDItems T1B D E F T2A C D E T3B E F G T4C F G T5A B D E G sample database ItemAppear P A0 1 0 0 1 B1 0 1 0 1 C0 1 0 1 0 D1 1 0 0 1 E1 1 1 0 1 F1 0 1 1 0 G0 0 1 1 1

25 25 Check the item supports  min-sup item min-sup item 3 min-sup FT 4 min-sup item = 3 min-sup FT = 4 δ=1 A : 2, B : 3, C : 2 D : 3, E : 4, F : 3 G : 3 Candidate items for constructing frequent FT-patterns B, D, E, F, G ItemAppear P A0 1 0 0 1 B1 0 1 0 1 C0 1 0 1 0 D1 1 0 0 1 E1 1 1 0 1 F1 0 1 1 0 G0 0 1 1 1

26 26

27 27FT-SupportFT-sup 1 (BD)= Vector (FT-Appear BD ) ․ I 5 = ․ = 4 Item- Support Item-Sup (B) = Vector (FT-Appear BD (1)) ․ Vector (Appear B ) = ․ = 3 Item-Sup (D) = Vector (FT-Appear BD (1)) ․ Vector (Appear D ) = ․ = 3 FT- appearing vector FT-Appear BD (1) = FT-Appear B (0)  (FT-Appear B (1)  Appear D ) =  (  ) = FT-Appear BD (0) = FT-Appear B (0)  Appear D =  =

28 28

29 29FT-SupportFT-sup 1 (BDE) = Vector (FT-Appear BDE (1)) ․ I 5 = ․ = 4 Item- Support Item-Sup (B) = ․ = 3 Item-Sup (D) = ․ = 3 Item-Sup (E) = ․ = 4 FT- appearing vector FT-Appear BDE (1) = FT-Appear BD (0)  (FT-Appear BD (1)  Appear E ) =  (  ) = FT-Appear BDE (0) = FT-Appear BD (0)  Appear E =  =

30 30

31 31FT-SupportFT-sup 1 (BDEF) = Vector (FT-Appear BDEF (1)) ․ I 5 = ․ = 3 BDEF is not a FT frequent pattern FT-appearing vector FT-Appear BDEF (1) =  (  ) = FT-Appear BDEF (0) =  =

32 32

33 33FT-Support FT-sup 1 (BDEG) = Vector ( FT-Appear BDEG (1) )․ I 5 = 3 BDEG is not a FT frequent pattern FT-appearing vector FT-Appear BDEG (1) = FT-Appear BDEG (0) =

34 34

35 35 Visual C++ 6.0 P4 2.4 GHz CPU 256MB main memory OS: Windows XP Professional Synthesis generator: IBM website http://www.almaden.ibm.com/cs/quest/DEMOS.html http://www.almaden.ibm.com/cs/quest/DEMOS.html

36 36 Experiment 1: min-sup item is changed T10I8D100kN450 (  =1)

37 37 Experiment 2: min-sup FT is changed T10I8D10kN1k (  =1 )

38 38 Experiment 3: fault tolerance  is changed T10I8D100kN450

39 39 Experiment 4: database size is changed T10I8N450 (  =1 )

40 40 Experiment 5: the number of various items in database is changed T10I8D100k (  =1 )

41 41 Conclusion VB-FT-Mine algorithm is proposed Construct FT-appearing vectors of candidates Compute FT-support and Item-support efficiently significant improvement on execution time than FT-Apriori algorithm Future work extend VB-FT-Mine algorithm for mining frequent patterns in data streams

42 42 An Efficient Approach for Mining Top-K Fault-Tolerant Repeating Patterns Jia-Ling Koh and Yu-Ting Kung DASFAA 2006

43 43 Outline Introduction Basic Terms Bit Sequence Representation Mining Top-K non-trivial FT-RPs with min_len Constraints Performance Study Conclusion & Feature Works

44 44 Introduction Repeating patterns  the sub-patterns appearing in a data sequence repeatedly  music feature extraction, user behavior monitoring In most studies, only exact matching was considered

45 45 Introduction (Cont.) For example: data sequence=ACDE……ACEDE…. using exact matching approache Allow insertion error the frequency of “ACDE” is 1 Find the implicit repeating pattern “ACDE”

46 46 Introduction (Cont.) Idea: 1.Discover fault-tolerant repeating patterns, FT- RPs in short, and 2.Avoid finding “duplicated” information & “short” patterns Mining “top-K non-trivial FT-RPs with length no less than min_len”

47 47 Outline Introduction Basic Terms Bit Sequence Representation Mining Top-K non-trivial FT-RPs with min_len Constraints Performance Study Conclusion & Feature Works

48 48 Data Sequence E = {A,B,…Z}  data items DSeq=D 1 D 2 …D n is a data sequence  where D i  E( i=1…n)  e.g. DSeq = ABCDABCACDEE |DSeq| = 12  the length of DSeq

49 49 Contain & Appear DSeq = ABCDABCDA P = CDA Contain (on position “3”) Appear (on position “3”) CDA 3 7 freq(P)? = 2

50 50 FT-contain: insertion error DSeq = ABCDABCA P = ABCA DSeq FT-contain P on position 1 with 1 insertion error ABC A ABCA DSeq FT-contain P on position 5 with 0 insertion error 15

51 51 FT-contain: deletion error DSeq = ABCBCA P = BCD BC DSeq FT-contain P on position 2 and 4 with 1 deletion error

52 52 IFT-contain & IFT-appear  insertion error: 0, 1, or 2 DSeq = ABCDABCA P = ABCA ABC A ABCA IFT-contain IFT-appear

53 53 DFT-contain & DFT-appear  deletion error: 0, 1, or 2 DSeq = ABCBCA P =BCD BC DFT-contain DFT-appear

54 54 Fault-Tolerant Frequency DSeq = ABCDABCAECDAA P = CA C C A C A A FT-freq DSeq (P) = 3

55 55 Fault-tolerant Repeating Patterns (FT-RPs) DSeq, P If FT-freq DSeq (P) ≧ min_freq P is a FT-RP

56 56 Outline Introduction Basic Terms Bit Sequence Representation Mining Top-K non-trivial FT-RPs with min_len Constraints Performance Study Conclusion & Feature Works

57 57 Appearing Bit Sequence DSeq = ABCDABCACDEEABCCDEAC Appear A 00000000000000000000 A A A AA Initially 11111 freq(“A”) = 5

58 58 Bit Index Table DSeq = ABCDABCACDEEABCCDEAC Data Item Appearing Bit Sequence( Appear N ) A 10001001000010000010 B 01000100000001000000 C 00100010100000110001 D 00010000010000001000 E 00000000001100000100

59 59 Appearing Bit Sequence of longer patterns DSeq = Appear AB ? Appear A = 10000001000011000001 ABCDBCCADEECAABCCDEA Appear B = 01001000000000100000 l_shif(Appear B,1) = 10000000000011000000  Appear AB = 10000000000011000000 freq(“AB”) = 3

60 60 Appearing Bit Sequence of longer patterns (Cont.) DSeq = Appear ABC ? ABCDBCCADEECAABCCDEA Appear AB = 10000000000011000000 Appear C = 00100110000100011000 l_shift(Appear C,2) = 10000100000011100010 ︿ Appear ABC = 10000000000011000000 freq(“ABC”) =3

61 61 Recursive Function-Appear P P=P 1 P 2 …P m-1 P m Appear P P’X

62 62 Fault-Tolerant Appearing Bit Sequences Represent the positions where the data sequence IFT/DFT-contains P under fault- tolerance Insertion Fault Tolerance Deletion Fault Tolerance

63 63 Appearing Bit Sequence of Insertion Fault Tolerance (E=0, 1, …, ) -The appearing bit sequence of P with E insertion errors

64 64 How to get ?? When |P| > 1 and E > 0  P = A B C, E = 2 1) A B x x C 2) A x B x C 3) A x x B C Shift 4 = |P|+ E -1 bit positions 0 insertion error in P’ 1 insertion error in P’ 2 insertion errors in P’  P’X

65 65 Recursive Function- P=P 1 P 2 …P m-1 P m, for E = 0 ~ P’=P 1 …P m-1, X=P m

66 66 Example V V

67 67 Fault-Tolerant Frequency FT-freq DSeq (P)  Get it by counting the number of bits with value “1” in  A pattern P can be evaluate whether P is FT-RP or not efficiently.

68 68 Appearing Bit Sequence of Deletion Fault Tolerance P=P 1 P 2 …P m -The appearing bit sequence of P with E deletion errors Y P’’ 0, 1, …,  D deletion errors in P’’ Shift 1 bit position

69 69 Recursive Function- P’’=P 2 …P m, for E = 0 ~ Q=P 2 …P m-1, X=P m

70 70 Example V V

71 71 Outline Introduction Basic Terms Bit Sequence Representation Mining Top-K non-trivial FT-RPs with min_len Constraints Performance Study Conclusion & Feature Works

72 72 TFTRP-Mine Algorithm DSeq = ABCDABCACDEEABCCDEAC , min_len = 2 and K = 3  min_freq is set to be 3. 1. Scan DSeq once to construct the bit index table.

73 73 TFTRP-Mine (Cont.) 2.Generate candidate patterns in Depth-First order root >= 3 A 5 < min_len = 2 B 3 = min_len = 2 C 3 >= 3 D 3 A 2 < 3 E 1 Minlen_Set A 1 >= min_freq = 3 AB(3) A 2 B 0 < 3 ABC (3) B 1 C 2 ABCD (3) A 2 EBCD 110 2 Temporal Output Set Check non-trivial Empty ABCD (3) Check non-trivial D 2 E 2 C 5 D 2 E 1 AC (5) ACD (3) AC (5) ACD (3)

74 74 TFTRP-Mine (Cont.) root B 3 A B CDE 2 0320 ABCDE 2 1 23 1 ABCDE 2 1 1 02 Minlen_Set Temporal Output Set AB(3) ABC (3) ABCD (3) min_freq = 3, K = 3, min_len = 2 BC(3) BCD (3) AC (5) ACD (3) AC (5) ACD (3)

75 75 TFTRP-Mine (Cont.) Temporal Output Set ABCD (3) CAC (3) CDAC (3) CDA (4) CDEAC (3) CDE (4) CEA (3) AC (5) ACD (3) CD (5) Sort Temporal Output Set AC (5) CD (5) CDA (4) CDE (4) CAC (3) ACD (3) CEA (3) ABCD (3) CDEAC (3) Results: AC(5), CD(5), CDA (4), CDE (4) CDAC (3)

76 76 RE-TFTRP-Mine Algorithm min_len = 2, K = 3 min_freq = 3 Minlen_Set AC (5) CD (5) AB (3) BC (3) CA (3) CE (3) DA (3) EA (3) vvvvvvvv Temporal Top-K Set AC (5) CD (5) AB (3) min_freq = 3

77 77 RE-TFTRP-Mine (Cont.) Minlen_Set AC (5) CD (5) AB (3) BC (3) CA (3) CE (3) DA (3) EA (3) Temporal Top-K Set AC (5) CD (5) AB (3) v ACD (3) Check non-trivial v v CDA (4) CDE (4) Check non-trivial min_freq = 3 A AB C D E 13521 5 AB C D E 20231 6 C ABCDE 31253 ABCDE 41204

78 78 RE-TFTRP-Mine (Cont.) Minlen_Set AB (3) BC (3) CA (3) CE (3) DA (3) EA (3) ACD (3) CDA (4) CDE (4) Temporal Top-K Set AC (5) CD (5) AB (3) min_freq = 3 CDA 4 A B C D E 12300 v CDAC (3) Check non-trivial CDA (4) 4

79 79 Performance Study Implementation Environment  Borland C++ Builder 5.0  2.4 GHz Intel Pentium IV PC machine  512 MB  Microsoft XP Professional Data Sequence Generator ParametersMeaning LLength of the generated data sequence ENumber of various data items in the generated data sequence

80 80 Performance Study (Cont.) Experiment 1  Performance evaluation on efficiency Vary one of the five parameters data parameters: L, E, Runtime parameters:, min_len, K Experiment 2  Performance evaluation on effectiveness on music objects

81 81 Experiment 1 Changing the size of a data sequence L  min_len = 8, K = 5 and E = 5 Candidates Patterns Unit: Numbers Algorithm L TFTRP-MineRE-TFTRP-Mine 100010,2958,705 200041,73024,300 3000120,07524,760 4000348,61036,090 5000533,77041,280

82 82 Experiment 1 (Cont.) Changing insertion fault tolerance  L2000.E5, min_len = 8, K = 5 Candidate Patterns Unit: Numbers TFTRP-MineRE-TFTRP-Mine 15,7605,465 241,73024,300 3394,51536,600 43,434,80576,195 Algorithm

83 83 Experiment 1 (Cont.) Changing the setting of min_len  L2000.E5, and K = 5 Candidate patterns Unit :Numbers Algorithm min_len TFTRP RE- TFTRP 541,7303,795 1041,73025,240 1541,73026,375 2041,73027,615 2541,73029,000 3041,73030,500 3541,73038,095 4041,73039,620 4541,730 5041,730

84 84 Experiment 1 (Cont.) Changing the setting of K  K = max_K x 1%, max_K x 20%, …max_K x 100%  L2000.E5, and min_len = 8 Candidate Patterns Unit: Numbers Algorithm K/max_K TFTRP-Mine RE-TFTRP- Mine 1%41,73014,320 20%41,73024,300 40%41,73031,735 60%41,73037,325 80%41,730 100%41,730

85 85 Experiment 2 Music Object Found FT-RPs under insertion fault tolerance 0 Found FT-RPs under insertion fault tolerance 1 Found FT-RPs under insertion fault tolerance 2 Motif in Music Object 1 (252 seconds)None 1. Ecgcegcdcgebbdgbbcaeaecaecba aegegeegecbbggdgacaffbfacfcgee ccgeb 2. None ecgcegcdcgebbdgbbcaeaecaecba aegegeegecbbggdgacaffbfacfcgee ccgebceaadfd 2 (270 seconds)None 1. dcbgddebdefgaabbadgcfbge dcdbaccaaccffaaccccddddcca afdd 2. None ddcbgddebdefgaabbadgcfbg edcdbaccaaccffaaccccddddc caaffddgend 3 (256 seconds)None 1. gggbbgfffdcbeeedccfeeeffgf 2. gggbbgfffdcbfeedccfeeeffgf cbcgggbbgfffdcbfeedccfeeef fgfccc 4 (288 seconds)None 1. deededacededaceded aagggagg 2. deecedacededaceded aagggagg 1. gedeededacededacededaagg gaggedgggcbbaaagabeeegaaa agedeededacedegedcaageedd eccgacaagegdegedcacdegedc dagagedddedcacccccbbaaaga beegaaaagedeededacededace dedaagggaggedgggc 2. None baaagabeeegaaaagedeededa cededacededaagggaggedggg cbbaaagabeeegaaaagedeede dacedegedcaageeddeccgaca agegdegedcacdegedcdagage dddedcacccccbbaaagabeeeg aaaagedeededacededaceded aagggaggedgggcbbaaagab 5 (291 seconds)None 1. aegcfaecaholdonfdbfebgfba holdonbgfeebfbgefdabfddbfd holdonbfcaccaecegcholdoncd fbdf 2. None ecgacaegcfaecaholdonfdbfe bgfbaholdonbgfeebfbgefdab fddbfdholdonbfccaccaecegc holdoncdfbdf min_freq = 3, K = 2 and min_len =8

86 86 Conclusion and Future Works Conclusion  fault-tolerant appearing bit sequences  TFTRP-Mine and RE-TFTRP-Mine algorithms For mining top-K non-trivial FT-RPs with length no less than min_len in data sequences efficiently Future works  partition the bit index table into several parts to perform parallel mining


Download ppt "1 The Strategies for Mining Fault-Tolerant Patterns Jia-Ling Koh Department of Information and Computer Education National Taiwan Normal University."

Similar presentations


Ads by Google