Graduate Course DataMining

Graduate Course DataMining
Jun-Ki Min

DataMinig Knowledge discovery in databases Association Rule AB
Transactions containing A tend to also contain the items Confidence The percentage of transactions containing B among the transaction containing A Support The percentage of transactions that contain both A and B

Fast Algorithms for Mining Association Rules

Problem Statement I = { i1,i2, …, im} //set of items
general association rule XY, where X  I, and Y  I, X Y =  confidence c if c% of transactions in D that contain X also contain Y support s if s% of transactions in D contain XY Given a set of transaction D, the problem of mining association rules is to generate all association rules that have support and confidence greater than minsup and minconf, respectively

Problem Decomposition
Find all sets of items (large itemset) that have transaction support above minsup Use large itemsets to generate the desired rules. For each large itemset l, find all non-empty subsets of l. For every such subset a, output a rule of the form a(l-a) if the ratio of support(l) to support(a) is at least minconf.

Discovering Large Itemsets
Require multiple pass 1st pass, find all large itemsets whose size is one. In each subsequence pass, we start with a seed set of itemsets (candidate set) found to be large in the previous pass. Then compute support. Anti-Monotonic if sup(A) > minSup, sup(A’) > minSup where A’  A

Aprior Algorithm L1 = {large 1-items} for( k = 2; Lk-1 !=0; k++) do
Ck = apriori-gen(Lk-1) forall transactions t  D do Ct = subset(Ck,t) //cadidates contained in t for all candidates c  Ct do c.count++; end Lk = { c Ck|c.count >= minsup} Answer = Lk

AprioriGen Using Lk-1, generate super sets of k-item
insert into Ck select p.item1, p.item2, …, p.itermk-1,q.itemk-1 from Lk-1 p, Lk-1 q where p.iterm1 = q.iterm1,…,p.itermk-2 = q.itermk-2,p.itemk-1 < q.itermk-1; forall itemsets c ∈ Ck do forall (k-1)-subsets s of c do if(not(s ∈Lk-1 )) then delete c from Ck ; Using Lk-1, generate super sets of k-item c ∈Ck인 c중에서 k-1개의 원소를 가지는 부분 집합들 중에서 하나라도 Lk-1에 포함되어 있지 않는 c는 Ck에서 제거한다

Example Item set I = {A, B, C, D, E}
min_sup = 0.4(i.e., >=2 transactions) D = TID 사건항목 100 A,C,D 200 B,C,E 300 A,B,C,E 400 B,E

Pass1 C L1 itemset support {A} 2/4 {B} 3/4 {C} {D} 1/4 {E}

Pass2 C2 C2 L2 itemset support {A,B} 1/4 {A,C} 2/4 {B,C} 3/4 {A.E}
{B,E} {C,E}

sup({B,C,E} )= 2 and sup({B,C}) =2
Pass3 sup({B,C,E} )= 2 and sup({B,C}) =2 Thus, rule {B,C}=>{E} with confidence 100% itemset support {B,C,E} 2/4

AprioriTid Principle of Apriori is simple
As increase the length of itemset by 1, whole DB should be retrieved. AprioriTid – Index를 활용 As Pass gone, the size of Index Ck is reduced.

AprioriTid Algorithm L1 = {large 1-itermsets}; C1 = database D;
for (k = 2; Lk-1 ≠0; k++) do begin Ck = apriori-gen(Lk-1); //new candidate Ck = 0; forall entries t ∈ Ck-1 do begin  (1) //determine candidate itemsets in Ck contained //in the transaction with identifier t.TID Ct = {c ∈ Ck | (c – c[k]) ∈ t.set-of-itemsets ∧ (c – c[k-1]) ∈ t.set-of-itemsets};  (2) forall candidates c ∈ Ct do c. count++; if (Ct ≠ 0) then Ck += <t.TID, Ct>; end Lk = {c ∈Ck | c.count ≥ min_sup} Answer = ∪k Lk c[k] denotes k’th item ex) if c = {B,C,D} , then c[3] = {D}, c[2] = {C}

Example C1 L1 C2 TID Set-of-ItemSet itestset support itemset 100
{{A},{C},{D}} {A} 2/4 {A,B} 1/4 200 {{B},{C},{E}} {B} 3/4 {A,C} 300 {{A},{B},{C},{E}} {C} {A,E} 400 {{B},{E}} {E} {B,C} {B.E} {C,E}

{{A B},{A C},{A E},{B C},{B E},{C E}}
C L C3 TID Set-of-ItermSet 사건항목 지지도 100 {{A C}} {A C} 2/4 {B C E} 200 {{B C},{B E}, {C E}} {B C} 300 {{A B},{A C},{A E},{B C},{B E},{C E}} {B E} 3/4 400 {{B E}} {C E}

Example C3 L3 TID Set-of-ItermSets itemset support 200 {{B C E}}
2/4 300

Apriori HyBrid Apriori and AprioriTid use the same candidate generation procedure and therefore count the same itemsets. In the later passes, the number of candidate itemsets reduces. However, Apriori still examines every transaction in DB. In other hand, AprioriTid use Index. Thus, AprioruHybrid perform Apriori in initial passes, then, if the size of Ck is enough small to fix memory, AprioriTid is performed in order to reduce DISK I/O.[5]

Graduate Course DataMining

Similar presentations

Presentation on theme: "Graduate Course DataMining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Graduate Course DataMining

Similar presentations

Presentation on theme: "Graduate Course DataMining"— Presentation transcript:

Similar presentations

About project

Feedback