Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh.

Similar presentations


Presentation on theme: "1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh."— Presentation transcript:

1 1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh city presented by Ho Tu Bao School of Knowledge Science Japan Advanced Institute of Science and Technology (*work done during 3 months of the author JSPS’s fellowship in JAIST)

2 2 Introduction Background A distributed Apriori algorithm using mobile agents Experimental evaluation Conclusion Outline

3 3 Introduction Association analysis is a new and attractive research area in data mining Apriori algorithm (R. Agrawal, IBM 1993) is a key technique for association analysis Though the apriori principle allows us to considerably reduce the search space, the technique still requires a huge computation, particularly for large database This research proposes a distributed version of Apriori algorithm using mobile agents. The experiments show that we can reduce computation time when using computers in a distributed computing environment.

4 4 Introduction Background Association rules and Apriori algorithm Mobile agents and Aglets A distributed Apriori algorithm using mobile agents Experimental evaluation Conclusion Outline

5 5 Association rules: Market basket analysis Analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets” (in the form X  Y, where X and Y are sets of items) I = {I1=beer, I2=cake, I3=onigiri} Transactional database An association rule {I1}  {I3} How often people buy onigiri and beer together? TID1: {I1, I2, I3} TID2: {I1, I2} TID3: {I2, I3} TID4: {I2} TID5: {I1, I2}

6 6 Rule measures: Support and Confidence  Association rule X  Y  support s = probability that a transaction contains X and Y  confidence c = conditional probability that a transaction having X also contains Y  A  C (s=50%, c=66.6%)  C  A (s=50%, c=100%) Customer buys onigiri Customer buys both Customer buys beer

7 7 Association mining : Apriori algorithm It is composed of two steps: 1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a pre-determined minimum support count 2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence (Agrawal, R., 1993)

8 8 Association mining: Apriori principle For rule A  C support = support({A and C}) = 50% confidence = support({A and C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent (if an itemset is not frequent, its supersets are not) Min. support 50% Min. confidence 50%

9 9 The Apriori algorithm: Finding frequent itemsets using candidate generation 1.Find the frequent itemsets: the sets of items that have support higher than the minimum support A subset of a frequent itemset must also be a frequent itemset i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset Iteratively find frequent itemsets L k with cardinality from 1 to k (k-itemset) by from candidate itemsets C k (L k  C k ) 2.Use the frequent itemsets to generate association rules. C 1  …  L i-1  C i  L i  C i+1  …  L k

10 10 Example (min_sup_count = 2) TID List of items_IDs T100 I1, I2, I5 T200 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 T600 I2, I3 T700 I1, I3 T800 I1, I2, I3, I5 T900 I1, I2, I3 Itemset Sup.Count {I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2 C1 Itemset Sup.Count {I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2 L1 Transactional data Scan D for count of each candidate Compare candidate support count with minimum support count

11 11 Example (min_sup_count = 2) Itemset {I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5} C2 Scan D for count of each candidate Itemset S.count {I1, I2} 4 {I1, I3} 4 {I1, I4} 1 {I1, I5} 2 {I2, I3} 4 {I2, I4} 2 {I2, I5} 2 {I3, I4} 0 {I3, I5} 1 {I4, I5} 0 C2 Compare candidate support count with minimum support count Itemset S.count {I1, I2} 4 {I1, I3} 4 {I1, I5} 2 {I2, I3} 4 {I2, I4} 2 {I2, I5} 2 L2 Generate candidates C3 from L2 using Apriori principle Itemset {I1, I2, I3} {I1, I2, I5} Scan D for count of each candidate Itemset Sc {I1, I2, I3} 2 {I1, I2, I5} 2 C3 Compare candidate support count with minimum support count Itemset Sc {I1, I2, I3} 2 {I1, I2, I5} 2 L3 Generate candidates C2 from L1 using Apriori principle

12 12 Agents and Mobile agents An agent is a computation entity that: Acts on behalf of other entities in autonomous fashion. Performs its actions with some level of pro-activity and re-activeness. Exhibits some level of the key attributes of co-operation. Mobile network agents are programs that:  can migrate from system to system within a network environment  Performs some processing at each host  Agent decides when and where to move next  How does it move?  Save state  Transport saved state to next system  Resume execution of saved state

13 13 Distributed Computing using Mobile Programs

14 14 Mobile agent tools 

15 15 What are Aglets ? Aglets (Agile Applets) are Java objects that can move from one host on the Internet to another, and perform arbitrary operations within the security limits. When an Aglet moves it takes along its program code as well as its data. The Aglets framework is implemented by the Aglets Software Development Kit (ASDK) from IBM. It is an environment for programming mobile Internet Agent in Java.

16 16 Aglets at Runtime Currently aglets use the Agent Transfer Protocol (ATP) as a default implementation of the communication layer (ATP is modeled after HTTP) Used on the Tahiti aglet server Use the Aglets Server Interface to write application capable of hosting, receiving and dispatching aglets

17 17 Introduction Background A distributed Apriori algorithm using the mobile agents Experimental evaluation Conclusion Outline

18 18 A distributed Apriori algorithm (1) spawn n slave processes; (2) divide database into partitions (3) distribute partitions to each slave process Master process 1.send global candidate (k-1)-itemsets C k-1 to each slave process 4.wait and receive local supports, count global supports for global candidate (k-1)-itemsets C k-1 5.compute frequent (k-1)-itemsets L k-1, and send clusters of frequent (k-1)- itemsets L k-1 to slave processes 8. wait and receive local candidate k-itemsets from slave processes 9. unionize local candidate k-itemsets and prune to form global candidate k-itemsets. 1 2 Slave processes 2.receive the global candidate (k-1)-itemsets C k-1 3.count local supports for global candidate (k-1)-itemsets C k-1, and send local supports to the master process. 6.receive frequent (k-1)-itemsets L k-1 from the master process 7.generate local candidate k- itemsets and send these local candidate k-itemsets to the master process

19 19 A distributed Apriori algorithm SEND global candidate (k-1) itemsets C k-1 COUNT and SEND local supports for global candidate (k-1)-itemsets (counting support Aglets) COUNT global supports for global candidate (k-1)-itemsets C k-1 UNIONIZE local candidate k-itemsets and PRUNE to form global candidate k-itemsets C k JOIN and SEND local candidate k-itemsets (Aprio_gen Aglet) … e.g.,{AB} FIND and SEND frequent (k-1)- itemsets L k-1 DB 1 DB 2 DB n DB 1 DB 2 DB n master slaves master slaves master DB …

20 20 Global support count & Global candidate itemsets X is a candidate itemset, global support count of X is The set of global candidate k-itemsets GC k formed by local candidate k-itemsets GL k formed by Apriori-gen with ID segment (p, q) of GL k-1 GL k = {GC k ׀ GC k.G-Supp  G-Min-Supp}

21 21 Introduction Background A distributed Apriori algorithm using the mobile agents Experimental evaluation Conclusion Outline

22 22 Experiments: Synthetic datasets Using synthetic datasets of varying sizes: Name|D||T|Size (MB) D100k.T30100K303M D100k.T100100K10010M D320k.T150320K15048M |D| Number of transactions |T| Average amount of items on transactions

23 23 Experiment environment Software Database : Oracle server Language: Java – JDK1.3-Sun Mobile agents: Aglet- IBM Protocol traffic: ATP – Aglet Transfer Protocol Platform: Windows Hardware PC Petium3-300 Mhz, RAM 128MB 15 machines (at Knowledge Science Center, JAIST)

24 24 Execution time (sec.) with different minimum support thresholds 35% 40% 50%

25 25 Execution time with min_sup 35%

26 26 Execution time with min_sup 40%

27 27 Execution time with min_sup 50%

28 28 Rate of execution time The rate between execution time and number of slaves is nearly linear

29 29 Conclusion Proposed a distributed apriori algorithm for mining association rule Experimental evaluation show that when the number of slaves increases the execution time decreases nearly linear Future work: Segment both the master and GL k for support counts Develop incremental algorithms for association analysis using the MA technology


Download ppt "1 A distributed method for mining association rules Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh."

Similar presentations


Ads by Google