A distributed method for mining association rules

Presentation on theme: "A distributed method for mining association rules"— Presentation transcript:

A distributed method for mining association rules
Pham Nguyen Anh Huy* Department of Information Technology Vietnam National University of HoChiMinh city presented by Ho Tu Bao School of Knowledge Science Japan Advanced Institute of Science and Technology (*work done during 3 months of the author JSPS’s fellowship in JAIST)

Outline Introduction Background
A distributed Apriori algorithm using mobile agents Experimental evaluation Conclusion

Introduction Association analysis is a new and attractive research area in data mining Apriori algorithm (R. Agrawal, IBM 1993) is a key technique for association analysis Though the apriori principle allows us to considerably reduce the search space, the technique still requires a huge computation, particularly for large database This research proposes a distributed version of Apriori algorithm using mobile agents. The experiments show that we can reduce computation time when using computers in a distributed computing environment.

Outline Introduction Background
Association rules and Apriori algorithm Mobile agents and Aglets A distributed Apriori algorithm using mobile agents Experimental evaluation Conclusion

Analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets” (in the form X  Y, where X and Y are sets of items) I = {I1=beer, I2=cake, I3=onigiri} Transactional database An association rule {I1}  {I3} How often people buy onigiri and beer together? TID1: {I1, I2, I3} TID2: {I1, I2} TID3: {I2, I3} TID4: {I2} TID5: {I1, I2}

Rule measures: Support and Confidence
Association rule X Y support s = probability that a transaction contains X and Y confidence c = conditional probability that a transaction having X also contains Y A  C (s=50%, c=66.6%) C  A (s=50%, c=100%) Customer buys both Customer buys beer Customer buys onigiri

Association mining: Apriori algorithm
It is composed of two steps: Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a pre-determined minimum support count Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence (Agrawal, R., 1993)

Association mining: Apriori principle
Min. support 50% Min. confidence 50% For rule A  C support = support({A and C}) = 50% confidence = support({A and C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent (if an itemset is not frequent, its supersets are not)

C1  …  Li-1  Ci  Li  Ci+1  …  Lk
The Apriori algorithm: Finding frequent itemsets using candidate generation Find the frequent itemsets: the sets of items that have support higher than the minimum support A subset of a frequent itemset must also be a frequent itemset i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset Iteratively find frequent itemsets Lk with cardinality from 1 to k (k-itemset) by from candidate itemsets Ck (Lk  Ck) Use the frequent itemsets to generate association rules. C1  …  Li-1  Ci  Li  Ci+1  …  Lk

Example (min_sup_count = 2)
Scan D for count of each candidate Compare candidate support count with minimum support count Transactional data TID List of items_IDs T100 I1, I2, I5 T200 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 T600 I2, I3 T700 I1, I3 T800 I1, I2, I3, I5 T900 I1, I2, I3 C1 L1 Itemset Sup.Count {I1} {I2} {I3} {I4} {I5} Itemset Sup.Count {I1} {I2} {I3} {I4} {I5}

Example (min_sup_count = 2)
Itemset {I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5} Itemset S.count {I1, I2} 4 {I1, I3} 4 {I1, I4} 1 {I1, I5} 2 {I2, I3} 4 {I2, I4} 2 {I2, I5} 2 {I3, I4} 0 {I3, I5} 1 {I4, I5} 0 Compare candidate support count with minimum support count Generate candidates C2 from L1 using Apriori principle Scan D for count of each candidate L2 Itemset S.count {I1, I2} 4 {I1, I3} 4 {I1, I5} 2 {I2, I3} 4 {I2, I4} 2 {I2, I5} 2 Compare candidate support count with minimum support count Generate candidates C3 from L2 using Apriori principle C3 L3 Scan D for count of each candidate Itemset {I1, I2, I3} {I1, I2, I5} Itemset Sc {I1, I2, I3} 2 {I1, I2, I5} 2 Itemset Sc {I1, I2, I3} 2 {I1, I2, I5} 2

Agents and Mobile agents
Mobile network agents are programs that: can migrate from system to system within a network environment Performs some processing at each host Agent decides when and where to move next How does it move? Save state Transport saved state to next system Resume execution of saved state An agent is a computation entity that: Acts on behalf of other entities in autonomous fashion. Performs its actions with some level of pro-activity and re-activeness. Exhibits some level of the key attributes of co-operation.

Distributed Computing using Mobile Programs

Mobile agent tools

What are Aglets ? Aglets (Agile Applets) are Java objects that can move from one host on the Internet to another, and perform arbitrary operations within the security limits. When an Aglet moves it takes along its program code as well as its data. The Aglets framework is implemented by the Aglets Software Development Kit (ASDK) from IBM. It is an environment for programming mobile Internet Agent in Java.

Aglets at Runtime Currently aglets use the Agent Transfer Protocol (ATP) as a default implementation of the communication layer (ATP is modeled after HTTP) Used on the Tahiti aglet server Use the Aglets Server Interface to write application capable of hosting, receiving and dispatching aglets

Outline Introduction Background
A distributed Apriori algorithm using the mobile agents Experimental evaluation Conclusion

A distributed Apriori algorithm
1 (1) spawn n slave processes; (2) divide database into partitions (3) distribute partitions to each slave process 2 Master process send global candidate (k-1)-itemsets Ck-1 to each slave process wait and receive local supports, count global supports for global candidate (k-1)-itemsets Ck-1 compute frequent (k-1)-itemsets Lk-1, and send clusters of frequent (k-1)-itemsets Lk-1 to slave processes 8. wait and receive local candidate k-itemsets from slave processes 9. unionize local candidate k-itemsets and prune to form global candidate k-itemsets. Slave processes receive the global candidate (k-1)-itemsets Ck-1 count local supports for global candidate (k-1)-itemsets Ck-1, and send local supports to the master process. receive frequent (k-1)-itemsets Lk-1 from the master process generate local candidate k-itemsets and send these local candidate k-itemsets to the master process

A distributed Apriori algorithm
SEND global candidate (k-1) itemsets Ck-1 COUNT and SEND local supports for global candidate (k-1)-itemsets (counting support Aglets) COUNT global supports for global candidate (k-1)-itemsets Ck-1 JOIN and SEND local candidate k-itemsets (Aprio_gen Aglet) UNIONIZE local candidate k-itemsets and PRUNE to form global candidate k-itemsets Ck e.g.,{AB} 2 3 1 DB1 DB1 DB DB DB2 DB2 DB 8 . . FIND and SEND frequent (k-1)-itemsets Lk-1 DBn DBn master slaves master slaves master

Global support count & Global candidate itemsets
X is a candidate itemset, global support count of X is The set of global candidate k-itemsets GCk formed by local candidate k-itemsets GLk formed by Apriori-gen with ID segment (p, q) of GLk-1 GLk = {GCk ׀ GCk.G-Supp  G-Min-Supp}

Outline Introduction Background
A distributed Apriori algorithm using the mobile agents Experimental evaluation Conclusion

Experiments: Synthetic datasets
Using synthetic datasets of varying sizes: Name |D| |T| Size (MB) D100k.T30 100K 30 3M D100k.T100 100 10M D320k.T150 320K 150 48M |D| Number of transactions |T| Average amount of items on transactions

Experiment environment
Software Database : Oracle server Language: Java – JDK1.3-Sun Mobile agents: Aglet- IBM Protocol traffic: ATP – Aglet Transfer Protocol Platform: Windows Hardware PC Petium3-300 Mhz, RAM 128MB 15 machines (at Knowledge Science Center, JAIST)

Execution time (sec.) with different minimum support thresholds
35% 40% 50%

Execution time with min_sup 35%

Execution time with min_sup 40%

Execution time with min_sup 50%

Rate of execution time The rate between execution time and number of slaves is nearly linear

Conclusion Proposed a distributed apriori algorithm for mining association rule Experimental evaluation show that when the number of slaves increases the execution time decreases nearly linear Future work: Segment both the master and GLk for support counts Develop incremental algorithms for association analysis using the MA technology