Presentation is loading. Please wait.

Presentation is loading. Please wait.

Frequent Item Based Clustering M.Sc Student:Homayoun Afshar Supervisor:Martin Ester.

Similar presentations


Presentation on theme: "Frequent Item Based Clustering M.Sc Student:Homayoun Afshar Supervisor:Martin Ester."— Presentation transcript:

1 Frequent Item Based Clustering M.Sc Student:Homayoun Afshar Supervisor:Martin Ester

2 Homayoun AfsharFrequent Item Based Clustering2 Contents Introduction and motivation Frequent item sets Text data as transactional data Cluster set definition Our approach Test data set, results, challenges Related works Conclusion

3 Homayoun AfsharFrequent Item Based Clustering3 Introduction and Motivation Huge amount of information online Lots of this information is in text format E.G. Emails, web pages, news group postings, … Need to group related documents Nontrivial task

4 Homayoun AfsharFrequent Item Based Clustering4 Frequent Item Sets Given a dataset D={t 1,t 2,…,t n } Each t i is a transaction t i  I where I is the set of all items Given a threshold min_sup i  I such that |{t  i  t and t  D}|>min_sup i is a frequent item set with respect to minimum support min_sup

5 Homayoun AfsharFrequent Item Based Clustering5 Text Data As Transactional Data Assume each word as an item And each document as a transaction Using a minimum support find frequent item sets (frequent word sets) Frequent Word Sets  Frequent Item Sets

6 Homayoun AfsharFrequent Item Based Clustering6 Cluster Set Definition f={X 1,X 2,…,X n } is the set of all the frequent item sets with respect to some minimum support c={C 1,C 2,…,C m } is a cluster set, where C i is the documents that are covered with some X k  f And…

7 Homayoun AfsharFrequent Item Based Clustering7 Cluster Set Definition … Each optimal cluster set has to: Cover the whole data set Mutual overlap between clusters in cluster set must be minimized Clusters should be roughly the same size

8 Homayoun AfsharFrequent Item Based Clustering8 Our Approach: Frequent-Item Based Clustering … Find all the frequent word sets Form cluster sets with just one cluster Overlap is zero Coverage is the support of the frequent item set presenting the cluster Form cluster sets with two clusters Find the overlap and coverage

9 Homayoun AfsharFrequent Item Based Clustering9 Our Approach: Frequent-Item Based Clustering … Prune the candidate list for cluster sets If Cov(c i )  Cov(c j ) and overlap(c i )>overlap(c j ) c i and c j are candidates in same level remove if Overlap(c i )>= |Cov(c i )| Generate the next level Find Overlap and Coverage, Prune Stop when there are no more candidates left

10 Homayoun AfsharFrequent Item Based Clustering10 Our Approach: Coverage And Overlap … Using a bit matrix Each column is a document Each row is a frequent word set Coverage: OR, counting the 1s Overlap: XOR, OR, AND, counting 1s

11 Homayoun AfsharFrequent Item Based Clustering11 Our Approach: Coverage And Overlap … 10110010 (1st) 10001010 (2nd) 10101100 (3rd) ------------ Coverage: OR all =10111110 count 1s -> coverage = 6 cost = 2 ORs + counting 1s cost for counting 1s = 8 (shifts, ANDs, Adds)

12 Homayoun AfsharFrequent Item Based Clustering12 Our Approach: Coverage And Overlap … Overlap: 10110010 (1st) 10001010 (2nd) ------------ AND first two =10000010 (i) XOR first two =00111000 (ii) 10101100 (3rd) ------------ AND 3rd with (ii)00101000 (iii) ------------ OR (i) and (iii) 10101010 now count 1s for overlap -> Overlap = 4

13 Homayoun AfsharFrequent Item Based Clustering13 Test Data, Results, Challenges Test data set Reuters 21578 21578 documents Reuters news 8655 of them have exactly one topic Remove stop words Stem all the words Number of frequent word sets 5% min_sup = 10678 10% min_sup=1217 20% min_sup=78

14 Homayoun AfsharFrequent Item Based Clustering14 Test Data, Results, Challenges With 20% min support sample 2-cluster candidate set {(said,reuter)(line,ct,vs)} Overlap = 1 Coverage = 5259 sample 5-cluster candidate set {(reuter)(vs)(net)(line,ct,net)(vs,net,shr)} Overlap = 3303 Coverage = 8609

15 Homayoun AfsharFrequent Item Based Clustering15 Test Data, Results, Challenges More Results With min_sup=10% {(reuter)(includ)(mln,includ)(mln,profit)(year,ct)(year,mln,net)} 6-clusters cluster set Coverage = 8616 Overlap = 2553 {(reuter)(loss)(profit)(year,1986)(mln,profit)(year,ct)(year,mln,net)} 7-clusters cluster set Coverage = 8611 Overlap = 2705 {(reuter)(loss)(profit)(year,1986)(mln,includ)(mln,profit)(year,ct)(year,mln,net)} 8-clusters cluster set Coverage = 8616 Overlap = 3033

16 Homayoun AfsharFrequent Item Based Clustering16 Test Data, Results, Challenges Lower support values Pruning is very slow 2-cluster set with minSup=20% Creating= 0.010 seconds. Updating= 1.853 seconds. (Overlap and Coverage) Pruning= 11.767 seconds. Sorting= 0.000 seconds. Number of candidates Before prune=3003 After prune=73

17 Homayoun AfsharFrequent Item Based Clustering17 Test Data, Results, Challenges Hierarchical clustering Clustering quality In our test data set, entropy Real data sets, classes are not known Test the pruning more efficiently Defining an upper threshold Using following ratios to prune candidates or Using only max item sets

18 Homayoun AfsharFrequent Item Based Clustering18 Related Works Similar idea Frequent Term-Based Text Clustering [BEX02] Florian Beil, Martin Ester, Xiaowei Xu Focuses on finding one optimal clustering set (non overlapping)-FTC Hierarchical clustering (overlapping)-HFTC

19 Homayoun AfsharFrequent Item Based Clustering19 Conclusion To get optimal clustering Reduce minimum support Reduce number of frequent items Introduce maximum support Use only max item sets Better pruning (speed) Hierarchical clustering

20 Homayoun AfsharFrequent Item Based Clustering20 References [AS94] R. Agrawal, R. Sirkant. Fast Algorithms for Mining Association rules in large databases. In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), pages 487-499, Santiago, Chile, Sept. 1994. [BEX02] F. Beil, M. Ester,X. Xu. Frequent Term-Based Text clustering. J. Han, M. Kamber. Data Mining Concepts and Techniques. Morgan Kaufmann, 2001.


Download ppt "Frequent Item Based Clustering M.Sc Student:Homayoun Afshar Supervisor:Martin Ester."

Similar presentations


Ads by Google