Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis.

Similar presentations


Presentation on theme: "Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis."— Presentation transcript:

1 Clustering for web documents 1 박흠

2 Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis (2002) by Ying Zhao and George Karypis Department of Computer Science, University of Minnesota, Minneapolis, MN 55455 Feature selection for web documents (2004)

3 Clustering for web documents 3 Cluto Clustering Toolkit. 2.1.1 Department of Computer Science, University of Minnesota, Minneapolis http://www-users.cs.umn.edu/~karypis/ platform Linux 2.4.18 Sun OS 5.7 Win32 programs CLUTO's user callable library vcluster scluster

4 Clustering for web documents 4 Cluto What is Cluto.(1/2) Clustering algorithms partitional clustering agglomerative clustering graph-partitioning clustering clustering criterion function provide seven different criterion functions both partitional and agglomerative clustering algorithms provide some of the more traditional local criteria (e.g., single-link, complete-link, and UPGMA) agglomerative clustering.

5 Clustering for web documents 5 Cluto What is Cluto.(2/2) Analyze discovered clusters relations between the objects assigned to each cluster relations between the different clusters identify the features that best describe and/or discriminate each cluster. relationships between the clusters, objects, and features. operate on very large datasets the number of objects the number of dimensions.

6 Clustering for web documents 6 Cluto Programs vcluster operate in the object’s feature space scluster operate in the object’s similarity space. Interface vcluster [optional parameters] MatrixFile Ncluster n*m matrix. rows to objects, cols to features space Ncluster : number of cluster

7 Clustering for web documents 7 Cluto Parameters of Algorithms rd, rdr k-1 repeated bisections. (rdr : optimize the criterion function) direct computed by simultaneously finding all k clusters agglo the agglomerative paradigm graph using a nearest-neighbor graph bagglo

8 Clustering for web documents 8 Cluto Parameters of the similarity function cos the cosine function. default. corr the correlation coefficient. dist the Euclidean distance applicable when -clmethod=graph. jacc the extended Jaccard coefficient. applicable when -clmethod=graph.

9 Clustering for web documents 9 Cluto Parameters of the criterion function i1, i2, e1, g1, g1p, h1, h2

10 Clustering for web documents 10 Cluto Parameters of the criterion function slinksingle link wslinkweighted single link clinkcomplete link wclinkweighted complete link upgmaUPGMA cstype fulltree rowmodel, colmodel showfeatures

11 Clustering for web documents 11

12 Clustering for web documents 12 Criterion Functions for Document Clustering Experiments and Analysis (2002) by Ying Zhao and George Karypis Department of Computer Science, University of Minnesota, Minneapolis, MN 55455

13 Clustering for web documents 13 Data Clustering A.K. JAIN Michigan State University M.N. MURTY Indian Institute of Science AND P.J. FLYNN The Ohio State University ACM Computing Surveys

14 Clustering for web documents 14 Introduction(1/2) Clustering algorithms Agglomerative algorithms UPGMA, single-link, complete-link, CURE, ROCK, Chameleon Partitional algorithms K-means, K-medoids, Autoclass, graph-partitional-based, spectral-partitional-based well suit for large datasets. so fast. Seven Criterion functions measure intra-cluster similarity, inter-cluster similarity, two combinations. i1, i2, e1, g1, g1p, h1, h2

15 Clustering for web documents 15 Introduction(2/2) Datasets 15 different data sets

16 Clustering for web documents 16 Preliminaries(1/3) Document Representation use vector space model for each document d : document, tf : term frequency, tf i : frequency of i-th term in the doc use idf or tf*idf N : total documents Similarity Measures The similarity between two docs di, dj Cosine functions ||d|| : normalize the length of doc vector 1 : identical, 0 : nothing in common

17 Clustering for web documents 17 Preliminaries(2/3) Euclidean functions if dis=0, docs are identical, if, nothing in common. Definitions S : set of documents S 1, S 2, … S k : set of document of k-th cluster k : number of clusters n 1, n 2, … n k : size docs of the corresponding clusters A : a set of docs composite vector D A centroid vector C A. sum of all docs vector in A average the weight of terms of docs in A

18 Clustering for web documents 18 Preliminaries(3/3) Vector Properties Si, Sj : two sets of docs containing ni, nj documents Di, Dj : the composite vector, Ci, Cj : the centroid vector The sum of the pair similarity between the docs in Si and Sj is D j t D j The sum of the pair similarity between the docs in Si is ||D i || 2

19 Clustering for web documents 19 Criterion Functions(1/5) Internal Criterion Functions maximize sum of the average pairwise similarities between the docs to each cluster use cosine function. I1 is similar to function of hierarchical agglomerative clustering that uses group average heuristics to determine merge. use cosine function. I2 : vector space of K-means algorithm. Cr : centroid vector of clusters

20 Clustering for web documents 20 Criterion Functions (2/5) External Criterion Functions. E1, E2 optimize a function that different from each cluster external function derived that the centroid vectors of the different clusters as orthogonal as possible C : the centroid vector of the entire docs D : the composite vector of the entire docs. 1/||D|| is constant.

21 Clustering for web documents 21 Criterion Functions (3/5) define with the Euclidean distance function. Hybrid Criterion Functions. H1, H2 maximize the similarity of docs in each cluster, minimize the similarity between the cluster’s docs and the entire docs H1. combine criterion function I1, E1

22 Clustering for web documents 22 Criterion Functions (4/5) H2. combine criterion function I2, E1 Graph Based Criterion Functions view the relations between docs is to use graphs G1 : computing pairwise similarities between the docs G2 : computing pairwise similarities between the docs and terms S : given collection of n docs Gs : similarity graph

23 Clustering for web documents 23 Criterion Functions (5/5) G1. G2.

24 Clustering for web documents 24

25 Clustering for web documents 25

26 Clustering for web documents 26 Experimental Results Direct k -way Clustering

27 Clustering for web documents 27 Experimental Results

28 Clustering for web documents 28 Experimental Results

29 Clustering for web documents 29 Data Sets ‘the Natural Science’ category in Naver directory (http://dir.naver.com) 6 subcategories in corpora 1,215 docs, 17,223 terms, 20 clusters, 5 features per a doc, idf Sub CategoryNo. of Docs.Sub CategoryNo. of Docs. Physics102 Earth science149 Biology426 Astrology323 Mathematics102 Chemistry113 Total1,215

30 Clustering for web documents 30 Experimental parameters Algorithms rd, rdr k-1 repeated bisections. (rdr : optimize the criterion function) direct computed by simultaneously finding all k clusters agglo the agglomerative paradigm graph using a nearest-neighbor graph

31 Clustering for web documents 31 Experimental parameters Criterion Functions i1, i2, e1, g1, g1p, h1, h2, clink, slink Similarity Functions cosine measure

32 Clustering for web documents 32 Experimental results Entropy rbrbrdirectagglograph I1.464.452.490.642.417 I2.379.375.374.564 E1.388.398.416.540 G1.389.418.398.895 G1p.326.366.391.562 H1.386.392.386.541 H2.348.352.367.559 Clink.761 slink.895

33 Clustering for web documents 33 Entropy

34 Clustering for web documents 34 Experimental results Purity rbrbrdirectagglograph I1.686.690.683.548.749 I2.772.762.761.629 E1.741.737.723.647 G1.768.739.752.367 G1p.780.758.647 H1.753.744.758.634 H2.780.782.751.650 Clink.458Cut functions slink.368

35 Clustering for web documents 35 Purity

36 Clustering for web documents 36 Best results rbrbrdirectagglograph entrpurientrpurientrpurientrpurientrpuri g1ph2h1 cut 0.3260.7800.3520.7820.3860.7580.5410.6340.4170.749


Download ppt "Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis."

Similar presentations


Ads by Google