Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluation of Clustering Techniques on DMOZ Data

Similar presentations


Presentation on theme: "Evaluation of Clustering Techniques on DMOZ Data"— Presentation transcript:

1 Evaluation of Clustering Techniques on DMOZ Data
Alper Rifat Uluçınar Rıfat Özcan Mustafa Canım

2 Outline What is DMOZ and why do we use it? What is our aim? Conclusion
Evaluation of partitioning clustering algorithms Evaluation of hierarchical clustering algorithms Conclusion

3 What is DMOZ and why do we use it?
Another name for ODP, Open Directory Project The largest human edited directory on the Internet 5,300,000 sites 72,000 editors 590,000 categories

4

5

6 What is our aim? Evaluating cluster algorithms is not easy
We will use DMOZ as reference point (ideal cluster structure) Run our own cluster algorithms on same data Finally compare results.

7 ? Applying Clustering Algorithms such as C3M, K Means etc.
All DMOZ documents (websites) Applying Clustering Algorithms such as C3M, K Means etc. Human Evaluation ? DMOZ Clusters ??

8 A) Evaluation of Partitioning Clustering Algorithms
20,000 documents from DMOZ flat partitioned data (214 folders) We applied html parsing, stemming, stop word list elimination We will apply two clustering algorithms : C3M K-Means

9 Before applying html parsing, stemming, stop word list elimination

10 After applying html parsing, stemming, stop word list elimination

11 Applying C3M Human Evaluation 214 Clusters 642 Clusters
20,000 DMOZ documents Applying C3M Human Evaluation 214 Clusters 642 Clusters

12 How to compare DMOZ Clusters and C3M clusters ?
Answer: Corrected Rand

13 Validation of Partitioning Clustering
Comparison of two clustering structures N documents Clustering structure 1: R clusters Clustering structure 2: C clusters Metrics [1]: Rand Index Jaccard Coefficient Corrected Rand Coefficient

14 Validation of Partitioning Clustering
….. ….. ….. d1,d2 d2 d1 Type II, Frequency: b d1,d2 d1,d2 Type I, Frequency: a ….. d2 d1 d1,d2 Type III, Frequency: c ….. d2 d1 Type IV, Frequency: d

15 Validation of Partitioning Clustering
Rand Index = (a+d) / (a+b+c+d) Jaccard Coefficient = a / (a+b+c) Corrected Rand Coefficient Accounts for randomness Normalize rand index so that 0 when the partitions are selected by chance and 1 when a perfect match achieved. CR = (R – E(R)) / (1 – E(R))

16 Validation of Partitioning Clustering
Example: Docs: d1 , d2 , d3 , d4 , d5 , d6 Clustering Structure 1: C1: d1 , d2 , d3 C2: d4 , d5 , d6 Clustering Structure 2: D1: d1 , d2 D2: d3 , d4 D3: d5 , d6

17 Validation of Partitioning Clustering
Contingency Table: a : (d1, d2), (d5, d6) b : (d1, d3), (d2, d3), (d4, d5), (d4, d6) c : (d3, d4) d : remaining 8 pairs (15-7) Rand Index = (2+8)/15 = 0.66 Jaccard Coeff. = 2/(2+4+1) = 0.29 Corrected Rand = 0.24 D1 D2 D3 C1 2 1 3 C2 6

18 Results Results: Possible Reasons:
Low corrected rand and jaccard values ~=0.01 Rand index ~= 0.77 Possible Reasons: Noise in the data Ex: 300 Document Not Found pages. Problem is difficult: Ex: Homepages category.

19 B) Evaluation of Hierarchical Clustering Algorithms
Obtain a partitioning of DMOZ Determine a depth (experiment?) Collect documents of higher (or equal) depth at that level Documents of lower depths? Ignore them…

20 Hierarchical Clustering: Steps
Obtain the hierarchical clusters using: Single Linkage Average Linkage Complete Linkage Obtain a partitioning on the hierarchical cluster…

21 Hierarchical Clustering: Steps
One way, treat DMOZ clusters as “queries”: For each selected cluster of DMOZ Find the number of “target clusters” on computerized partitioning Take the average See if Nt < Ntr If not, either choice of partitioning or hierarchical clustering did not perform well…

22 Hierarchical Clustering: Steps
Another way: Compare the two partitions using an index, i.e. C-RAND…

23 Choice of Partition: Outline
Obtain the dendrogram Single linkage Complete linkage Group average linkage Ward’s methods

24 Choice of Partition: Outline
How to convert a hierarchical cluster structure into a partition? Visually inspect the dendrogram? Use tools from statistics?

25 Choice of Partition: Inconsistency Coefficient
At each fusion level: Calculate the “inconsistency coefficient” Utilize statistics from the previous fusion levels Choose the fusion level for which inconsistency coefficient is at maximum.

26 Choice of Partition: Inconsistency Coefficient
Inconsistency coefficient (I.C.) at fusion level i:

27 Choice of Partition: I.C. Hands on, Objects
Plot of the objects Distance measure: Euclidean Distance

28 Choice of Partition: I.C. Hands on, Single Linkage

29 Choice of Partition: I.C. Single Linkage Results
Level 1  0 Level 2  0 Level 3  0 Level 4  0 Level 5  0 Level 6  Level 7  => Cut the dendrogram at a height between level 5 & 6

30 Choice of Partition: I.C. Single Linkage Results

31 Choice of Partition: I.C. Hands on, Average Linkage

32 Choice of Partition: I.C. Average Linkage Results
Level 1  0 Level 2  0 Level 3  Level 4  0 Level 5  Level 6  Level 7  => Cut the dendrogram at a height between level 5 & 6

33 Choice of Partition: I.C. Hands on, Complete Linkage

34 Choice of Partition: I.C. Complete Linkage Results
Level 1  0 Level 2  0 Level 3  Level 4  0 Level 5  Level 6  Level 7  => Cut the dendrogram at a height between level 5 & 6

35 Conclusion Our aim is to evaluate clustering techniques on DMOZ Data.
Analysis on partitioning & hierarchical clustering algorithms. If the experiments are succesfull we will apply same experiments on larger DMOZ data after we download it. Else We will try other methodologies to improve our experiment results.

36 References www.dmoz.org
[1] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. [2] Korenius T., Laurikkala J., Juhola M., Jarvelin K. Hierarchical clustering of a Finnish newspaper article collection with graded relevance assessments. Information Retrieval, 9(1). Kluwer Academic Publishers, 2006.


Download ppt "Evaluation of Clustering Techniques on DMOZ Data"

Similar presentations


Ads by Google