Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab.

Similar presentations


Presentation on theme: "Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab."— Presentation transcript:

1 Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab

2 Background

3 Open Directory Project Used by Google, Lycos, etc. Categorizing Web pages by hand  Accurate  Lately updated  Unscalable

4 World Wide Web Rapid increase (= # of clusters changes) Daily updated (= cluster centers move) Due to these two properties of the Web..  A Web page clustering system without human effort is needed.

5 Purpose Constructing a Web page clustering system which  finds clusters without human help  is scalable  clusters Web pages in high speed  clusters Web pages accurately

6 Agenda Introduction Related Work Proposal Comparison Conclusion

7 Clustering Algorithm Text-based clustering  Use of word as feature  Generally used algorithm Link-based clustering  Focus on link structure  Especially used in clustering Web pages

8 k-means Algorithm k = 3 point: vector expression of each document

9 Problems of k-means Algorithm k depends on the data set. Outliers sensitively effect clustering result.

10 Hierarchical Clustering BIRCH [Zhang ’96], CURE [Guha ’98], Chameleon [Karypis ’99], ROCK [Guha ’00]

11 Hierarchical Clustering # of clusters can be determined by condition. Clustering a large number of points (pages) results in many I/O accesses.

12 Use of Link Structure Web pages include not only text but also links. People link Web pages to other related pages. Linked Web pages may share the same topic

13 Extraction of Web Community based on Link Analysis An Approach to Find Related Communities Based on Bipartite Graphs [P.Krishna Reddy et al., 2001]

14 Terminology Fans and Centers Bipartite Graph  Complete BG  Dense BG FanCenter (a) CBG (b) DBG p q

15 An Approach to Find Related Communities Based on Bipartite Graphs Definition The set T contains the members of the community if there exist a dense bipartite graph DBG(T, I, p, q) where  T: Fans  I: Centers  p: # of out-link  q: # of in-link p q DBG(T, I, 2, 3)

16 DBG Extraction Algorithm (pt = 2, qt = 3) 1. Gathering related nodes threshold = 1

17 DBG Extraction Algorithm (pt = 2, qt = 3) 2. Extracting a DBG 1 2 1 2 1 0 2 3 2 2 3 3 3

18 DBG-based Web Community O High speed (O( #links )) O Finding out topics over the Web X Possibility of extracting disrelated Web page group

19 Comparison Text-based clustering  Accurate  Difficult to determine the center of cluster Community topology based on DBG  Inaccurate  Can be used as topic selection Refined Web CommunityCenter of Cluster

20 Agenda Introduction Related Word Proposal Comparison Conclusion

21 Proposal 1. Extract DBGs through link analysis 2. Refine communities and fix centers with DBSCAN 3. Partition other pages to the nearest center

22 Community Extraction Extract DBGs from the Web Graph  Disallow the same page to be included in more than one Web community Web Graph

23 Cluster Center Refinement Find meaningful page sets 1. Does the DBGs really have a topic? 2. Is there any page in the community that is not related the topic? Feature: terms of extracted pages DBSCAN [Martin Easter et al., A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, 1999]

24 DBSCAN radius: r minP: m r Core Density reachable Community (Center of cluster)

25 Partitioning Remaining Pages Feature: term’s appearance 1. Calculate distance between a remaining page and each center 2. If the distance to the nearest center is shorter than threshold, attach the page to that cluster 3. Otherwise, attach the page to “Unclassified cluster”

26 Agenda Introduction Related Word Proposal Experimental Result Conclusion

27 Target Seed: 3,000 pages categorized to Computer/Software by ODP 70,000 pages departed from seed pages by 2 hops

28 Preprocess Word ID  Use words of a dictionary as base vectors  Attribute the same ID to words sharing the same derivation  Add terms which appear in many documents (IDF <= 8)  Total: 29347 Link Extraction Elimination of links to pages which are not collected.

29 # Communities

30 # Community Members (pt=3, qt=3)

31 # Community Members

32 Variance of Terms

33 After DBSCAN

34 Conclusion

35 Future Work Applying to more large data set  This may need parallel processing Analyzing with


Download ppt "Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab."

Similar presentations


Ads by Google