Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently.

Similar presentations


Presentation on theme: "Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently."— Presentation transcript:

1 Web Document Clustering By Sang-Cheol Seok

2 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently the most powerful search engine Google Metacrawler : a search engine which cluster retrieved web documents. Metacrawler

3 2. Approaches Using contents of documents Using user’s usage logs Using current search engines Using hyperlinks Other classical methods

4 (1) Using Contents of Documents Creating clusters based on snippets returned by web search engines. clusters based on snippets are almost as good as clusters created using the full text of Web documents. Suffix Tree Clustering (STC) : incremental, O(n) time algorithm three logical steps: (1) document “cleaning”, (2) identifying base clusters using a suffix tree, and (3) combining these base clusters into clusters

5 (2) Using user’s usage logs Advantage: relevancy information is objectively reflected by the usage logs An experimental result on www.nasa.gov/ Cluster 1/shuttle/missions/41-c/news /shuttle/missions/61-b … Cluster 2/history/apollo/sa-2/news/ /history/apollo/sa-2/images … Cluster 3/software/winvn/userguide/3_3_2.htm /software/winvn/userguide/3_3_4.htm … …….

6 (3) Using current web search engines – Metacrawler Step1: When MetaCrawler receives a query, it posts the query to multiple search engines in parallel. Step2: performs sophisticated pruning on the responses returned. (prune 75% of the returned responses as irrelevant, outdated, or unavailable ) Metacrawler at U. of Washington. Metacrawler at U. of Washington

7 (4) Using hyperlinks Consider web documents as vertices and the hyperlinks as direct edges in a direct graph. Similarity-based clustering method was successfully used in image segmentation Kleinberg’s HITS algorithm based purely on hyperlink information. authority and hub documents for a user query. only cover the most popular topics and leave out the less popular ones.

8 (4) Using Hyperlinks: continued cluster web documents based on both the textual and hyperlink the hyperlink structure is used as the dominant factor in the similarity metric

9 (5) Other classical clustering methods K-means method HAC (hierarchical agglomerative clustering) DBSCAN (Density-based SCAN) And Single-link and group-average methods, Complete-link methods, Single-pass methods, and Buckshot and Fraction have been used

10 3. Key requirements and future challenges (1) key requirements for Web document clustering methods Relevance Browsable Summaries Overlap Speed Incrementality for some methods.

11 3. Key requirements and future challenges: continued (2) Concerns on current methods Each method has pros and cons. Using hyperlinks : the best accuracy and still some room to improve and it does not overlap. STC : best to browse and for incrementality. Metacrawler : best to prune.

12 3. Key requirements and future challenges: continued Future challenges We can not take advantage of all pros of each method. Some pros work against other pros. So, we have to trade off. Moreover, we need to find improvements.


Download ppt "Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently."

Similar presentations


Ads by Google