Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.

Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07

2 Motivation - 1/2 The reasons for clustering of search results are two-fold –cluster hypothesis : similar documents tend to be relevant to the same requests –ranked list is usually too large and contains many irrelevant documents Successful academic and industrial (vivisimo.com) –Organize search results into groups (clusters) –Topical similarity

3 Motivation - 2/2 Clustering problem : – there is not enough contextual information on a page For example: savethejaguar.com –Web sites are contextually different but actually refer to the same meaning of the query Michel D´ecary –a computer scientist (www.zoominfo.com/MichelDecary), –a lawyer (www.stikeman.com/cgi-bin/profile.cfm?P ID=366), –and a chansonnier (www.decary.com).

4 Introduction - 1/3 Thematic locality of the Web graph: –Directed graph in which nodes are Web pages and edges are hyperlink –If page A hyperlink page B, page A and page B are semantically close. –For example: –Michel D´ecary –a computer scientist (www.zoominfo.com/MichelDecary), –and a chansonnier (www.decary.com).www.decary.com –cogilex.com

5 Introduction - 2/3 Heuristic Search : –To collect as much useful information as possible while crawling the Web –Heuristic estimate the amount of information available in a particular Web sub-graph. –This paper uses heuristics to estimate the utility of expanding the current node in terms of leading to the target node. The heuristics are not to reduce the search time, but to improve the search accuracy. –Heuristics are used as filters to prune branches of search trees that are likely to establish undesired connections between unrelated Web pages.

6 Introduction - 3/3 Multi-agent system: –Given n Web pages in the ranked list –n collaborative Web agents initial dataset : assigned one page Each agent performs heuristic search to traverse the Web graph in order to meet as many other agents as possible. Two applications: –Web appearance disambiguation –Search result clustering

7 Multi-agent heuristic search Two multi-agent heuristic search –Sequential Heuristic Search (SHS) Frontier: –a list of nodes (URL) to be expanded (initially, the URL of its source page) Filter : ( later) Initialize :

8 Multi-agent heuristic search The SHS algorithm –simple and intuitive One crucial drawback –there is no possibility to control the topology of the constructed clusters –In a worst case If,, and Pages A and D will be placed in the same cluster despite that the semantic relation between them is probably weak Page A --> Page BPage B --> Page C Page C --> Page D

9 Multi-agent heuristic search Incremental Heuristic Search (IHS)

10 Heuristics - 1/2 Two heuristics –Topology-driven High-degree node elimination –Remove high out-degree pages and high in-degree pages –Content-driven Person name heuristic

11 Heuristics - 2/2 To detect high out-degree URL –Using Google’s link:operator –Threshold in/out hyperlinks 1000 Person names consist of two, three or four words –This heuristic excludes people names that are too common (again, we use Google’s link: operator) In many cases, an entity tagged as a person name has millions of Google’s hits if it is a tagger error. Examples of such entities are Price Range and Mac Os.

12 Datasets - disambiguation dataset Web appearance disambiguation dataset –www.cs.umass.edu/~ronb –It consists of 1085 Web pages retrieved on 12 names of people from Melinda Gervasio’s social network (mostly, SRI engineers and university professors). –The dataset is labeled according to the person’s occupation. The process crawled the Web starting with these 1085 pages (source pages). –7009 pages at the first hop (( 一次飛行的 ) 航程 ), –69,454 pages at the second hop –592,299 pages at the third hop

13 One-Cluster

14 Datasets - Jaguar dataset - 1/2 Problem of clustering Web search results Retrieved and labeled 100 first Google hits obtained on the query jaguar.

15 Datasets - Jaguar dataset - 2/2 Jaguar dataset –K = 3 (car, Mac Os, and cats) –883 pages on the first hop –8548 pages on the second hop –56,287 pages on the third hop

17 –Agglomerative/Conglomerative Distributional Clustering (A/CDC) ( Bekkerman and McCallum, 2005)

18 Conclusion This paper is the first study of heuristic search in the Web graph. Heuristic search : –Viable in the vast domain of the WWW –Clustering of Web search results –Web appearance disambiguation

19 Introduction - 4/4 Topological clustering –Only k largest cluster : a set C of k –Initial : Each document from the original ranked list into one cluster C’ a set C’ of k’ > k topical cluster –For each cluster c i  C to find it closest cluster c j’ from C’ j=argmax j’ |c i  c’ j’ |

Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.

Similar presentations

Presentation on theme: "Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.

Similar presentations

Presentation on theme: "Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07."— Presentation transcript:

Similar presentations

About project

Feedback