Presentation is loading. Please wait.

Presentation is loading. Please wait.

Online Clustering of Web Search results

Similar presentations


Presentation on theme: "Online Clustering of Web Search results"— Presentation transcript:

1 Online Clustering of Web Search results
Shixian Chu

2 Two papers: O. Zamir and O. Etzioni. Web Document Clustering: A Feasibility Demonstration. In Proceedings of the 21st International ACM SIGIR Conference on Research andDevelopment in Information Retrieval, Melbourne, Australia,1998. Dell Zhang and Yisheng Dong. Semantic, Hierarchical, Online Clustering of Web Search Results, Apr In Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China

3 Introduction… Current status of information Retrieval is far from satisfaction for several possible reasons: Many returned pages are useless or irrelevant; Users may be just interested in small part of information returned while thousands of pages are returned from search engine; Different users have different requirements and expectations for search results;

4 Sometimes search requests can not be expressed clearly just in several keywords;
The phenomena of synonymy (several words may correspond to same concept) and polysemy (one word may have several different meanings) make things more complicated; ......

5 Search results clustering can help to solve some of these problems
Search results can be viewed as a database composed of thousand of documents. All the results are clustered into hierarchical groups with the “key phrases” as the name of the cluster. With hierarchical clusters, users will be able to have an overview of the whole topic or just select interested clusters to browse and neglect the non-relevant groups.

6 Example… Clustered Search results of query “Jaguar”

7 “Web Document Clustering: A Feasibility Demonstration”
O. Zamir and O. Etzioni.

8 What’s new? This paper introduces linear time (in the document collection size) algorithm called Suffix Tree Clustering(STC), which creates clusters based on phrases shared between documents. STC is faster and more precise than standard clustering methods such as K-means, Buckshot and so on.

9 Key requirements for Web document clustering methods:
Relevance: relevant and irrelevant docs are in different clusters Browsable Summaries: key phrases that can summary the cluster Overlap: one doc maybe in several clusters Snippet-tolerance: produce high quality clusters even when it only has access to the snippets returned by the search engines Speed: high

10 STC has three logical steps:
(1) document “cleaning”, (2) identifying base clusters using a suffix tree, (3) combining these base clusters into clusters.

11 Step 1 - Document "Cleaning"
Deleting word prefixes and suffixes and reducing plural to singular Marking Sentence boundaries Stripping non-word tokens (such as numbers,HTML tags and most punctuation)

12 Step 2 - Identifying Base Clusters
We treat documents as strings of words,not characters, thus suffixes contain one or more of the whole words. In more precise terms: 1. A suffix tree is a rooted, directed tree. 2. Each internal node has at least 2 children. 3. Each edge is labeled with a non-empty sub-string

13 Step 2 - Identifying Base Clusters
4. No two edges out of the same node can have edge-labels that begin with the same word (hence it is compact). 5. For each suffix s of S, there exists a suffix-node whose label equals s.

14 Step 2 - Identifying Base Clusters
The following may be the snippets of three search result docs: "cat ate cheese” document 1 "mouse ate cheese too" document 2 "cat ate mouse too" document 3

15 Step 2 - Identifying Base Clusters
"cat ate cheese”,"mouse ate cheese too“, "cat ate mouse too"

16 Step 2 - Identifying Base Clusters
All parent nodes are base clusters

17 Step 2 - Identifying Base Clusters
Each base cluster is assigned a score where |B| is the number of documents in base cluster B, P is the phrase of cluster B, and |P| is the number of words in P that have a non-zero score We maintain a stoplist that is supplemented with Internet specific words(e.g., “previous”, “java”, “frames” and “mail”). Words appearing in the stoplist, or that appear in too few (3 or less)or too many (more than 80% of the collection) documents receive a score of zero.

18 Step 3 - Combining Base Clusters
Given two base clusters Bm and Bn, with sizes |Bm| and |Bn| |Bm∩Bn| representing the number of documents common to both base clusters 1 if |Bm∩Bn|/|Bm| > 0.5 and |Bm∩Bn|/|Bn| > 0.5 Similarity of Bm and Bn= 0 Otherwise

19 Step 3 - Combining Base Clusters

20 Step 3 - Combining Base Clusters

21 Experiments

22 Experiments

23 “Semantic, Hierarchical, Online Clustering of Web Search Results”
Dell Zhang and Yisheng Dong.

24 What’s new? A document or snippet is treated as a string of characters not as a string of words Group Web search results semantically Not only English but also oriental languages like Chinese.

25 Step 1 - Document "Cleaning"
Deleting word prefixes and suffixes and reducing plural to singular Marking Sentence boundaries Stripping non-word tokens (such as numbers,HTML tags and most punctuation)

26 Step 2 – Key phrase extraction
Extract phrases of high 1. “completeness”, 2. “ stability”, and 3. “significance” as Key phrases.

27 DEFINITION: Completeness
Suppose phrase S occurs in k distinct positions p1, p2, … ,pk in document D, S is “complete” if and only if the (pi-1)th token in D is different with the (pj-1)th token for at least one (i, j) pair, 1≤i<j≤k (called “left-complete”), and the (pi+|S|)th token is different with the (pj+|S|)th token for at least one (i, j) pair, 1≤i<j≤k (called “right-complete”).

28 DEFINITION: Stability

29 DEFINITION: significance

30

31

32 Suffix array---result of step 2

33 Step 3 – Organizing Clusters

34 X threshold=0.5, y threshold=0.15

35 Thank you


Download ppt "Online Clustering of Web Search results"

Similar presentations


Ads by Google