Online Clustering of Web Search results

Slides:

Advertisements

Similar presentations

CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.

Advertisements

Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.

Chapter 5: Introduction to Information Retrieval

Web Intelligence Text Mining, and web-related Applications

Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Semantic Access to Data from the Web Raquel Trillo *, Laura Po +, Sergio Ilarri *, Sonia Bergamaschi + and E. Mena * 1st International Workshop on Interoperability.

Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.

Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.

Extracting Key-Substring-Group Features for Text Classification KDD 2006 Dell Zhang: Univ of London Wee Sun Lee: Nat Univ of Singapore Presented by: Payam.

Tries Standard Tries Compressed Tries Suffix Tries.

Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.

GROUPER: A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS Erdem Sarıgil O ğ uz Yılmaz

Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.

Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)

IR Models: Structural Models

WMES3103 : INFORMATION RETRIEVAL

Query Languages: Patterns & Structures. Pattern Matching Pattern –a set of syntactic features that must occur in a text segment Types of patterns –Words:

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.

J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.

1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)

1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization.

WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.

Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth.

Evaluation Information retrieval Web. Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile.

Chapter 5: Information Retrieval and Web Search

CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

WebPage Summarization Using Clickthrough Data JianTao Sun & Yuchang Lu, TsingHua University, China Dou Shen & Qiang Yang, HK University of Science & Technology.

Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.

 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.

Querying Structured Text in an XML Database By Xuemei Luo.

Laboratory for InterNet Computing CSCE 561 Social Media Projects Ryan Benton October 8, 2012.

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester

Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.

1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)

A New Suffix Tree Similarity Measure for Document Clustering

Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.

Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong.

TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.

Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining

Chapter 6: Information Retrieval and Web Search

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester

SINGULAR VALUE DECOMPOSITION (SVD)

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

Information Retrieval

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

Generating Query Substitutions Alicia Wood. What is the problem to be solved?

The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.

Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.

INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.

XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.

Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.

Clustering of Web pages

Information Retrieval and Web Design

Information Retrieval and Web Design

Presentation transcript:

Online Clustering of Web Search results Shixian Chu

Two papers: O. Zamir and O. Etzioni. Web Document Clustering: A Feasibility Demonstration. In Proceedings of the 21st International ACM SIGIR Conference on Research andDevelopment in Information Retrieval, Melbourne, Australia,1998. Dell Zhang and Yisheng Dong. Semantic, Hierarchical, Online Clustering of Web Search Results, Apr 2004. In Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China

Introduction… Current status of information Retrieval is far from satisfaction for several possible reasons: Many returned pages are useless or irrelevant; Users may be just interested in small part of information returned while thousands of pages are returned from search engine; Different users have different requirements and expectations for search results;

Sometimes search requests can not be expressed clearly just in several keywords; The phenomena of synonymy (several words may correspond to same concept) and polysemy (one word may have several different meanings) make things more complicated; ......

Search results clustering can help to solve some of these problems Search results can be viewed as a database composed of thousand of documents. All the results are clustered into hierarchical groups with the “key phrases” as the name of the cluster. With hierarchical clusters, users will be able to have an overview of the whole topic or just select interested clusters to browse and neglect the non-relevant groups.

Example… Clustered Search results of query “Jaguar”

“Web Document Clustering: A Feasibility Demonstration” O. Zamir and O. Etzioni.

What’s new? This paper introduces linear time (in the document collection size) algorithm called Suffix Tree Clustering(STC), which creates clusters based on phrases shared between documents. STC is faster and more precise than standard clustering methods such as K-means, Buckshot and so on.

Key requirements for Web document clustering methods: Relevance: relevant and irrelevant docs are in different clusters Browsable Summaries: key phrases that can summary the cluster Overlap: one doc maybe in several clusters Snippet-tolerance: produce high quality clusters even when it only has access to the snippets returned by the search engines Speed: high

STC has three logical steps: (1) document “cleaning”, (2) identifying base clusters using a suffix tree, (3) combining these base clusters into clusters.

Step 1 - Document "Cleaning" Deleting word prefixes and suffixes and reducing plural to singular Marking Sentence boundaries Stripping non-word tokens (such as numbers,HTML tags and most punctuation)

Step 2 - Identifying Base Clusters We treat documents as strings of words,not characters, thus suffixes contain one or more of the whole words. In more precise terms: 1. A suffix tree is a rooted, directed tree. 2. Each internal node has at least 2 children. 3. Each edge is labeled with a non-empty sub-string

Step 2 - Identifying Base Clusters 4. No two edges out of the same node can have edge-labels that begin with the same word (hence it is compact). 5. For each suffix s of S, there exists a suffix-node whose label equals s.

Step 2 - Identifying Base Clusters The following may be the snippets of three search result docs: "cat ate cheese”---------------document 1 "mouse ate cheese too" ------document 2 "cat ate mouse too"-----------document 3

Step 2 - Identifying Base Clusters "cat ate cheese”,"mouse ate cheese too“, "cat ate mouse too"

Step 2 - Identifying Base Clusters All parent nodes are base clusters

Step 2 - Identifying Base Clusters Each base cluster is assigned a score where |B| is the number of documents in base cluster B, P is the phrase of cluster B, and |P| is the number of words in P that have a non-zero score We maintain a stoplist that is supplemented with Internet specific words(e.g., “previous”, “java”, “frames” and “mail”). Words appearing in the stoplist, or that appear in too few (3 or less)or too many (more than 80% of the collection) documents receive a score of zero.

Step 3 - Combining Base Clusters Given two base clusters Bm and Bn, with sizes |Bm| and |Bn| |Bm∩Bn| representing the number of documents common to both base clusters 1 if |Bm∩Bn|/|Bm| > 0.5 and |Bm∩Bn|/|Bn| > 0.5 Similarity of Bm and Bn= 0 Otherwise

Step 3 - Combining Base Clusters

Step 3 - Combining Base Clusters

Experiments

Experiments

“Semantic, Hierarchical, Online Clustering of Web Search Results” Dell Zhang and Yisheng Dong.

What’s new? A document or snippet is treated as a string of characters not as a string of words Group Web search results semantically Not only English but also oriental languages like Chinese.

Step 1 - Document "Cleaning" Deleting word prefixes and suffixes and reducing plural to singular Marking Sentence boundaries Stripping non-word tokens (such as numbers,HTML tags and most punctuation)

Step 2 – Key phrase extraction Extract phrases of high 1. “completeness”, 2. “ stability”, and 3. “significance” as Key phrases.

DEFINITION: Completeness Suppose phrase S occurs in k distinct positions p1, p2, … ,pk in document D, S is “complete” if and only if the (pi-1)th token in D is different with the (pj-1)th token for at least one (i, j) pair, 1≤i<j≤k (called “left-complete”), and the (pi+|S|)th token is different with the (pj+|S|)th token for at least one (i, j) pair, 1≤i<j≤k (called “right-complete”).

DEFINITION: Stability

DEFINITION: significance

Suffix array---result of step 2

Step 3 – Organizing Clusters

X threshold=0.5, y threshold=0.15

Thank you