Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Evaluating Search Engine
Search Engines and Information Retrieval
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
Web queries classification Nguyen Viet Bang WING group meeting June 9 th 2006.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
1 Automatic Identification of User Goals in Web Search Uichin Lee, Zhenyu Liu, Junghoo Cho Computer Science Department, UCLA {uclee, vicliu,
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Search Engines and Information Retrieval Chapter 1.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Linking Wikipedia to the Web Antonio Flores Bernal Department of Computer Sciencies San Pablo Catholic University 2010.
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Introduction to Interactive Media 13: Writing for Interactive Media.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.
Chapter 6: Information Retrieval and Web Search
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Performance Measurement. 2 Testing Environment.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Document Clustering and Collection Selection Diego Puppin Web Mining,
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Web Crawling.
موضوع پروژه : بازیابی اطلاعات Information Retrieval
Data Mining Chapter 6 Search Engines
Introduction to Information Retrieval
Panagiotis G. Ipeirotis Luis Gravano
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Query Type Classification for Web Document Retrieval
Web Information retrieval (Web IR)
Information Retrieval and Web Design
Information Retrieval and Web Design
Introduction to Search Engines
Presentation transcript:

Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003

Introduction  Need different strategies to find target documents according query type  Web sources Content Information Link Information URL Information  User queries can be classified as three categories topic relevance task homepage finding task Service finding task

Proposed Method  Query classification  Topic relevance Type  Content information Homepage finding Type  URL information + Link information

Model : “ indexing ” “ query matching method ” “ scoring method ” P avg : average precision MRR : Mean Reciprocal Rank got the better result with the common content text representation than the anchor text representation in the topic relevance task URL information and Link information are good for the homepage finding task but bad for the topic relevance task. We can conclude that we need different retrieval strategies according to the category of a query. WT10g TREC-2001

Query Classification  present the method for making a language model for a user query classification  Query: topic relevance task,TREC-2000 topic relevance task queries (topics ) (QUERY T−TRAIN ) homepage finding task, queries for randomly selected 100 homepages are used (QUERY H−TRAIN )  Documents: 10 gigabyte WT10g If the URL type of a document is ‘ root ’ type, we put this document to DB HOME, others are added to DB TOPIC  root type : a domain name (e.g.

Query Classification – distribution of Query Terms Chi-squared value If the chi-square value of the word ‘ w ’ is high, then ‘ w ’ is a special term of DB TOPIC or DB HOME General terms tend to have same distribution regardless of the database. If the difference of distribution is larger than expected, this tells whether a given query is in the topic relevance task class or the homepage finding task class

Query Classification – Mutual Information  ‘ tornadoes formed ’ vs. ‘ Fan Club ’ (similar dependency vs. high dependency in DBHOME set) 

Query Classification – Usage Rate as an Anchor Text  If query terms appear in titles and anchor texts frequently, this tells the category of a given query is the homepage finding task  C SITE (w) means the number of site entry documents that have w as an index term.  C SITE_ANCHOR (w) means the number of site entry documents and anchor texts that have w as an index term

Query Classification – POS Information  Since the homepage finding task queries are proper names, they do not usually contain a verb  If a query has a verb except the ‘ be ’ verb, then we classified it into the topic relevance task E.g. ‘ tornadoes formed ’

Query Classification – Combination of Measures   diff Dist : distribution of Query Terms  diff MI : Mutual Information  use Anchor : Usage Rate as an Anchor Text  POS info : POS Information

Experiment – Query Classification  TREC-2001 topic relevance task queries (Topic ) and TREC-2001 homepage finding task queries (1-145) are used for testing (TEST) The main reason of misclassification is wrong division of WT10g a verb is in the homepage finding task query E.g. ‘ Protect & Preserve ’ is the homepage finding task query QUERY T−TEST that look like queries of QUERY H−TEST. For example, ‘ Dodge Recalls ’ do not have a result document that has all query terms in it

Experiment – IR Improvement  Topic Relevance Task  Homepage Finding Task MLemur Toolkit: combined URL information and Link information to reorder results

Experiment – IR Improvement means the default category for an unclassified query