Presentation is loading. Please wait.

Presentation is loading. Please wait.

Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003.

Similar presentations


Presentation on theme: "Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003."— Presentation transcript:

1 Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003

2 Introduction  Need different strategies to find target documents according query type  Web sources Content Information Link Information URL Information  User queries can be classified as three categories topic relevance task homepage finding task Service finding task

3 Proposed Method  Query classification  Topic relevance Type  Content information Homepage finding Type  URL information + Link information

4 Model : “ indexing ” “ query matching method ” “ scoring method ” P avg : average precision MRR : Mean Reciprocal Rank got the better result with the common content text representation than the anchor text representation in the topic relevance task URL information and Link information are good for the homepage finding task but bad for the topic relevance task. We can conclude that we need different retrieval strategies according to the category of a query. WT10g TREC-2001

5 Query Classification  present the method for making a language model for a user query classification  Query: topic relevance task,TREC-2000 topic relevance task queries (topics 451-500) (QUERY T−TRAIN ) homepage finding task, queries for randomly selected 100 homepages are used (QUERY H−TRAIN )  Documents: 10 gigabyte WT10g If the URL type of a document is ‘ root ’ type, we put this document to DB HOME, others are added to DB TOPIC  root type : a domain name (e.g. http://trec.nist.gov)

6 Query Classification – distribution of Query Terms Chi-squared value If the chi-square value of the word ‘ w ’ is high, then ‘ w ’ is a special term of DB TOPIC or DB HOME General terms tend to have same distribution regardless of the database. If the difference of distribution is larger than expected, this tells whether a given query is in the topic relevance task class or the homepage finding task class

7 Query Classification – Mutual Information  ‘ tornadoes formed ’ vs. ‘ Fan Club ’ (similar dependency vs. high dependency in DBHOME set) 

8 Query Classification – Usage Rate as an Anchor Text  If query terms appear in titles and anchor texts frequently, this tells the category of a given query is the homepage finding task  C SITE (w) means the number of site entry documents that have w as an index term.  C SITE_ANCHOR (w) means the number of site entry documents and anchor texts that have w as an index term

9 Query Classification – POS Information  Since the homepage finding task queries are proper names, they do not usually contain a verb  If a query has a verb except the ‘ be ’ verb, then we classified it into the topic relevance task E.g. ‘ tornadoes formed ’

10 Query Classification – Combination of Measures   diff Dist : distribution of Query Terms  diff MI : Mutual Information  use Anchor : Usage Rate as an Anchor Text  POS info : POS Information

11 Experiment – Query Classification  TREC-2001 topic relevance task queries (Topic 501- 550) and TREC-2001 homepage finding task queries (1-145) are used for testing (TEST) The main reason of misclassification is wrong division of WT10g a verb is in the homepage finding task query E.g. ‘ Protect & Preserve ’ is the homepage finding task query QUERY T−TEST that look like queries of QUERY H−TEST. For example, ‘ Dodge Recalls ’ do not have a result document that has all query terms in it

12 Experiment – IR Improvement  Topic Relevance Task  Homepage Finding Task MLemur Toolkit: combined URL information and Link information to reorder results

13 Experiment – IR Improvement means the default category for an unclassified query


Download ppt "Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003."

Similar presentations


Ads by Google