Presentation is loading. Please wait.

Presentation is loading. Please wait.

Thoughts (and Research) on Query Intent Bruce Croft Center for Intelligent Information Retrieval UMass Amherst.

Similar presentations


Presentation on theme: "Thoughts (and Research) on Query Intent Bruce Croft Center for Intelligent Information Retrieval UMass Amherst."— Presentation transcript:

1 Thoughts (and Research) on Query Intent Bruce Croft Center for Intelligent Information Retrieval UMass Amherst

2 Overview Query Representation and Understanding Workshop at SIGIR 2010 Research projects in the CIIR

3 Observations “Query intent” has become a popular phrase at conferences and at companies Research with query logs = acceptance of paper Few standards in these papers about test collections, metrics, even tasks Query processing has been part of IR for a long time – e.g., stemming, expansion, relevance feedback Most retrieval models say little about queries So, what’s going on and what’s interesting?

4 Terminology Query intent (or search intent) is the same thing as information need – The notion of an information need or problem underlying a query has been discussed in the IR literature for many years, and it was generally agreed that query intent is another way of referring to the same idea Query representation involves modeling the intent or need – Query understanding refers to the process of identifying the underlying intent or need based on a particular representation Intent classes, intent dimensions, and query classes – terms used to talk about the many different types of information needs and problems

5 Terminology Query rewriting, query transformation, query refinement, query alteration, and query reformulation – names given to the process of changing the original query to better represent the underlying intent (and consequently improve ranking) Query expansion, substitution, reduction, segmentation – some of the techniques or steps used in the query transformation process Query – most research assumes the query is the string entered by user. Transformation can produce many different representations of the query. Difference between explicit and implicit query is important

6 Research Questions How to develop a unified and general framework for query understanding? How to formally define a query representation? How to develop new system architectures for query understanding? How to combine query understanding with other components in information retrieval systems? How to conduct evaluations of query understanding? How to make effective use of both human knowledge and machine learning in query understanding?

7 Possible Research Tasks Long query relevance Query reduction Similar query finding Query classification Named entity recognition in queries Context-aware search – Intent-aware search

8 Methodology Must agree on tasks, evaluation metrics, and text collections TREC-style vs. “black-box” evaluations Crowdsourcing for annotations Resources such as query collections, document collections, query logs, etc. differ widely in their availability in academic and industry settings

9 Resources Document collections – TREC ClueWeb collection preferred Query collections – need collections of different query types (e.g. long, location, product…) validated by industry Query logs – critical resource for some approaches, not available in academia. Alternatives include MSN/AOL logs, KDD queries, anchor text logs, logs from other applications (Wikipedia), logs from some restricted environment (e.g. academic library) N-grams, etc. – corpus and query language statistics from web collections

10 CIIR Projects Modeling structure in queries Modeling distributions of queries Modeling diversity in queries Transforming long queries Generating queries from documents Generating query logs from anchor text Finding similar queries

11 The Challenge of Query Representation User inputs a string of characters Query structure is never explicitly observed and is difficult to infer – Short and ambiguous search queries – Idiosyncratic grammar – No capitalization and punctuation talking to heaven movie new york times square do grover cleveland have kids

12 Structural Query Representation A query Q has a hierarchical representation – A query is a set of structures  = {  1,…,  n } – Each structure is a set of concepts  ={  1,  2,…} Hierarchical representation allows to – Model arbitrary term dependencies as concepts – Group concepts by structures – Assign weights to concepts/structures

13 members rock group nirvana [members] [rock] [group] [nirvana] [members rock] [rock group] [group nirvana] [members] [nirvana] [members] [rock group] [nirvana] Terms Bigrams Chunks Key Concepts [members nirvana] [rock group] Dependence Structures Concepts

14 Encoding Query Structure in a Hypergraph Document Structure 1 Structure n Concepts

15 Weighted Sequential Dependence Model (WSD) Allow the parameters of the sequential dependence model to depend on the concept Assume the parameters take a simple parametric form – maintains reasonable model complexity w - free parameters g - concept importance features w - free parameters g - concept importance features [Bendersky, Metzler, and Croft, 2009]

16 Defining Concept Importance in WSD Features g define the concept importance Depend on the concept (term/bigram) Independent of a specific document/document corpus Combine several sources for more accurate weighting – Endogenous Features – collection dependent features – Exogenous Features – collection independent features

17 WSD Ranking Function Score document D by:

18 ConceptImportance FeaturesWeight GF … DF civil16.9…14.10.0619 war17.9…12.80.1947 battle16.6…12.60.0913 reenactments10.8…9.70.3487 civil war14.5…10.80.1959 war battle9.5…7.40.2458 battle reenactments7.6…4.70.0540 Query “civil war battle reenactments” Concept weights may vary even if concept DF is similar Good segments do not necessarily predict important concepts

19 TREC Description (Long) Queries +6.3% +24.1% +1.6%

20 Query Representation Distribution of Terms (DOT) words + phrases : original or new Single Reformulated Query (SRQ) a single reformulation operation Relevance Model [Lavrenko and Croft, SIGIR01] Sequential Dependence Model [Metzler and Croft, SIGIR05] Latent Concept Expansion [Metzler and Croft, SIGIR07] Uncertainty in PRF [Collins-Thompson and Callan, SIGIR07] Query Segmentation [Bergsma and Wang, EMNLP-CoNLL07] [Tan and Peng, WWW08] Query Substitution [Jones et al, WWW06] [Wang and Zhai, CIKM08] DOT does not consider how these terms are fitted into actual queries, thus missing the dependencies between them. SRQ does not consider combining with other operations, thus missing information about alternative reformulations Distribution of Queries (DOQ) each query is the output of applying single or multiple reformulation operations.

21 Example Distribution of Terms (DOT) Single Reformulated Query (SRQ) Distribution of Queries (DOQ) 0.28 ``(oil industry)(history)'', 0.24 ``(petroleum industry)(history)'', 0.20 ``(oil and gas industry)(history)'', 0.18 ``(oil)(industrialized)(history)'' … Original TREC Query: oil industry history Relevance Model { 0.44 ``industry'', 0.28 ``oil'', 0.08 ``petroleum'', 0.08 ``gas'', 0.08 ``county'', 0.04 ``history''...} Sequential Dependence Model [Metzler, SIGIR05] { 0.28 ``oil'', 0.28 ``industry'', 0.28 ``history'', 0.08 ``oil industry'', 0.08 ``industry history''...} Query Segmentation ``(oil industry)(history)'' Query Substitution ``petroleum industry history''

22 Application I Reducing Long Queries [Xue, Huston, and Croft, CIKM2010] – A novel CRF-based model learns distribution of subset queries, which directly optimizes retrieval performance (1)using the top 1 subset query (K)using the top K subset queries q, d indicate significantly Better than QL and DM

23 Query Reduction

24 Application II: Substitution

25 Query Substitution A context of a word is the unigram preceding it Context distribution The translation model The substitution model – Q= q 1, … q i-2, q i-1, q i, q i+1, q i+2, … q n, candidate = s The probability that the term c i appears in w’s context The KL divergence between the context distributions of w and s How fit the new term is to the context of the current query

26 Query Expansion and Stemming Probabilities are estimated from corpus or query log – Using text passages nearly the same as pseudo relevance feedback Query Expansion is similar to substitution – We add the new term and keep the original term substitution: “ cheap airfare” → “cheap flight” expansion: “ cheap airfare” → “cheap airfare flight” Stemming – New terms are restricted to Porter-stemmed root terms “drive direction” → “drive driving direction”

27 The Anchor Log Extract pairs from the Gov-2 collection to create the anchor log [Dang and Croft, 2009] The anchor log is very noisy – “click here”, “print version”, … don’t represent the linked page Anchor text gives comparable performance to MSN log for substitution, expansion, stemming MSN LogAnchor Log # Total Queries14 million526 million # Unique Queries6 million20 million Avg. Query Length2.682.62

28 Learning to Rank Reformulations [Dang, Bendersky, and Croft, 2010]

29 Using Query Distributions Reformulating Short Queries [Xue et al, CIKM2010] – Passage Information used to generate candidate queries and estimate probabilities Gov2 o, w, m, a represents different methods to generate candidate queries. q, d, r indicate significantly better than QL, SDM and RM.

30 Example Query Reformulations using Passages

31 Conclusions Studying query intent is not new, but more data is leading to many new insights Not just a web search issue, but more obvious in web search Lots of interesting research to do, but field needs more coherence in terms of research goals, testbeds


Download ppt "Thoughts (and Research) on Query Intent Bruce Croft Center for Intelligent Information Retrieval UMass Amherst."

Similar presentations


Ads by Google