Putting Query Representation and Understanding in Context: ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign A Decision-Theoretic Framework for Optimal Interactive Retrieval through Dynamic User Modeling Including joint work with Xuehua Shen, Bin Tan
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland 2 What is a query? Query = a sequence of keywords that describe the information need of a particular user at a particular time for finishing a particular task iPhone batterySearch Rich context ! Query = a sequence of keywords?
Query must be put in a context SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland 3 JaguarSearch Mac OS? Car ? Animal ? What queries did the user type in before this query? What documents were just viewed by this user? What documents were skipped by this user? What other users looked for similar information? ……
Context helps query understanding SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland 4 Car Software Animal Suppose we know: 1.Previous query = “racing cars” vs. “Apple OS” 2.“car” occurs far more frequently than “Apple” in pages browsed by the user in the last 20 days 3. User just viewed an “Apple OS” document
Questions How can we model a query in a context- sensitive way? Generalize query representation to user model How can we model the dynamics of user information needs? Dynamic updating of user models How can we put query representation into a retrieval framework to improve search? A framework for optimal interactive retrieval SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland 5
Rest of the talk: UCAIR Project 1.A decision-theoretic framework 2.Statistical language models for implicit feedback (personalized search without extra user effort) 3.Open challenges SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland 6
SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, UCAIR Project UCAIR = User-Centered Adaptive IR – user modeling (“user-centered”) – search context modeling (“adaptive”) – interactive retrieval Implemented as a personalized search agent that –sits on the client-side (owned by the user) –integrates information around a user (1 user vs. N sources as opposed to 1 source vs. N users) –collaborates with each other –goes beyond search toward task support
8 Main Idea: Putting the User in the Center! Search Engine “java” Personalized search agent WEB Search Engine Search Engine Desktop Files Personalized search agent “java”... Viewed Web pages Query History A search agent can know about a particular user very well SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010
1. A Decision-Theoretic Framework for Optimal Interactive Retrieval SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010
IR as Sequential Decision Making UserSystem A 1 : Enter a query Which documents to present? How to present them? R i : results (i=1, 2, 3, …) Which documents to view? A 2 : View document Which part of the document to show? How? R’: Document content View more? A 3 : Click on “Back” button (Information Need) (Model of Information Need) SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010
Retrieval Decisions User U: A 1 A 2 … … A t-1 A t System: R 1 R 2 … … R t-1 Given U, C, A t, and H, choose the best R t from all possible responses to A t History H={(A i,R i )} i=1, …, t-1 Document Collection C Query=“Jaguar” All possible rankings of C The best ranking for the query Click on “Next” button All possible rankings of unseen docs The best ranking of unseen docs R t r(A t ) R t =? SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010
12 A Risk Minimization Framework User: U Interaction history: H Current user action: A t Document collection: C Observed All possible responses: r(A t )={r 1, …, r n } User Model M=(S, U …) Seen docs Information need L(r i,A t,M)Loss Function Optimal response: r* (minimum loss) ObservedInferred Bayes risk SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland 13 Approximate the Bayes risk by the loss at the mode of the posterior distribution Two-step procedure –Step 1: Compute an updated user model M* based on the currently available information –Step 2: Given M*, choose a response to minimize the loss function A Simplified Two-Step Decision-Making Procedure
14 Optimal Interactive Retrieval User A1A1 UC M* 1 P(M 1 |U,H,A 1,C) L(r,A 1,M* 1 ) R1R1 A2A2 L(r,A 2,M* 2 ) R2R2 M* 2 P(M 2 |U,H,A 2,C) A3A3 … Collection IR system SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland 15 Refinement of Risk Minimization r(A t ): decision space (A t dependent) –r(A t ) = all possible subsets of C (document selection) –r(A t ) = all possible rankings of docs in C –r(A t ) = all possible rankings of unseen docs –r(A t ) = all possible subsets of C + summarization strategies M: user model –Essential component: U = user information need –S = seen documents –n = “Topic is new to the user” L(R t,A t,M): loss function –Generally measures the utility of R t for a user modeled as M –Often encodes retrieval criteria (e.g., using M to select a ranking of docs) P(M|U, H, A t, C): user model inference –Often involves estimating a unigram language model U
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland 16 Case 1: Context-Insensitive IR –A t =“enter a query Q” –r(A t ) = all possible rankings of docs in C –M= U, unigram language model (word distribution) –p(M|U,H,At,C)=p( U |Q)
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland 17 Case 2: Implicit Feedback –A t =“enter a query Q” –r(A t ) = all possible rankings of docs in C –M= U, unigram language model (word distribution) –H={previous queries} + {viewed snippets} –p(M|U,H,At,C)=p( U |Q,H)
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland 18 Case 3: General Implicit Feedback –A t =“enter a query Q” or “Back” button, “Next” button –r(A t ) = all possible rankings of unseen docs in C –M= ( U, S), S= seen documents –H={previous queries} + {viewed snippets} –p(M|U,H,At,C)=p( U |Q,H)
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland 19 Case 4: User-Specific Result Summary –A t =“enter a query Q” –r(A t ) = {(D, )}, D C, |D|=k, {“snippet”,”overview”} –M= ( U, n), n {0,1} “topic is new to the user” –p(M|U,H,At,C)=p( U,n|Q,H), M*=( *, n*) n*=1n*=0 i =snippet 10 i =overview 01 Choose k most relevant docs If a new topic (n*=1), give an overview summary; otherwise, a regular snippet summary
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland 20 Case 5: Active Feedback –A t =“enter a query Q” or “Back” button, “Next” button –r(A t ) = all subsets of k docs in C; r i ={d 1, …, d k } –A t+1 = {J 1, …, J k } relevance judgments on R t –M= U, unigram language model (word distribution) –L = utility of R t for the user + utility of R t for feedback Tradeoff between relevance and diversity For difficult topics, diversity dominates the loss
2. Statistical Language Models for implicit feedback (Personalized search without extra user effort) SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland 22 Risk Minimization for Implicit Feedback –A t =“enter a query Q” –r(A t ) = all possible rankings of docs in C –M= U, unigram language model (word distribution) –H={previous queries} + {viewed snippets} –p(M|U,H,At,C)=p( U |Q,H) Need to estimate a context-sensitive LM
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland 23 Estimate a Context-Sensitive LM Q2Q2 C 2 ={C 2,1, C 2,2,C 2,3, … } … C 1 ={C 1,1, C 1,2,C 1,3, …} User Clickthrough QkQk Q1Q1 User Query e.g., Apple software e.g., Apple - Mac OS X Apple - Mac OS X The Apple Mac OS X product page. Describes features in the current version of Mac OS X, … e.g., Jaguar User Model: Query HistoryClickthrough
Short-term vs. long-term implicit feedback Short term implicit feedback –context = current retrieval session –past queries in the context are closely related to the current query –clickthroughs user’s current interests Long term implicit feedback –context = all search interaction history –not all past queries/clickthroughs are related to the current query SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland 24
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland 25 “Bayesian interpolation” for short-term implicit feedback Q1Q1 Q k-1 … C1C1 C k-1 … Average user query and clickthrough history Intuition: trust the current query Q k more if it’s longer QkQk Dirichlet Prior
26 Overall Effect of Search Context Query FixInt ( =0.1, =1.0) BayesInt ( =0.2, =5.0) OnlineUp ( =5.0, =15.0) BatchUp ( =2.0, =15.0) Q3Q Q 3 +H Q +H C Improve 72.4%32.6%93.8%39.4%67.7%20.2%92.4%39.4% Q4Q Q 4 +H Q +H C Improve 66.2%15.5%78.2%19.9%47.8%6.9%77.2%16.4% Short-term context helps system improve retrieval accuracy BayesInt better than FixInt; BatchUp better than OnlineUp SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland
27 Using Clickthrough Data Only Q3Q Q 3 +H C Improve81.9%37.1% Q4Q Q 4 +H C Improve72.6%18.1% BayesInt ( =0.0, =5.0) Clickthrough is the major contributor 13.9% 67.2%Improve Q 4 +H C Q4Q4 42.4%99.7%Improve Q 3 +H C Q3Q3 Performance on unseen docs -4.1%15.7%Improve Q 4 +H C Q4Q4 23.0%23.8%Improve Q 3 +H C Q3Q3 Snippets for non-relevant docs are still useful!
28 Mixture model with dynamic weighting for long-term implicit feedback q1D1C1q1D1C1 S1S1 θS1θS1 q2D2C2q2D2C2 S2S2 θS2θS2... q t-1 D t-1 C t-1 S t-1 θ S t-1 θHθH qtDtqtDt StSt θqθq θ q,H λ1?λ1? λ2?λ2? λq?λq? 1-λ q λ t-1 ? select {λ} to maximize P(D t | θq, H) EM algorithm SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010
29 Results: Different Individual Search Models recurring ≫ fresh combination ≈ clickthrough > docs > query, contextless SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010
30 Results: Different Weighting Schemes for Overall History Model hybrid ≈ EM > cosine > equal > contextless SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland Open Challenges What is a query? How to collect as much context information as possible without infringing user privacy? How to store and organize the collected context information? How to accurately interpret/exploit context information? How to formally represent the evolving information need of a user? How to optimize search results for an entire session? What’s the right architecture (client-side, server-side, and client-server combo)?
References Framework –Xuehua Shen, Bin Tan, and ChengXiang Zhai, Implicit User Modeling for Personalized Search, In Proceedings of CIKM 2005, pp –ChengXiang Zhai and John Lafferty, A risk minimization framework for information retrieval, Information Processing and Management, 42(1), Jan pages Short-term implicit feedback –Xuehua Shen, Bin Tan, ChengXiang Zhai, Context-Sensitive Information Retrieval with Implicit Feedback, Proceedings of SIGIR 2005, pp Long-term implicit feedback –Bin Tan, Xuehua Shen, ChengXiang Zhai, Mining long-term search history to improve search accuracy, Proceedings of KDD 2006, pp SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland 32
Thank You! SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010