Query Recommendation Xiaofei Zhu

Query Recommendation Xiaofei Zhu (zhu@l3s.de)
L3S Research Center, Leibniz Universität Hannover 16 minutes

Lack of domain knowledge
Introduction ? Short (1-2 words) Ambiguous (e.g., Java) Lack of domain knowledge

original query Query Recommendation It aims to provide users alternative queries, which can represent their information needs more clearly in order to return better search results . recommendation

Query Recommendation How to do query recommendation?
Find alternative queries with similar search intent. Differ with Document , Image?

Query log Query log. A query log records information about the search actions of the users of a search engine. A typical query log is a set of records <qi,ui,ti,Vi,Ci> qi – the submitted query ui – an anonymized identifier for the user who submitted the query ti – timestamp, the time at which the query was submitted for search. Vi – the set of returned results to the query Ci - the set of documents clicked by the user.

Example of query log (AOL, 2006)
AnonID Query QueryTime ItemRank ClickURL motorola text messages :35:31 1 motorola text messages :35:31 4 motorola t730 text messages :38:40 2 motorola t730 text messages :38:40 3 motorola t730 text messages :38:40 5 motorola t730 text messages :38:40 7 spike muscle car :57:43 2 spike muscle car :57:43 5 spike muscle car :00:22 usps :23:21 1 vc2 auctions :31:41 auctions for :33:47

Microsoft 2006 RFP dataset Time Query QueryID SessionID ResultCount
:00:01 defination Gravitational 46c13f0705f6436b 19ab975e898d46d1 11 :00:01 kimclement a3d2cae45e2b4c5b 1b748d1afa9b :00:01 scientology crazy beliefs ef33d14ed2 10f477402db84c9a 10 :00: bdf8834d eb6bf174c5c 9 :00:04 f92efd ac4 193f9f8442d44c48 0 :00:08 What is May Day? 37afe7af832649d2 21f6a0dfea4348ac 14 :00:10 vikings draft choices suck b0519e4528d84b44 196b0bb2f1d643f2 10 :00:10 wwwcrownawards.com 9eda4716dfb045e2 04e3a26067a :00:15 Australian miners ba6d190cc4cd4fd3 136fd5e571d QueryID Query Time URL Position a718649f2 schwab :07: d43b549c1 us geography :23: d43b549c1 us geography :23: aa52e4fbc wwf :25: aa6e27443f biggercity :30: aac1f6423f studios :21: d8afaa459a :22: c2848e4a68 north hills school district :29:12 1

How to use query log for query recommendation?
Click-through data Click-through data records the clicked documents after user submit a query to the search engine. Basic Assumption If user clicks a document after she issues a query, then the clicked document is more or less relevant to the submitted query, thus the query can be represented by it clicked documents. [Mei, CIKM’08] [Beeferman, KDD’00] Query Feature Representation If two queries co-clicked many common documents, then they have similar search intent. Query-URL Graph

How to use query log for query recommendation?
Query Session Query session: a single user submits a sequence of related queries in a time interval for a specific search task. [Boldi, CIKM’08, WSCD’09] Basic Assumption [Zhang, WWW’06] If two queries frequently co-occur in the same sessions, then they are relevant to each other. [Foneseca, LA-WEB’03] Association Rules Continuous submitted queries in short time interval by the same user share similar search intent. Query Graph

High Relevant Query Recommendation
Query Suggestion Using Hitting Time (CIKM’08) Click-through Data Query-URL Bipartite Graph Query Suggestions Using Query-Flow Graphs (WSCD’09) Session Data Query-Flow Graph

Query Suggestion Using Hitting Time (CIKM’08)
Query-URL Bipartite Graph Edges between V1 and V2 No edge inside V1 or V2 Edges are weighted e.g., V1 = query; V2 = Url Transition Probabilities A i j 4 5 7 V1 V2 1 3

Random Walk and Hitting Time Hitting time. How long does it take to hit node a in a random walk starting at node b ? Start at 1 2 3 4 1 5

Random Walk and Hitting Time Hitting time. How long does it take to hit node a in a random walk starting at node b ? Start at 1 Pick a neighbor i based on the transition probability. Move to i 2 3 4 1 5 t=1

Random Walk and Hitting Time Hitting time. How long does it take to hit node a in a random walk starting at node b ? Start at 1 Pick a neighbor i uniformly at random Move to i Continue 2 3 4 1 5 t=2

Random Walk and Hitting Time Hitting time. How long does it take to hit node a in a random walk starting at node b ? Start at 1 Pick a neighbor i uniformly at random Move to i Continue 2 3 4 1 5 If the random walk hits a node quickly, then its close to the start node! Hitting time! t=2

Graph G i A Hitting time from i to A
Lets be more specific about the definition. i A

Hitting time from i to A Graph G j i A k

Generate Query Suggestion
Construct a (kNN) subgraph from the query log data (of a predefined number of queries/urls) Compute transition probabilities p(i  j) Compute hitting time hiA Rank candidate queries using hiA Query Url 300 T aa 15 planner_main.jsp mexiana american airline en.wikipedia.org/wiki/Mexicana

Result: Query Suggestion
Query = ‘aa’ Yahoo aa route planner aa route finder aa airlines aa meetings aa autoroute aa road map Live aa route finder aa route planner aa airlines american airlines aa meeting aa road map Hitting time alcoholics anonymous automobile association theaa american airlines american air american airline ticket reservation

High Relevant Query Recommendation
Query Suggestion Using Hitting Time (CIKM’08) Click-through Data Query-URL Bipartite Graph Query Suggestions Using Query-Flow Graphs (WSCD’09) Session Data Query-Flow Graph

Query Suggestions Using Query-Flow Graphs (WSCD’09)
Session Data Definition: the sequence of queries of one particular user within a specific time limit

Query Graph two consecutive queries queries that are not neighbors in the same session This model works by accumulating many query sessions and adding up the similarity values for many same query pairs Z. Zhang and O. Nasraoui. Mining search engine query logs for query recommendation. In WWW, pages 1039–1040, 2006.

Query-Flow Graph P. Boldi, F. Bonchi, C. Castillo, D. Donato, A. Gionis, S. Vigna: “The query-flow graph: model and applications”. CIKM 2008.

Build Query-flow Graph
The key aspect of the construction of the query-flow graph is to define the weighting function w. represent the number of times the transition was observed in the same search session.

Query Recommendation The query recommendation methods are based on the probability of being at a certain node after performing a random walk over a query graph. Random Walk with restart a random surfer starts at the initial query q at each step α , follows one of the outlinks from the current node 1 - α , jumps back to q

Query Recommendation The query recommendation methods are based on the probability of being at a certain node after performing a random walk over a query graph. Random Walk with restart M - the transition matrix of a Markov chain P - row-normalized weight matrix of the query flow graph ej - the vector j-th entry is 1,others are zeroes

Random walks Random walks on graphs correspond to Markov Chains
The set of states S is the set of nodes of the graph G The transition probability matrix is the probability that we follow an edge from one node to another

Definitions Adjacency matrix A Transition matrix P 1 1/2 1

random walk t=0 1 1/2

random walk 1 1/2 t=0 1 1/2 t=1

random walk 1 1/2 t=0 1 1/2 t=1 t=2 1 1/2

random walk 1 1/2 t=0 1 1/2 t=1 t=2 1 1/2 t=3 1 1/2

Probability Distributions
xt(i) = probability that the surfer is on node i at time t xt+1(i) = ∑j(Probability of being at node j)*Pr(j->i) =∑jxt(j)*P(j,i) xt+1 = xtP = xt-1*P*P= xt-2*P*P*P = …=x0 Pt What happens when the surfer keeps walking for a long time?

What happens when the surfer keeps walking for a long time?
Stationary Distribution Intuitively the stationary distribution at a node is related to the amount of time a random walker spends visiting that node. Mathematically Remember that we can write the probability distribution at a node as xt+1 = xtP. For the stationary distribution v0 we have v0 = v0 P v0 is the left eigenvector of the transition matrix P !

Interesting questions
Does a stationary distribution always exist? Is it unique? Yes, if the graph is “well-behaved”, i.e., P is ergodic P is ergodic if : irreducible aperiodic Aperiodic Periodicity is 3 Aperiodic: State i is periodic with period k if all paths from i to i have length that is multiple of k. Otherwise, it’s aperiodic. Irreducible Not irreducible Irreducible: There is a path from every node to every other node.

If a markov chain P is irreducible and aperiodic then the largest eigenvalue of the transition matrix will be equal to 1 and all the other eigenvalues will be strictly less than 1. Let the eigenvalues of P be {σi| i=0:n-1} in non-increasing order of σi . σ0 = 1 > σ1 > σ2 >= ……>= σn

Result: Query Suggestion (q =“apple” and q =“jeep” )

Why Diversity Query Recommendation
Actually, in query recommendation, only providing the “relevant” recommendations is far away from satisfying users’ information needs. 相关性 Original Query：Apple The queries we recommend should cover multiple potential search intents of users and minimize the risk that users will not be satisfied. apple ipad 3 apple tree apple iphone 4s apple seed apple computer ⁞

High Diversity Query Recommendation
Diversifying Query Suggestion Results [Hao Ma, AAAI’10] Query-URL graph Hitting time A Unified Framework for Recommending Diverse and Relevant Queries[Xiaofei Zhu, WWW’11] Manifold Manifold Ranking with Stop Points

Diversifying Query Suggestion Results [H. Ma, AAAI’10] Query-URL graph Hitting time A Unified Framework for Recommending Diverse and Relevant Queries[X.F. Zhu, WWW’11] Manifold Manifold Ranking with Stop Points

Graph Construction Figure 1: Example for Bipartite Graph
(extracted from the clickthrough data)

Determining the First Suggested Query
Initial Transition Probability -- the number of click frequency between node i and node j -- normalization term, is the total number of times that the query node i has been issued in the dataset. -- initial transition probability from node i to node j

Random Jump In addition to the transition probability, there are random relations among different queries. It adds a uniform random relation among different queries -- the probability of taking a “random jump”, i.e., transit among different queries -- Without any prior knowledge, it sets , where d is a uniform stochastic distribution vector

Random Walk on the Query-URL graph With the transition probabilistic matrix P defined, it then can perform the random walk on the query-URL graph. the probability of transition from node i to node j after a t step random walk as: Explain: 1) The random walk sums the probabilities of all paths of length t between the two nodes. if there are many paths the transition probability will be high 2) The larger the transition probability Pt(i, j) is, the more the node j is similar to the node i.

the largest transition probability from node q will be recommended as the first suggested query performing a t-step random walk parameter t determines the resolution of the Markov random walk Large t: the random walk depend more on the graph structure Small t: preserves information about the starting node

Ranking the Rest Queries
Employ the hitting time to rank and diversify the rest of the queries. Hitting time Let S be a subset of vertex set V, the expected hitting time h(i|S) of the random walk is the expected number of steps before node i is visiting the starting set S. N(i) denotes the neighbors of node i

Ranking the Rest Queries
Property those nodes strongly connected to s1 will have many fewer visits by the random walk nodes far away from s1 still allow the random walk to move among them and thus receive more visits The second suggestion node select the second suggestion node s2 ∈ Q with the largest expected hitting time to the subset S containing two nodes q and s1.

Result: Query Suggestion

Diversifying Query Suggestion Results [Hao Ma, aaai’10] Query-URL graph Hitting time A Unified Framework for Recommending Diverse and Relevant Queries[Xiaofei Zhu, WWW’11] Manifold Manifold Ranking with Stop Points

A novel unified framework Manifold ranking with stop points
Query Recommendation relevance diversity Manifold ranking Import stop points A novel unified framework Manifold ranking with stop points

query1 query2 queryn Affinity matrix W

Traditional manifold ranking process
Step 1: Step 2: Step 3: W- affinity matrix, D – diagonal matrix

Manifold ranking with stop points

(1) (2) (3) (4)

Results: Query recommendation (‘abc’, ‘yamaha’)

Evaluation Metrics Automatic Evaluation Given two queries q and q’
Open Directory Project(ODP) <-> Relevance Given two queries q and q’ c(q): ‘Arts/Television/News’ c(q’): Arts/Television/Stations/North America /United States’ l(c, c’): their longest common prefix , e.g., ‘Arts/Television’ : the longest category of c and c’, e.g., 5

Evaluation Metrics Automatic Evaluation Given two queries q and q’
Open Directory Project(ODP) <-> Relevance Given two queries q and q’ c(q): ‘Arts/Television/News’ c(q’): Arts/Television/Stations/North America /United States’

Evaluation Metrics Automatic Evaluation
Commercial search engine (i.e., Google) <-> Diversity Given two queries q and q’ o(q, q) is the number of overlapped URLs among the top k search results of query q and q’.

Evaluation Metrics Automatic Evaluation
Commercial search engine (i.e., Google) <-> Diversity Given two queries q and q’

Evaluation Metrics Automatic Evaluation Evaluation metrics
Open Directory Project(ODP) <-> Relevance Commercial search engine (i.e., Google) <-> Diversity Evaluation metrics Q-measure β - parameter to control the tradeoff between relevance and diversity

Experiments Average Q-measure of Query Recommendation over Different Recommendation Size under 5 Approaches. Proposed Method

Experiments Manual Evaluation Recommendation pool 3 human judges
Label tool Recommendation pool search results

Intent-Coverage α-nDCG (α -normalized Discounted Cumulative Gain )
Experiments Evaluation Metrics α-nDCG (α -normalized Discounted Cumulative Gain ) Intent-Coverage

Experiments Table 2: Performance of recommendation results over a sample of queries under five different approaches.

Why High Utility Query Recommendation
Focuses on recommending users relevant queries to their initial queries. Query Level initial query query 1 query 2 query 3 Common Query Terms (Wen J. et al, WWW2001) Same Clicked Documents (Mei Q. et al, CIKM 2008) Co-Occurring in Same Search Sessions (Zhang Z.et al, WWW 2006) Only recommend relevant query is enough for find useful search results?

Why High Utility Query Recommendation
iphone sell time ‘iphone start sell’ Recommend High Utility Query ‘iphone initial release’

High Utility Query Recommendation
More Than Relevance: High Utility Query Recommendation By Mining Users’ Search Behaviors[X.F. Zhu, CIKM’12] Probabilistic Graphical Model (Query Utility Model) Recommending High Utility Query via Session-Flow Graph [X.F. Zhu, ECIR’13] Session-Flow Graph Two-phase model based on absorbing random walk

A Typical Search Session
bad posterior utiltiy bad perceived utility red - relevant √ - attractiveness

Probabilistic Graphical Model
Ri： whether there is a reformulation at position i Ci：whether the user clicks on some of the search results of the reformulation at position i; Ai：whether the user is attracted by the search results of the reformulaiton at position i; Si：whether the user’s information needs have been satisfied at position i;

Parameter Estimation Maximum Likelihood Estimation Where

Log Likelihood Function
Parameter Estimation Log Likelihood Function

Maximize Log Likelihood Function
Parameter Estimation Maximize Log Likelihood Function Lagrange multiplier Regularization term

Optimization Condition：
Parameter Estimation Optimization Condition：

Parameter Estimation Newton-Raphson

Experimental Results Dataset
Our experiments are based on publicly available query logs, namely UFindIt log data. There are totally 40 search tasks represented by 40 test queries.

Experimental Results Metric QRR (Query Relevant Ratio)
Measuring the probability that a user finds relevant results when she uses query q for her search task MRD (Mean Relevant Document) Measuring the average number of relevant results a user finds when she uses query q for her search task.

Experimental Results Two component utilities (i.e., perceived utility and posterior utility) in the QUM method: Perceived Utility method (PCU) and Posterior Utility method (PTU). PCU PTU Query-Flow Graph (QF): query-flow graph based on collective search sessions, and perform a random walk on this graph for query recommendation [cikm'08]. Click-through Graph (CT): query-URL bipartite graph, employs the hitting time as a measure to select queries for recommendation [cikm'08]. QF CT Query Utility Model(QUM): the expected information gain users obtained from the search results of the query according to their original information needs, which is the product of the two component utilities. QUM Adjacency (ADJ): given a test query q, the top frequent queries in the same session adjacent to q are recommended to users[www'06]. Co-occurrence (CO): given a test query q, the top frequent queries co-occurred in the same session with q are selected as recommendations [wsdm'10]. ADJ CO

Experiments Impact of parameter μ to the performance of QUM

Limitation of QUM method
Cannot make full use of the click-through information. it only considers whether the search results of a reformulated query have some clicked documents or not, but does not take individually clicked document into consideration. It is necessary to proposes a novel method to further capture these specific clicked documents for modeling query utility.

Framework of Our Approach
Two-phase model based on Absorbing Random Walk (TARW) Query-Flow Graph Session-Flow Graph Document Nodes Reformulation Behaviors + Click Behaviors Random Walk Absorbting Random Walk Absorbing States

Session Flow Graph Query-Flow Graph: Boldi et al. (CIKM 2008)
q → q1 → q3 q → q3 → q4 q → q4 ⁞ query session

Session Flow Graph Session Flow Graph: expands query-flow graph (document nodes + failure nodes) q → q1:u1:u2→ q3:u3 q → q3 → q4:u4:u5 q → q4:u6 ⁞ query session

Session Flow Graph Definition: Nodes Adjacency Matrix Edges

Two-phase model based on absorbing random walk (TARW)
Forward Utility Propagation Backward Utility Propagation > Utility score was transferred from the original query node to reformulation node, and at last absorbed by document node and failure node. > Utility score was inversely transferred from document nodes to reformulation node. Recommendation: queries with the highest utilities.

Forward Utility Propagation
Assign transition probability to different types of nodes (reformulation, document, failure): α2 α3 Reformulation Node —— α1 Document Node —— α2 α1 Failure Node —— α3 α1+α2+α3=1

Parameter Setting: Previous work (Sadikov, WWW2010): share the same transition probability setting (a1,a2,a3) to different types of nodes. α1 —— Reformulation node —— document node α2 α3 —— failure node Our work: assign transition probability based on characteristics of each candidate query. prior transition probability observed transition probability posterior transition probability

Transition Probability
Reformulation Nodes Document Nodes: Failure Node:

Computing the Distribution
In the forward utility propagation, the corresponding transition matrix is: PQ : n  n transition matrix on query nodes PD : n  m matrix of transition from query node to document node PS : n  1 matrix of transition from query to failure node. ID,IS: identity matrix, denoting document nodes and failure nodes are absorbing states. reducible (no station distribution)

Computing the Distribution
Computing the absorbing distribution by an iterative way： Pt[i, j] represents the probability of node i to node j after t step walk. we only have to compute the probability from query to document. O(tn3+n2m) in recommendation scenario, only the probability from original query to documents are needed, i.e. computing the matrix row of original query. O(tn2+nm)

Backward Utility Propagation

Experimental Results Dataset
Our experiments are based on publicly available query logs, namely UFindIt log data. There are totally 40 search tasks represented by 40 test queries.

Experimental Results Metric QRR (Query Relevant Ratio)
Measuring the probability that a user finds relevant results when she uses query q for her search task MRD (Mean Relevant Document) Measuring the average number of relevant results a user finds when she uses query q for her search task.

Overall Evaluation Results
Experimental Results Overall Evaluation Results TARW TARW method significantly better than all the baseline recommendation methods (p-value <= 0.05))

Evaluation of Document Utility
Baseline methods: Document Frequency Based Method (DF) the click frequency of a document reflects users preference for that document when they search with the original query Session Document Frequency Based Method (SDF) clicked documents within the same search session convey the similar search intent Markov-model Based Method (MM): Based on the learned document distribution for the original query by a Markov-model based method

Metrics: Precision at position Normalized Discounted Cumulative Gain(NDCG) Mean Average Precision (MAP)

TARW improvements over MM by: using an adaptive transition probability setting to different types of nodes modeling users' behaviors of giving up their search tasks by introducing the failure nodes.

query recommendation techniques
Summary query recommendation techniques High Relevant Query Recommendation High Diversity Query Recommendation High Utility Query Recommendation

Query Recommendation Xiaofei Zhu

Similar presentations

Presentation on theme: "Query Recommendation Xiaofei Zhu"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Query Recommendation Xiaofei Zhu

Similar presentations

Presentation on theme: "Query Recommendation Xiaofei Zhu"— Presentation transcript:

Similar presentations

About project

Feedback