Statistical Models for Web Search Click Log Analysis

Statistical Models for Web Search Click Log Analysis
Fan Guo Chao Liu Carnegie Mellon University Microsoft Research-Redmond

Prologue Search Results for “CIKM” # of clicks received 3/27/2017
CIKM'09 Tutorial, Hong Kong, China

Prologue Adapt ranking to user clicks? # of clicks received 3/27/2017

Prologue Tools needed for non-trivial cases # of clicks received
3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Motivation – Click Data Are Valuable
One of the most extensive (yet indirect) surveys of user experience. For researchers: Help understand human interaction with IR results Design and calibrate novel models and hypotheses For practitioners: Measure, monitor and improve search engine performance. Attract more page views and clicks, boost profit 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Tutorial Goals Introduce problems and applications in web search click modeling. Present latest development of click models in web search. Provide examples and discuss trade-offs for model design, implementation and evaluation. 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Presenters – Fan Guo Ph.D. Student (exp. 2011), Computer Science Department, Carnegie Mellon University Advisor: Christos Faloutsos Dissertation topic: graph mining for large bioinformatics image databases 2008, M.S., CMU 2005, B.E., Tsinghua University, Beijing, China 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Presenters – Chao Liu Researcher, Internet Services Research Center (ISRC), MSR- Redmond. Research focus: large-scale search/browsing log analysis for effective Web information access. 2007, Ph.D., UIUC 2005, M.S., UIUC Advisor: Jiawei Han Dissertation on statistical debugging and automated failure analysis 2003, B.S., Peking University, China 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Outline Introduction Designing click models Bayesian click models
Selected topics on click models Conclusion 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Outline Introduction Web search click logs
Interpret clicks as relevance feedback Building statistical models for clicks Applications of click models Designing click models Bayesian click models Selected topics on click models Conclusion 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Diverse User Feedbacks
Click-through Browser action Dwelling time Explicit judgment Other page elements 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Web Search Click Log Auto-generated data keeping important information about search activity. Query cikm Session ID f851c5af178384d12f3d Position URL Click 1 cikm2008.org 2 3 4 5 6 cikmconference.org 7 Ir.iit.edu/cikm2004 8 9 10 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Web search click log A real world example 3/27/2017

Web Search Click Log How large is the click log?
search logs: 10+ TB/day In existing publications: [Craswell+08]: 108k sessions [Dupret+08] : 4.5M sessions (21 subsets * 216k sessions) [Guo +09a] : 8.8M sessions from 110k unique queries [Guo+09b]: 8.8M sessions from 110k unique queries [Chapelle+09]: 58M sessions from 682k unique queries [Liu+09a]: 0.26PB data from 103M unique queries 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Web Search Click Log How large is one ? 3/27/2017

Interpret Clicks: an Example
Clicks are good… Are these two clicks equally “good”? Non-clicks may have excuses: Not relevant Not examined 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Eye-tracking User Study

Click Position-bias Normal Position Percentage Higher positions receive more user attention (eye fixation) and clicks than lower positions. This is true even in the extreme setting where the order of positions is reversed. “Clicks are informative but biased”. Reversed Impression Percentage [Joachims+07] 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Clicks as Relative Judgments
“Clicked > Skipped Above” [Joachims02] Preference pairs: #5>#2, #5>#3, #5>#4. Use Rank SVM to optimize the retrieval function. Limitation: Confidence of judgments Little implication to user modeling 1 2 3 4 5 6 7 8 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Problem Definition Given a set of web search click logs:
Predict clicks: output the probability of click vectors given a new order of URLs. 210 possibilities! 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

The Heart of Solution Given a set of web search click logs:
Estimate relevance: measures how good a URL is with regard to the information need of the query/user. Relevance score = 0.5 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Measuring Relevance The probability of a click if the document appears at the top position. Relevance score = 0.5 indicates that on average, the document will be clicked once per 2 sessions. Bayesian click models characterize relevance using a probability distribution Relevance score Density function 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Desired Properties Effective: aware of the position-bias and address it properly Scalable: linear complexity for both time and space, easy to parallel Incremental: flexible for model update based on new data 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Applications of click models
Optimizing the retrieval function Ranking alternation based on clicks [Liu+09b] 0.90 0.10 0.08 0.05 0.20 0.72 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Optimizing the retrieval function Ranking alternation based on clicks As a feature to a learning-to-rank system (e.g., RankNet [Burges+05] ) 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Online advertising User model for sponsored search auctions 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Online advertising User model for sponsored search auctions Click through rate (CTR) prediction [Zhu+10] The image shows the CTR of ads on a commercial search engine for a typical day (Image by courtesy of Zeyuan Zhu) 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Search engine evaluation Pskip [Wang+09]: click-through-rate above last clicks; dwelling time features could also be incorporated. 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Search engine evaluation Pskip [Wang+09]: click-through-rate above last clicks; Search relevance score [Guo+09c]: average relevance score weighted by chance of examination 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

User behavior analysis A preliminary work showing different user behavior patterns for navigational and informational queries [Guo+09c] 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Outline Designing click models Basic user hypotheses
Introduction Designing click models Basic user hypotheses Modeling the first click Extending to multiple clicks Summary of model design Bayesian click models Selected topics on click models Conclusion 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Examination Hypothesis [Richardson+07]
A document must be examined before a click. The (conditional) probability of click upon examination depends on document relevance. 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Examination Hypothesis [Richardson+07]
The click probability could be decomposed: Global component: the examination probability which reflects the position-bias Local component: depends on the (query, URL) pair only The building block for every existing model! 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Cascade Hypothesis [Craswell+08]
The first document is always examined. First-order Markov property: Examination at position (i+1) depends on examination and click at position i only Examination follows a strict linear order: Position i Position (i+1) 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Cascade Hypothesis [Craswell+08]
Limitation: examination/click rate monotonically decreases with rank, which is not always true. Some models do not follow this hypothesis (e.g., UBM) Web search data in [Guo+09a] Ads click data in [Zhu+10] 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Cascade model Put together two hypotheses:
Formal model specification: P(Ci=1|Ei=0) = 0, P(Ci=1|Ei=1) = rui P(E1=1) =1, P(Ei+1=1|Ei=0) = 0 P(Ei+1=1|Ei=1, Ci=0)=1 Cascade Model = [Craswell+08] examination hypothesis cascade hypothesis modeling a single click 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Cascade model The user behavior chart: Done Examine the URL Click? No
See Next URL? Yes Yes Done Index for URL at position i 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

The chance that user may immediately abandon examination w/o a click.
Alternatives First click in Click Chain Model [Guo+09b] as well as Dynamic Bayesian Network model [Chapelle+09] Examine the URL Click? No See Next URL? Yes Yes No The chance that user may immediately abandon examination w/o a click. Done Done 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Position-dependent parameters
Alternatives First click in User Browsing Model [Dupret+08] Examine the URL Click? No See Next URL? Yes No i ←i+1 Yes Done Position-dependent parameters 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Dependent Click Model [Guo+09a]
Generalize the cascade model to 1+ clicks: P(Ci=1|Ei=0) = 0, P(Ci=1|Ei=1) = rui P(E1=1) =1, P(Ei+1=1|Ei=0) = 0 P(Ei+1=1|Ei=1, Ci=0)=1 P(Ei+1=1|Ei=1, Ci=1)= λi λ:global parameters characterizing user browsing behavior 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Generalize the cascade model to 1+ clicks: 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

DCM Algorithms: Input: for each query session, the query term, with (URL, clicked) tuple for all top-10 positions. Output: relevance for each (query, URL) pair; global parameters for user behavior Method: approximate* maximum-likelihood estimation. *Footnote: the algorithm maximizes a lower bound of log-likelihood function. 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Detour: last clicked position
Query cikm Session ID f851c5af178384d12f3d Position URL Click 1 cikm2008.org 2 3 4 5 6 cikmconference.org 7 Ir.iit.edu/cikm2004 8 9 10 Last clicked position 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Detour: last clicked position
Query cikm Session ID ab8dee4c4dd21e6aaf03 Position URL Click 1 cikm2008.org 2 3 4 5 cikmconference.org 6 7 Ir.iit.edu/cikm2004 8 9 10 Last clicked position 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

DCM Algorithms [Guo+09a]
The estimation formula for relevance: empirical CTR measured before last clicked position The estimation formula for global (user behavior) parameters: empirical probability of “clicked-but-not-last” 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

DCM Implementation [Guo+09a]
Details DCM Implementation [Guo+09a] Keep 3 counts for each (query, URL) pair Then 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Click Chain Model [Guo+09b]
The examine-next probability depends on the relevance of the URL clicked: Not what I want, go to examine the next Aha, this is the right one, and I’m done! 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

The examine-next probability depends on the relevance of the URL clicked: P(Ei+1=1|Ei=1, Ci=1)= α2(1-rui) + α3rui P(Ei+1=1|Ei=1, Ci=0)= α1 where 0 < α1 ≤ 1, 0 ≤ α3< α2≤ 1 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

The full picture: 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

DBN Model [Chapelle+09] Conclusion: attractive, but not satisfactory.
There is a subtle difference between the relevance of the URL snippet and the landing page. hmmm…, this looks pretty nice errr…, it’s way out of date Conclusion: attractive, but not satisfactory. 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

DBN Model [Chapelle+09] The examine-next probability depends on the “satisfaction score”: P(Ei+1=1|Ei=1, Ci=1)= γ(1-sui) + 0sui P(Ei+1=1|Ei=1, Ci=0)= γ where 0 < γ ≤1 The click probability is associated with “attractiveness score”: P(Ci=1|Ei=1)= aui 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

DBN Model [Chapelle+09] The full picture: 3/27/2017

User Browsing Model [Dupret+08]
The examine-next probability depends on both the preceding clicked position r, and the distance to this position d. Position URL Click 1 cikm2008.org 2 3 4 5 cikmconference.org 6 … r = 0 d = 1 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

The examine-next probability depends on both the preceding clicked position r, and the distance to this position d. Users would lose patience when they browse through without issuing a click. The probability monotonically drops as d increases and r remains the same. 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

The examine-next probability depends on both the preceding clicked position r, and the distance to this position d. P(Ei=1|C1:i-1)= βri,di 55 parameters are needed for top-10 positions (0≤r<r+d≤10). Cascade hypothesis is not assumed. where ri = max{j| j <i , Cj=1}, di = i - ri 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

The full picture: 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Summary of Model Design
Probability of examine the first URL Model P(E1) Cascade 1 DCM CCM 1* DBN UBM β0,1 * Footnote: it is flexible to add another parameter to specify this probability. 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Probability of click upon examination Model P(Ci=1|Ei=1) Cascade rdi DCM CCM rdi* DBN adi UBM *Footnote: the mean of the relevance distribution, detailed in the next part 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Probability of examine-next w/o a click Model P(Ei+1=1|Ei=1,Ci=0) Cascade 1 DCM CCM α1 DBN γ UBM βri+1,di+1 * *Footnote: the probability does not depend on Ei 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Probability of examine-next after a click Model P(Ei+1=1|Ei=1,Ci=1) Cascade -- DCM αi CCM α2(1-rdi) + α3rdi DBN γ(1-sdi) UBM βi,1 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Size of parameter sets Model # of global params Cascade DCM 9 CCM 3 DBN 1 UBM 55 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Inference and estimation algorithms Model Single-Pass Details DCM Maximizing a lower bound of LL, fastest CCM No iteration needed, thanks to the Bayesian framework DBN EM-based, iterative algorithms UBM EM-based, usually takes ~30 iterations to converge 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Outline Bayesian click models Bayesian framework and the rationale
Introduction Designing click models Bayesian click models Bayesian framework and the rationale Bayesian Browsing Model: a case study Click Chain Model in a nutshell Selected topics on click models Conclusion 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Primer: Coin-Toss Example
1 Prior Posterior “probability” of p(H) p(H)=0.8 p(H) p(H) Bayesian Frequentist 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Posterior Prior Density Function (not normalized) x x x x3(1-x) x4(1-x) 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Posterior Prior Density Function (not normalized) x1(1-x) x2(1-x) x3(1-x) x3(1-x) x4(1-x)1 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

The graphical model for coin-toss X C1 C2 C3 C4 C5 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Primer: Click Data Example
x1 (1-x)0 (1-0.6x)0 (1+0.3x)1 (1-0.5x)0 (1-0.2x)0 … x1 (1-x)1 (1-0.6x)0 (1+0.3x)1 (1-0.5x)0 (1-0.2x)0 … x2 (1-x)1 (1-0.6x)0 (1+0.3x)2 (1-0.5x)0 (1-0.2x)0 … x3 (1-x)1 (1-0.6x)1 (1+0.3x)2 (1-0.5x)0 (1-0.2x)0 … x3 (1-x)1 (1-0.6x)1 (1+0.3x)2 (1-0.5x)1 (1-0.2x)0 … Prior Density Function (not normalized) 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Overview Representation of relevance
A probability distribution on [0,1] for each (query, URL) pair The density function is in a polynomial form over a small set of linear factors. The coefficients of such linear factors are shared between different (query, URL) pairs. x3 (1-1x)1 (1-0.6x)1 (1+0.3x)2 (1-0.5x)1 (1-0.2x)0 … 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Overview Inference: Go over each query session once, update the exponents for corresponding (query, URL) pair impressed* Analytical or numerical integration may be needed to compute the normalization constant. *Footnote: by virtue of the Bayes theorem and conditional independence relationship/assumption 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Overview Key problems: Which is the right factor to update?
How to estimate all the coefficients? 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Why Bayesian? Modeling Benefits: Computational Benefits:
Confidence for the URL relevance estimate Relative judgments: probability of URL i is more relevant to the query than URL j Easy to interpret: coefficients in linear factors reflect position-bias and user browsing patterns Computational Benefits: Single-pass, linear algorithms; no iterations Paralleled version is easy to implement 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Variable Definition For a specific query session, let
where 1 ≤ i ≤ M=10. S1 S2 S3 SM … E1 E2 E3 EM … C1 C2 C3 CM … 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Graphical Illustration
Relevance S1 S2 S3 SM … Examination E1 E2 E3 EM … Click C1 C2 C3 CM … 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Inference Algorithms Compute the posterior distribution
Details Inference Algorithms Compute the posterior distribution Conditional independence relationship induced from the graphical model How many times URLj was not clicked when it is at position (r + d) with the preceding click at position r How many times the URL j was clicked 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

A Toy Example Only top M=3 positions are shown, 3 query sessions and 4 distinct URLs. 4 1 3 2 Position Query Session 3 Query Session 2 Query Session 1 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

A Toy Example Initialize M(M+1)/2+1 counts for each URL 4 Clicks
r=0 d=1 r=0 d=2 r=0 d=3 r=1 d=1 r=1 d=2 r=2 d=1 4 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

A Toy Example Update counts for URL 4 If not impressed, do nothing;
If clicked, increment “clicks” by 1; Otherwise, locate the right r and d to increment. URL Clicks r=0 d=1 r=0 d=2 r=0 d=3 r=1 d=1 r=1 d=2 r=2 d=1 4 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

A Toy Example Update counts for URL 4 If not impressed, do nothing;
If clicked, increment “clicks” by 1; Otherwise, locate the right r and d to increment. URL Clicks r=0 d=1 r=0 d=2 r=0 d=3 r=1 d=1 r=1 d=2 r=2 d=1 4 1 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

A Toy Example The posterior for URL 4 Interpretation:
The larger the probability of examination, the stronger the penalty for a non-click. URL Clicks r=0 d=1 r=0 d=2 r=0 d=3 r=1 d=1 r=1 d=2 r=2 d=1 4 1 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Parameter Estimation Keep 2 counts for each parameter (one for click, and the other one for non-click) Parameter Click Non-click Non-Click β0,1 β1,1 β0,2 β1,2 β0,3 β2,1 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Parameter Estimation For each position in a query session, locate the right r and d to increment. Parameter Click Non-click Non-Click β0,1 1 β1,1 β0,2 β1,2 β0,3 β2,1 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Parameter Estimation For each position in a query session, locate the right r and d to increment. Parameter Click Non-click Non-Click β0,1 1 2 β1,1 β0,2 β1,2 β0,3 β2,1 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Parameter Estimation Maximum-Likelihood Estimate: β0,1 1 2 β1,1 β0,2
Click Non-click Non-Click β0,1 1 2 β1,1 β0,2 β1,2 β0,3 β2,1 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Algorithm Complexities
Details Algorithm Complexities Let Initializing and updating the counts: Time: Space: Linear to the size of the click log Almost constant storage required 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Algorithm Complexities
Details Algorithm Complexities Let Initializing and updating the counts: Time: Space: Computing relevance scores using numerical integration with B bins: 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Summary of Algorithms Step 1: initialize counting statistics;
Step 2: scan through the click log once and update the counts for both inference and estimation Step 3: compute parameter values; Step 4: use numerical integration to obtain relevance scores. Step 2 also applies for (linear) incremental computation! 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

CCM Recap The user behavior model: 3/27/2017

CCM Recap Graphical model: Relevance … Examination Click S1 S2 S3 E1
SM … E1 E2 E3 EM C1 C2 C3 CM 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Details CCM Inference 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Comparing with UBM Number of user behavior parameters
Number of distinct factors for (query, URL) Number of counts needed for parameters CCM UBM 3 55 CCM UBM 22 56 CCM UBM 5 110 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Outline Selected topics on click models
Introduction Designing click models Bayesian click models Selected topics on click models Scaling click models for Petabyte-scale data Click model evaluation Tailoring user goals to click models Conclusion 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Petabyte-Scale Data Data collected in 8 weeks
Job k includes data between week 1 and k Both time and space costs are prohibitive for a single node. 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Introducing Map-Reduce [Dean+04]
A Simple Task: counting # impression for each (query, URL) pair 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Machine #1 Machine #2 Machine #3 Machine #4 Extent Extent Extent Extent GetPairs GetPairs GetPairs GetPairs Map Map Map Map Sort Sort Sort Sort Count Count Count Count Output

Machine #1 Machine #2 Machine #3 Machine #4 Extent Extent Extent Extent GetPairs GetPairs GetPairs GetPairs Map Map Map Map Sort “Map” puts all of the same Pairs onto one machine. This allows you to group by various fields in subsequent processes. Sort Sort Sort Count Count Count Count Output

A Simple Task: counting # impression for each (query, URL) pair Map = Bucket: the intermediate key is (query, URL) pair 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Machine #1 Machine #2 Machine #3 Machine #4 Extent Extent Extent Extent GetPairs GetPairs GetPairs GetPairs “Count” carries out standard increment-by-1 over each distinct Pair. Map Map Map Map “Count” REDUCES the amount of data since each Pair has only one output value Sort Sort Sort Sort Count Count Count Count Output

A Simple Task: counting # impression for each (query, URL) pair Map = Bucket: the intermediate key is (query, URL) pair Reduce = Count: it accepts a list of (key, value) tuple, and outputs the final result for each distinct key 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Machine #1 Machine #2 Machine #3 Machine #4 Extent Extent Extent Extent GetPairs GetPairs GetPairs GetPairs MAP Map Map Map Map Sort Sort Sort Sort REDUCE Count Count Count Count Output

MapReduce for BBM 0 for clicks 3 2 5 1 4 6 3/27/2017

MapReduce for BBM Map: scan the click log
Intermediate key: (query, URL) Value: the index of linear factors (0~55 for top-10 positions) Reduce: scan the list of (key, value) The key indicates which exponent vector to update The value indicates the index of the element in the exponent vector to increment 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Empirical Results Linearly increasing computation load
Near-constant elapsed time 3 hours 265 TB log data 1.15 billion (query, url) pairs Single machine computation load Elapse time on SCOPE 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Evaluation Overview - Training
Impression Data Click Data 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Evaluation Overview - Training
Impression Data Click Data Global Parameters M=10 Relevance Scores 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Evaluation Overview - Test
New Impression Vector from an Existing Query Relevance Predicted Examination Predicted Clicks Global params 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Data Preprocessing [Guo+09a, 09b]
Data are collected from a commercial search engine after query term normalization and spam removal. For each query term, split query sessions evenly into training and test sets according to the timestamp. Top frequent/infrequent query terms are removed. 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Evaluation Metrics Most popular metrics:
Average test data log-likelihood (LL) (probability of accurately predicting the click vector, 2^10 possibilities) [Guo+09a, Guo+09b, Liu+09a, Zhu+10] Perplexity of prediction for each position (2^{average entropy} of click/no-click binary prediction for each position independently) [Dupret+08, Guo+09a, Guo+09b, Zhu+10] 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Evaluation Metrics Other Metrics:
Click-through-rate (CTR) prediction (Especially for predicting [Chapelle+09, Zhu+10] Predicting first/last clicked positions [Guo+09a, Guo+09b] Position-bias sanity check (plot the click rate curve for top-10 positions v.s. the ground truth) [Guo+09a, Guo+09b] 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Results – Log Likelihood [Guo+09b]
Average Log-likelihood Random guess: log(2-10) = -3.01 Optimal value: 0 Model CCM UBM DCM LL -1.171 -1.264 -1.302 Improve-ment Ratio 9.7% 14% 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Results – Log Likelihood [Guo+09b]
Better Worse 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Results – Log Likelihood [Liu+09a]
Better Worse 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Results – Perplexity [Guo+09b]
Average Perplexity over top 10 positions Random guess: 2 Optimal value: 1 Model CCM UBM DCM Perplexity 1.1577 1.1590 Improve-ment Ratio 7.5% 8.3% 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Results – Perplexity [Guo+09b]
Worse Better 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Examine/Click Position-Bias [Guo+09b]

Results – Efficiency For 1M query sessions, the estimated time in seconds: * Time for CCM and BBM includes computing posterior mean and variance using numerical integration w/ 100 bins. ** UBM converges in 34 iterations. DCM CCM* BBM* UBM** 80 150 165 5,000 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

User Goals in Web Search
Queries could be categorized into 2 sets: Navigational: to find the link to an existing website, e.g., bing; Informational: more exploration, multiple clicks may arise, e.g., iron man. 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Fitting Multiple Click Models
Different user goals result in different browsing and click patterns. The straightforward mixture-modeling approach is not practical. [Dupret+08] Solution: Classify query terms a priori based on user goals. Fitting and learning 2 sets of model parameters for navigational and informational queries. 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Determining User Goals [Lee+05]
Two-way classification for query terms based on click data using… Median position of click distribution Mean position of click distribution Average # clicks per query session … Pick the one which has best click prediction If a position receives 50% of the click, then navigational, else informational 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Empirical Results [Guo+09c]
Improvement of click prediction for DCM: Log-Likelihood: 4.0% Perplexity: 1.3% Examination/Click position-bias: 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Outline Introduction Designing click models Bayesian click models
Selected topics on click models Conclusion 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Summary Click models A statistical tool to leverage valuable user implicit feedback in terabyte/petabyte search logs. Provide click prediction as well as relevance estimates. Application domains include learning to rank, measuring search performance, online advertising, user behavior analysis… 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Summary Click models Different model designs reflect various assumption of user behaviors to explain the position-bias. The modeling choice may depend on the application scenario. 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Summary Click models Efficient, single-pass, parallelizable algorithms are desired in real-world applications. Bayesian framework could be applied to click models for both modeling benefits and computational benefits. Click Chain Model and Bayesian Browsing Model represent state-of-the-art examples. 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Future Directions Bigger Context Richer inputs
Query reformulations Personalization Richer inputs Universal search Diverse user feedback Click model v.s. Human judgments 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

References [Burges+05]: C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. ICML’05. [Chapelle+09]: O. Chapelle and Y. Zhang. A dynamic Bayesian network click model for web search ranking. WWW’09. [Craswell+08]: N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An experimental comparison of click position-bias models. WSDM ’08. [Dean+04]: J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. OSDI’04. [Dupret+08]: G. Dupret and B. Piwowarski. A user browsing model to predict search engine click data from past observations. SIGIR’08. 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

References [Guo+09a]: F. Guo, C. Liu, and Y.-M. Wang. Efficient multiple-click models in web search. WSDM’09. [Guo+09b]: F. Guo, C. Liu, A. Kannan, T. Minka, M. Taylor, Y.-M. Wang, and C. Faloutsos. Click chain model in web search. WWW’09. [Guo+09c]: F. Guo, L. Li, and C. Faloutsos. Tailoring click models to user goals. WSCD’09. [Joachims02]: T. Joachims. Optimizing search engines using clickthrough data. KDD’02. [Joachims+07]: T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Accurately interpreting clickthrough data as implicit feedback, ACM TOIS, 25(2), 2007. 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

References [Lee+05]: U. Lee, Z. Liu, and J. Cho. Automatic identification ofuser goals in web search. WWW’05. [Liu+09a]: C. Liu, F. Guo, and C. Faloutsos. BBM: Deriving click models from petabyte-scale data. KDD’09. [Liu+09b]: C. Liu, M. Li, and Y.-M. Wang. Post-rank reordering: resolving preference misalignments between search engines and end users. CIKM’09. [Richardson+07]: M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: estimating the click-through rate for new ads. WWW’07. [Zhu+10]: Z. Zhu, W. Chen, T. Minka, C. Zhu and Z. Chen. A novel click model and its applications to online advertising. To appear in WSDM’10. 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Acknowledgements Nick Craswell Christos Faloutsos Li-Wei He Tom Minka
MSR, Cambridge Carnegie Mellon University MSR, ISRC-Redmond Tom Minka Anitha Kannan MSR, Cambridge MSR, Search Lab 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Acknowledgements Yi-Min Wang Mike Taylor Ethan Tu MSR, ISRC-Redmond
MSR, Cambridge MSR, ISRC-Redmond 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

End 3/27/2017 CIKM'09 Tutorial, Hong Kong, China

Statistical Models for Web Search Click Log Analysis

Similar presentations

Presentation on theme: "Statistical Models for Web Search Click Log Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Models for Web Search Click Log Analysis

Similar presentations

Presentation on theme: "Statistical Models for Web Search Click Log Analysis"— Presentation transcript:

Similar presentations

About project

Feedback