Eugene Agichtein Mathematics & Computer Science Emory University

Slides:

Advertisements

Similar presentations

Beliefs & Biases in Web Search

Advertisements

Accurately Interpreting Clickthrough Data as Implicit Feedback Joachims, Granka, Pan, Hembrooke, Gay Paper Presentation: Vinay Goel 10/27/05.

Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Modelling Relevance and User Behaviour in Sponsored Search using Click-Data Adarsh Prasad, IIT Delhi Advisors: Dinesh Govindaraj SVN Vishwanathan* Group:

Optimizing search engines using clickthrough data

A Machine Learning Approach for Improved BM25 Retrieval

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.

1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.

Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites from User Activity Data Misha Bilenko and Ryen White presented by Matt Richardson.

Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.

A Quality Focused Crawler for Health Information Tim Tang.

Evaluating Search Engine

Information Retrieval in Practice

Search Engines and Information Retrieval

Learning to Rank: New Techniques and Applications Martin Szummer Microsoft Research Cambridge, UK.

Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.

Ryen W. White, Microsoft Research Jeff Huang, University of Washington.

1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.

Overview of Search Engines

Adapting Deep RankNet for Personalized Search

“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.

Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)

Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.

Search Engines and Information Retrieval Chapter 1.

Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.

Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.

Modeling User Interactions in Web Search and Social Media Eugene Agichtein Intelligent Information Access Lab Emory University.

Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.

1 Mining User Behavior Mining User Behavior Eugene Agichtein Mathematics & Computer Science Emory University.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Report on Intrusion Detection and Data Fusion By Ganesh Godavari.

Modeling User Interactions in Web Search and Social Media Eugene Agichtein Intelligent Information Access Lab Emory University.

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.

Question Answering over Implicitly Structured Web Content

 Examine two basic sources for implicit relevance feedback on the segment level for search personalization. Eye tracking Display time.

1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG,

Qi Guo Emory University Ryen White, Susan Dumais, Jue Wang, Blake Anderson Microsoft Presented by Tetsuya Sakai, Microsoft Research.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Adish Singla, Microsoft Bing Ryen W. White, Microsoft Research Jeff Huang, University of Washington.

Retroactive Answering of Search Queries Beverly Yang Glen Jeh.

Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.

Post-Ranking query suggestion by diversifying search Chao Wang.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.

Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.

Information Retrieval in Practice

Ranking and Learning 290N UCSB, Tao Yang, 2014

Accurately Interpreting Clickthrough Data as Implicit Feedback

Evaluation Anisio Lacerda.

Assessing the Scenic Route: Measuring the Value of Search Trails in Web Logs Ryen W. White1 Jeff Huang2 1Microsoft Research 1University of Washington.

Search Engine Architecture

Search User Behavior: Expanding The Web Search Frontier

Evaluation of IR Systems

Source: Procedia Computer Science（2015）70:

Martin Rajman, Martin Vesely

Data Mining Lecture 11.

Learning to Rank Shubhra kanti karmaker (Santu)

Overview of Machine Learning

Towards a Personal Briefing Assistant

Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007

INF 141: Information Retrieval

Learning to Rank with Ties

Presentation transcript:

Eugene Agichtein Mathematics & Computer Science Emory University Patterns in Web Search Eugene Agichtein Mathematics & Computer Science Emory University

Web Search Ranking Rank pages for a query using hundreds of features: Content match, e.g., page terms, anchor text, term weights Prior document quality, e.g., web topology, spam features Evaluate accuracy, tune ranking functions on explicit relevance ratings Millions of users interact with the results

Query: SIGIR 2006 Users can help indicate most relevant results Show results, clickthrough ##s

Outline Predicting search result preferences Incorporating user behavior into ranking Behavior-based query segmentation Current research

User Interactions Goal: Harness rich user interactions with search results to improve quality of search Millions of users submit queries daily and interact with the search results Clicks, query refinement, dwell time User interactions with search engines are plentiful, but require careful interpretation Delete Task Task: predict general user preferences E.g., a user likely to prefer Page A > Page B

Interpreting User Interactions Clickthrough and subsequent browsing behavior of individual users influenced by many factors Relevance of a result to a query Visual appearance and layout Result presentation order Context, history, etc. General idea: Aggregate interactions across all users and queries Compute “expected” behavior for any query/page Recover relevance signal for a given query “result to for query” “Other factors (e.g., presentation, order…)” bolding

Case Study: Clickthrough Clickthrough frequency for all queries in sample More generally, the observed behavior can be modelled as mixture of relevance and other. In the curve (o(q…)) oberserd value for a given feature consists of cummulative and rel + cf Clickthrough (query q, document d, result position p) = expected (p) + relevance (q , d)

Clickthrough for Queries with Known Position of Top Relevant Result In the key, perhaps just say “Top Relevant Result at Position 1” instead of PTR=1? You'll need to go through this slowly, so that people grok it. Might also mention Joachims experiments where he reversed the order of the top 10 results (for a small set of queries in a controlled setting). Relative clickthrough for queries top relevant result known to be at position 1

Clickthrough for Queries with Known Position of Top Relevant Result Higher clickthrough at top non-relevant than at top relevant document You'll need to go through this slowly, so that people grok it. Might also mention Joachims experiments where he reversed the order of the top 10 results (for a small set of queries in a controlled setting). Relative clickthrough for queries with known relevant results in position 1 and 3 respectively

Deviation from Expected Relevance component: deviation from “expected”: Relevance(q , d)= observed - expected (p) Why is the title bouncing between “Case Study: Clickthrough” and “Case Study: Signal in Noisy Clicks”?

Beyond Clickthrough: Rich User Interaction Space Observed and Distributional features Observed features: aggregated values over all user interactions for each query and result pair Distributional features: deviations from the “expected” behavior for the query Represent user interactions as vectors in “Behavior Space” Presentation: what a user sees before a click Clickthrough: frequency and timing of clicks Browsing: what users do after the click “observed values:” bolding

Some User Interaction Features Presentation ResultPosition Position of the URL in Current ranking QueryTitleOverlap Fraction of query terms in result Title Clickthrough DeliberationTime Seconds between query and first click ClickFrequency Fraction of all clicks landing on page ClickDeviation Deviation from expected click frequency Browsing DwellTime Result page dwell time DwellTimeDeviation Deviation from expected dwell time for query Can this slide be moved much earlier, when User Interaction Features are first mentioned? Or later, when you compare them? You might need to be ready to flip back later, anyway… Browsing subheading, couple of example for each type Raw and distribution for click and browsing

Predicting Result Preferences Task: predict pairwise preferences A user will prefer Result A > Result B Models for preference prediction Current search engine ranking Clickthrough Full user behavior model “A user” -> “User” “Different preference prediction models” (“Several”?)

Clickthrough Model SA+N: “Skip Above” and “Skip Next” Example Adapted from Joachims’ et al. [SIGIR’05] Motivated by gaze tracking Example Click on results 2, 4 Skip Above: 4 > (1, 3), 2>1 Skip Next: 4 > 5, 2>3 1 2 3 4 5 6 7 8

Distributional Model CD: distributional model, extends SA+N Clickthrough considered iff frequency > ε than expected Click on result 2 likely “by chance” 4>(1,2,3,5), but not 2>(1,3) 1 2 3 4 5 This is somewhat confusing if done quickly. Do you need to use such a stark distribution? 6 7 8

User Behavior Model Full set of interaction features Presentation, clickthrough, browsing Train the model with explicit judgments Input: behavior feature vectors for each query-page pair in rated results Use RankNet (Burges et al., [ICML 2005]) to discover model weights Output: a neural net that can assign a “relevance” score to a behavior feature vector Train system on pairs (where first point is to be ranked higher or equal to second). Use cross entropy cost  probabilistic model. Use gradient descent to train weights in neural net.

RankNet for User Behavior RankNet: general, scalable, robust Neural Net training algorithms and implementation Optimized for ranking – predicting an ordering of items, not scores for each Trains on pairs (where first point is to be ranked higher or equal to second) Extremely efficient Uses cross entropy cost (probabilistic model) Uses gradient descent to set weights Restarts to escape local minima definition of “ranking”, as important here? Or maybe just mention that it is not just basic regression? Or maybe it was fine. “->” -> “()”

RankNet [Burges et al. 2005] For query results 1 and 2, present pair of vectors and labels, label(1) > label(2) Feature Vector1 Label1 NN output 1

RankNet [Burges et al. 2005] For query results 1 and 2, present pair of vectors and labels, label(1) > label(2) Feature Vector2 Label2 NN output 1 NN output 2

RankNet [Burges et al. 2005] For query results 1 and 2, present pair of vectors and labels, label(1) > label(2) Error is function of both outputs (Desire output1 > output2) NN output 1 NN output 2

RankNet [Burges et al. 2005] Update feature weights: Cost function: f(o1-o2) – details in Burges et al. paper Modified back-prop Error is function of both outputs (Desire output1 > output2) NN output 1 NN output 2

Predicting with RankNet Present individual vector and get score Feature Vector1 NN output

Evaluation Metrics Task: predict user preferences Metric: pairwise agreement Precision for a query: Fraction of pairs predicted that agree with preferences derived from human ratings Recall for a query: Fraction of human-rated preferences predicted correctly Average Precision and Recall across all queries

Datasets Explicit judgments User behavior data 3,500 queries, top 10 results, relevance ratings converted to pairwise preferences for each query User behavior data Opt-in MSN Toolbar instrumentation Anonymized UserID, time, visited page Detect queries submitted to MSN Search engine Subsequent visited pages 120,000 instances of these 3,500 queries submitted at least 2 times over 21 days

Methods Compared Preferences inferred by: Current search engine ranking: Baseline Result i > Result j iff i > j Clickthrough model: SA+N Clickthrough distributional model: CD Full user behavior model: UserBehavior Move preferences part above?

Results: Predicting User Preferences full user behavior better than other methods that we and others have tried; browsing features the most important. Baseline < SA+N < CD << UserBehavior Rich user behavior features result in dramatic improvement

Contribution of Feature Types Perhaps say “Full User Behavior Model:” in title? Presentation features not helpful Browsing features: higher precision, lower recall Clickthrough features > CD: richer model + learning

Amount of Interaction Data Prediction accuracy for varying amount of user interactions per query Slight increase in Recall, substantial increase in Precision

Learning Curve Minimum precision of 0.7 Recall increases substantially with more days of user interactions

Experiments Summary: Preferences Clickthrough distributional model: more accurate than previously published work Rich user behavior features: dramatic accuracy improvement Accuracy increases for frequent queries and longer observation period

Outline Predicting result preferences Incorporating behavior into ranking Behavior-based query segmentation Current research directions

Web Search Ranking Rank pages relevant for a query Content match e.g., page terms, anchor text, term weights Prior document quality e.g., web topology, spam features Hundreds of parameters Tune ranking functions on explicit document relevance ratings

Web Search Ranking: Revisited Incorporate user behavior information Millions of users submit queries daily Rich user interaction features Complementary to content and web topology Some challenges: User behavior “in the wild” is not reliable How to integrate interactions into ranking What is the impact over all queries Behavior varies with information need!

User Behavior Models for Ranking Use interactions from previous instances of query General-purpose (not personalized) Only for the queries with past user interactions Models: Rerank, clickthrough only: reorder results by number of clicks Rerank, predicted preferences (all user behavior features): reorder results by predicted preferences Integrate directly into ranker: incorporate user interactions as features for the ranker

Rerank, Clickthrough Only Promote all clicked results to the top of the result list Re-order by click frequency Retain relative ranking of un-clicked results

Rerank, Preference Predictions Re-order results by function of preference prediction score Experimented with different variants Using inverse of ranks Intuition: scores not comparable  merge ranks

Enhance Ranker Features with User Behavior Features For a given query Merge original feature set with user behavior features when available User behavior features computed from previous interactions with same query Train RankNet [Burges et al., ICML’05] on the enhanced feature set

Feature Merging: Details Query: SIGIR, fake results w/ fake feature values Result URL BM25 PageRank … Clicks DwellTime sigir2007.org 2.4 0.5 ? Sigir2006.org 1.4 1.1 150 145.2 acm.org/sigs/sigir/ 1.2 2 60 23.5 Value scaling: Binning vs. log-linear vs. linear (e.g., μ=0, σ=1) Missing Values: 0? (meaning for normalized feats s.t. μ=0?) Runtime: significant plumbing problems

Evaluation Metrics Precision at K: fraction of relevant in top K NDCG at K: norm. discounted cumulative gain Top-ranked results most important MAP: mean average precision Average precision for each query: mean of the precision at K values computed after each relevant document was retrieved

Datasets 8 weeks of user behavior data (@Winter 2006) from anonymized opt-in client instrumentation Millions of unique queries and interaction traces Random sample of 3,000 queries Gathered independently of user behavior 1,500 train, 500 validation, 1,000 test Explicit relevance assessments for top 10 results for each query in sample

Methods Compared Content only: BM25F Full Search Engine: RN Hundreds of parameters for content match and document quality Tuned with RankNet Incorporating User Behavior Clickthrough: Rerank-CT Full user behavior model predictions: Rerank-All Integrate all user behavior features directly: +All

Content, User Behavior: Precision at K, queries with interactions BM25 < Rerank-CT < Rerank-All < +All

Content, User Behavior: NDCG BM25 < Rerank-CT < Rerank-All < +All

Full Search Engine, User Behavior: NDCG, MAP MAP Gain RN 0.270 RN+ALL 0.321 0.052 (19.13%) BM25 0.236 BM25+ALL 0.292 0.056 (23.71%)

User Behavior Complements Content and Web Topology BM25 (keyword-based ranking) + user behavior is better than full model with hundreds of features – keyword, web structure, et al. Method P@1 Gain RN (Content + Links) 0.632 RN + All (User Behavior) 0.693 0.061(10%) BM25 0.525 BM25+All 0.687 0.162 (31%)

Impact: All Queries, Precision at K < 50% of test queries w/ prior interactions improves 0.06-0.12 over all test queries

Impact: All Queries, NDCG +0.03-0.05 NDCG over all test queries

Which Queries Benefit Most Most gains are for queries with poor ranking

Result Summary Incorporating user behavior into web search ranking dramatically improves relevance Providing rich user interaction features to ranker is the most effective strategy Large improvement shown for up to 50% of test queries

Promising Extensions Backoff (improve query coverage) Model user intent/information need Personalization of various degrees Query segmentation

Identifying “Best Bet” Results by Mining Past User Behavior

How can we get the perfect top result for navigational queries? 7,000 unique queries. 1,2 Million searches. 10 Million user interactions. How can we get the perfect top result for navigational queries?

Not Quite a Ranking Problem The “best bet” problem: Select the most appropriate result to display in the top position, if user behavior clearly indicates a preference for this result over all other results for the query. “Navigational” behavior associated with some queries (e.g., google, hotmail) Train a classifier (e.g., decision tree) on examples of “excellents” and “perfects” Classify <query, result> pairs based on tree

Training A Classifier Featureset: 30+ features Dataset: 7,000 queries w/ rated results and >1 click Label: “Perfect”, “Excellent”  1 Otherwise  0 Method: train WinMine classifier [M. Chickering]

Results Method Precision Recall Prec Gain (%) RankNet 0.239 - RankNet+UserBehavior 0.331 38.5% BehaviorClassifier 0.753 0.299 216% DomainAlgorithms 0.758 0.185 218% BehaviorClassifier exhibits significantly higher precision than RankNet and RankNet+UserBehavior BehaviorClassifier exhibits comparable precision and higher recall than domain algorithms over similar features

Example Rule

Potential Applications Click spam detection Search abuse detection Personalization Domain-specific ranking Website optimization

Current Work Understanding searcher and author behavior in online sources Text Mining and Information Extraction for the Life Sciences Inferring social networks: beyond the blogosphere

Understanding searcher and author behavior in online sources Searcher behavior: Infer models for human inference, decision making, learning within (and across) query sessions First pass: adapt collaborative filtering techniques to understand how behavior changes with browsed pages. Adapt information extraction and content presentation accordingly Author behavior: Beyond statistical language models: information content, update, information flow Implications for ranking, information extraction, question answering

Text Mining and Information Extraction for the Life Sciences Improving automated diagnosis based on text in patient record (with School of Med) Add context for expert system rules Flag possible complications Public health: epidemics early detection, monitoring (with School Public Health) Identify complaints/notes in patient records that tend to co-occur with a syndrome Infer RL social information for more accurate epidemic modelling

Inferring social networks and information flow Extend “blogosphere” diffusion work Entities, facts, events (with GaTech) Ditto for non-blog data, non-text data Question-Answer portals (Y! Answers) Infer author quality, “experts”

Summary Predicting User Preferences Incorporating User Behavior into Ranking Behavior-based query segmentation Next: author and searcher understanding

Primary References http://www.mathcs.emory.edu/~eugene/ Improving Web Search Ranking by Incorporating User Behavior, E. Agichtein, E. Brill, and S. Dumais, in SIGIR 2006 Learning User Interaction Models for Predicting Web Search Result Preferences, E. Agichtein, E. Brill, S. Dumais, and R. Ragno, in SIGIR 2006 Identifying ”best bet” web search results by mining past user behavior, E. Agichtein and Z. Zheng, in KDD 2006 Web Information Extraction and User Modeling: Towards Closing the Gap, E. Agichtein, IEEE Data Engineering Bulletin, Dec. 2006 This and other work on Information Extraction and Text Mining: http://www.mathcs.emory.edu/~eugene/

Presentation Features Features in SIGIR/KDD papers: Query terms in Title, Summary, URL Position of result Length of URL Depth of URL

Clickthrough Features Fraction of clicks on URL Deviation from “expected” given result position Time to click Time to first click in “session” Deviation from average time for query

Browsing Features Dwell time Cumulative time on URL (CuriousBrowser) Deviation from average time on URL Averaged over the “user” Averaged over all results for the query Number of subsequent non-result URLs