Modeling User Interactions in Web Search and Social Media Eugene Agichtein Intelligent Information Access Lab Emory University.

Modeling User Interactions in Web Search and Social Media Eugene Agichtein Intelligent Information Access Lab Emory University

Intelligent Information Access Lab Intelligent Information Access Lab http://ir.mathcs.emory.edu/ http://ir.mathcs.emory.edu/ Research areas: Research areas: – Information retrieval & extraction, text mining, and information integration – User behavior modeling, social networks and interactions, social media People And colleagues at Yahoo! Research, Microsoft Research, Emory Libraries, Psychology, Emory School of Medicine, Neuroscience, and Georgia Tech College of Computing. Support Walter Askew, EC‘09 Qi Guo, 2 nd year Ph.D Yandong Liu, 2 nd year Ph.D Ryan Kelly, Emory’10 Alvin Grissom, 2 nd year MS Abulimiti Aji, 1 st Year Ph.D

3 User Interactions: The 3 rd Dimension of the Web Amount exceeds web content and structure Amount exceeds web content and structure – Published: 4Gb/day; Social Media: 10gb/Day – Page views: 100Gb/day [Andrew Tomkins, Yahoo! Search, 2007]

Talk Outline Web Search Interactions – Click modeling – Browsing Social media – Content quality – User satisfaction – Ranking and Filtering

Interpreting User Interactions Clickthrough and subsequent browsing behavior of individual users influenced by many factors – Relevance of a result to a query – Visual appearance and layout – Result presentation order – Context, history, etc. General idea: – Aggregate interactions across all users and queries – Compute “expected” behavior for any query/page – Recover relevance signal for a given query

Case Study: Clickthrough Clickthrough frequency for all queries in sample Clickthrough (query q, document d, result position p) = expected (p) + relevance (q, d)

Clickthrough for Queries with Known Position of Top Relevant Result Relative clickthrough for queries with known relevant results in position 1 and 3 respectively Higher clickthrough at top non-relevant than at top relevant document

Model Deviation from “Expected” Behavior Relevance component: deviation from “expected”: Relevance(q, d)= observed - expected (p)

Predicting Result Preferences Task: predict pairwise preferences – A user will prefer Result A > Result B Models for preference prediction – Current search engine ranking – Clickthrough – Full user behavior model

Predicting Result Preferences: Granka et al., SIGIR 2005 SA+N: “Skip Above” and “Skip Next” – Adapted from Joachims’ et al. [SIGIR’05] – Motivated by gaze tracking Example – Click on results 2, 4 – Skip Above: 4 > (1, 3), 2>1 – Skip Next: 4 > 5, 2>3 1 2 3 4 5 6 7 8

Our Extension: Use Click Distribution CD: distributional model, extends SA+N – Clickthrough considered iff frequency > ε than expected Click on result 2 likely “by chance” 4>(1,2,3,5), but not 2>(1,3) 1 2 3 4 5 6 7 8

Results: Click Deviation vs. Skip Above+Next

Problem: Users click based on result summaries/”captions”/”Snippets” Effect of Caption Features on Clickthrough Inversions, C. Clarke, E. Agichtien, S. Dumais, R. White, SIGIR 2007

Clickthrough Inversions

Relevance is Not the Dominant Factor!

Snippet Features Studied

Feature Importance

Important Words in Snippet

Summary Clickthrough inversions are powerful tool for assessing the influence of caption features. Relatively simple caption features can significantly influence user behavior. Can help more accurately predicting relevance from clickthough by accounting for summary bias.

20 Idea: go beyond clickthrough/download counts Presentation ResultPosition Position of the URL in Current ranking QueryTitleOverlap Fraction of query terms in result Title Clickthrough DeliberationTime Seconds between query and first click ClickFrequency Fraction of all clicks landing on page ClickDeviation Deviation from expected click frequency Browsing DwellTime Result page dwell time DwellTimeDeviation Deviation from expected dwell time for query

User Behavior Model Full set of interaction features – Presentation, clickthrough, browsing Train the model with explicit judgments – Input: behavior feature vectors for each query-page pair in rated results – Use RankNet (Burges et al., [ICML 2005]) to discover model weights – Output: a neural net that can assign a “relevance” score to a behavior feature vector

RankNet for User Behavior RankNet: general, scalable, robust Neural Net training algorithms and implementation Optimized for ranking – predicting an ordering of items, not scores for each Trains on pairs (where first point is to be ranked higher or equal to second) – Extremely efficient – Uses cross entropy cost (probabilistic model) – Uses gradient descent to set weights – Restarts to escape local minima

RankNet [Burges et al. 2005] Feature Vector1 Label1 NN output 1 For query results 1 and 2, present pair of vectors and labels, label(1) > label(2)

RankNet [Burges et al. 2005] Feature Vector2 Label2 NN output 1 NN output 2 For query results 1 and 2, present pair of vectors and labels, label(1) > label(2)

RankNet [Burges et al. 2005] NN output 1 NN output 2 Error is function of both outputs (Desire output1 > output2) For query results 1 and 2, present pair of vectors and labels, label(1) > label(2)

RankNet [Burges et al. 2005] NN output 1 NN output 2 Error is function of both outputs (Desire output1 > output2) Update feature weights: – Cost function: f(o1-o2) – details in Burges et al. paper – Modified back-prop

Predicting with RankNet Feature Vector1 NN output Present individual vector and get score

28 Example results: Predicting User Preferences Baseline < SA+N < CD << UserBehavior Rich user behavior features result in dramatic improvement

How to Use Behavior Models for Ranking? Use interactions from previous instances of query – General-purpose (not personalized) – Only for the queries with past user interactions Models: – Rerank, clickthrough only: reorder results by number of clicks – Rerank, predicted preferences (all user behavior features): reorder results by predicted preferences – Integrate directly into ranker: incorporate user interactions as features for the ranker

Enhance Ranker Features with User Behavior Features For a given query – Merge original feature set with user behavior features when available – User behavior features computed from previous interactions with same query Train RankNet [Burges et al., ICML’05] on the enhanced feature set

Feature Merging: Details Value scaling: – Binning vs. log-linear vs. linear (e.g., μ=0, σ=1) Missing Values: – 0? (meaning for normalized feats s.t. μ=0?) Runtime: significant plumbing problems Result URLBM25PageRank … ClicksDwellTime … sigir2007.org2.40.5 … ?? … Sigir2006.org1.41.1 … 150145.2 … acm.org/sigs/sigir/1.22 … 6023.5 … Query: SIGIR, fake results w/ fake feature values

Evaluation Metrics Precision at K: fraction of relevant in top K NDCG at K: norm. discounted cumulative gain – Top-ranked results most important MAP: mean average precision – Average precision for each query: mean of the precision at K values computed after each relevant document was retrieved

Content, User Behavior: NDCG BM25 < Rerank-CT < Rerank-All < +All

Full Search Engine, User Behavior: NDCG, MAP MAPGain RN0.270 RN+ALL0.3210.052 ( 19.13%) BM250.236 BM25+ALL0.2920.056 (23.71%)

User Behavior Complements Content and Web Topology MethodP@1Gain RN (Content + Links)0.632 RN + All (User Behavior)0.6930.061(10%) BM250.525 BM25+All0.6870.162 (31%)

Which Queries Benefit Most Most gains are for queries with poor ranking

Result Summary Incorporating user behavior into web search ranking dramatically improves relevance Providing rich user interaction features to ranker is the most effective strategy Large improvement shown for up to 50% of test queries

38 User Generated Content

39 Some goals of mining social media Find high-quality content Find high-quality content Find relevant and high quality content Find relevant and high quality content Use millions of interactions to Use millions of interactions to – Understand complex information needs – Model subjective information seeking – Understand cultural dynamics

41 http://answers.yahoo.com/question/index;_ylt=3?qid=20071008115118AAh1HdO

Lifecycle of a Question in CQA 42 User Choose a category Choose a category Compose the question Open question Open question Examine Find the answer? Close question Choose best answers Give ratings Close question Choose best answers Give ratings Question is closed by system. Best answer is chosen by voters Question is closed by system. Best answer is chosen by voters Yes No Answer + - - - + + +

48Community

54 Editorial Quality != User Popularity != Usefulness

Are editor/judge labels “meaningful”? Information seeking process: want to find useful information about topic with incomplete knowledge N. Belkin: “Anomalous States of Knowledge” Want to model directly if user found satisfactory information Specific (amenable) case: CQA

56 Yahoo! Answers: The Good News Active community of millions of users in many countries and languages Active community of millions of users in many countries and languages Accumulated a great number of questions and answers Accumulated a great number of questions and answers Effective for subjective information needs Effective for subjective information needs – Great forum for socialization/chat (Can be) invaluable for hard-to-find information not available on web (Can be) invaluable for hard-to-find information not available on web

58 Yahoo! Answers: The Bad News May have to wait a long time to get a satisfactory answer May have to wait a long time to get a satisfactory answer May never obtain a satisfying answer May never obtain a satisfying answer 1. 2006 FIFA World Cup 2. Optical 3. Poetry 4. Football (American) 5. Scottish Football (Soccer) 6. Medicine 7. Winter Sports 8. Special Education 9. General Health Care 10. Outdoor Recreation Time to close a question (hours) for sample question categories Time to close (hours)

59 Asker Satisfaction Problem Given a question submitted by an asker in CQA, predict whether the user will be satisfied with the answers contributed by the community. Given a question submitted by an asker in CQA, predict whether the user will be satisfied with the answers contributed by the community. – Where “Satisfied” is defined as: The asker personally has closed the question AND The asker personally has closed the question AND Selected the best answer AND Selected the best answer AND Provided a rating of at least 3 “stars” for the best answer Provided a rating of at least 3 “stars” for the best answer – Otherwise, the asker is “Unsatisfied

Approach: Machine Learning over Content and Usage Features Theme: holistic integration of content analysis and usage analysis Method: Supervised (and later partially- supervised) machine learning over features Tools: – Weka (ML library): SVM, Boosting, DTs, NB, … – Part of speech taggers, chunkers – Corpora (wikipedia, web, queries, …)

61 Satisfaction Prediction Features Approach: Classification algorithms from machine learning Approach: Classification algorithms from machine learning Classifier Support Vector Machines Decision Tree Boosting Naïve Bayes asker is satisfied asker is not satisfied Textual Features Category Features Answerer History Features Asker History Features Answer Features Question Features

62 Prediction Algorithms Heuristic: # answers Heuristic: # answers Baseline: Simply predicts the majority class (satisfied). Baseline: Simply predicts the majority class (satisfied). ASP_SVM: Our system with the SVM classifier ASP_SVM: Our system with the SVM classifier ASP_C4.5: with the C4.5 classifier ASP_C4.5: with the C4.5 classifier ASP_RandomForest: with the RandomForest classifier ASP_RandomForest: with the RandomForest classifier ASP_Boosting: with the AdaBoost algorithm combining weak learners ASP_Boosting: with the AdaBoost algorithm combining weak learners ASP_NaiveBayes: with the Naive Bayes classifier ASP_NaiveBayes: with the Naive Bayes classifier

63 Evaluation metrics Precision Precision – The fraction of the predicted satisfied asker information needs that were indeed rated satisfactory by the asker. Recall Recall – The fraction of all rated satisfied questions that were correctly identified by the system. F1 F1 – The geometric mean of Precision and Recall measures, – Computed as 2*(precision*recall)/(precision+recall) Accuracy Accuracy – The overall fraction of instances classified correctly into the proper class.

64 Datasets Crawled from Yahoo! Answers in early 2008 (Thanks, Yahoo! for support)QuestionAnswerAskersCategories % Satisfied 216,1701,963,615158,51510050.7% Data is available at http://ir.mathcs.emory.edu/shared

65 Dataset Statistics Category#Q#A #A per Q Satisfied Avg asker rating Time to close by asker 2006 FIFA World Cup(TM) 119435659329.8655.4%2.63 47 minutes Mental Health 15111597.6870.9%4.30 1 day and 13 hours Mathematics65123293.5844.5%4.48 33 minutes Diet & Fitness 45024365.4168.4%4.30 1.5 days Asker satisfaction varies significantly across different categories. #Q, #A, Time to close … -> Asker Satisfaction

66 Satisfaction Prediction: Human Perf Truth: asker’s rating Truth: asker’s rating A random sample of 130 questions A random sample of 130 questions Annotated by researchers to calibrate the asker satisfaction Annotated by researchers to calibrate the asker satisfaction – Agreement: 0.82 – F1: 0.45

67 A service provided by Amazon. Workers submit responses to a Human Intelligence Task (HIT) for $0.01-0.1 per A service provided by Amazon. Workers submit responses to a Human Intelligence Task (HIT) for $0.01-0.1 per Can usually get 1000s of items labeled in hours Can usually get 1000s of items labeled in hours Satisfaction Prediction: Human Perf (Cont’d): Amazon Mechanical Turk Amazon Mechanical Turk

68 Satisfaction Prediction: Human Perf (Cont’d): Amazon Mechanical Turk Methodology Methodology – Used the same 130 questions – For each question, list the best answer, as well as other four answers ordered by votes – Five independent raters for each question. – Agreement: 0.9 F1: 0.61. – Best accuracy achieved when at least 4 out of 5 raters predicted asker to be ‘satisfied’ (otherwise, labeled as “unsatisfied”).

69 Comparison of Human and Automatic (F1 measure) Classifier With Text Without Text Selected Features ASP_SVM0.690.720.62 ASP_C4.50.750.760.77 ASP_RandomForest0.700.740.68 ASP_Boosting0.670.670.67 ASP_NB0.610.650.58 Best Human Perf 0.61 Baseline (naïve) 0.66 C4.5 is the most effective classifier in this task Human F1 performance is lower than the na ï ve baseline!

70 Features by Information Gain (Satisfied class) 0.14219 Q: Askers’ previous rating 0.13965 Q: Average past rating by asker 0.10237 UH: Member since (interval) 0.04878 UH: Average # answers for by past Q 0.04878 UH: Previous Q resolved for the asker 0.04381 CA: Average asker rating for the category 0.04306 UH: Total number of answers received 0.03274 CA: Average voter rating 0.03159 Q: Question posting time 0.02840 CA: Average # answers per Q

71 “Offline” vs. “Online” Prediction Offline prediction: Offline prediction: – All features( question, answer, asker & category) – F1: 0.77 Online prediction: Online prediction: – NO answer features – Only asker history and question features (stars, #comments, sum of votes…) – F1: 0.74

72 Feature Ablation PrecisionRecallF1 Selected features0.800.730.77 No question-answer features0.760.740.75 No answerer features0.760.75 No category features0.750.760.75 No asker features0.720.690.71 No question features0.680.720.70 Asker & Question features are most important. Answer quality/Answerer expertise/Category characteristics: may not be important caring or supportive answers might be preferred sometimes

73 Satisfaction: varying by asker experience Group together questions from askers with the same number of previous questions Accuracy of prediction increase dramatically Reaching F1 of 0.9 for askers with >= 5 questions

74 Personalized Prediction of Asker Satisfaction with info Same information != same usefulness for different users! Same information != same usefulness for different users! Personalized classifier achieves surprisingly good accuracy (even with just 1 previous question!) Personalized classifier achieves surprisingly good accuracy (even with just 1 previous question!) Simple strategy of grouping users by number of previous questions is more effective than other methods for users with moderate amount of history Simple strategy of grouping users by number of previous questions is more effective than other methods for users with moderate amount of history For users with >= 20 questions, textual features are more significant For users with >= 20 questions, textual features are more significant

75 Some Results

76 Some Personalized Models

77 Summary Asker satisfaction is predictable Asker satisfaction is predictable – Can achieve higher than human accuracy by exploiting interaction history User’s experience is important User’s experience is important General model: one-size-fits-all General model: one-size-fits-all – 2000 questions for training model are enough Personalized satisfaction prediction: Personalized satisfaction prediction: – Helps with sufficient data (>= 1 prev interactions, can observe text patterns with >=20 prev. interactions)

78 Other tasks in progress Subjectivity, sentiment analysis Subjectivity, sentiment analysis – B. Li, Y. Liu, and E. Agichtein, CoCQA: Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation, in Proc. of EMNLP 2008 Discourse analysis Discourse analysis Cross-cultural comparisons Cross-cultural comparisons CQA vs. web search comparison CQA vs. web search comparison

79 Summary User-generated Content User-generated Content – Growing – Important: impact on main-stream media, scholarly publishing, … – Can provide insight into information seeking and social processes – “Training” data for IR, machine learning, NLP, …. – Need to re-think quality, impact, usefulness

References Y. Liu, J. Bian, and E. Agichtein, Predicting Information Seeker Satisfaction in Community Question Answering, in Proc. of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR), 2008 Y. Liu and E. Agichtein, You've Got Answers: Towards Personalized Models for Predicting Success in Community Question Answering (short paper), in Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL), 2008 B. Li, Y. Liu, and E. Agichtein, CoCQA: Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation, in Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP), 2008 E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne, Finding High Quality Content in Social Media, in Proc. of the ACM Web Search and Data Mining Conference (WSDM), 2008 C. Clarke, E. Agichtein, S. T. Dumais, and R. W. White, The Influence of Caption Features on Clickthrough Patterns in Web Search, in Proc. of the ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2007 P. Jurczyk and E. Agichtein, Discovering Authorities in Question Answer Communities Using Link Analysis (short paper), in Proc. of the ACM Conference on Information and Knowledge Management (CIKM), 2007 E. Agichtein, E. Brill, and S. T. Dumais, Improving Web Search Ranking by Incorporating User Behavior Information, in Proc. of the ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR), 2006 E. Agichtein, E. Brill, S. T. Dumais, and R. Ragno, Learning User Interaction Models for Predicting Web Search Result Preferences, in Proc. of the ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR), 2006

Thank you!

82 Question-Answer Features Q: length, posting time… QA: length, KL divergence Q:Votes Q:Terms

83 User Features U: Member since U: Total points U: #Questions U: #Answers

84 Category Features CA: Average time to close a question CA: Average time to close a question CA: Average # answers per question CA: Average # answers per question CA: Average asker rating CA: Average asker rating CA: Average voter rating CA: Average voter rating CA: Average # questions per hour CA: Average # questions per hour CA: Average # answers per hour CA: Average # answers per hour Category#Q#A #A per Q Satisfied Avg asker rating Time to close by asker General Health 1347375.4670.4%4.49 1 day and 13 hours

Modeling User Interactions in Web Search and Social Media Eugene Agichtein Intelligent Information Access Lab Emory University.

Similar presentations

Presentation on theme: "Modeling User Interactions in Web Search and Social Media Eugene Agichtein Intelligent Information Access Lab Emory University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Modeling User Interactions in Web Search and Social Media Eugene Agichtein Intelligent Information Access Lab Emory University.

Similar presentations

Presentation on theme: "Modeling User Interactions in Web Search and Social Media Eugene Agichtein Intelligent Information Access Lab Emory University."— Presentation transcript:

Similar presentations

About project

Feedback