From Devices to People: Attribution of Search Activity in Multi-User Settings Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz Microsoft Research,

Slides:



Advertisements
Similar presentations
Beliefs & Biases in Web Search
Advertisements

Predicting User Interests from Contextual Information
Enhancing Personalized Search by Mining and Modeling Task Behavior
Struggling or Exploring? Disambiguating Long Search Sessions
Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Beliefs & Biases in Web Search Ryen White Microsoft Research
Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.
Studies of the Onset & Persistence of Medical Concerns in Search Logs Ryen White and Eric Horvitz Microsoft Research, Redmond
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Evaluating Search Engine
Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.
Ryen W. White, Microsoft Research Jeff Huang, University of Washington.
Ryen White, Susan Dumais, Jaime Teevan Microsoft Research {ryenw, sdumais,
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Cohort Modeling for Enhanced Personalized Search Jinyun YanWei ChuRyen White Rutgers University Microsoft BingMicrosoft Research.
Modeling Long-Term Search Engine Usage Ryen White, Ashish Kapoor & Susan Dumais Microsoft Research.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Information Re-Retrieval Repeat Queries in Yahoo’s Logs Jaime Teevan (MSR), Eytan Adar (UW), Rosie Jones and Mike Potts (Yahoo) Presented by Hugo Zaragoza.
Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.
1 The BT Digital Library A case study in intelligent content management Paul Warren
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
The Usage and Value of Local Search Sources comScore study findings / marketer application Stuart McKelvey, CEO - TMP Directional Marketing Search with.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Eric Horvitz Tadayoshi Kohno Frank McSherry Wendy Seltzer Daniel Weitzner.
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
©2015 Apigee Corp. All Rights Reserved. Preserving signal in customer journeys Joy Thomas, Apigee Jagdish Chand, Visa.
Understanding and Predicting Personal Navigation Date : 2012/4/16 Source : WSDM 11 Speaker : Chiu, I- Chih Advisor : Dr. Koh Jia-ling 1.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Personalizing Search on Shared Devices Ryen White and Ahmed Hassan Awadallah Microsoft Research, USA Contact:
Improving Classification Accuracy Using Automatically Extracted Training Data Ariel Fuxman A. Kannan, A. Goldberg, R. Agrawal, P. Tsaparas, J. Shafer Search.
Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang
Ryen W. White, Dan Morris Microsoft Research, Redmond, USA {ryenw,
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Analysis of Topic Dynamics in Web Search Xuehua Shen (University of Illinois) Susan Dumais (Microsoft Research) Eric Horvitz (Microsoft Research) WWW 2005.
WHAT AND HOW CHILDREN SEARCH ON THE WEB Sergio Duarte Torres, Ingmar Weber.
Qi Guo Emory University Ryen White, Susan Dumais, Jue Wang, Blake Anderson Microsoft Presented by Tetsuya Sakai, Microsoft Research.
Adish Singla, Microsoft Bing Ryen W. White, Microsoft Research Jeff Huang, University of Washington.
Retroactive Answering of Search Queries Beverly Yang Glen Jeh.
Understanding User Goals in Web Search University of Seoul Computer Science Database Lab. Min Mi-young.
 Who Uses Web Search for What? And How?. Contribution  Combine behavioral observation and demographic features of users  Provide important insight.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Potential for Personalization Transactions on Computer-Human Interaction, 17(1), March 2010 Data Mining for Understanding User Needs Jaime Teevan, Susan.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Introduction to Machine Learning, its potential usage in network area,
Michael Xie, Neal Jean, Stefano Ermon
Clickprints on the Web: Are there Signatures in Web Browsing Data?
Human Computer Interaction Lecture 21,22 User Support
Web Mining Ref:
Collective Network Linkage across Heterogeneous Social Platforms
Personalizing Search on Shared Devices
CIKM Competition 2014 Second Place Solution
CIKM Competition 2014 Second Place Solution
Struggling and Success in Web Search
Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz
John H.L. Hansen & Taufiq Al Babba Hasan
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Web Mining Research: A Survey
Topological Signatures For Fast Mobility Analysis
Presentation transcript:

From Devices to People: Attribution of Search Activity in Multi-User Settings Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz Microsoft Research, USA; ETHZ, Switzerland Contact:

IDs in Behavioral Scale Search engines use machine ids based on cookies etc. Assume 1:1 mapping from ids (e.g., FDED432F901D) to people However, multi-user computer usage is common 2011 Census data: 75% of U.S. households have a computer In most homes that machine is shared between multiple people

Multi-User Web Search Analyzed two years’ of comScore search data (all engines, en-US) Both machine identifiers and person identifiers (users self- identify) Takeaway: 56% of machine ids comprise multi-user behavior Multi-user (56%)

Handling Multiple Users Limited current solutions in search engines (users can sign-in) Some solutions in other domains, e.g., streaming media Users can be asked to confirm identify (cumbersome), e.g., Our Focus: Can we do this automatically? (in context of search)

Activity Attribution Challenge Given a stream of data from a machine identifier, attribute observed historic and new behavior to the correct person Applications for: personalization, advertising, privacy protection Related work in signal processing and fraud detection— hardly any related work in user behavior analysis Research on “individual differences” in search activity is relevant Historic behavior from machine id New query Which user? User 1User 2User 1User 3 {k user clusters}

Three parts to analysis 1. Characterizing differences in behavior from a machine given the presence of one user versus multiple users 2. Predicting: -Presence of multiple users (1 vs. N problem) (Classification task) -Estimating the number of users on a machine (Regression task) 3. Associating behavior to the correct user (via clustering in our case) -e.g., New query arrives, which user issued that query? Focus on characteristics and prediction in this presentation

comScore Search Log Data Two years of data ( ) Purchased data from comScore (non proprietary) Summary statistics: Person information per machine is ground truth

Characterization Are there within-id behavioral differences for one user vs. many?

Characterization Characterize behavior observed from a machine identifier along a number of different dimensions: Behavioral: # Queries, # Clicks, # Unique Query Terms, etc. Temporal: Times machine used, Variations in time (hour, day) Topical: Types of topics, Variation in topics of queries/clicks Content: Nature of results viewed, inc. readability level

Behavioral Features Much more search behavior when there are multiple users - More searching and clicking, and diversity in queries/clicks BUT some of this also applies to active searchers … Increase in behavioral features for many users vs. single user

Temporal Features Variance in time at which searches are issued, specifically: - Day of week entropy - Time of day entropy Large differences with varying numbers of searchers associated with searching on the machine

Topical/Content Features Observed similar variations in entropy for topics and the readability of content Topic pair (Ti, Tj) in 4-hr bucket Topic association: Multi-searcher machines overestimate topic associations for 90% of pairs

Prediction Can we predict multi-user ids?

Prediction Two prediction tasks: 1. Classification task Question: Is a machine identifier composed of multiple people? 2. Regression task Question: If multiple users behind machine identifier, how many? MART classification and regression Use all features described so far 10-fold CV (at user level), 10 runs Can chain models: Regresso r k > 1? Yes No Classifier {Features} Label k' k‘ = 1

Features and Labels Features from the characterization: Behavioral, Temporal, Topical, and Content Plus Referential Indications that there is likely to be another member of household, e.g., reference to spouse, child, roommate, etc. in queries Labels: Classification: Multi-user (1) vs. single user (0) Regression: Number of users associated with machine id

Classification: Results Temporal features appear important for this task

Classification: Time of Day ONLY Variant that uses eight features that are only associated with the time of day only (e.g., hour bucket, bucket entropy) Perf. similar to full model Simpler to implement than the complete range of features highlighted earlier – 8 features vs 80!

Prediction: Regression Same features as for classification, different label NRMSE = RMSE/(k max – k min ) Time-of-day features not as useful here (NRMSE=0.1300)

Top Features Need the additional features for regression task Features linked to children’s interests are important Where there is a child, there is at least one adult (=> N ≥ 2)

Assignment Can we assign to correct user?

Assignment Given the k’ from the regressor, run k-means clustering on the history from each machine identifier Real-time assignment – given a user session: Compare 1 st query in session to cluster(s), assign to most similar Compute similarity between session/cluster representative Accuracy = proportion of assignments correct Purity = proportion of assigned cluster to correct user Baseline = one user

Discussion comScore data based on self-identification (Errors? Not apparent) Need to explore: Utility of sign-in to search engines as proxy for person identifier e.g., Different analysis timeframes (e.g., one month vs. two years)

Conclusions and Future Work Introduced activity attribution challenge Clear differences in logged behavior for one user vs. many Possible to accurately: 1. Predict if multiple users are behind a machine id (AUC = 0.94) 2. Estimate number of users behind machine id (NRMSE = 0.092) 3. Assign queries to people (75% accuracy, 56% gain over baseline) Future work: Apply methods to personalization, advertising, etc.