Observational Approaches to Information Retrieval SIGIR 2014 Tutorial: Choices and Constraints (Part II) Diane Kelly, Filip Radlinski, Jaime Teevan Slides.

Slides:



Advertisements
Similar presentations
The Internet and the Web
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Modelling Relevance and User Behaviour in Sponsored Search using Click-Data Adarsh Prasad, IIT Delhi Advisors: Dinesh Govindaraj SVN Vishwanathan* Group:
Robin L. Donaldson May 5, 2010 Prospectus Defense Florida State University College of Communication and Information.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Eye Tracking Analysis of User Behavior in WWW Search Laura Granka Thorsten Joachims Geri Gay.
G. Alonso, D. Kossmann Systems Group
CS305: HCI in SW Development Evaluation (Return to…)
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
Evaluating Search Engine
Measuring and reporting outcomes for your BTOP grant 1Measuring and Reporting Outcomes.
Chapter 12: Web Usage Mining - An introduction
THE WEB CHANGES EVERYTHING Jaime Teevan, Microsoft Research.
Ryen W. White, Microsoft Research Jeff Huang, University of Washington.
INFO 624 Week 3 Retrieval System Evaluation
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
An evaluation framework
DiffIE: Changing How You View Changes on the Web DiffIE: Changing How You View Changes on the Web Jaime Teevan, Susan T. Dumais, Daniel J. Liebling, and.
From Controlled to Natural Settings
Knowledge is Power Marketing Information System (MIS) determines what information managers need and then gathers, sorts, analyzes, stores, and distributes.
USING LARGE SCALE LOG ANALYSIS TO UNDERSTAND HUMAN BEHAVIOR Jaime Teevan, Microsoft Reseachdub 2013.
The Research Process. Purposes of Research  Exploration gaining some familiarity with a topic, discovering some of its main dimensions, and possibly.
Section 2: Finding and Refinding Jaime Teevan Microsoft Research 1.
Finding and Re-Finding Through Personalization Jaime Teevan MIT, CSAIL David Karger (advisor), Mark Ackerman, Sue Dumais, Rob Miller (committee), Eytan.
Information Re-Retrieval Repeat Queries in Yahoo’s Logs Jaime Teevan (MSR), Eytan Adar (UW), Rosie Jones and Mike Potts (Yahoo) Presented by Hugo Zaragoza.
Chapter 4 Principles of Quantitative Research. Answering Questions  Quantitative Research attempts to answer questions by ascribing importance (significance)
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
TwitterSearch : A Comparison of Microblog Search and Web Search
Section 2: Science as a Process
Modern Retrieval Evaluations Hongning Wang
Personalization of the Digital Library Experience: Progress and Prospects Nicholas J. Belkin Rutgers University, USA
Put it to the Test: Usability Testing of Library Web Sites Nicole Campbell, Washington State University.
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Making the most of social historic data Aleksander Kolcz Twitter, Inc.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Evaluating a Research Report
©2010 John Wiley and Sons Chapter 12 Research Methods in Human-Computer Interaction Chapter 12- Automated Data Collection.
Human Computer Interaction
©2010 John Wiley and Sons Chapter 6 Research Methods in Human-Computer Interaction Chapter 6- Diaries.
Hao Wu Nov Outline Introduction Related Work Experiment Methods Results Conclusions & Next Steps.
Understanding and Predicting Personal Navigation Date : 2012/4/16 Source : WSDM 11 Speaker : Chiu, I- Chih Advisor : Dr. Koh Jia-ling 1.
Understanding Query Ambiguity Jaime Teevan, Susan Dumais, Dan Liebling Microsoft Research.
USING LARGE SCALE LOG ANALYSIS TO UNDERSTAND HUMAN BEHAVIOR Jaime Teevan, Microsoft ReseachUNC 2015.
Implicit Acquisition of Context for Personalization of Information Retrieval Systems Chang Liu, Nicholas J. Belkin School of Communication and Information.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
12 Developing a Web Site Section 12.1 Discuss the functions of a Web site Compare and contrast style sheets Apply cascading style sheets (CSS) to a Web.
Personalized Search Xiao Liu
Science Fair How To Get Started… (
Quantitative and Qualitative Approaches
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
CHAPTER 12 Descriptive, Program Evaluation, and Advanced Methods.
Lesson Overview Lesson Overview What Is Science? Lesson Overview 1.1 What Is Science?
Introduction to Earth Science Section 2 Section 2: Science as a Process Preview Key Ideas Behavior of Natural Systems Scientific Methods Scientific Measurements.
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Education Performance Measures Session. Session Overview Combination of presentation and interactive components Time at the end of the session for Q&A.
Lesson Overview Lesson Overview What Is Science? Lesson Overview 1.1 What Is Science?
THE WEB CHANGES EVERYTHING Jaime Teevan, Microsoft
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
THE WEB CHANGES EVERYTHING Jaime Teevan, Microsoft
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent Presented by Jaime Teevan, Susan T. Dumais, Daniel J. Liebling Microsoft.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Potential for Personalization Transactions on Computer-Human Interaction, 17(1), March 2010 Data Mining for Understanding User Needs Jaime Teevan, Susan.
The Information School of the University of Washington Information System Design Info-440 Autumn 2002 Session #20.
SEARCH AND CONTEXT Susan Dumais, Microsoft Research INFO 320.
Information Systems in Organizations
Chapter 12: Automated data collection methods
Unit# 5: Internet and Worldwide Web
Presentation transcript:

Observational Approaches to Information Retrieval SIGIR 2014 Tutorial: Choices and Constraints (Part II) Diane Kelly, Filip Radlinski, Jaime Teevan Slides available at:

Diane Kelly, University of North Carolina, USA Filip Radlinski, Microsoft, UK Jaime Teevan, Microsoft Research, USA

Tutorial Goals 1. To help participants develop a broader perspective of research goals and approaches in IR.  Descriptive, predictive and explanatory 2. To improve participants’ understandings of research choices and constraints.  Every research project requires the researcher to make a series of choices about a range of factors and usually there are constraints that influence these choices.  By using some of our own research papers, we aim to expose you to the experiential aspects of the research process by giving you a behind the scenes view of how we make/made choices in our own research.

Research Goals & Approaches Describe Report a set of observations and provide benchmarks (e.g., average queries per user, problems a user experiences when engaging in search) Such studies might also present categorizations of the observations Predict Seek to establish predictable relationships Take as input some set of features (click through rate, dwell time) and use these to predict other variables (query abandonment, satisfaction) Explain (why?) Propose a theoretical model that explains how select constructs interact and interrelate Devise procedures to measure those constructs (that is, translate the constructs into variables that can be controlled and measured) Devise protocol (usually experimental) to observe phenomenon of interest. Seek to demonstrate causality, not just show the variables are related.

Research Goals & Approaches DescribePredictExplain Afternoon Field Observation ✔ Log Analysis ✔✔ Morning Laboratory Experiment ✔✔✔ Field Experiment ✔✔✔

Example: Search Difficulties Describe A diary study might be used to gain insight about when and how users’ experience and address search difficulties. Log data might also be analyzed to identify how often these events occur. Predict A model might be constructed using the signals available in a log to predict when users will abandon search result pages without clicking. This model might then be evaluated with other log data. Explain Results from these studies might then be used to create an explanatory/theoretical model of search difficulty, which can be used to generate testable hypotheses. The model can include constructs and variables beyond those which are available in the log data. An experiment might be designed to test the explanatory power of the theory indirectly by examining the predictive power of the hypotheses.

Overview  Observational log analysis  What we can learn  Collecting log data  Cleaning log data (Filip)  Analyzing log data  Field observations (Diane) Dumais, Jeffries, Russell, Tang & Teevan. “Understanding User Behavior through Log Data and Analysis.”

What We Can Learn Observational Approaches to Information Retrieval

David Foster Wallace Mark Twain Cowards die many times before their deaths. Annotated by Nelson Mandela Cowards die many times before their deaths. Annotated by Nelson Mandela I have discovered a truly marvelous proof... which this margin is too narrow to contain. Pierre de Fermat (1637) Students prefer used textbooks that are annotated. [Marshall 1998]

Digital Marginalia  Do we lose marginalia with digital documents?  Internet exposes information experiences  Meta-data, annotations, relationships  Large-scale information usage data  Change in focus  With marginalia, interest is in the individual  Now we can look at experiences in the aggregate

Practical Uses for Behavioral Data  Behavioral data to improve Web search  Offline log analysis  Example: Re-finding common, so add history support  Online log-based experiments  Example: Interleave different rankings to find best algorithm  Log-based functionality  Example: Boost clicked results in a search result list  Behavioral data on the desktop  Goal: Allocate editorial resources to create Help docs  How to do so without knowing what people search for?

Value of Observational Log Analysis  Focus of observational log analysis  Description: What do people currently do?  Prediction: What will people do in similar situations?  Study real behavior in natural settings  Understand how people search  Identify real problems to study  Improve ranking algorithms  Influence system design  Create realistic simulations and evaluations  Build a picture of human interest

Societal Uses of Behavioral Data  Understand people’s information needs  Understand what people talk about  Impact public policy? (E.g., DonorsChoose.org) Baeza-Yates, Dupret, Velasco. A study of mobile search queries in Japan. WWW 2007

Personal Use of Behavioral Data  Individuals now have a lot of behavioral data  Introspection of personal data popular  My Year in Status  Status Statistics  Expect to see more  As compared to others  For a purpose

Defining Behavioral Log Data  Behavioral log data are:  Traces of natural behavior, seen through a sensor  Examples: Links clicked, queries issued, tweets posted  Real-world, large-scale, real-time  Behavioral log data are not:  Non-behavioral sources of large-scale data  Collected data (e.g., poll data, surveys, census data)  Not recalled behavior or subjective impression

Real-World, Large-Scale, Real-Time  Private behavior is exposed  Example: Porn queries, medical queries  Rare behavior is common  Example: Observe 500 million queries a day  Interested in behavior that occurs 0.002% of the time  Still observe the behavior 10 thousand times a day!  New behavior appears immediately  Example: Google Flu Trends

Drawbacks  Not controlled  Can run controlled log studies  Discussed in morning tutorial (Filip)  Adversarial  Cleaning log data later today (Filip)  Lots of missing information  Not annotated, no demographics, we don’t know why  Observing richer information after break (Diane)  Privacy concerns  Collect and store data thoughtfully  Next section addresses privacy

QueryTimeUser sigir :41 am 1/15/ goldcoast sofitel10:44 am 1/15/ learning to rank10:56 am 1/15/ sigir :21 am 1/15/ ool transportation11:59 am 1/15/ restaurants brisbane12:01 pm 1/15/ surf lessons12:17 pm 1/15/ james allen12:18 pm 1/15/ daytrips from brisbane1:30 pm 1/15/ sigir 20141:30 pm 1/15/ sigir program2:32 pm 1/15/ sigir2014.org2:42 pm 1/15/ information retrieval4:56 pm 1/15/ sigir 20145:02 pm 1/15/ xxx clubs on gold coast10:14 pm 1/15/ sex videos1:49 am 1/16/

QueryTimeUser sigir :41 am 1/15/ goldcoast sofitel10:44 am 1/15/ teen sex10:56 am 1/15/ sigir :21 am 1/15/ ool transportation11:59 am 1/15/ restaurants brisbane12:01 pm 1/15/ surf lessons12:17 pm 1/15/ james allen12:18 pm 1/15/ daytrips from brisbane1:30 pm 1/15/ sex with animals1:30 pm 1/15/ sigir program2:32 pm 1/15/ sigir2014.org2:42 pm 1/15/ Information retrieval4:56 pm 1/15/ sigir 20145:02 pm 1/15/ xxx clubs on gold coast10:14 pm 1/15/ sex videos1:49 am 1/16/ cheap digital camera12:17 pm 1/15/ cheap digital camera12:18 pm 1/15/ cheap digital camera12:19 pm 1/15/ 社会科学 11:59 am 11/3/23 12:01 pm 11/3/23 Porn Language Spam System errors

QueryTimeUser sigir :41 am 1/15/ goldcoast sofitel10:44 am 1/15/ learning to rank10:56 am 1/15/ sigir :21 am 1/15/ ool transportation11:59 am 1/15/ restaurants brisbane12:01 pm 1/15/ surf lessons12:17 pm 1/15/ james allen12:18 pm 1/15/ daytrips from brisbane1:30 pm 1/15/ sigir 20141:30 pm 1/15/ sigir program2:32 pm 1/15/ sigir2014.org2:42 pm 1/15/ information retrieval4:56 pm 1/15/ sigir 20145:02 pm 1/15/ kangaroos10:14 pm 1/15/ machine learning1:49 am 1/16/

QueryTimeUser sigir :41 am 1/15/ goldcoast sofitel10:44 am 1/15/ learning to rank10:56 am 1/15/ sigir :21 am 1/15/ ool transportation11:59 am 1/15/ restaurants brisbane12:01 pm 1/15/ surf lessons12:17 pm 1/15/ james allen12:18 pm 1/15/ daytrips from brisbane1:30 pm 1/15/ sigir 20141:30 pm 1/15/ sigir program2:32 pm 1/15/ sigir2014.org2:42 pm 1/15/ information retrieval4:56 pm 1/15/ sigir 20145:02 pm 1/15/ kangaroos10:14 pm 1/15/ machine learning1:49 am 1/16/ Query typology

QueryTimeUser sigir :41 am 1/15/ goldcoast sofitel10:44 am 1/15/ learning to rank10:56 am 1/15/ sigir :21 am 1/15/ ool transportation11:59 am 1/15/ restaurants brisbane12:01 pm 1/15/ surf lessons12:17 pm 1/15/ james allen12:18 pm 1/15/ daytrips from brisbane1:30 pm 1/15/ sigir 20141:30 pm 1/15/ sigir program2:32 pm 1/15/ sigir2014.org2:42 pm 1/15/ information retrieval4:56 pm 1/15/ sigir 20145:02 pm 1/15/ kangaroos10:14 pm 1/15/ machine learning1:49 am 1/16/ Query typology Query behavior

QueryTimeUser sigir :41 am 1/15/ goldcoast sofitel10:44 am 1/15/ learning to rank10:56 am 1/15/ sigir :21 am 1/15/ ool transportation11:59 am 1/15/ restaurants brisbane12:01 pm 1/15/ surf lessons12:17 pm 1/15/ james allen12:18 pm 1/15/ daytrips from brisbane1:30 pm 1/15/ sigir 20141:30 pm 1/15/ sigir program2:32 pm 1/15/ sigir2014.org2:42 pm 1/15/ information retrieval4:56 pm 1/15/ sigir 20145:02 pm 1/15/ kangaroos10:14 pm 1/15/ machine learning1:49 am 1/16/ Query typology Query behavior Long term trends Uses of Analysis Ranking – E.g., precision System design – E.g., caching User interface – E.g., history Test set development Complementary research

Surprises About Query Log Data  From early log analysis  Examples: Jansen et al. 2000, Broder 1998  Scale: Term common if it appeared 100 times!  Queries are not 7 or 8 words long  Advanced operators not used or “misused”  Nobody used relevance feedback  Lots of people search for sex  Navigation behavior common  Prior experience was with library search

Surprises About Microblog Search?

Ordered by time Ordered by time Ordered by relevance 8 new tweets Surprises About Microblog Search?

Ordered by time Ordered by time Ordered by relevance 8 new tweets Surprises About Microblog Search? Time important People important Specialized syntax Queries common Repeated a lot Change very little Often navigational Time and people less important No syntax use Queries longer Queries develop

Overview  Observational log analysis  What we can learn  Understand and predict user behavior  Collecting log data  Cleaning log data  Analyzing log data  Field observations

Collecting Log Data Observational Approaches to Information Retrieval

How to Get Logs for Analysis  Use existing logged data  Explore sources in your community (e.g., proxy logs)  Work with a company (e.g., FTE, intern, visiting researcher)  Generate your own logs  Focuses on questions of unique interest to you  Examples: UFindIt, Wikispeedia  Construct community resources  Shared software and tools  Client side logger (e.g., VIBE logger)  Shared data sets  Shared platform  Lemur Community Query Log Project

Web Service Logs Governmentcontractor Recruiting Academic field  Example sources  Search engine  Commercial site  Types of information  Queries, clicks, edits  Results, ads, products  Example analysis  Click entropy  Teevan, Dumais & Liebling. To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent. SIGIR 2008

Controlled Web Service Logs  Example sources  Mechanical Turk  Games with a purpose  Types of information  Logged behavior  Active feedback  Example analysis  Search success  Ageev, Guo, Lagun & Agichtein. Find It If You Can: A Game for Modeling … Web Search Success Using Interaction Data. SIGIR 2011

Public Web Service Content  Example sources  Social network sites  Wiki change logs  Types of information  Public content  Dependent on service  Example analysis  Twitter topic models  Ramage, Dumais & Liebling. Characterizing microblogging using latent topic models. ICWSM 2010 j

Web Browser Logs  Example sources  Proxy  Logging tool  Types of information  URL visits, paths followed  Content shown, settings  Example analysis  DiffIE  Teevan, Dumais and Liebling. A Longitudinal Study of How Highlighting Web Content Change Affects.. Interactions. CHI 2010

Web Browser Logs  Example sources  Proxy  Logging tool  Types of information  URL visits, paths followed  Content shown, settings  Example analysis  Revisitation  Adar, Teevan and Dumais. Large Scale Analysis of Web Revisitation Patterns. CHI 2008

Rich Client-Side Logs  Example sources  Client application  Operating system  Types of information  Web client interactions  Other interactions – rich!  Example analysis  Stuff I’ve Seen  Dumais et al. Stuff I've Seen: A system for personal information retrieval and re-use. SIGIR 2003

dumaisbeijingsigir 2014vancouver A Simple Example  Logging search Queries and Clicked Results Web Service “SERP” chi 2014

A Simple Example  Logging Queries  Basic data:  Which time? time Client.send, time Server.receive, time Server.send, time Client.receive  Additional contextual data:  Where did the query come from?  What results were returned?  What algorithm or presentation was used?  Other metadata about the state of the system

A Simple Example  Logging Clicked Results (on the SERP)  How can a Web service know which SERP links are clicked?  Proxy re-direct  Script (e.g., JavaScript)  Dom and cross-browser challenges, but can instrument more than link clicks  No download required; but adds complexity and latency, and may influence user interaction  What happened after the result was clicked?  What happens beyond the SERP is difficult to capture  Browser actions (back, open in new tab, etc.) are difficult to capture  To better interpret user behavior, need richer client instrumentation vs. 3&log=DiFVYj1tRQZtv6e1FF7kltj02Z30eatB2jr8tJUFR function changeImage(){ document.imgC.src="thank_you..gif “; } function backImage(){ document.imgC.src=“image.gif"; }

A (Not-So-) Simple Example  Logging: Queries, Clicked Results, and Beyond

What to Log  Log as much as possible  Time keyed events, e.g.:  Ideal log allows user experience to be fully reconstructed  But … make reasonable choices  Richly instrumented client experiments can provide guidance  Consider the amount of data, storage required  Challenges with scale  Storage requirements  1k bytes/record x 10 records/query x 100 mil queries/day = 1000 Gb/day  Network bandwidth  Client to server; Data center to data center

What to Do with the Data  Keep as much raw data as possible  And allowable  Must consider Terms of Service, IRB  Post-process data to put into a usable form  Integrate across servers to organize the data  By time  By userID  Normalize time, URLs, etc.  Rich data cleaning

Practical Issues: Time  Time  Client time is closer to the user, but can be wrong or reset  Server time includes network latencies, but controllable  In both cases, need to synchronize time across multiple machines  Data integration  Ensure that joins of data are all using the same basis (e.g., UTC vs. local time)  Accurate timing data is critical for understanding the sequence of user activities, daily temporal patterns, etc.

Practical Issues: Users  Http cookies, IP address, temporary ID  Provides broad coverage and easy to use, but …  Multiple people use same machine  Same person uses multiple machines (and browsers)  How many cookies did you use today?  Lots of churn in these IDs  Jupiter Res (39% delete cookies monthly); Comscore (2.5x inflation)  Login or download client code (e.g., browser plug-in)  Better correspondence to people, but …  Requires sign-in or download  Results in a smaller and biased sample of people or data (who remember to login, decided to download, etc.)  Either way, loss of data

Using the Data Responsibly  What data is collected and how it can be used?  User agreements (terms of service)  Emerging industry standards and best practices  Trade-offs  More data:  More intrusive and potential privacy concerns, but also more useful for understanding interaction and improving systems  Less data:  Less intrusive, but less useful  Risk, benefit, and trust

 August 4, 2006: Logs released to academic community  3 months, 650 thousand users, 20 million queries  Logs contain anonymized User IDs  August 7, 2006: AOL pulled the files, but already mirrored  August 9, 2006: New York Times identified Thelma Arnold  “A Face Is Exposed for AOL Searcher No ”  Queries for businesses, services in Lilburn, GA (pop. 11k)  Queries for Jarrett Arnold (and others of the Arnold clan)  NYT contacted all 14 people in Lilburn with Arnold surname  When contacted, Thelma Arnold acknowledged her queries  August 21, 2006: 2 AOL employees fired, CTO resigned  September, 2006: Class action lawsuit filed against AOL AnonIDQueryQueryTimeItemRankClickURL uw cse :18:181http:// uw admissions process :18:183http://admit.washington.edu/admission computer science hci :19: computer science hci :20:042http:// seattle restaurants :25:502http://seattletimes.nwsource.com/rests perlman montreal :15:144http://oldwww.acm.org/perlman/guide.html uw admissions notification :13:13 … Example: AOL Search Dataset

 Other well known AOL users  User i love alaska   User how to kill your wife  User 927  Anonymous IDs do not make logs anonymous  Contain directly identifiable information  Names, phone numbers, credit cards, social security numbers  Contain indirectly identifiable information  Example: Thelma’s queries  Birthdate, gender, zip code identifies 87% of Americans

Example: Netflix Challenge  October 2, 2006: Netflix announces contest  Predict people’s ratings for a $1 million dollar prize  100 million ratings, 480k users, 17k movies  Very careful with anonymity post-AOL  May 18, 2008: Data de-anonymized  Paper published by Narayanan & Shmatikov  Uses background knowledge from IMDB  Robust to perturbations in data  December 17, 2009: Doe v. Netflix  March 12, 2010: Netflix cancels second competition Ratings 1: [Movie 1 of 17770] 12, 3, [CustomerID, Rating, Date] 1234, 5, [CustomerID, Rating, Date] 2468, 1, [CustomerID, Rating, Date] … Movie Titles … 10120, 1982, “Bladerunner” 17690, 2007, “The Queen” … All customer identifying information has been removed; all that remains are ratings and dates. This follows our privacy policy... Even if, for example, you knew all your own ratings and their dates you probably couldn’t identify them reliably in the data because only a small sample was included (less than one tenth of our complete dataset) and that data was subject to perturbation.

Using the Data Responsibly  Control access to the data  Internally: Access control; data retention policy  Externally: Risky (e.g., AOL, Netflix, Enron, Facebook public)  Protect user privacy  Directly identifiable information  Social security, credit card, driver’s license numbers  Indirectly identifiable information  Names, locations, phone numbers … you’re so vain (e.g., AOL)  Putting together multiple sources indirectly (e.g., Netflix, hospital records)  Linking public and private data  k-anonymity; Differential privacy; etc.  Transparency and user control  Publicly available privacy policy  Give users control to delete, opt-out, etc.

Overview  Observational log analysis  What we can learn  Understand and predict user behavior  Collecting log data  Not as simple as it seems  Cleaning log data – Filip!  Analyzing log data  Field observations

[Filip on data cleaning] 52

Observational Approaches to Information Retrieval SIGIR 2014 Tutorial: Choices and Constraints (Part II) Diane Kelly, Filip Radlinski, Jaime Teevan

Overview  Observational log analysis  What we can learn  Understand and predict user behavior  Collecting log data  Not as simple as it seems  Cleaning log data  Significant portion of log analysis about cleaning  Analyzing log data  Field observations

Analyzing Log Data Observational Approaches to Information Retrieval

Develop Metrics to Capture Behavior [Joachims 2002] Sessions 2.20 queries long [Silverstein et al. 1999] [Lau and Horvitz, 1999] Navigational, Informational, Transactional [Broder 2002] 2.35 terms [Jansen et al. 1998] Queries appear 3.97 times [Silverstein et al. 1999]  Summary measures  Query frequency  Query length  Analysis of query intent  Query types and topics  Temporal features  Session length  Common re-formulations  Click behavior  Relevant results for query  Queries that lead to clicks

Develop Metrics to Capture Behavior Lee, Teevan, de la Chica. Characterizing multi-click search behavior. SIGIR 2014

Partitioning the Data  Language  Location  Time  User activity  Individual  Entry point  Device  System variant Baeza-Yates, Dupret, Velasco. A study of mobile search queries in Japan. WWW 2007

Partition by Time  Periodicities  Spikes  Real-time data  New behavior  Immediate feedback  Individual  Within session  Across sessions Beitzel, et al. Hourly analysis of a.. topically categorized web query log. SIGIR 2004

Partition by User  Temporary ID (e.g., cookie, IP address)  High coverage but high churn  Does not necessarily map directly to users  User account  Only a subset of users Teevan, Adar, Jones, Potts. Information re-retrieval: Repeat queries … SIGIR 2007

Partition by System Variant  Also known as controlled experiments  Some people see one variant, others another  Example: What color for search result links?  Bing tested 40 colors  Identified #0044CC  Value: $80 million

Considerations When Partitioning  Choose comparison groups carefully  From the same time period  With comparable users, tasks, etc.  Log a lot because it can be hard to recreate state  Which partition did a particular behavior fall into?  Confirm partitions with metrics that should be the same White, Dumais, Teevan. Characterizing the influence of domain expertise... WSDM 2009

Interpreting Significant Metrics  Often, everything is significant Adar, Teevan, Dumais. Large scale analysis of web revisitation patterns. CHI 2008

Interpreting Significant Metrics  Everything is significant, but not always meaningful  “All differences significant except when noted.”  Choose the metrics you care about first  Look for converging evidence  Look at the data  Beware: Typically very high variance  Large variance by user, task, noise  Calculate empirically

Confidence Intervals  Confidence interval (C.I.):  Interval around the treatment mean that contains the true value of the mean x% (typically 95%) of the time  Gives useful information about the size of the effect and its practical significance  C.I.s that do not contain the control mean are statistically significant (statistically different from the control)  This is an independent test for each metric  Thus you will get 1 in 20 results (for 95% C.I.s) that are spurious  Challenge: You don't know which ones are spurious

Confidence Intervals Lee, Teevan, de la Chica. Characterizing multi-click search behavior. SIGIR 2014 Radlinski, Kurup, Joachims. How does clickthrough data reflect retrieval quality? CIKM 2008.

When Significance Is Wrong  Sometimes there is spurious significance  Confidence interval only tells you there is a 95% chance that this difference is real; not 100%  If only a few things significant, chance a likely explanation  Sometimes you will miss significance  Because the true difference is tiny/zero or because you don’t have enough power  If you did your sizing right, you have enough power to see all the differences of practical significance  Sometimes reason for change is unexpected  Look at many metrics to get a big picture Chilton, Teevan. Addressing Info. Needs Directly in the Search Result Page. WWW 2011

Be Thoughtful When Combining Metrics  1995 and 1996 performance != Combined performance  Simpsons Paradox  Changes in mix (denominators) make combined metrics (ratios) inconsistent with yearly metrics Batting Average Combined HitsAt BatHitsAt BatHitsAt Bat Derek Jeter David Justice

Detailed Analysis  Big Picture  Not all effects will point the same direction  Take a closer look at the items going in the “wrong” direction  Can you interpret them?  E.g., people are doing fewer next-pages because they are finding their answer on the first page  Could they be artifactual?  What if they are real?  What should be the impact on your conclusions? on your decision?  Significance and impact are not the same thing  Looking at % change vs. absolute change helps  Effect size depends on what you want to do with the data

Beware of Tyranny of the Data  Can provide insight into behavior  Example: What is search for, how needs are expressed  Can be used to test hypotheses  Example: Compare ranking variants or link color  Can only reveal what can be observed  Cannot tell you what you cannot observe  Example: Nobody uses Twitter to re-find

 People’s intent  People’s success  People’s experience  People’s attention  People’s beliefs  Behavior can mean many things  81% of search sequences ambiguous [Viermetz et al. 2006] 7:16 – Try new engine What Logs Cannot Tell Us 7:16 – Read Result 1 7:20 – Read Result 3 7:27 –Save links locally 7:12 – Query 7:14 – Click Result 1 7:15 – Click Result 3

HCI Example: Click Entropy  Question: How ambiguous is a query?  Approach: Look at variation in clicks  Measure: Click entropy  Low if no variation  human computer …  High if lots of variation  hci Companies Wikipedia disambiguation HCI Teevan, Dumais, Liebling. To personalize or not to personalize... SIGIR 2008

Which Has Less Variation in Clicks?  v. federal government jobs  find phone number v. msn live search  singapore pools v. singaporepools.com  tiffany v. tiffany’s  nytimes v. connecticut newspapers  campbells soup recipes v. vegetable soup recipe  soccer rules v. hockey equipment ? ? ? Results change Result quality varies Tasks impacts # of clicks Clicks/user = 1.1 Clicks/user = 2.1 Click position = 2.6 Click position = 1.6 Result entropy = 5.7 Result entropy = 10.7

Supplementing Log Data  Enhance log data  Collect associated information  Example: For browser logs, crawl visited webpages  Instrumented panels  Converging methods  Usability studies  Eye tracking  Surveys  Field studies  Diary studies

 Large-scale log analysis of re-finding  Do people know they are re-finding?  Do they mean to re-find the result they do?  Why are they returning to the result?  Small-scale critical incident user study  Browser plug-in that logs queries and clicks  Pop up survey on repeat clicks and 1/8 new clicks  Insight into intent + Rich, real-world picture  Re-finding often targeted towards a particular URL  Not targeted when query changes or in same session Example: Re-Finding Intent Tyler, Teevan. Large scale query log analysis of re-finding. WSDM 2010

Example: Curious Browser  Browser plug-in to examine relationship between implicit and explicit behavior  Capture many implicit actions (e.g., click, click position, dwell time, scroll)  Probe for explicit user judgments of relevance of a page to the query  Deployed to ~4k people in US and Japan  Learned models to predict explicit judgments from implicit indicators  45% accuracy w/ just click; 75% accuracy w/ click + dwell + session  Used to identify important features; then apply model in open loop setting Fox, et al. Evaluating implicit measures to improve the search experience. TOIS 2005

Overview  Observational log analysis  What we can learn  Partition logs to observe behavior  Collecting log data  Not as simple as it seems  Cleaning log data  Clean and sanity check  Analyzing log data  Big picture more important than individual metrics  Field observations – Diane!

[Diane on field observations] 78