USING LARGE SCALE LOG ANALYSIS TO UNDERSTAND HUMAN BEHAVIOR Jaime Teevan, Microsoft ReseachJITP 2011.

Slides:



Advertisements
Similar presentations
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers New York University Stern School Victor Sheng Foster Provost Panos.
Advertisements

©2011 1www.id-book.com Evaluation studies: From controlled to natural settings Chapter 14.
Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,
1 Take a tour of De Gruyter Online Join the rally and learn how to navigate through our website.
Myra Shields Training Manager Introduction to OvidSP.
28 April 2004Second Nordic Conference on Scholarly Communication 1 Citation Analysis for the Free, Online Literature Tim Brody Intelligence, Agents, Multimedia.
1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.
HL7 Project Management Tool Overview for HL7 Project Facilitators
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 4.1 Chapter 4 : Searching the Web The mechanics.
Plan My Care Brokerage Training Working in partnership with Improvement and Efficiency South East.
Chapter 3 Critically reviewing the literature
1 Advanced Tools for Account Searches and Portfolios Dawn Gamache Cindy Bylander.
Understanding Multiyear Estimates from the American Community Survey Updated February 2013.
1 Understanding Multiyear Estimates from the American Community Survey.
Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.
- A Powerful Computing Technology Department of Computer Science Wayne State University 1.
1 SANS Technology Institute - Candidate for Master of Science Degree 1 Assessing Privacy Risks of Flash Cookies Kevin Fuller and Stacy Jordan February.
Introduction Lesson 1 Microsoft Office 2010 and the Internet
1. 2 Objectives Become familiar with the purpose and features of Epsilen Learn to navigate the Epsilen environment Develop a professional ePortfolio on.
Computer Literacy BASICS
Welcome. © 2008 ADP, Inc. 2 Overview A Look at the Web Site Question and Answer Session Agenda.
Break Time Remaining 10:00.
ACT User Meeting June Your entitlements window Entitlements, roles and v1 security overview Problems with v1 security Tasks, jobs and v2 security.
Microsoft Office Illustrated Fundamentals Unit C: Getting Started with Unit C: Getting Started with Microsoft Office 2010 Microsoft Office 2010.
Services Course Outlook Live Participant Guide.
Promoting Regulatory Excellence Self Assessment & Physiotherapy: the Ontario Model Jan Robinson, Registrar & CEO, College of Physiotherapists of Ontario.
Struggling or Exploring? Disambiguating Long Search Sessions
Services Course Windows Live SkyDrive Participant Guide.
SLP – Endless Possibilities What can SLP do for your school? Everything you need to know about SLP – past, present and future.
1 How Do I Order From.decimal? Rev 05/04/09 This instructional training document may be updated at anytime. Please visit and check the.
1 BRState Software Demonstration. 2 After you click on the LDEQ link to download the BRState Software you will get this message.
Macromedia Dreamweaver MX 2004 – Design Professional Dreamweaver GETTING STARTED WITH.
School Census Summer 2010 Headlines 1 Jim Haywood Product Manager for Statutory Returns Version 1.0.
2004 EBSCO Publishing Presentation on EBSCOadmin.
1 Wiki Tutorial. 2 Outline of Wiki Tutorial 1) Welcome and Introductions 2) What is a wiki, and why is it useful for our work in moving forward the program.
Services Course Windows Live SkyDrive Participant Guide.
: 3 00.
5 minutes.
A lesson approach © 2011 The McGraw-Hill Companies, Inc. All rights reserved. a lesson approach Microsoft® PowerPoint 2010 © 2011 The McGraw-Hill Companies,
RefWorks: The Basics October 12, What is RefWorks? A personal bibliographic software manager –Manages citations –Creates bibliogaphies Accessible.
Profile. 1.Open an Internet web browser and type into the web browser address bar. 2.You will see a web page similar to the one on.
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
South Dakota Library Network MetaLib User Interface South Dakota Library Network 1200 University, Unit 9672 Spearfish, SD © South Dakota.
22 nd User Modeling, Adaptation and Personalization (UMAP 2014) Time-Sensitive User Profile for Optimizing Search Personalization Ameni Kacem, Mohand Boughanem,
TIDE Presentation Florida Standards Assessments 1 FSA Regional Trainings Updated 02/09/15.
Student Interface for Online Testing Training Module Copyright © 2014 American Institutes for Research. All rights reserved.
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
THE WEB CHANGES EVERYTHING Jaime Teevan, Microsoft Research.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
DiffIE: Changing How You View Changes on the Web DiffIE: Changing How You View Changes on the Web Jaime Teevan, Susan T. Dumais, Daniel J. Liebling, and.
USING LARGE SCALE LOG ANALYSIS TO UNDERSTAND HUMAN BEHAVIOR Jaime Teevan, Microsoft Reseachdub 2013.
Section 2: Finding and Refinding Jaime Teevan Microsoft Research 1.
Finding and Re-Finding Through Personalization Jaime Teevan MIT, CSAIL David Karger (advisor), Mark Ackerman, Sue Dumais, Rob Miller (committee), Eytan.
Information Re-Retrieval Repeat Queries in Yahoo’s Logs Jaime Teevan (MSR), Eytan Adar (UW), Rosie Jones and Mike Potts (Yahoo) Presented by Hugo Zaragoza.
TwitterSearch : A Comparison of Microblog Search and Web Search
Personalization of the Digital Library Experience: Progress and Prospects Nicholas J. Belkin Rutgers University, USA
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
A Comparison of Microblog Search and Web Search.
Understanding Query Ambiguity Jaime Teevan, Susan Dumais, Dan Liebling Microsoft Research.
USING LARGE SCALE LOG ANALYSIS TO UNDERSTAND HUMAN BEHAVIOR Jaime Teevan, Microsoft ReseachUNC 2015.
THE COMPLEX TASK OF MAKING SEARCH SIMPLE Jaime Teevan Microsoft Research UMAP 2015.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
THE WEB CHANGES EVERYTHING Jaime Teevan, Microsoft
THE WEB CHANGES EVERYTHING Jaime Teevan, Microsoft
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Using Large Scale Log Analysis to Understand Human Behavior
Search Engine Optimisation
Unit# 5: Internet and Worldwide Web
Presentation transcript:

USING LARGE SCALE LOG ANALYSIS TO UNDERSTAND HUMAN BEHAVIOR Jaime Teevan, Microsoft ReseachJITP 2011

David Foster Wallace Mark Twain Cowards die many times before their deaths. Annotated by Nelson Mandela Cowards die many times before their deaths. Annotated by Nelson Mandela I have discovered a truly marvelous proof... which this margin is too narrow to contain. Pierre de Fermat (1637) Students prefer used textbooks that are annotated. [Marshall 1998]

Digital Marginalia  Do we lose marginalia with digital documents?  Internet exposes information experiences  Meta-data, annotations, relationships  Large-scale information usage data  Change in focus With marginalia, interest is in the individual Now we can look at experiences in the aggregate

Defining Behavioral Log Data  Behavioral log data are:  Traces of human behavior, seen through a sensor  Actual, real-world behavior Not recalled behavior or subjective impression  Large-scale, real-time  Behavioral log data are not:  Non-behavioral sources of large-scale data  Collected data (e.g., poll data, surveys, census data)  Crowdsourced data (e.g., Mechanical Turk)

Real-World, Large-Scale, Real-Time  Private behavior is exposed  Example: Porn queries, medical queries  Rare behavior is common  Example: Observe 500 million queries a day Interested in behavior that occurs 0.002% of the time Still observe the behavior 10 thousand times a day!  New behavior appears immediately  Example: Google Flu Trends

Overview  How behavioral log data can be used  Sources of behavioral log data  Challenges with privacy and data sharing  Example analysis of one source: Query logs  To understand people’s information needs  To experiment with different systems  What behavioral logs cannot reveal  How to address limitations

Practical Uses for Behavioral Data  Behavioral data to improve Web search  Offline log analysis Example: Re-finding common, so add history support  Online log-based experiments Example: Interleave different rankings to find best algorithm  Log-based functionality Example: Boost clicked results in a search result list  Behavioral data on the desktop  Goal: Allocate editorial resources to create Help docs  How to do so without knowing what people search for?

Societal Uses of Behavioral Data  Understand people’s information needs  Understand what people talk about  Impact public policy? (E.g., DonorsChoose.org) [Baeza Yates et al. 2007]

Generalizing About Behavior Button clicks Structured answers Information use Information needs What people think jitp 2011 Feature use Human behavior

Personal Use of Behavioral Data  Individuals now have a lot of behavioral data  Introspection of personal data popular  My Year in Status  Status Statistics  Expect to see more  As compared to others  For a purpose

Overview  Behavioral logs give practical, societal, personal insight  Sources of behavioral log data  Challenges with privacy and data sharing  Example analysis of one source: Query logs  To understand people’s information needs  To experiment with different systems  What behavioral cannot reveal  How to address limitations

Web Service Logs  Example sources  Search engines  Commercial websites  Types of information  Behavior: Queries, clicks  Content: Results, products  Example analysis  Query ambiguity  Teevan, Dumais & Liebling. To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent. SIGIR 2008 Integral Theory and Practice Parenting IT & Politics

Public Web Service Content  Example sources  Social network sites  Wiki change logs  Types of information  Public content  Dependent on service  Example analysis  Twitter topic models  Ramage, Dumais & Liebling. Characterizing microblogging using latent topic models. ICWSM 2010 j

Web Browser Logs  Example sources  Proxies  Toolbar  Types of information  Behavior: URL visit  Content: Settings, pages  Example analysis  Diff-IE (  Teevan, Dumais & Liebling. A Longitudinal Study of How Highlighting Web Content Change Affects.. Interactions. CHI 2010

Web Browser Logs  Example sources  Proxies  Toolbar  Types of information  Behavior: URL visit  Content: Settings, pages  Example analysis  Webpage revisitation  Adar, Teevan & Dumais. Large Scale Analysis of Web Revisitation Patterns. CHI 2008

Client-Side Logs  Example sources  Client application  Operating system  Types of information  Web client interactions  Other interactions – rich!  Example analysis  Stuff I’ve Seen  Dumais, Cutrell, Cadiz, Jancke, Sarin & Robbins. Stuff I've Seen: A system for personal information retrieval and re-use. SIGIR 2003

Types of Logs Rich and Varied  Web services  Search engines  Commerce sites  Public Web services  Social network sites  Wiki change logs  Web Browsers  Proxies  Toolbars or plug-ins  Client applications  Interactions  Posts, edits  Queries, clicks  URL visits  System interactions  Context  Results  Ads  Web pages shown Sources of Log DataTypes of Information Logged

Public Sources of Behavioral Logs  Public Web service content  Twitter, Facebook, Digg, Wikipedia  At JITP: InfoExtractor, Facebook Harvester, scraping tools  Research efforts to create logs  At JITP: Roxy, a research proxy  Lemur Community Query Log Project 1 year of data collection = 6 seconds of Google logs  Publicly released private logs  DonorsChoose.org  Enron corpus, AOL search logs, Netflix ratings

Example: AOL Search Dataset  August 4, 2006: Logs released to academic community  3 months, 650 thousand users, 20 million queries  Logs contain anonymized User IDs  August 7, 2006: AOL pulled the files, but already mirrored  August 9, 2006: New York Times identified Thelma Arnold  “A Face Is Exposed for AOL Searcher No ”  Queries for businesses, services in Lilburn, GA (pop. 11k)  Queries for Jarrett Arnold (and others of the Arnold clan)  NYT contacted all 14 people in Lilburn with Arnold surname  When contacted, Thelma Arnold acknowledged her queries  August 21, 2006: 2 AOL employees fired, CTO resigned  September, 2006: Class action lawsuit filed against AOL AnonIDQueryQueryTimeItemRankClickURL jitp :18:181http:// jipt submission process :18:183http:// computational social scinece :19: computational social science :20:042http://socialcomplexity.gmu.edu/phd.php seattle restaurants :25:502http://seattletimes.nwsource.com/rests perlman montreal :15:144http://oldwww.acm.org/perlman/guide.html jitp 2006 notification :13:13 …

Example: AOL Search Dataset  Other well known AOL users  User 927 how to kill your wife  User i love alaska  Anonymous IDs do not make logs anonymous  Contain directly identifiable information Names, phone numbers, credit cards, social security numbers  Contain indirectly identifiable information Example: Thelma’s queries Birthdate, gender, zip code identifies 87% of Americans

Example: Netflix Challenge  October 2, 2006: Netflix announces contest  Predict people’s ratings for a $1 million dollar prize  100 million ratings, 480k users, 17k movies  Very careful with anonymity post-AOL  May 18, 2008: Data de-anonymized  Paper published by Narayanan & Shmatikov  Uses background knowledge from IMDB  Robust to perturbations in data  December 17, 2009: Doe v. Netflix  March 12, 2010: Netflix cancels second competition Ratings 1: [Movie 1 of 17770] 12, 3, [CustomerID, Rating, Date] 1234, 5, [CustomerID, Rating, Date] 2468, 1, [CustomerID, Rating, Date] … Movie Titles … 10120, 1982, “Bladerunner” 17690, 2007, “The Queen” … All customer identifying information has been removed; all that remains are ratings and dates. This follows our privacy policy... Even if, for example, you knew all your own ratings and their dates you probably couldn’t identify them reliably in the data because only a small sample was included (less than one tenth of our complete dataset) and that data was subject to perturbation.

Overview  Behavioral logs give practical, societal, personal insight  Sources include Web services, browsers, client apps  Public sources limited due to privacy concerns  Example analysis of one source: Query logs  To understand people’s information needs  To experiment with different systems  What behavioral logs cannot reveal  How to address limitations

QueryTimeUser jitp :41 am 5/15/ social science10:44 am 5/15/ computational social science10:56 am 5/15/ jitp :21 am 5/15/ crowne plaza seattle11:59 am 5/15/ restaurants seattle12:01 pm 5/15/ pikes market restaurants12:17 pm 5/15/ stuart shulman12:18 pm 5/15/ daytrips in seattle, wa1:30 pm 5/15/ jitp 20111:30 pm 5/15/ jitp program2:32 pm 5/15/ jitp2011.org2:42 pm 5/15/ computational social science4:56 pm 5/15/ jitp 20115:02 pm 5/15/ xxx clubs in seattle10:14 pm 5/15/ sex videos1:49 am 5/16/

QueryTimeUser jitp :41 am 5/15/ social science10:44 am 5/15/ teen sex10:56 am 5/15/ jitp :21 am 5/15/ crowne plaza seattle11:59 am 5/15/ restaurants seattle12:01 pm 5/15/ pikes market restaurants12:17 pm 5/15/ stuart shulman12:18 pm 5/15/ daytrips in seattle, wa1:30 pm 5/15/ sex with animals1:30 pm 5/15/ jitp program2:32 pm 5/15/ jitp2011.org2:42 pm 5/15/ computational social science4:56 pm 5/15/ jitp 20115:02 pm 5/15/ xxx clubs in seattle10:14 pm 5/15/ sex videos1:49 am 5/16/ cheap digital camera12:17 pm 5/15/ cheap digital camera12:18 pm 5/15/ cheap digital camera12:19 pm 5/15/ 社会科学 11:59 am 11/3/23 12:01 pm 11/3/23 Porn Language Spam System errors Data cleaning pragmatics Significant part of data analysis Ensure cleaning is appropriate Keep track of the cleaning process Keep the original data around – Example: ClimateGate

QueryTimeUser jitp :41 am 5/15/ social science10:44 am 5/15/ computational social science10:56 am 5/15/ jitp :21 am 5/15/ crowne plaza seattle11:59 am 5/15/ restaurants seattle12:01 pm 5/15/ pikes market restaurants12:17 pm 5/15/ stuart shulman12:18 pm 5/15/ daytrips in seattle, wa1:30 pm 5/15/ jitp 20111:30 pm 5/15/ jitp program2:32 pm 5/15/ jitp2011.org2:42 pm 5/15/ computational social science4:56 pm 5/15/ jitp 20115:02 pm 5/15/ xxx clubs in seattle10:14 pm 5/15/ sex videos1:49 am 5/16/

QueryTimeUser jitp :41 am 5/15/ social science10:44 am 5/15/ computational social science10:56 am 5/15/ jitp :21 am 5/15/ crowne plaza seattle11:59 am 5/15/ restaurants seattle12:01 pm 5/15/ pikes market restaurants12:17 pm 5/15/ stuart shulman12:18 pm 5/15/ daytrips in seattle, wa1:30 pm 5/15/ jitp 20111:30 pm 5/15/ jitp program2:32 pm 5/15/ jitp2011.org2:42 pm 5/15/ computational social science4:56 pm 5/15/ jitp 20115:02 pm 5/15/ xxx clubs in seattle10:14 pm 5/15/ sex videos1:49 am 5/16/ Query typology

QueryTimeUser jitp :41 am 5/15/ social science10:44 am 5/15/ computational social science10:56 am 5/15/ jitp :21 am 5/15/ crowne plaza seattle11:59 am 5/15/ restaurants seattle12:01 pm 5/15/ pikes market restaurants12:17 pm 5/15/ stuart shulman12:18 pm 5/15/ daytrips in seattle, wa1:30 pm 5/15/ jitp 20111:30 pm 5/15/ jitp program2:32 pm 5/15/ jitp2011.org2:42 pm 5/15/ computational social science4:56 pm 5/15/ jitp 20115:02 pm 5/15/ xxx clubs in seattle10:14 pm 5/15/ sex videos1:49 am 5/16/ Query typology Query behavior

QueryTimeUser jitp :41 am 5/15/ social science10:44 am 5/15/ computational social science10:56 am 5/15/ jitp :21 am 5/15/ crowne plaza seattle11:59 am 5/15/ restaurants seattle12:01 pm 5/15/ pikes market restaurants12:17 pm 5/15/ stuart shulman12:18 pm 5/15/ daytrips in seattle, wa1:30 pm 5/15/ jitp 20111:30 pm 5/15/ jitp program2:32 pm 5/15/ jitp2011.org2:42 pm 5/15/ computational social science4:56 pm 5/15/ jitp 20115:02 pm 5/15/ xxx clubs in seattle10:14 pm 5/15/ sex videos1:49 am 5/16/ Query typology Query behavior Long term trends Uses of Analysis Ranking – E.g., precision System design – E.g., caching User interface – E.g., history Test set development Complementary research

Things Observed in Query Logs  Summary measures  Query frequency  Query length  Analysis of query intent  Query types and topics  Temporal features  Session length  Common re-formulations  Click behavior  Relevant results for query  Queries that lead to clicks [Joachims 2002] Sessions 2.20 queries long [Silverstein et al. 1999] [Lau and Horvitz, 1999] Navigational, Informational, Transactional [Broder 2002] 2.35 terms [Jansen et al. 1998] Queries appear 3.97 times [Silverstein et al. 1999]

Surprises About Query Log Data  From early log analysis  Examples: Jansen et al. 2000, Broder 1998  Queries are not 7 or 8 words long  Advanced operators not used or “misused”  Nobody used relevance feedback  Lots of people search for sex  Navigation behavior common  Prior experience was with library search

Surprises About Microblog Search?

Ordered by time Ordered by time Ordered by relevance 8 new tweets Surprises About Microblog Search?

Ordered by time Ordered by time Ordered by relevance 8 new tweets Surprises About Microblog Search? Time important People important Specialized syntax Queries common Repeated a lot Change very little Often navigational Time and people less important No syntax use Queries longer Queries develop

Generalizing Across Systems A particular feature A web search engine Web search engines Search engines Information seeking Build new features Build new tools Build better systems Bing experiment #123 Bing Bing, Google, Yahoo Different corpora Browser, search,

Partitioning the Data  Corpus  Language  Location  Device  Time  User  System variant [Baeza Yates et al. 2007]

Partition by Time  Periodicities  Spikes  Real-time data  New behavior  Immediate feedback  Individual  Within session  Across sessions [Beitzel et al. 2004]

Partition by User  Temporary ID (e.g., cookie, IP address)  High coverage but high churn  Does not necessarily map directly to users  User account  Only a subset of users [Teevan et al. 2007]

Partition by System Variant  Also known as controlled experiments  Some people see one variant, others another  Example: What color for search result links?  Bing tested 40 colors  Identified #0044CC  Value: $80 million

Everything is Significant  Everything is significant, but not always meaningful  Choose the metrics you care about first  Look for converging evidence  Choose comparison group carefully  From the same time period  Log a lot because it can be hard to recreate state  Confirm with metrics that should be the same  High variance, calculate empirically  Look at the data

Overview  Behavioral logs give practical, societal, personal insight  Sources include Web services, browsers, client apps  Public sources limited due to privacy concerns  Partitioned query logs to view interesting slices  By corpus, time, individual  By system variant = experiment  What behavioral logs cannot reveal  How to address limitations

7:16 – Try new engine What Logs Cannot Tell Us  People’s intent  People’s success  People’s experience  People’s attention  People’s beliefs of what happens  Behavior can mean many things  81% of search sequences ambiguous [Viermetz et al. 2006] 7:16 – Read Result 1 7:20 – Read Result 3 7:27 –Save links locally 7:12 – Query 7:14 – Click Result 1 7:15 – Click Result 3

IT & Politics Integral Theory and Practice Parenting IT & Politics Example: Click Entropy  Question: How ambiguous is a query?  Approach: Look at variation in clicks [Teevan et al. 2008]  Measure: Click entropy  Low if no variation journal of information…  High if lots of variation jitp

Which Has Less Variation in Clicks?  v. federal government jobs  find phone number v. msn live search  singapore pools v. singaporepools.com  tiffany v. tiffany’s  nytimes v. connecticut newspapers  campbells soup recipes v. vegetable soup recipe  soccer rules v. hockey equipment ? ? ? Results change Result quality varies Tasks impacts # of clicks Clicks/user = 1.1 Clicks/user = 2.1 Click position = 2.6 Click position = 1.6 Result entropy = 5.7 Result entropy = 10.7

Beware of Adversaries  Robots try to take advantage your service  Queries too fast or common to be a human  Queries too specialized (and repeated) to be real  Spammers try to influence your interpretation  Click-fraud, link farms, misleading content  Never-ending arms race  Look for unusual clusters of behavior  Adversarial use of log data [Fetterly et al. 2004]

Beware of Tyranny of the Data  Can provide insight into behavior  Example: What is search for, how needs are expressed  Can be used to test hypotheses  Example: Compare ranking variants or link color  Can only reveal what can be observed  Cannot tell you what you cannot observe  Example: Nobody uses Twitter to re-find

Supplementing Log Data  Enhance log data  Collect associated information Example: For browser logs, crawl visited webpages  Instrumented panels  Converging methods  Usability studies  Eye tracking  Surveys  Field studies  Diary studies

Example: Re-Finding Intent  Large-scale log analysis of re-finding [Tyler and Teevan 2010]  Do people know they are re-finding?  Do they mean to re-find the result they do?  Why are they returning to the result?  Small-scale critical incident user study  Browser plug-in that logs queries and clicks  Pop up survey on repeat clicks and 1/8 new clicks  Insight into intent + Rich, real-world picture  Re-finding often targeted towards a particular URL  Not targeted when query changes or in same session

Summary  Behavioral logs give practical, societal, personal insight  Sources include Web services, browsers, client apps  Public sources limited due to privacy concerns  Partitioned query logs to view interesting slices  By corpus, time, individual  By system variant = experiment  Behavioral logs are powerful but not complete picture  Can expose small differences and tail behavior  Cannot expose motivation, which is often adversarial  Look at the logs and supplement with complementary data

Jaime Teevan Questions?

References  Adar, E., J. Teevan and S.T. Dumais. Large scale analysis of Web revisitation patterns. CHI  Akers, D., M. Simpson, T. Wingorad and R. Jeffries. Undo and erase events as indicators of usability problems. CHI  Beitzel, S.M., E.C. Jensen, A. Chowdhury, D. Grossman and O. Frieder. Hourly analysis of a very large topically categorized Web query log. SIGIR  Broder, A. A taxonomy of Web search. SIGIR Forum, 36(2),  Chilton, L. and J. Teevan. Addressing information needs directly in the search result page. WWW  Cutrell, E., D.C. Robbins, S.T. Dumais and R. Sarin. Fast, flexible filtering with Phlat: Personal search and organization made easy. CHI  Dagon, D. Botnet detection and response: The network is the infection. OARC Workshop  Dasu, T. and T. Johnson. Exploratory data mining and data cleaning  Dumais, S. T., E. Cutrell, J.J. Cadiz, G. Jancke, R. Sarin and D.C. Robbins. Stuff I've Seen: A system for personal information retrieval and re-use. SIGIR  Fetterly, D., M. Manasse, and M. Najork. Spam, damn spam, and statistics: Using statistical analysis to locate spam Web pages. Workshop on the Web and Databases  Fox, S., K. Karnawat, M. Mydland, S.T. Dumais and T. White. Evaluating implicit measures to improve Web search. TOIS 23(2),  Jansen, B.J., A. Spink, J. Bateman and T. Saracevic. Real life information retrieval: A study of user queries on the Web. SIGIR Forum 32(1),  Joachims, T. Optimizing search engines using clickthrough data. KDD  Kellar, M., C. Wattersand, M. Shepherd. The impact of task on the usage of Web browser navigation mechanisms. GI  Kohavi, R., R. Longbotham, D. Sommerfield and R.M. Henne. Controlled experiments on the Web: Survey and practical guide. Data Mining and Knowledge Discovery 18(1),  Kohavi, R., R. Longbotham and T. Walker. Online experiments: Practical lessons. IEEE Computer 43 (9),  Kotov, A., P. Bennett, R.W. White, S.T. Dumais and J. Teevan. Modeling and analysis of cross-session search tasks. SIGIR 2011.

References  Kulkarni, A., J. Teevan, K.M. Svore and S.T. Dumais. Understanding temporal query dynamics. WSDM  Lau, T. and E. Horvitz. Patterns of search: Analyzing and modeling Web query refinement. User Modeling  Marshall, C.C. The future of annotation in a digital (paper) world. GSLIS Clinic  Narayanan, A. and V. Shmatikov. Robust de-anonymization of large sparse datasets. IEEE Symposium on Security and Privacy  Silverstein, C., Henzinger, M., Marais, H. and Moricz, M. (1999). Analysis of a very large Web search engine query log. SIGIR Forum, 33 (1),  Tang, D., A. Agarwal and D. O’Brien. Overlapping experiment Infrastructure: More, better, faster experimentation. KDD  Teevan, J., E. Adar, R. Jones and M. Potts. Information re-retrieval: Repeat queries in Yahoo's logs. SIGIR  Teevan, J., S.T. Dumais and D.J. Liebling. To personalize or not to personalize: Modeling queries with variation in user intent. SIGIR  Teevan, J., S.T. Dumais and D.J. Liebling. A longitudinal study of how highlighting Web content change affects people's Web interactions. CHI  Teevan, J., D.J. Liebling and G.R. Geetha. Understanding and predicting personal navigation. WSDM 2011  Teevan, J., D. Ramage and M. R. Morris. #TwitterSearch: A comparison of microblog search and Web search. WSDM  Tyler, S. K. and J. Teevan. Large scale query log analysis of re-finding. WSDM  Viermetz, M., C. Stolz, V. Gedov and M. Skubacz. Relevance and impact of tabbed browsing behavior on Web usage mining. Web Intelligence  Weinreich, H., H. Obendorf, E. Herder and M. Mayer. Off the beaten tracks: Exploring three aspects of Web navigation. WWW  White, R.W., S.T. Dumais and J. Teevan. Characterizing the influence of domain expertise on Web search behavior. WSDM  Yates, B., G. Dupret and J. Velasco. A study of mobile search queries in Japan. Query Log Analysis: Social and Technological Challenges. WWW 2007.