CCF ADL, Jul 31 2011 Susan Dumais Microsoft Research Temporal Dynamics and Information Retrieval In collaboration.

Slides:

Advertisements

Similar presentations

Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.

Advertisements

Predicting User Interests from Contextual Information

Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.

1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.

Personalization and Search Jaime Teevan Microsoft Research.

IDM 2003 Workshop Stuff I’ve Seen: Susan Dumais Microsoft Research A System for Personal Information Retrieval and.

1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.

Information Retrieval in Practice

Search Engines and Information Retrieval

Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,

THE WEB CHANGES EVERYTHING Jaime Teevan, Microsoft Research.

Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.

Seesaw Personalized Web Search Jaime Teevan, MIT with Susan T. Dumais and Eric Horvitz, MSR.

Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:

Ryen W. White, Microsoft Research Jeff Huang, University of Washington.

Information Retrieval in Practice

DiffIE: Changing How You View Changes on the Web DiffIE: Changing How You View Changes on the Web Jaime Teevan, Susan T. Dumais, Daniel J. Liebling, and.

T EMPORAL -I NFORMATICS R ESEARCH Eytan Adar University of Michigan, School of Information and Computer Science & Engineering April 23, 2010.

Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.

Overview of Search Engines

1 of 7 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.

Time-Sensitive Web Image Ranking and Retrieval via Dynamic Multi-Task Regression Gunhee Kim Eric P. Xing 1 School of Computer Science, Carnegie Mellon.

Section 2: Finding and Refinding Jaime Teevan Microsoft Research 1.

Finding and Re-Finding Through Personalization Jaime Teevan MIT, CSAIL David Karger (advisor), Mark Ackerman, Sue Dumais, Rob Miller (committee), Eytan.

Information Re-Retrieval Repeat Queries in Yahoo’s Logs Jaime Teevan (MSR), Eytan Adar (UW), Rosie Jones and Mike Potts (Yahoo) Presented by Hugo Zaragoza.

By Kyle Rector Senior, EECS, OSU. Agenda Background My Approach Demonstration How it works The Survey Plans for User Evaluation Future Plans.

TwitterSearch : A Comparison of Microblog Search and Web Search

CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

From Devices to People: Attribution of Search Activity in Multi-User Settings Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz Microsoft Research,

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.

Personalization of the Digital Library Experience: Progress and Prospects Nicholas J. Belkin Rutgers University, USA

Search Engines and Information Retrieval Chapter 1.

1 The BT Digital Library A case study in intelligent content management Paul Warren

What is the Internet? Internet: The Internet, in simplest terms, is the large group of millions of computers around the world that are all connected to.

Facets of Personalization Jaime Teevan Microsoft Research (CLUES) with S. Dumais, E. Horvitz, D. Liebling, E. Adar, J. Elsas, R. Hughes.

User Browsing Graph: Structure, Evolution and Application Yiqun Liu, Yijiang Jin, Min Zhang, Shaoping Ma, Liyun Ru State Key Lab of Intelligent Technology.

What is the Internet? Internet: The Internet, in simplest terms, is the large group of millions of computers around the world that are all connected to.

Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.

Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.

Predicting Content Change On The Web BY : HITESH SONPURE GUIDED BY : PROF. M. WANJARI.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

Personal Information Management Vitor R. Carvalho : Personalized Information Retrieval Carnegie Mellon University February 8 th 2005.

Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.

Personalized Search Xiao Liu

Microsoft Research1 Characterizing Alert and Browse Services for Mobile Clients Atul Adya, Victor Bahl, Lili Qiu Microsoft Research USENIX Annual Technical.

NSA - Nov 4, 2010 Susan Dumais Microsoft Research Information Analysis in a Dynamic and Data-Rich World.

Collaborative Information Retrieval - Collaborative Filtering systems - Recommender systems - Information Filtering Why do we need CIR? - IR system augmentation.

인지구조기반 마이닝 소프트컴퓨팅 연구실 박사 2 학기 박 한 샘 2006 지식기반시스템 응용.

Discovering and Using Groups to Improve Personalized Search Jaime Teevan, Merrie Morris, Steve Bush Microsoft Research.

Personalizing Search Jaime Teevan, MIT Susan T. Dumais, MSR and Eric Horvitz, MSR.

Qi Guo Emory University Ryen White, Susan Dumais, Jue Wang, Blake Anderson Microsoft Presented by Tetsuya Sakai, Microsoft Research.

CiteSight: Contextual Citation Recommendation with Differential Search Avishay Livne 1, Vivek Gokuladas 2, Jaime Teevan 3, Susan Dumais 3, Eytan Adar 1.

Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #15 Secure Multimedia Data.

Augmenting (personal) IR Readings Review Evaluation Papers returned & discussed Papers and Projects checkin time.

THE WEB CHANGES EVERYTHING Jaime Teevan, Microsoft

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.

Personalizing Web Search Jaime Teevan, MIT with Susan T. Dumais and Eric Horvitz, MSR.

Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.

Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.

To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent Presented by Jaime Teevan, Susan T. Dumais, Daniel J. Liebling Microsoft.

Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.

SEARCH AND CONTEXT Susan Dumais, Microsoft Research INFO 320.

The Web Changes Everything How Dynamic Content Affects the Way People Find Online Jaime Teevan Microsoft Research (CLUES) with S. Dumais, D. Liebling,

Information Retrieval in Practice

Information Retrieval in Practice

Search Engine Architecture

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

SIS: A system for Personal Information Retrieval and Re-Use

Presentation transcript:

CCF ADL, Jul Susan Dumais Microsoft Research Temporal Dynamics and Information Retrieval In collaboration with: Eric Horvitz, Jaime Teevan, Eytan Adar, Jon Elsas, Ed Cutrell, Dan Liebling, Richard Hughes, Merrie Ringel Morris, Evgeniy Gabrilovich, Krysta Svore, Anagha Kulkani

CCF ADL, Jul Change is Everywhere in IR Change is everywhere in digital information systems Change is everywhere in digital information systems New documents appear all the time New documents appear all the time Document content changes over time Document content changes over time Queries and query volume change over time Queries and query volume change over time What’s relevant to a query changes over time What’s relevant to a query changes over time E.g., U.S. Open 2011 (in May vs. Sept) E.g., U.S. Open 2011 (in May vs. Sept) User interaction changes over time User interaction changes over time E.g., tags, anchor text, social networks, query-click streams, etc. E.g., tags, anchor text, social networks, query-click streams, etc. Relations between entities change over time Relations between entities change over time E.g., President of the US in 2008 vs vs E.g., President of the US in 2008 vs vs Change is pervasive in digital information systems Change is pervasive in digital information systems … yet, we’re not doing much about it ! … yet, we’re not doing much about it !

CCF ADL, Jul Information Dynamics Content Changes Today’s Browse and Search Experiences But, ignores … User Visitation/ReVisitation

CCF ADL, Jul Digital Dynamics Easy to Capture Easy to capture Easy to capture But … few tools support dynamics But … few tools support dynamics

CCF ADL, Jul Overview Change on the desktop and news Change on the desktop and news Desktop: Stuff I’ve Seen; Memory Landmarks; LifeBrowser Desktop: Stuff I’ve Seen; Memory Landmarks; LifeBrowser News: Analysis of novelty (e.g., NewsJunkie) News: Analysis of novelty (e.g., NewsJunkie) Change on the Web Change on the Web Content changes over time Content changes over time User interaction varies over time (queries, re-visitation, anchor text, query-click stream, “likes”) User interaction varies over time (queries, re-visitation, anchor text, query-click stream, “likes”) Tools for understanding Web change (e.g., Diff-IE) Tools for understanding Web change (e.g., Diff-IE) Improving Web retrieval using dynamics Improving Web retrieval using dynamics Query trends over time Query trends over time Retrieval models that leverage dynamics Retrieval models that leverage dynamics Task evolution over time Task evolution over time

CCF ADL, Jul Stuff I’ve Seen (SIS) Many silos of information Many silos of information SIS: SIS: Unified access to distributed, heterogeneous content (mail, files, web, tablet notes, rss, etc.) Unified access to distributed, heterogeneous content (mail, files, web, tablet notes, rss, etc.) Index full content + metadata Index full content + metadata Fast, flexible search Fast, flexible search Information re-use Information re-use Stuff I’ve Seen Windows-DS [Dumais et al., SIGIR 2003] SIS -> SIS -> Windows Desktop Search Windows Desktop Search

Example Desktop Searches Looking for: recent from Fedor that contained a link to his new demo Initiated from: Start menu Query: from:Fedor Looking for: the pdf of a SIGIR paper on context and ranking (not sure it used those words) that someone (don’t remember who) sent me about a month ago Initiated from: Outlook Query: SIGIR Looking for: meeting invite for the last intern handoff Initiated from: Start menu Query: intern handoff kind:appointment Looking for: C# program I wrote a long time ago Initiated from: Explorer pane Query: QCluster*.* CCF ADL, Jul Lots of metadata … especially time Stuff I’ve Seen

Stuff I’ve Seen: Findings Evaluation: Evaluation: Internal to Microsoft, ~3000 users in 2004 Internal to Microsoft, ~3000 users in 2004 Methods: free-form feedback, questionnaires, usage patterns from log data, in situ experiments, lab studies for richer data Methods: free-form feedback, questionnaires, usage patterns from log data, in situ experiments, lab studies for richer data Personal store characteristics: Personal store characteristics: 5k–1500k items 5k–1500k items Information needs: Information needs: Desktop search != Web search Desktop search != Web search Short queries (1.6 words) Short queries (1.6 words) Few advanced operators in the initial query (~7%) Few advanced operators in the initial query (~7%) But … many advanced operators and query iteration in UI (48%) But … many advanced operators and query iteration in UI (48%) Filters (type, date, people); modify query; re-sort results Filters (type, date, people); modify query; re-sort results People know a lot about what they are looking for and we need to provide a way to express it ! People know a lot about what they are looking for and we need to provide a way to express it ! CCF ADL, Jul

Stuff I’ve Seen: Findings Information needs: Information needs: People are important – 29% queries involve names/aliases People are important – 29% queries involve names/aliases Date is the most common sort order Date is the most common sort order Even w/ “best-match” default Even w/ “best-match” default Few searches for “best” matching object Few searches for “best” matching object Many other criteria (e.g., time, people, type), depending on task Many other criteria (e.g., time, people, type), depending on task Need to support flexible access Need to support flexible access Abstraction is important – “useful” date, people, pictures Abstraction is important – “useful” date, people, pictures Age of items retrieved Age of items retrieved Today (5%), Last week (21%), Last month (47%) Today (5%), Last week (21%), Last month (47%) Need to support episodic access to memory Need to support episodic access to memory CCF ADL, Jul

Beyond Stuff I’ve Seen Better support for human memory Better support for human memory Memory Landmarks Memory Landmarks LifeBrowser LifeBrowser Phlat Phlat Beyond search Beyond search Proactive retrieval Proactive retrieval Stuff I Should See (IQ) Stuff I Should See (IQ) Temporal Gadget Temporal Gadget Using desktop index as a rich “user model” Using desktop index as a rich “user model” PSearch PSearch NewsJunkie NewsJunkie Diff-IE Diff-IE CCF ADL, Jul

Memory Landmarks Importance of episodes in human memory Importance of episodes in human memory Memory organized into episodes (Tulving, 1983) Memory organized into episodes (Tulving, 1983) People-specific events as anchors (Smith et al., 1978) People-specific events as anchors (Smith et al., 1978) Time of events often recalled relative to other events, historical or autobiographical (Huttenlocher & Prohaska, 1997) Time of events often recalled relative to other events, historical or autobiographical (Huttenlocher & Prohaska, 1997) Identify and use landmarks facilitate search and information management Identify and use landmarks facilitate search and information management Timeline interface, augmented w/ landmarks Timeline interface, augmented w/ landmarks Learn Bayesian models to identify memorable events Learn Bayesian models to identify memorable events Extensions beyond search, e.g., Life Browser Extensions beyond search, e.g., Life Browser CCF ADL, Jul

Memory Landmarks Search Results Memory Landmarks - General (world, calendar) - Personal (appts, photos) Linked to results by time Distribution of Results Over Time [Ringle et al., 2003]

CCF ADL, Jul Memory Landmarks: Findings Dates OnlyLandmarks + Dates Search Time (s) With Landmarks Without Landmarks

Memory Landmarks Learned models of memorability CCF ADL, Jul [Horvitz et al., 2004]

Images & videos Appts & events Desktop & search activity Whiteboard capture Locations LifeBrowser [Horvitz & Koch, 2010] CCF ADL, Jul

LifeBrowser Learned models of selective memory CCF ADL, Jul

News is a stream of information w/ evolving events News is a stream of information w/ evolving events But, it’s hard to consume it as such But, it’s hard to consume it as such Personalized news using information novelty Personalized news using information novelty Identify clusters of related articles Identify clusters of related articles Characterize what a user knows about an event Characterize what a user knows about an event Compute the novelty of new articles, relative to this background knowledge (relevant & novel) Compute the novelty of new articles, relative to this background knowledge (relevant & novel) Novelty = KLDivergence (article || current_knowledge) Novelty = KLDivergence (article || current_knowledge) Use novelty score and user preferences to guide what, when, and how to show new information Use novelty score and user preferences to guide what, when, and how to show new information [Gabrilovich et al., WWW 2004] NewsJunkie Personalized news via information novelty

CCF ADL, Jul NewsJunkie in Action Novelty Score Friends say Wells is innocent Looking for two people Copycat case in Missouri Article Sequence by Time Gun disguised as cane NewsJunkie: Pizza delivery man w/ bomb incident

NewsJunkie in Action CCF ADL, Jul

News Junkie Evaluation Experiment to evaluate algorithms for detecting novelty Experiment to evaluate algorithms for detecting novelty Task: Given background article, select set of articles that you would recommend for a friend who wants to find out what’s new about the story Task: Given background article, select set of articles that you would recommend for a friend who wants to find out what’s new about the story KL and Named Entity algorithms better than temporal KL and Named Entity algorithms better than temporal But, many types of “differences” But, many types of “differences” Recap, review of prior information Recap, review of prior information Elaboration, new information Elaboration, new information Offshoot, related but mostly about something else Offshoot, related but mostly about something else Irrelevant, not related to main story Irrelevant, not related to main story CCF ADL, Jul

NewsJunkie Types of novelty, via intra-article novelty dynamics Offshoot: SARS impact on Asian stock markets Word Position Novelty Score On-topic, recap Word Position Novelty Score On topic, elaboration: SARS patient’s wife held under quarantine Word Position Novelty Score Offshoot: Swiss company develops SARS vaccine Word Position Novelty Score

CCF ADL, Jul Overview Change on the desktop and news Change on the desktop and news Desktop: Stuff I’ve Seen; Memory Landmarks; LifeBrowser Desktop: Stuff I’ve Seen; Memory Landmarks; LifeBrowser News: Analysis of novelty (e.g., NewsJunkie) News: Analysis of novelty (e.g., NewsJunkie) Change on the Web Change on the Web Content changes over time Content changes over time User interaction varies over time (queries, re-visitation, anchor text, query-click stream, “likes”) User interaction varies over time (queries, re-visitation, anchor text, query-click stream, “likes”) Tools for understanding Web change (e.g., Diff-IE) Tools for understanding Web change (e.g., Diff-IE) Improving Web retrieval using dynamics Improving Web retrieval using dynamics Query trends over time Query trends over time Retrieval models that leverage dynamics Retrieval models that leverage dynamics Task evolution over time Task evolution over time Questions?

CCF ADL, Jul Content Changes User Visitation/ReVisitation Characterizing Web Change Large-scale Web crawls, over time Large-scale Web crawls, over time Revisited pages Revisited pages 55,000 pages crawled hourly for 18+ months 55,000 pages crawled hourly for 18+ months Unique users, visits/user, time between visits Unique users, visits/user, time between visits Pages returned by a search engine (for ~ 100k queries) Pages returned by a search engine (for ~ 100k queries) 6 million pages crawled every two days for 6 months 6 million pages crawled every two days for 6 months [Adar et al., WSDM 2009]

CCF ADL, Jul Measuring Web Page Change Summary metrics Summary metrics Number of changes Number of changes Amount of change Amount of change Time between changes Time between changes Change curves Change curves Fixed starting point Fixed starting point Measure similarity over different time intervals Measure similarity over different time intervals Within-page changes Within-page changes

CCF ADL, Jul Measuring Web Page Change Summary metrics Summary metrics Number of changes Number of changes Amount of change Amount of change Time between changes Time between changes 33% of Web pages change 33% of Web pages change 66% of visited Web pages change 66% of visited Web pages change 63% of these change every hr. 63% of these change every hr. Avg. Dice coeff. = 0.80 Avg. Dice coeff. = 0.80 Avg. time bet. change = 123 hrs. Avg. time bet. change = 123 hrs..edu and.gov pages change infrequently, and not by much.edu and.gov pages change infrequently, and not by much popular pages change more frequently, but not by much popular pages change more frequently, but not by much

CCF ADL, Jul Measuring Web Page Change Summary metrics Summary metrics Number of changes Number of changes Amount of change Amount of change Time between changes Time between changes Change curves Change curves Fixed starting point Fixed starting point Measure similarity over different time intervals Measure similarity over different time intervals Knot point Time from starting point

CCF ADL, Jul Measuring Within-Page Change DOM-level changes DOM-level changes Term-level changes Term-level changes Divergence from norm Divergence from norm cookbooks cookbooks salads salads cheese cheese ingredient ingredient bbq bbq … “Staying power” in page “Staying power” in page Time Sep. Oct. Nov. Dec.

CCF ADL, Jul Example Term Longevity Graphs

CCF ADL, Jul Revisitation on the Web Content Changes User Visitation/ReVisitation What was the last Web page you visited? Why did you visit (re-visit) the page? Revisitation patterns Revisitation patterns Log analyses Log analyses Toolbar logs for revisitation Toolbar logs for revisitation Query logs for re-finding Query logs for re-finding User survey to understand intent in revisitations User survey to understand intent in revisitations [Adar et al., CHI 2009]

60-80% of Web pages you visit, you’ve visited before 60-80% of Web pages you visit, you’ve visited before Many motivations for revisits Many motivations for revisits CCF ADL, Jul Measuring Revisitation Summary metrics Summary metrics Unique visitors Unique visitors Visits/user Visits/user Time between visits Time between visits Revisitation curves Revisitation curves Histogram of revisit intervals Histogram of revisit intervals Normalized Normalized Time Interval

CCF ADL, Jul Four Revisitation Patterns Fast Fast Hub-and-spoke Hub-and-spoke Navigation within site Navigation within site Hybrid Hybrid High quality fast pages High quality fast pages Medium Medium Popular homepages Popular homepages Mail and Web applications Mail and Web applications Slow Slow Entry pages, bank pages Entry pages, bank pages Accessed via search engine Accessed via search engine

CCF ADL, Jul Relationships Between Change and Revisitation Interested in change Interested in change Monitor Monitor Effect change Effect change Transact Transact Change unimportant Change unimportant Re-find old Re-find old Change can interfere with re-finding Change can interfere with re-finding

CCF ADL, Jul Revisitation and Search (ReFinding) Repeat query (33%) Repeat query (33%) Q: microsoft research Q: microsoft research Click same and different URLs Click same and different URLs Repeat click (39%) Repeat click (39%) Q: microsoft research; msr Q: microsoft research; msr Big opportunity (43%) Big opportunity (43%) 24% “navigational revisits” 24% “navigational revisits” [Teevan et al., SIGIR 2007] [Tyler et al., WSDM 2010] [Teevan et al., WSDM 2011]

CCF ADL, Jul Building Support for Web Dynamics Content Changes Diff IE Temporal IR User Visitation/ReVisitation

CCF ADL, Jul Diff-IE Changes to page since your last visit Diff-IE toolbar [Teevan et al., UIST 2009] [Teevan et al., CHI 2010]

CCF ADL, Jul Interesting Features of Diff-IE Always on In-situ New to you Non-intrusive Try it:

CCF ADL, Jul Examples of Diff-IE in Action

CCF ADL, Jul Expected New Content

CCF ADL, Jul Monitor

Serendipitous Encounters

CCF ADL, Jul Unexpected Important Content

CCF ADL, Jul Understand Page Dynamics

CCF ADL, Jul Unexpected Unimportant Content Attend to Activity Edit Understand Page Dynamics Serendipitous Encounter Unexpected Important Content Expected New Content Monitor Expected Unexpected

CCF ADL, Jul Studying Diff-IE Feedback buttons Feedback buttons Survey Survey Prior to installation Prior to installation After a month of use After a month of use Logging Logging URLs visited URLs visited Amount of change when revisited Amount of change when revisited Experience interview Experience interview In situ Representative Experience Longitudinal

CCF ADL, Jul People Revisit More Perception of revisitation remains constant Perception of revisitation remains constant How often do you revisit? How often do you revisit? How often are revisits to view new content? How often are revisits to view new content? Actual revisitation increases Actual revisitation increases Last week: 45.0% of visits are revisits Last week: 45.0% of visits are revisits First week: 39.4% of visits are revisits First week: 39.4% of visits are revisits Why are people revisiting more with DIFF-IE? Why are people revisiting more with DIFF-IE? 14%

CCF ADL, Jul Revisited Pages Change More Perception of change increases Perception of change increases What proportion of pages change regularly? What proportion of pages change regularly? How often do you notice unexpected change? How often do you notice unexpected change? Amount of change seen increases Amount of change seen increases Last week: 32.4% revisits changed, by 9.5% Last week: 32.4% revisits changed, by 9.5% First week: 21.5% revisits changed, by 6.2% First week: 21.5% revisits changed, by 6.2% Diff-IE is driving visits to changed pages Diff-IE is driving visits to changed pages It supports people in understanding change It supports people in understanding change 51+% 17% 8%

CCF ADL, Jul Change by Page Type Perceptions of change reinforced Perceptions of change reinforced Pages that change a lot  change more Pages that change a lot  change more Pages that change a little  change less Pages that change a little  change less News pages Message boards, forums, news groups Search engine results Blogs you read Pages with product information Wikipedia pages Company homepages Personal home pages of people you know Reference pages (dictionaries, yellow pages, maps) Change a little Change a lot

Other Examples of Dynamics and User Experience Content changes Content changes Diff-IE (Teevan et al., 2008) Diff-IE (Teevan et al., 2008) Zoetrope (Adar et al., 2008) Zoetrope (Adar et al., 2008) Diffamation (Chevalier et al., 2010) Diffamation (Chevalier et al., 2010) Temporal summaries and snippets … Temporal summaries and snippets … Interaction changes Interaction changes Explicit annotations, ratings, wikis, etc. Explicit annotations, ratings, wikis, etc. Implicit interest via interaction patterns Implicit interest via interaction patterns Edit wear and read wear (Hill et al., 1992) Edit wear and read wear (Hill et al., 1992) CCF ADL, Jul

Overview Change on the desktop and news Change on the desktop and news Desktop: Stuff I’ve Seen; Memory Landmarks; LifeBrowser Desktop: Stuff I’ve Seen; Memory Landmarks; LifeBrowser News: Analysis of novelty (e.g., NewsJunkie) News: Analysis of novelty (e.g., NewsJunkie) Change on the Web Change on the Web Content changes over time Content changes over time User interaction varies over time (queries, re-visitation, anchor text, query-click stream, “likes”) User interaction varies over time (queries, re-visitation, anchor text, query-click stream, “likes”) Tools for understanding Web change (e.g., Diff-IE) Tools for understanding Web change (e.g., Diff-IE) Improving Web retrieval using dynamics Improving Web retrieval using dynamics Query trends over time Query trends over time Retrieval models that leverage dynamics Retrieval models that leverage dynamics Task evolution over time Task evolution over time Questions?

CCF ADL, Jul Leveraging Dynamics for Retrieval Content Changes User Visitation/ReVisitation Temporal IR

Query frequency over time Query frequency over time Retrieval models that incorporate time Retrieval models that incorporate time Ranking algorithms typically look only at a single snapshot in time Ranking algorithms typically look only at a single snapshot in time But, both content and user interaction with the content change over time But, both content and user interaction with the content change over time Model content change on a page Model content change on a page Model user interactions Model user interactions Tasks evolve over time Tasks evolve over time CCF ADL, Jul

Query Dynamics Queries sometimes mention time, but often don’t Queries sometimes mention time, but often don’t Explicit time (e.g., World Cup Soccer 2011) Explicit time (e.g., World Cup Soccer 2011) Explicit news (e.g., earthquake news) Explicit news (e.g., earthquake news) Implicit time (e.g., Harry Potter reviews; implicit “now”) Implicit time (e.g., Harry Potter reviews; implicit “now”) Queries are not uniformly distributed over time Queries are not uniformly distributed over time Often triggered by events in the world Often triggered by events in the world Using temporal query patterns to: Using temporal query patterns to: Cluster similar queries Cluster similar queries Identify events and find related news Identify events and find related news

Query Dynamics Q: cinema Q: cinema Discrete Fourier Transform Discrete Fourier Transform Best k components (vs. first k) Best k components (vs. first k) Best k significantly reduce reconstruction error Best k significantly reduce reconstruction error Burst detection Burst detection Bursts as deviations from moving average Bursts as deviations from moving average [Vlachos et al., SIGMOD 2004] Modeling query frequency over time (Vlachos et al.) Modeling query frequency over time (Vlachos et al.)

Query Dynamics Types of query popularity patterns Types of query popularity patterns Number of spikes (0, 1, multiple) Number of spikes (0, 1, multiple) Periodic (yes, no) Periodic (yes, no) Shape of rise and fall (wedge, sail, castle) Shape of rise and fall (wedge, sail, castle) Trend (flat, up, down) Trend (flat, up, down) Changes in query popularity and content to changes in user intent (i.e., what is relevant to the query) … more on this later Changes in query popularity and content to changes in user intent (i.e., what is relevant to the query) … more on this later [Kulkarni et al., WSDM 2011]

CCF ADL, Jul Model query patterns using empirical query frequency (normalized by total queries per day) Model query patterns using empirical query frequency (normalized by total queries per day) Examples: Examples: Identify “similar” queries using correlation coefficient between the normalized time series Identify “similar” queries using correlation coefficient between the normalized time series Nice use of time to identify semantic similarity between queries/entities, but not predictive Nice use of time to identify semantic similarity between queries/entities, but not predictive Using Query Dynamics to Find Similar Queries [Chien & Immorlica, WWW 2005] Q: movies Q: scott peterson Q: weather report

Query Dynamics via Document Dates News: date of document easy to obtain News: date of document easy to obtain Li & Croft – used publication date of documents to rank results Li & Croft – used publication date of documents to rank results Jones & Diaz – model period of time relevant to a query (temporal profile of query) Jones & Diaz – model period of time relevant to a query (temporal profile of query) Three classes of queries Three classes of queries Atemporal (0 bursts) Atemporal (0 bursts) Temporally unambiguous (1 burst) Temporally unambiguous (1 burst) Temporally ambiguous (>1 burst) Temporally ambiguous (>1 burst) Web: date of documents harder to obtain Web: date of documents harder to obtain CCF ADL, Jul

Many queries to Web search engines are motivated by events in the world Many queries to Web search engines are motivated by events in the world Should you show just Web results? Should you show just Web results? Or, provide an integrated view of news and Web? Or, provide an integrated view of news and Web? Example Example Learn model to predict “newsworthiness” of a query (i.e., will a user click on news results) Learn model to predict “newsworthiness” of a query (i.e., will a user click on news results) Is the query part of a burst? [content consumption] Is the query part of a burst? [content consumption] Are the top-ranked news results very recent? [content production] Are the top-ranked news results very recent? [content production] Improve prediction using ongoing click data for this and related queries Improve prediction using ongoing click data for this and related queries [Diaz 2010] Using Query Dynamics to Identify “News” Queries

Temporal Retrieval Models 1 Current retrieval algorithms look only at a single snapshot of a page Current retrieval algorithms look only at a single snapshot of a page But, Web page content changes over time But, Web page content changes over time Can we can leverage this to improved retrieval? Can we can leverage this to improved retrieval? Pages have different rates of change Pages have different rates of change Different priors (using change vs. link structure) Different priors (using change vs. link structure) Terms have different longevity (staying power) Terms have different longevity (staying power) Some are always on the page; some transient Some are always on the page; some transient Language modeling approach to ranking Language modeling approach to ranking CCF ADL, Jul Change prior Term longevity [Elsas et al., WSDM 2010]

Relevance and Page Change Page change is related to relevance judgments Page change is related to relevance judgments Human relevance judgments Human relevance judgments 5 point scale – Perfect/Excellent/Good/Fair/Bad 5 point scale – Perfect/Excellent/Good/Fair/Bad Rate of Change -- 60% Perfect pages; 30% Bad pages Rate of Change -- 60% Perfect pages; 30% Bad pages Use change rate as a document prior (vs. priors based on link structure like Page Rank) Use change rate as a document prior (vs. priors based on link structure like Page Rank) Shingle prints to measure change Shingle prints to measure change CCF ADL, Jul Change prior

Relevance and Term Change Terms patterns vary over time Terms patterns vary over time Represent a document as a mixture of terms with different “staying power” Represent a document as a mixture of terms with different “staying power” Long, Medium, Short Long, Medium, Short CCF ADL, Jul Term longevity

Evaluation: Queries & Documents 18K Queries, 2.5M Judged Documents 18K Queries, 2.5M Judged Documents 5-level relevance judgment (Perfect … Bad) 5-level relevance judgment (Perfect … Bad) 2.5M Documents crawled weekly for 10 wks 2.5M Documents crawled weekly for 10 wks Navigational queries Navigational queries 2k queries identified with a “Perfect” judgment 2k queries identified with a “Perfect” judgment Assume these relevance judgments are consistent over time Assume these relevance judgments are consistent over time CCF ADL, Jul

Experimental Results Baseline Static Model Dynamic Model Dynamic Model + Change Prior Change Prior CCF ADL, Jul

Temporal Retrieval Models 2 Initial evaluation Initial evaluation Navigational queries; assume relevance is “static” over time Navigational queries; assume relevance is “static” over time But, relevance often changes over time But, relevance often changes over time E.g., Stanley Cup in 2011 vs. in 2010 E.g., Stanley Cup in 2011 vs. in 2010 E.g., US Open 2011 in May (golf) vs. in Sept (tennis) E.g., US Open 2011 in May (golf) vs. in Sept (tennis) E.g., March madness 2011 E.g., March madness 2011 Before event: Schedule and tickets, e.g., stubhub Before event: Schedule and tickets, e.g., stubhub During event: Real-time scores, e.g., espn, cbssports During event: Real-time scores, e.g., espn, cbssports After event: General sites, e.g., wikipedia, ncaa After event: General sites, e.g., wikipedia, ncaa Ongoing evaluation Ongoing evaluation Collecting explicit relevance judgments, query frequency, interaction data and page content over time Collecting explicit relevance judgments, query frequency, interaction data and page content over time Developing temporal IR models, temporal snippets Developing temporal IR models, temporal snippets CCF ADL, Jul [Kulkarni et al., WSDM 2011] [Radinsky et al., in prep]

Relevance over Time Query: march madness [Mar 15 – Apr 4, 2010] Query: march madness [Mar 15 – Apr 4, 2010] CCF ADL, Jul DuringAfter

Relevance over Time Query: sigir Query: sigir Why is old content ranked higher? Why is old content ranked higher? User interaction data more prevalent for older documents User interaction data more prevalent for older documents E.g., query-clicks, anchor text, etc. E.g., query-clicks, anchor text, etc. CCF ADL, Jul

Relevance over Time Query: nfl Query: nfl Again, time of query and time of event not well modeled Again, time of query and time of event not well modeled CCF ADL, Jul Sept - Jan Feb - Aug

Time Series Modeling of Queries and Clicks E.g., ny daily news E.g., ny daily news Q: weekly period Q: weekly period URL clicks: consistent ordering URL clicks: consistent ordering E.g., gold miner E.g., gold miner Q: more spikes Q: more spikes URL clicks: most clicked URL varies over time URL clicks: most clicked URL varies over time Model queries and URL clicks as time series Model queries and URL clicks as time series CCF ADL, Jul

Time Series Modeling CCF ADL, Jul

Experimental Results Ground truth Ground truth Actual user behavior over time Actual user behavior over time Two search-related prediction tasks Two search-related prediction tasks Predict number of URL clicks (prediction error) Predict number of URL clicks (prediction error) Ranking of URLs for search (correlation) Ranking of URLs for search (correlation) Three types of models Three types of models No historical features No historical features Historical average Historical average Time-series modeling Time-series modeling Results (preliminary) Results (preliminary) TaskPrediction ErrorRanking URLs Model Text Only Text + Historical Avg Text + Time Series CCF ADL, Jul

Task Evolution over Time Most retrieval models look at queries in isolation Most retrieval models look at queries in isolation But, people often use multiple queries to accomplish a task But, people often use multiple queries to accomplish a task Within-session tasks Within-session tasks Use previous actions in current session to improve understanding of current query Use previous actions in current session to improve understanding of current query Cross-session tasks Cross-session tasks Develop methods to predict and support task resumption over time Develop methods to predict and support task resumption over time CCF ADL, Jul

Within-Session Tasks Examples Examples Q: [sigir] … given [information retrieval] vs. [iraq reconstruction] Q: [sigir] … given [information retrieval] vs. [iraq reconstruction] Q: [acl] … given [computational linguistics] vs. [knee injury] vs. [country music] Q: [acl] … given [computational linguistics] vs. [knee injury] vs. [country music] 40% of session contain multiple queries 40% of session contain multiple queries 60% of queries have at least one preceding query 60% of queries have at least one preceding query Use previous actions in current session to improve understanding of current query Use previous actions in current session to improve understanding of current query Model using ODP categories Model using ODP categories Use to predict future actions, rank current results, etc. Use to predict future actions, rank current results, etc. Health/Medicine Music Computer Science Sports Business [White et al., CIKM 2010] CCF ADL, Jul

Within-Session Results Context helps Context helps Using any context source improves accuracy Using any context source improves accuracy Using more sources improves accuracy Using more sources improves accuracy Context source Models None (i.e., current query only) 0.39  Queries (i.e., all previous queries) Queries + SERPClicks (i.e., all previous queries / result clicks) Queries + SERPClicks + NavTrails (i.e., all previous actions) Context source Percentage of queries best Between models Queries (i.e., all previous queries)25%18%22% Queries + SERPClicks (i.e., all previous queries / result clicks) 30%16%25% Queries + SERPClicks + NavTrails (i.e., all previous actions) 34%11%30% Differences across queries Differences across queries Query model wins: current query has specific intent [espn] [webmd], or first actions after a shift in interests Query model wins: current query has specific intent [espn] [webmd], or first actions after a shift in interests Context model wins: query is ambiguous [amazon] and session has a consistent intent Context model wins: query is ambiguous [amazon] and session has a consistent intent Intent model wins: session has consistent intent throughout Intent model wins: session has consistent intent throughout CCF ADL, Jul

Research Missions Research missions (from Jones & Klinkner) Research missions (from Jones & Klinkner) Identify “research missions” on-the-fly during a search session Identify “research missions” on-the-fly during a search session Three general signals Three general signals Research_mission (q1, q2) classifier Research_mission (q1, q2) classifier Same_mission (q1, q2) classifier Same_mission (q1, q2) classifier Sim (topics (q1), topics (q2)) Sim (topics (q1), topics (q2)) Many features Many features Textual – similarity q1, q1 Textual – similarity q1, q1 Session-based – queries, clicks, queries since last click, etc. Session-based – queries, clicks, queries since last click, etc. Time-based – time between q1, q2, total session time, etc. Time-based – time between q1, q2, total session time, etc. Trigger Yahoo! Scratch Pad if a research mission is detected Trigger Yahoo! Scratch Pad if a research mission is detected [Donato et al., WWW 2010] [Jones & Klinckner, CIKM 2008) CCF ADL, Jul

Cross-Session Tasks Many tasks extend across sessions – e.g., medical diagnosis and treatment, event planning, how-to advice, shopping research, academic research, etc. Many tasks extend across sessions – e.g., medical diagnosis and treatment, event planning, how-to advice, shopping research, academic research, etc % of tasks continue across multiple sessions 10-15% of tasks continue across multiple sessions 20-25% of queries are from multi-session tasks 20-25% of queries are from multi-session tasks Example Example Develop methods to support task resumption over time Develop methods to support task resumption over time Same Task: Find (previous) related queries/clicks Same Task: Find (previous) related queries/clicks Task Resumption: Predict whether user will resume task Task Resumption: Predict whether user will resume task [Kotov et al., SIGIR 2011] CCF ADL, Jul

Cross-Session Tasks Results Results Same Task Same Task Task Continuation Task Continuation Develop support for task continuation Develop support for task continuation [Kotov et al., SIGIR 2011] Approach Approach Classification (logistic regression, MART) Classification (logistic regression, MART) Features (query, pair-wise, session-based, history-based) Features (query, pair-wise, session-based, history-based) CCF ADL, Jul

Other Examples of Dynamics and Information Systems Document dynamics, for crawling and indexing Document dynamics, for crawling and indexing Adar et al. (2009); Cho & Garcia-Molina (2000); Fetterly et al. (2003) Adar et al. (2009); Cho & Garcia-Molina (2000); Fetterly et al. (2003) Query dynamics Query dynamics Kulkarni et al. (2011); Jones & Diaz (2004); Diaz (2009); Kotov et al. (2010) Kulkarni et al. (2011); Jones & Diaz (2004); Diaz (2009); Kotov et al. (2010) Temporal retrieval models Temporal retrieval models Elsas & Dumais (2010); Liu & Croft (2004); Efron (2010); Aji et al. (2010) Elsas & Dumais (2010); Liu & Croft (2004); Efron (2010); Aji et al. (2010) Extraction of temporal entities within documents Extraction of temporal entities within documents Protocol extension for retrieving versions over time Protocol extension for retrieving versions over time E.g., Memento (Van de Sompel et al., 2010) E.g., Memento (Van de Sompel et al., 2010) CCF ADL, Jul

Summary Content Changes Relating revisitation and change allows us to – Identify pages for which change is important – Identify interesting components within a page Diff-IE: Supports (and influences) interaction and understanding Temporal IR: Leverages change for improved IR Web content changes: page-level, term-level User Visitation/ReVisitation People revisit and re-find Web content Time on the desktop, news, real-time media

Challenges and Opportunities Temporal dynamics are pervasive in information systems Temporal dynamics are pervasive in information systems Influence many aspect of information systems Influence many aspect of information systems Systems: protocols, crawling, indexing, caching Systems: protocols, crawling, indexing, caching Document representations : meta-data generation, information extraction, sufficient statistics at page and term-level Document representations : meta-data generation, information extraction, sufficient statistics at page and term-level Retrieval models: term weights, document priors, etc. Retrieval models: term weights, document priors, etc. User experience and evaluation User experience and evaluation Better supporting temporal dynamics of information Better supporting temporal dynamics of information Requires digital preservation and temporal metadata extraction Requires digital preservation and temporal metadata extraction Enables richer understanding of the evolution (and prediction) of key ideas, relations, and trends over time Enables richer understanding of the evolution (and prediction) of key ideas, relations, and trends over time Time is one important example of context and IR Time is one important example of context and IR Others include: location, individuals, tasks … Others include: location, individuals, tasks … CCF ADL, Jul

User Context Task/Use Context Document Context Query Words Ranked List Think Outside the (Search) Boxes Search Research

CCF ADL, Jul Thank You ! Questions/Comments … More info, Diff-IE … try it!

References Desktop and news Dumais et al. (SIGIR 2003). Ringle et al. (Interact 2003). Horvitz & Koch (2010). Gabrilovich et al. (WWW 2004). Diaz (2009). Document dynamics Adar et al. (WSDM 2009). Cho & Garcia-Molina (VLDB 2000). Fetterly et al. (WWW 2003). User interaction Adar et al. (CHI 2009). Teevan et al. (SIGIR 2007). Tyler et al. (WSDM 2010). Teevan et al. (WSDM 2011). Adar et al. (CHI 2010). Teevan et al. (UIST 2009). Teevan et al. (CHI 2010). CCF ADL, Jul Query dynamics Kulkarni et al. (WSDM 2011). Jones & Diaz (TOIS 2007). Vlachos et al. (SIGMOD 2004). Chien & Immorlica (WWW 2005). Temporal retrieval models Elsas & Dumais (WSDM 2010). Radinsky (in prep). Li & Croft (2004). Efron & Golovshinshy (SIRGIR 2011). Tasks over time White et al. (CIKM 2010). Donato et al. (WWW 2010). Kotov et al. (SIGIR 2011). Jones and Klinkner (CIKM 2008). General Kleinberg (). Van de Sompel et al. (Arxiv 2009).