Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan, MIT Susan T. Dumais, Microsoft Eric Horvitz, Microsoft SIGIR 2005.

Similar presentations


Presentation on theme: "1 Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan, MIT Susan T. Dumais, Microsoft Eric Horvitz, Microsoft SIGIR 2005."— Presentation transcript:

1 1 Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan, MIT Susan T. Dumais, Microsoft Eric Horvitz, Microsoft SIGIR 2005

2 2 Problem In IR, a query may have different meanings for different people In IR, a query may have different meanings for different people IR = information retrieval IR = information retrieval IR = Iran IR = Iran People are too lazy to type in long, detailed queries People are too lazy to type in long, detailed queries

3 3 Approach Keep a user profile (automated) Keep a user profile (automated) Rank retrieved documents by the information in user profile Rank retrieved documents by the information in user profile ≠explicit relevance feedback, because no user interaction ≠explicit relevance feedback, because no user interaction ≠pseudo relevance feedback, because user profile is long-term ≠pseudo relevance feedback, because user profile is long-term

4 4 User Profile? Web pages that the user viewed Web pages that the user viewed Email messaged viewed or sent Email messaged viewed or sent Calendar items Calendar items Documents (text files) on the user ’ s computer Documents (text files) on the user ’ s computer Profile is kept on the user machine for obvious reasons … Profile is kept on the user machine for obvious reasons … Treat each piece of information as an ordinary text document Treat each piece of information as an ordinary text document Think of a user profile as a user document database Think of a user profile as a user document database

5 5 System Architecture Search EngineUser Computer (Contains User Profile) 1. User Query 2. Top-N Documents 3. Re-rank Documents

6 6 Document Relevance Score Ignore search engine ranking Ignore search engine ranking Modify 神的公式 BM25 Modify 神的公式 BM25 w i = weight of query term i w i = weight of query term i N = total number of documents in corpus N = total number of documents in corpus n i = number of documents containing term i n i = number of documents containing term i R = total number of relevant documents by feedback R = total number of relevant documents by feedback r i = number of relevant documents containing term i r i = number of relevant documents containing term i

7 7 Alternate Variable Definition (1) N ≠ total number of documents in corpus N ≠ total number of documents in corpus N = number of documents retrieved by search engine N = number of documents retrieved by search engine n i ≠ number of documents containing term i n i ≠ number of documents containing term i n i = number of retrieved documents containing term i n i = number of retrieved documents containing term i Otherwise, we have to perform one web search for each query term Otherwise, we have to perform one web search for each query term

8 8 Alternate Variable Definition (2) The value of n i changes depending on how much of the retrieved document is seen: The value of n i changes depending on how much of the retrieved document is seen: –full text –title and search engine summary

9 9 Pseudo-Relevant Documents (1) Need to compute the value of R and r i in Need to compute the value of R and r i in This is where the user profile comes into play This is where the user profile comes into play There are two ways of defining R There are two ways of defining R –R = total number of documents in the user profile; independent of query –R = number of documents in the user profile that matches (boolean?  ) the query r i depends on the definition R r i depends on the definition R

10 10 Pseudo-Relevant Documents (2) We do not have to use all documents in the user profile We do not have to use all documents in the user profile Use only a subset of documents Use only a subset of documents –past query strings –viewed web pages –recently viewed documents

11 11 Term Frequency So far, we have seen many definitions of w i So far, we have seen many definitions of w i But BM25 also requires the term frequency (tf i ) value But BM25 also requires the term frequency (tf i ) value The value of tf i depends on how much of a document we see The value of tf i depends on how much of a document we see –full text –title and search engine summary

12 12 Query Expansion Expand the query by all words in the relevant documents (pseudo-rf) Expand the query by all words in the relevant documents (pseudo-rf) Expand the query by words surrounding the query terms in the relevant documents (pseudo-rf) Expand the query by words surrounding the query terms in the relevant documents (pseudo-rf)

13 13 Experiment Setup (1) 15 participants (Microsoft employees) 15 participants (Microsoft employees) Each participant has 10 queries Each participant has 10 queries The participants may pick their queries from pre-selected queries and/or make up their own queries The participants may pick their queries from pre-selected queries and/or make up their own queries MSN search engine (MSN.com) is used MSN search engine (MSN.com) is used Participants judges the top 50 results for each of their queries Participants judges the top 50 results for each of their queries –highly relevant, relevant, or not relevant

14 14 Ignore queries with no retrieved document or relevant document Ignore queries with no retrieved document or relevant document 131 queries left 131 queries left –53 pre-selected, 78 self-generated Participants provide documents on their computers as user profiles Participants provide documents on their computers as user profiles Experiment Setup (2)

15 15 Evaluation Measure Discounted Cumulative Gain (DCG) Discounted Cumulative Gain (DCG) Higher-ranking documents affect the performance more than the lower-ranking documents Higher-ranking documents affect the performance more than the lower-ranking documents G(i) = 0 if document is irrelevant G(i) = 0 if document is irrelevant G(i) = 1 if document is relevant G(i) = 1 if document is relevant G(i) = 2 if document is highly relevant G(i) = 2 if document is highly relevant Normalized to values between 0 (worst) and 1 (best) Normalized to values between 0 (worst) and 1 (best)

16 16 Experiment Combinations (1) Corpus Representation (N and n i ) Corpus Representation (N and n i ) –Full Text – from retrieved documents –Web – one web search per query term –Snippet – title and summary from retrieved documents Query Focus Query Focus –No – use all documents in user profile as relevant documents –Yes – use documents which matches the query in user profile as relevant documents User Representation (User Profile) User Representation (User Profile) –No Model – user profile not used –Query – query strings –Web – viewed web pages –Recent – recently viewed documents –Full Index – everything in user profile

17 17 Experiment Combinations (2) Document Representation (tf i ) Document Representation (tf i ) –Snippet – title and summary –Full Text Query Expansion Query Expansion –Near Query – use words near query terms in relevant documents –All Words – use all words in relevant documents Vary only one variable at a time; hold other constant Vary only one variable at a time; hold other constant Average DCG score for all 131 queries Average DCG score for all 131 queries

18 18 Experiment Result ㊣ ㊣ ㊣ ㊣ ㊣ ㊣ = 0.46

19 19 Real World Stuff Rand = random No = pure BM25 RF = pseudo- relevance feedback PS = Personalized URL = URL history boost Web = search engine ranking Mix = PS + Web by probability of relevance by rank position 

20 20 終於 Slight improvement when merged with search engine result … Slight improvement when merged with search engine result …


Download ppt "1 Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan, MIT Susan T. Dumais, Microsoft Eric Horvitz, Microsoft SIGIR 2005."

Similar presentations


Ads by Google