Presentation is loading. Please wait.

Presentation is loading. Please wait.

Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.

Similar presentations


Presentation on theme: "Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical."— Presentation transcript:

1 Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical and Computer Engineering 2008-11-05

2  Query ‘IR’ on Google Introduction Current Web Search Engines We are studying about ‘information retrieval’, so we need the pages about that! NO! And what is this? 2

3  Query ‘IR’ on Google For the information-retrieval researcher the SIGIR homepage For the financial analyst stock quotes for Ingersoll-Rand For the chemist pages about infrared light Introduction We want the results like this 3

4  Two methodologies for a Web search engine to incorporate information about a user 1. A user profile is communicated to the server 2. The results are downloaded and re-ranked Two methodologies(1/2) 4 User profile top ranked pages re-ranked

5  Focusing the 2 nd method for several reasons 1. Ensuring privacy 2. Feasible to include computationally- intensive procedures 3. the re-ranking methods facilitate straightforward evaluation Two methodologies(2/2) 5

6 6 Traditional FB vs. Personal Profile FB # of documents in the corpus that contains the term i # of documents for which relevance feedback has been provided that contains the term i Relevance information (R, r i ) comes from the corpus Profiles are derived from a personal store

7  A well known probabilistic weighting scheme Essentially sums over query terms the log odds of the query terms occurring in relevant and non-relevant documents Without relevance information relevance : With relevance information relevance : BM25 (Traditional FB) 7 tf i : the frequency with which that term appears in the document N : the number of document in the corpus n i : the number of documents in the corpus that contains the term i R: the number of documents for which relevance feedback has been provided r i : The number of documents for which relevance feedback has been provided that contain the term

8  Using information outside of the Web corpus pulling the relevant document outside the document space Extending the notion of corpus Relevance Personal Profile FB 8 N’ = (N+R), n i ’=(n i +r i ) Substituting

9  Estimating… N : the number of documents on the web Using the most frequent word in English, “the”, as the query The result n i : the number of document on the web that contain term i Probing the web by issuing on word queries 9 Representation Corpus(N, n i ) (1/2)

10  Focusing the corpus presentation Corpus statistics can either be gathered from all of the documents on the Web or, only the subset of documents that are relevant to the query ( referred as a query focus ) An example, the query is “IR” a query-focused corpus consists only of documents that contain the term “IR” When the corpus representation is limited to a query focus, the user representation is correspondingly query focused 10 Representation Corpus(N, n i ) (2/2)

11  A rich index of personal content that captured a user’s interests and computational activities could be obtained from desktop indices such as Google Desktop, Mac Tiger, Windows Desktop Search  In this paper, indexed all of the information created, copied or viewed by a user used Web pages, email messages, calendar items, documents stored on the client machine  The most straightforward way to use this index Treating every document in it as a source of the user’s interests R : the number of documents in the index r i : the number of documents in the index that contain term i 11 Representation User (R, r i ) (1/2)

12  Considering limiting documents Restricting the document type to the Web pages Limiting documents the most recent ones In this paper, considering documents indexed in the last month versus the full index of documents  Two lighter-weight representation Using the query terms that the user had issued in the past Boosting the search results with URLs from domains that the user had visited in past 12 Representation User (R, r i ) (2/2)

13  Using the full text of documents in the results set Accessing the full text of each document takes considerable time  Also, experimented with using only the title and the snippet of the document returned by the Web search engine the snippet is inherently query focused  Query Expansion The inclusion of all of the terms occurring in the relevant documents a kind of blind or pseudo-relevance feedback in which the top-k documents are considered relevant 13 Representation Document(tf i ) and Query

14  An evaluation collection 15 participants evaluate the top 50 Web search results for approximately 10 self-selected queries each For each search result, asked to determine highly relevant, relevant, or not relevant to the query Web search results from MSN Search 14 Evaluation Framework(1/4)

15  Selecting the queries to be evaluated 1. users choose a query to mimic a search they had performed earlier that day 2. users select a query from a list formulated to be of general interest (e.g., “cancer”, “Bush”, “Web search”) A total of 131 queries 53 were pre-selected 78 were self-generated 15 Evaluation Framework(2/4)

16  Each participant provided us with an index of the information on their personal computer in size from 10,000 to 100,000 items used to compute their personalized term weights  All participants were employees of Microsoft software engineers, researchers, program managers, and administrators 16 Evaluation Framework(3/4)

17  Discounted Cumulative Gain(DCG)  [9]IR evaluation methods for retrieving highly relevant documents Cumulative Gain(GC) Example) G= CG = 17 Evaluation Framework(4/4) if i = 1 otherwise if i = 1 otherwise

18 18 Results Alternative Representations(1/2) Richer model Poorer model When using only documents related to the query to represent the corpus, the term weights represent how different the user is from the average person who submits the query

19  a rich user profile is more important that a rich document representation  The best combination of 67 different combinations Corpus : Approximated by the result set title and snippets, which is inherently query focused User : Built from the user’s entire personal index, query focused Document and Query : Documents represented by the title and snippet returned by the search engine, with query expansion based on words that occur near the query term 19 Results Alternative Representations(2/2)

20 20 Results Baseline Comparisons

21 21 Thank you.


Download ppt "Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical."

Similar presentations


Ads by Google