Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Ranking. 2 Boolean vs. Non-boolean Queries Until now, we assumed that satisfaction is a Boolean function of a query –it is easy to determine if a document.

Similar presentations


Presentation on theme: "1 Ranking. 2 Boolean vs. Non-boolean Queries Until now, we assumed that satisfaction is a Boolean function of a query –it is easy to determine if a document."— Presentation transcript:

1 1 Ranking

2 2 Boolean vs. Non-boolean Queries Until now, we assumed that satisfaction is a Boolean function of a query –it is easy to determine if a document satisfies a query –all satisfying documents should be returned –no significance to order of results –similar to query processing in a database This model is often inappropriate We now consider a non-boolean setting: query results are ranked and returned in ranking order

3 Relevant and Irrelevant The user has a task, which he formulates as a query A given document may contain all words and yet not be relevant A given document may not contain all words and yet be relevant Relevance is subjective and can only be determined by the user 3

4 Evaluating the Quality of Search Results Goals: –Return all relevant documents –Return no non-relevant documents –Return relevant documents “earlier” Suppose that a search engine runs a query Q and returns the result set R. Then, a document falls in one of the categories: 4 RelevantNot Relevant Retrieved Not Retrieved

5 Quality of Search Results Lots of measures. We focus on 3 measures: –Precision: percentage of retrieved documents that are relevant –Recall: percentage of relevant documents that are retrieved –Precision at k: percentage of relevant results within top k 5

6 Questions How can perfect precision be achieved? How can perfect recall be achieved? Suppose that: – there are 1000 documents –50 documents are relevant to the query –30 query results are returned, including 20 relevant documents What is the precision? Recall? Using these scores, how can search engine quality be automatically assessed? 6

7 7 What is Ranking? Ranking is the problem of returning queries answers in the “best” order Note that “best” is subjective Ranking is NOT based on money!!! To rank well, we want to know which documents best satisfy a query –How can we determine this?

8 8 Types of Ranking Query Dependent versus Query Independent –advantages/disadvantages of each Based on: –plain text –HTML markup –link analysis

9 9 Query Dependent Ranking: Ranking of Plain Text TF-IDF and the Vector Space Model

10 10 Goal Given a query Q and document P, want to find a rank r that indicates how relevant P is for Q –Answers will be returned by decreasing r To simplify: Assume that queries are given as free-text (without logical operators) and that a document may be relevant if it contains at least one word in the query

11 11 Goal Intuition: –Give each term t in a document d a weight w t,d –Give each term t in the query a weight w t –The score of a document for a query will be a function of the values of the query term weights in the document Two questions: –how should we define the weight of a term in a document? –how should we combine the term weights together?

12 12 Weights of Terms in Documents From now on, t is a term, d is a document Goal is to define w t,d, i.e., weight of t in d All our weighting schemes will have in common: –w t,d = 0 if t does not appear in d Simplest definition of weight: –w t,d = 1 if t appears in d –Advantages? Disadvantages?

13 13 Term Frequency Intuitively, a document that has many occurrences of a term t is more “about” t Leads us to defining weight: –w t,d = f t,d –f t,d is the number of times that t appears in d, also called the term frequency of t in d Advantages? Disadvantages? Are 20 occurrences of t 20 times better than 1 occurrence of t?

14 14 Normalized Term Frequency The term frequency can be normalized in many ways Standard normalization: –w t,d = 1 + log 2 f t,d if t appears in d –w t,d = 0 otherwise Are all terms created equal?

15 15 Are All Words Equal? The occurrence of a very rare term is usually more interesting than the occurrence of a common word Example query: Winnie the Pooh Suppose that –document 1 has 300 occurrences of the –document 2 has 1 occurrence of Pooh –Which do you prefer?

16 16 Inverse Document Frequency The document frequency of a term t is: –f t = number of documents containing t We define a term weight of –w t = log 2 (1+N/f t ) –N is the number of documents –Again, log is used for normalization w t reflects the inverse document frequency of t

17 Summary We will use the following weighting scheme: –w t,d = 1 + log 2 f t,d if t appears in d –w t,d = 0 otherwise –w t = log 2 (1+N/f t ) This is called TF-IDF ranking, since it takes into consideration the term frequency and the inverse document frequency 17

18 18 Other TF-IDF Variants There are many different options that have been suggested to compute TF and IDF values All methods comply with 2 monotonicity constraints: –a term that appears in many documents should not be more important than a term that appears in only a few –a document with many occurrences of a term should not be less important than a document with only a few

19 19 Combining Weights Suppose a query has several terms t 1,…,t k We have defined weights for terms in the query and in the document How can/should these be combined? Ideas?

20 20 The Vector Space Model We model documents and queries as vectors with n dimensions –n is the number of words in our lexicon If t is the k-th term in the lexicon, then –The k-th place in the vector of a document d is w t,d –The k-th place in the vector of the query is w t if t appears in the query, and 0 otherwise

21 Similarity Between Vectors The similarity between two vectors is measured by the difference in angles between the vectors If X and Y are vectors, then the angle  between them satisfies –Where is the inner product of X and Y – is the length of X (i.e., the square root of Then 21

22 Cosine Distance for Ranking Since cos  increases when  decreases, a higher value indicates greater similarity –We will compute the cosine distance between the query vector and each document vector –The greater the cosine distance, the higher the document will rank 22

23 Ranking Since |Q| appears in all documents, we can remove it from the equation without affecting the relative scores of the documents, i.e., use for ranking: 23

24 Example Suppose: –N = 210 –w a = 30 –w b = 210 –w c = 70 Doc 1: aabbd Doc 2 bcccc Which will be higher ranked for the query: a b c 24

25 25 Markup-Based Ranking Give higher score to words appearing in "important HTML elements", e.g., –title –h1, h2 –bold or italics –links How can this be implemented using the Vector Space Model?

26 26 Using Anchor Text Consider a page P1 that points to another page P2. The link from P1 to P2 has anchor text Most search engines use this text to understand the content of page P2 –Often pages do not contain words describing their content. –IBM does not have computer on its homepage –Google does not have search engine on its homepage

27 27 Using Anchor Text (2) Using the current model, how can anchor text help with ranking? What problems can arise? Example

28 28 Query Independent Ranking: Link Analysis

29 29 Intuition Pages are given an apriori ranking of importance –this ranking is irrespective of any query Return pages satisfying the query, in ranking order Question: Is it possible for D 1 to score higher than D 2 on one query and lower on another?

30 30 A Naive Approach to Link- Based Ranking We can represent the Web as a graph: –pages are nodes –there is an edge from P 1 to P 2 if P 1 to links to P 2 Intuitively a link from P 1 to P 2 is a vote of confidence of P 1 for P 2 –Directed Popularity: Grade of P is the number of incoming links to P –Undirected Popularity: Grade of P is the sum of incoming and outgoing links of P What problems can you find with these ranking schemes (at least 2 problems)?

31 31 Random Surfer Imagine a surfer doing a random walk on web pages: –Start at a random page –At each step, go out of the current page along one of the links on that page, with equal probability The long-run probability of being at a page can be used for ranking

32 Transition Matrix Let out k be the out degree of node k Model the surfer’s walk using a transition matrix M –m ik = 1/out k if k links to i –m ik = 0, otherwise 32 1 Yahoo 2 Amazon 3 Microsoft

33 33 Dead-Ends and Spider Traps Web is full of dead-ends and spider traps. –Random walk can get stuck in these. –Makes no sense to talk about long-term visit rates. Page 5 Page 3 Page 4 Page 2 Page 1 Which pages are dead ends? spider traps? Page 6

34 34 Solution: Teleporting At any moment in time, –With probability d, go out on a random link. –With probability 1-d, jump to a random web page. (may be the same page!) –d is a parameter called the damping factor, often thought to be 0.85 Teleporting ensures that the transitions represent an ergodic Markov chain –there is a long-term probability of being at a given state

35 35 PageRank (PR) The PageRank formula: –P 1,…,P n are pages with incoming links to P –O i is the number of outgoing links of page P i –N is the number of pages in total NOTE: Formula has also been presented with 1-d used instead of (1-d)/N PR(P) = (1-d)/N + d(PR(P 1 )/O 1 +... + PR(P n )/P n )

36 Matrix Formulation of PageRank PageRank is solution to the above formula, for y, a, m (d is a chosen constant) 36 1 Yahoo 2 Amazon 3 Microsoft

37 37 Computing PR One can compute PageRank using standard methods from linear algebra, e.g., Guassian elimination The web has billions of pages –Can't solve equations for so many variables Power Method: Choose a starting value for each of the PR-s (e.g., 1/N). Iteratively recompute all PR values, based on the formulas –This process is guaranteed to converge

38 Very Small Example Damping factor: 0.8 38 2 Amazon 3 Microsoft

39 39 Intuitive Examples Which page will have the highest page rank? In which of the following will E have higher PR? AB CD BC DE A BC DE A F

40 40 Careful with Your References! Consider the following three urls: –http://iew3.technion.ac.il/~mwd –http://iew3.technion.ac.il/~mwd/ –http://iew3.technion.ac.il/~mwd/index.html All three urls point to the same page. However, Google does not know this (Why not?) –If the page is pointed to by different names, its PR will be spread among several pages.

41 41 Topic-Specific PageRank Suppose that we are building a topic-specific search engine, e.g., one that should help to find sports information Consider the query batter –what pages should get a higher preference? –what will PageRank give?

42 Simple Solution The set of documents that can be reached, when the surfer chooses a random page on the Web is call the teleport set –so far, the entire Web is the teleport set For Topic-Specific PageRank, choose as the teleport set a set of pages known to be related to the topic of interest –What effect will this have? 42

43 Spam Wikipedia: “Spamdexing (also known as search spam or search engine spam) involves a number of methods, such as repeating unrelated phrases, to manipulate the relevancy or prominence of resources indexed by a search engine, in a manner inconsistent with the purpose of the indexing system” How can one spamdex assuming that TF-IDF is used? If PageRank is used? 43

44 The Google Dance Nickname given to the periods of time in which Google is updating its index, and its different servers are inconsistent The Florida Dance (Nov. 16, 2003): Huge changes in rankings over a period of a few months –Most likely theories: Google incorporated one of Topic- Sensitive PageRank, TrustRank, HillTop 44

45 Homework Read the paper Hilltop: A Search Engine based on Expert Documents by Krishna Bharat and George A. MihailaHilltop: A Search Engine based on Expert Documents What are the main problems with PageRank that Hilltop tries to solve? How does it solve these problems? Describe the basic ideas behind this ranking mechanism. What are the main disadvantages of Hilltop and how can they be overcome in practice? 45

46 46 Other Types of Ranking

47 47 Click-Through Ranking: DirectHit Click popularity: the number of clicks received by each site in a search engine's results page. Example: –20 users search for “wizard of oz" –If, after scanning the first 10 results, all the users click on the imdb site, then this site is considered more relevant than others –Next time someone searches for “wizard of oz”, the imdb site will have a higher ranking in the results. Is this query dependant or query independent?

48 48 Click-Through Ranking: DirectHit Stickiness: Amount of time a user spends at a site Example: –Someone searches for “wizard of oz” –They click on the first one and wander off for 20 minutes –Then they come back to the results page and click on the next site –After 20 seconds they click on a third –The first site gets higher ranking than the second Is it possible to “spam” this method, or the one on the previous slide?


Download ppt "1 Ranking. 2 Boolean vs. Non-boolean Queries Until now, we assumed that satisfaction is a Boolean function of a query –it is easy to determine if a document."

Similar presentations


Ads by Google