Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.

Similar presentations


Presentation on theme: "Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking."— Presentation transcript:

1 Ranked Retrieval INST 734 Module 3 Doug Oard

2 Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking

3 What’s a Model? Model –A simplification that describes something complex –A particular way of “looking at things” Computational model –Simplification or reality that facilitates computation

4 Similarity-Based Queries Treat the query as if it were a document –Create a query bag-of-words Find the similarity of each document –Using the coordination measure, for example Rank order the documents by similarity –Most similar to the query first Surprisingly, this works pretty well! –Especially for very short queries

5 Counting Terms Terms tell us about documents –If “rabbit” appears a lot, it may be about rabbits Documents tell us about terms –“the” is in every document -- not discriminating Documents are most likely described well by rare terms that occur in them frequently –Higher “term frequency” is stronger evidence –Low “document frequency” makes it stronger still

6 A Partial Solution: TF*IDF High TF is evidence of meaning Low DF is evidence of term importance –Equivalently high “IDF” Multiply them to get a “term weight” Add up the weights for each query term

7 TF*IDF Example 4 5 6 3 1 3 1 6 5 3 4 3 7 1 nuclear fallout siberia contaminated interesting complicated information retrieval 2 123 2 3 2 4 4 0.50 0.63 0.90 0.13 0.60 0.75 1.51 0.38 0.50 2.11 0.13 1.20 123 0.60 0.38 0.50 4 0.301 0.125 0.602 0.301 0.000 0.602 query: contaminated retrieval Result: 2, 3, 1, 4

8 The Document Length Effect Document lengths vary in many collections Long documents have an unfair advantage –They use a lot of terms So they get more matches than short documents –They use the same terms repeatedly So they have much higher term frequencies Two strategies –Adjust term frequencies for document length –Divide the documents into equal “passages”

9 Passage Retrieval Break long documents up somehow –Con chapter or section boundaries –On topic boundaries (“text tiling”) –Overlapping 300-word passages (“sliding window”) Use best passage’s rank as the document’s rank

10 “Cosine” Normalization Compute the length of each document vector –Multiply each weight by itself –Add all the resulting values –Take the square root of that sum Divide each weight by that length

11 Cosine Normalization Example 0.29 0.37 0.53 0.13 0.62 0.77 0.57 0.14 0.19 0.79 0.05 0.71 123 0.69 0.44 0.57 4 4 5 6 3 1 3 1 6 5 3 4 3 7 1 nuclear fallout siberia contaminated interesting complicated information retrieval 2 123 2 3 2 4 4 0.50 0.63 0.90 0.13 0.60 0.75 1.51 0.38 0.50 2.11 0.13 1.20 123 0.60 0.38 0.50 4 0.301 0.125 0.602 0.301 0.000 0.602 1.700.972.670.87 Length query: contaminated retrieval, Result: 2, 4, 1, 3 (compare to 2, 3, 1, 4)

12 Why Call It “Cosine”?  d2 d1

13 Formally … Document Vector Query Vector Inner Product Length Normalization

14 Interpreting the Cosine Measure Think of query and the document as vectors –Query normalization does not change the ranking –Square root does not change the ranking Similarity is the angle between two vectors –Small angle = very similar –Large angle = little similarity Passes some key sanity checks –Depends on pattern of word use but not on length –Every document is most similar to itself

15 “Okapi BM-25” Term Weights TF componentIDF component

16 Summary Goal: find documents most similar to the query Compute normalized document term weights –Some combination of TF, DF, and Length Sum the weights for each query term –In linear algebra, this is an “inner product” operation

17 Agenda Ranked retrieval Similarity-based ranking  Probability-based ranking


Download ppt "Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking."

Similar presentations


Ads by Google