Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

Similar presentations


Presentation on theme: "Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit."— Presentation transcript:

1 Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit to my query? –This involves determining what the query is about and how well the document answers it Compare them –Show me more like this. –This involves determining what the document is about.

2 Determining Relevance by Keyword The typical web query consists entirely of keywords. Retrieval can be binary: present or absent More sophisticated is to look for degree of relatedness: how much does this document reflect what the query is about? Simple strategies: –How many times does word occur in document? –How close to head of document? –If multiple keywords, how close together?

3 Keywords for Relevance Ranking Count: repetition is an indicaiton of emphasis –Very fast (usually in the index) –Reasonable heuristic –Unduly influenced by document length –Can be "stuffed" by web designers Position: Lead paragraphs summarize content –Requires more computation –Also reasonably heuristic –Less influenced by document length –Harder to "stuff"; can only have a few keywords near beginning

4 Keywords for Relevant Ranking Proximity for multiple keywords –Requires even more computation –Obviously relevant only if have multiple keywords –Effectiveness of heuristic varies with information need; typically either excellent or not very helpful at all –Very hard to "stuff" All keyword methods –Are computationally simple and adequately fast –Are effective heuristics –typically perform as well as in-depth natural language methods for standard search

5 Comparing Documents "Find me more like this one" really means that we are using the document as a query. This requires that we have some conception of what a document is about overall. Depends on context of query. We need to –Characterize the entire content of this document –Discriminate between this document and others in the corpus

6 Comparing Documents cont Two very general approaches: –statistical –semantic We will discuss semantic approaches more in text mining Statistical approach still focuses on keywords: –To what extent does each term characterize this document? –To what extent does each term discriminate this document from other documents?

7 Characterizing a Document: Term Frequency A document can be treated as a sequence of words. Each word characterizes that document to some extent. When we have eliminated stop words, the most frequent words tend to be what the document is about Therefore: f kd (# of occurrences of word K in document d) will be an important measure. Also called the term frequency

8 Characterizing a Document: Document Frequency What makes this document distinct from others in the corpus? The terms which discriminate best are not those which occur with high frequency! Therefore: D k (# of documents in which word K occurs) will also be an important measure. Also called the document frequency

9 TF*IDF This can all be summarized as: –Words are best discriminators when they occur often in this document (term frequency) don’t occur in a lot of documents (document frequency) One very common measure of the importance of a word to a document is TF*IDF: term frequency * inverse document frequency There are multiple formulas for actually computing this; the book gives Robertson and Jones. The underlying concept is the same in all of them.

10 Describing an Entire Document So what is a document about? TF*IDF: can simply list keywords in order of their TF*IDF values Document is about all of them to some degree: it is at some point in some vector space of meaning

11 Vector Space Any corpus has defined set of terms (index) These terms define a knowledge space Every document is somewhere in that knowledge space -- it is or is not about each of those terms. Consider each term as a vector. Then –We have an n-dimensional vector space –Where n is the number of terms (very large!) –Each document is a point in that vector space The document position in this vector space can be treated as what the document is about.

12 Similarity Between Documents How similar are two documents? –Measures of association How much do the feature sets overlap? Modified for length: DICE coefficient DICE coefficient: # terms compared to intersection Simple Matching coefficient: take into account exclusions –Cosine similarity similarity of angle of the two document vectors not sensitive to vector length

13 Bag of Words All of these techniques are what is known as bag of words approaches. Keywords treated in isolation Difference between "man bites dog" and "dog bites man" non-existent In text mining will discuss linguistic approaches which pay attention to semantics


Download ppt "Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit."

Similar presentations


Ads by Google