Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 430: Information Discovery

Similar presentations


Presentation on theme: "CS 430: Information Discovery"— Presentation transcript:

1 CS 430: Information Discovery
Lecture 4 Ranking

2 Course Administration
• The slides for Lecture 3 have been reposted with slightly revised notation. • The reading for Discussion Class 2 requires a computer connected to the network with a Cornell IP address. • Teaching assistants do not have office hours. If your query cannot be addressed by , ask to meet with them or come to my office hours. • Assignment 1 is an individual assignment. Discuss the concepts and the choice of methods with your colleagues, but the actual programs and report much be individual work.

3 Choice of Weights query q ant dog document text terms
d1 ant ant bee ant bee d2 dog bee dog hog dog ant dog ant bee dog hog d3 cat gnu dog eel fox cat dog eel fox gnu ant bee cat dog eel fox gnu hog q ? ? d ? ? d ? ? ? ? d ? ? ? ? ? What weights lead to the best information retrieval?

4 Methods for Selecting Weights
Empirical Test a large number of possible weighting schemes with actual data. (This lecture, based on work of Salton, et al.) Model based Develop a mathematical model of word distribution and derive weighting scheme theoretically. (Probabilistic model of information retrieval.)

5 Weighting 1. Term Frequency
Suppose term j appears fij times in document i. What weighting should be given to a term j? Term Frequency: Concept A term that appears many times within a document is likely to be more important than a term that appears only once.

6 Term Frequency: Free-text Document
Length of document Simple method (as illustrated in Lecture 3) is to use fij as the term frequency. ...but, in free-text documents, terms are likely to appear more often in long documents. Therefore fij should be scaled by some variable related to document length. i

7 Term Frequency: Free-text Document
Standard method for free-text documents Scale fij relative to the frequency of other terms in the document. This partially corrects for variations in the length of the documents. Let mi = max (fij) i.e., mi is the maximum frequency of any term in document i Term frequency (tf): tfij = fij / mi when fij > 0 Note: There is no special justification for taking this form of term frequency except that it works well in practice and is easy to calculate. i

8 Weighting 2. Inverse Document Frequency
Suppose term j appears fij times in document i. What weighting should be given to a term j? Inverse Document Frequency: Concept A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents.

9 Inverse Document Frequency
Suppose there are n documents and that the number of documents in which term j occurs is nj. A possible method might be to use n/nj as the inverse document frequency. Standard method The simple method over-emphasizes small differences. Therefore use a logarithm. Inverse document frequency (idf): idfj = log2 (n/nj) nj > 0 Note: There is no special justification for taking this form of inverse document frequency except that it works well in practice and is easy to calculate.

10 Example of Inverse Document Frequency
n = 1,000 documents term j nj idfj A B C D 1, From: Salton and McGill

11 Full Weighting: Standard Form of tf.idf
Practical experience has demonstrated that weights of the following form perform well in a wide variety of circumstances: (weight of term j in document i) = (term frequency) * (inverse document frequency) The standard tf.idf weighting scheme, for free text documents, is: tij = tfij * idfj = (fij / mi) * (log2 (n/nj) + 1) when nj > 0

12 Structured Text Structured text
Structured texts, e.g., queries or catalog records, have different distribution of terms from free-text. A modified expression for the term frequency is: tfij = K + (1 - K)*fij / mi when fij > 0 K is a parameter between 0 and 1 that can be tuned for a specific collection. Query To weigh terms in the query, Salton and Buckley recommend K equal to 0.5. i

13 Similarity The similarity between query q and document i is given by:
 tqktik |dq| |di| Where dq and di are the corresponding weighted term vectors, with components in the k dimension (corresponding to term k) given by: tqk = ( *fqk / mq)*(log2 (n/nk) + 1) when fqk > 0 tik = (fik / mi) * (log2 (n/nk) + 1) when fik > 0 n cos(dq, di) = k=1

14 Boolean Queries Boolean query: two or more search terms, related by
logical operators, e.g., and or not Examples: abacus and actor abacus or actor (abacus and actor) or (abacus and atoll) not actor

15 Boolean Diagram not (A or B) A and B A B A or B

16 Adjacent and Near Operators
abacus adj actor Terms abacus and actor are adjacent to each other as in the string "abacus actor" abacus near 4 actor Terms abacus and actor are near to each other as in the string "the actor has an abacus" Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph).

17 Evaluation of Boolean Operators
Precedence of operators must be defined: adj, near high and, not or low Example A and B or C and B is evaluated as (A and B) or (C and B)

18 Inverted File Inverted file:
A list of search terms that are used to index a set of documents. The inverted file is organized for associative look-up, i.e., to answer the question, "In which documents does a specified search term appear?" In practical applications, the inverted file contains related information, such as the location within the document where the search terms appear.

19 Inverted File -- Basic Concept
Word Document abacus 19 22 actor 2 29 aspen 5 atoll 11 34 Stop words are removed and stemming carried out before building the index.

20 Inverted List -- Concept
Inverted List: All the entries in an inverted file that apply to a specific word, e.g. abacus 19 22 Posting: Entry in an inverted list, e.g., there are three postings for "abacus".

21 Evaluating a Boolean Query
Examples: abacus and actor Postings for abacus Postings for actor Document 19 is the only document that contains both terms, "abacus" and "actor". 3 19 22 To evaluate the and operator, merge the two inverted lists with a logical AND operation. 2 19 29

22 Enhancements to Inverted Files -- Concept
Location: The inverted file holds information about the location of each term within the document. Uses adjacency and near operators user interface design -- highlight location of search term Frequency: The inverted file includes the number of postings for each term. term weighting query processing optimization

23 Inverted File -- Concept (Enhanced)
Word Postings Document Location abacus 22 56 actor aspen atoll 11 70

24 Evaluating an Adjacency Operation
Examples: abacus adj actor Postings for abacus Postings for actor Document 19, locations 212 and 213, is the only occurrence of the terms "abacus" and "actor" adjacent.


Download ppt "CS 430: Information Discovery"

Similar presentations


Ads by Google