CS 430: Information Discovery

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
CS 430 / INFO 430 Information Retrieval
CS 430 / INFO 430 Information Retrieval
IR Models: Overview, Boolean, and Vector
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
CS/Info 430: Information Retrieval
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
The Vector Space Model …and applications in Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 6 Vector Methods 2.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Vector Methods Classical IR Thanks to: SIMS W. Arms.
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
CS 430: Information Discovery
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
1 CS 430: Information Discovery Lecture 5 Ranking.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Information Retrieval Lecture 6 Vector Methods 2.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Automated Information Retrieval
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Plan for Today’s Lecture(s)
Vector Methods Classical IR
Text Based Information Retrieval
Information Retrieval and Web Search
Searching and Indexing
Information Retrieval and Web Search
Representation of documents and queries
From frequency to meaning: vector space models of semantics
CS 430: Information Discovery
Vector Methods Classical IR
Chapter 5: Information Retrieval and Web Search
4. Boolean and Vector Space Retrieval Models
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Information Retrieval and Web Design
CS 430: Information Discovery
Vector Methods Classical IR
Presentation transcript:

CS 430: Information Discovery Lecture 4 Ranking

Course Administration • The slides for Lecture 3 have been reposted with slightly revised notation. • The reading for Discussion Class 2 requires a computer connected to the network with a Cornell IP address. • Teaching assistants do not have office hours. If your query cannot be addressed by email, ask to meet with them or come to my office hours. • Assignment 1 is an individual assignment. Discuss the concepts and the choice of methods with your colleagues, but the actual programs and report much be individual work.

Choice of Weights query q ant dog document text terms d1 ant ant bee ant bee d2 dog bee dog hog dog ant dog ant bee dog hog d3 cat gnu dog eel fox cat dog eel fox gnu ant bee cat dog eel fox gnu hog q ? ? d1 ? ? d2 ? ? ? ? d3 ? ? ? ? ? What weights lead to the best information retrieval?

Methods for Selecting Weights Empirical Test a large number of possible weighting schemes with actual data. (This lecture, based on work of Salton, et al.) Model based Develop a mathematical model of word distribution and derive weighting scheme theoretically. (Probabilistic model of information retrieval.)

Weighting 1. Term Frequency Suppose term j appears fij times in document i. What weighting should be given to a term j? Term Frequency: Concept A term that appears many times within a document is likely to be more important than a term that appears only once.

Term Frequency: Free-text Document Length of document Simple method (as illustrated in Lecture 3) is to use fij as the term frequency. ...but, in free-text documents, terms are likely to appear more often in long documents. Therefore fij should be scaled by some variable related to document length. i

Term Frequency: Free-text Document Standard method for free-text documents Scale fij relative to the frequency of other terms in the document. This partially corrects for variations in the length of the documents. Let mi = max (fij) i.e., mi is the maximum frequency of any term in document i Term frequency (tf): tfij = fij / mi when fij > 0 Note: There is no special justification for taking this form of term frequency except that it works well in practice and is easy to calculate. i

Weighting 2. Inverse Document Frequency Suppose term j appears fij times in document i. What weighting should be given to a term j? Inverse Document Frequency: Concept A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents.

Inverse Document Frequency Suppose there are n documents and that the number of documents in which term j occurs is nj. A possible method might be to use n/nj as the inverse document frequency. Standard method The simple method over-emphasizes small differences. Therefore use a logarithm. Inverse document frequency (idf): idfj = log2 (n/nj) + 1 nj > 0 Note: There is no special justification for taking this form of inverse document frequency except that it works well in practice and is easy to calculate.

Example of Inverse Document Frequency n = 1,000 documents term j nj idfj A 100 4.32 B 500 2.00 C 900 1.13 D 1,000 1.00 From: Salton and McGill

Full Weighting: Standard Form of tf.idf Practical experience has demonstrated that weights of the following form perform well in a wide variety of circumstances: (weight of term j in document i) = (term frequency) * (inverse document frequency) The standard tf.idf weighting scheme, for free text documents, is: tij = tfij * idfj = (fij / mi) * (log2 (n/nj) + 1) when nj > 0

Structured Text Structured text Structured texts, e.g., queries or catalog records, have different distribution of terms from free-text. A modified expression for the term frequency is: tfij = K + (1 - K)*fij / mi when fij > 0 K is a parameter between 0 and 1 that can be tuned for a specific collection. Query To weigh terms in the query, Salton and Buckley recommend K equal to 0.5. i

Similarity The similarity between query q and document i is given by:  tqktik |dq| |di| Where dq and di are the corresponding weighted term vectors, with components in the k dimension (corresponding to term k) given by: tqk = (0.5 + 0.5*fqk / mq)*(log2 (n/nk) + 1) when fqk > 0 tik = (fik / mi) * (log2 (n/nk) + 1) when fik > 0 n cos(dq, di) = k=1

Boolean Queries Boolean query: two or more search terms, related by logical operators, e.g., and or not Examples: abacus and actor abacus or actor (abacus and actor) or (abacus and atoll) not actor

Boolean Diagram not (A or B) A and B A B A or B

Adjacent and Near Operators abacus adj actor Terms abacus and actor are adjacent to each other as in the string "abacus actor" abacus near 4 actor Terms abacus and actor are near to each other as in the string "the actor has an abacus" Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph).

Evaluation of Boolean Operators Precedence of operators must be defined: adj, near high and, not or low Example A and B or C and B is evaluated as (A and B) or (C and B)

Inverted File Inverted file: A list of search terms that are used to index a set of documents. The inverted file is organized for associative look-up, i.e., to answer the question, "In which documents does a specified search term appear?" In practical applications, the inverted file contains related information, such as the location within the document where the search terms appear.

Inverted File -- Basic Concept Word Document abacus 3 19 22 actor 2 29 aspen 5 atoll 11 34 Stop words are removed and stemming carried out before building the index.

Inverted List -- Concept Inverted List: All the entries in an inverted file that apply to a specific word, e.g. abacus 3 19 22 Posting: Entry in an inverted list, e.g., there are three postings for "abacus".

Evaluating a Boolean Query Examples: abacus and actor Postings for abacus Postings for actor Document 19 is the only document that contains both terms, "abacus" and "actor". 3 19 22 To evaluate the and operator, merge the two inverted lists with a logical AND operation. 2 19 29

Enhancements to Inverted Files -- Concept Location: The inverted file holds information about the location of each term within the document. Uses adjacency and near operators user interface design -- highlight location of search term Frequency: The inverted file includes the number of postings for each term. term weighting query processing optimization

Inverted File -- Concept (Enhanced) Word Postings Document Location abacus 4 3 94 19 7 19 212 22 56 actor 3 2 66 19 213 29 45 aspen 1 5 43 atoll 3 11 3 11 70 34 40

Evaluating an Adjacency Operation Examples: abacus adj actor Postings for abacus Postings for actor Document 19, locations 212 and 213, is the only occurrence of the terms "abacus" and "actor" adjacent. 3 94 19 7 19 212 22 56 2 66 19 213 29 45