Automated Information Retrieval

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

CS 430 / INFO 430 Information Retrieval
CS 430 / INFO 430 Information Retrieval
IR Models: Overview, Boolean, and Vector
ISP 433/533 Week 2 IR Models.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Ch 4: Information Retrieval and Text Mining
Hinrich Schütze and Christina Lioma
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
The Vector Space Model …and applications in Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 6 Vector Methods 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Information Retrieval
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Vector Methods Classical IR Thanks to: SIMS W. Arms.
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Vector Space Models.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 5 Ranking.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Information Retrieval Lecture 6 Vector Methods 2.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
1 Midterm Examination. 2 General Observations Examination was too long! Most people submitted by .
Vector Methods Classical IR
7CCSMWAL Algorithmic Issues in the WWW
CS 430: Information Discovery
CS 430: Information Discovery
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Text & Web Mining 9/22/2018.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Searching and Indexing
Information Retrieval and Web Search
Representation of documents and queries
From frequency to meaning: vector space models of semantics
CS 430: Information Discovery
Vector Methods Classical IR
4. Boolean and Vector Space Retrieval Models
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Recuperação de Informação B
Recuperação de Informação B
Information Retrieval and Web Design
CS 430: Information Discovery
CS 430: Information Discovery
VECTOR SPACE MODEL Its Applications and implementations
Vector Methods Classical IR
Presentation transcript:

Automated Information Retrieval CS 502: Spring 2001 Guest Lecture Automated Information Retrieval William Y. Arms

Types of Information Discovery media type text image, video, audio, etc. linking searching browsing CS 502 natural language processing catalogs, indexes (metadata) user-in-loop automatic CS 474

This Lecture media type text image, video, audio, etc. linking searching browsing CS 502 natural language processing catalogs, indexes (metadata) user-in-loop automatic CS 474

Automated Information Discovery Creating metadata records manually is labor-intensive and hence expensive. The aim of automated information discovery is for users to discover information without using skilled human effort to build indexes.

Resources for Automated Information Discovery Computer power brute force computing ranking methods automatic generation of metadata The intelligence of the user browsing relevance feedback information visualization

Brute Force Computing Few people really understand Moore's Law -- Computing power doubles every 18 months -- Increases 100 times in 12 years -- Increases 10,000 times in 25 years Simple algorithms + immense computing power may outperform human intelligence

Contrast with (Old-Fashioned) Boolean Searching Traditional information retrieval uses Boolean retrieval, where a document either matches a query exactly or not at all. • Encourages short queries • Requires precise choice of index terms • Requires precise formulation of queries (professional training) Modern retrieval systems rank documents depending on the likelihood that they satisfy the user's needs. • Encourages long queries (to have as many dimensions as possible) • Benefits from large numbers of index terms • Permits queries with many terms, not all of which need match the document

Similarity Ranking Ranking methods using similarity Measure the degree of similarity between a query and a document (or between two documents). Basic technique is the vector space model with term weighting. Similar Requests Documents Similar: How similar is document to a request?

Document Ranking Methods using document ranking Rank the documents using some algorithm to rank them in order of importance. Best known technique is the Google PageRank algorithm.

Vector Space Methods: Concept n-dimensional space, where n is the total number of different terms (words) in a set of documents. Each document is represented by a vector, with magnitude in each dimension equal to the (weighted) number of times that the corresponding term appears in the document. Similarity between two documents is the angle between their vectors. Much of this work was carried out by Gerald Salton and colleagues in Cornell's computer science department.

Example 1: Incidence Array terms in d1 -> ant ant bee terms in d2 -> bee hog ant dog terms in d3 -> cat gnu dog eel fox terms ant bee cat dog eel fox gnu hog d1 1 1 d2 1 1 1 1 d3 1 1 1 1 1 Weights: tij = 1 if document i contains term j and zero otherwise

Three Terms Represented in 3-Dimensions  t1

Vector Space Revision x = (x1, x2, x3, ..., xn) is a vector in an n-dimensional vector space Length of x is given by (extension of Pythagoras's theorem) |x|2 = x12 + x22 + x32 + ... + xn2 If x1 and x2 are vectors: Inner product (or dot product) is given by x1.x2 = x11x21 + x12x22 + x13x23 + ... + x1nx1n Cosine of the angle between the vectors x1 and x2: cos () = x1.x2 |x1| |x2|

Example 1 (continued) Similarity of documents in example: d1 d2 d3

Reasons for Term Weighting Similarity using an incidence matrix measures the occurrences of terms, but no other characteristics of the documents. Terms are more useful for information retrieval if they: appear several times in one document (weighting by term frequency) only appear in some documents (weighting by document frequency) appear in short document (weighting by document length)

Weighting Very simple approach (basic tf.idf): wij = fij / dj Where: wij is the weight given to term j in document i fij is the frequency with which term j appears in document i dj is the number of documents that contain term j

Term Frequency Concept A term that appears many times within a document is likely to be more important than a term that appears only once.

Term Frequency Simple weighting to reflect term frequency: Suppose term j appears fij times in document i Let mi = max (fij) i.e., mi is the maximum frequency of any term in document i Term frequency (tf): tfij = fij / mi i

Example 2: Frequency Array terms in d1 -> ant ant bee terms in d2 -> bee hog ant dog terms in d3 -> cat gnu dog eel fox ant bee cat dog eel fox gnu hog d1 2 1 d2 1 1 1 1 d3 1 1 1 1 1 Weights: tij = frequency that term j occurs in document i

Example 2 (continued) Similarity of documents in example: d1 d2 d3 Similarity depends upon the weights given to the terms.

Inverse Document Frequency Concept A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents.

Inverse Document Frequency For term j number of documents = n document frequency (number of documents in which term j occurs) = dj One possible measure is: n/dj But this over-emphasizes small differences. Therefore a more useful definition is: Inverse document frequency (IDF): idfj = log2 (n/dj) + 1

Weighting Weights of the following form perform well in a wide variety of circumstances: (Weight of term j in document i) = (Term frequency) * (Inverse document frequency) The standard weighting scheme is: wij = tfij * idfj = (fij / mi) * (log2 (n/dj) + 1) Experience shows that the effectiveness depends on characteristics of the collection. There are few general improvements beyond this simple weighting scheme.

PageRank Algorithm (Google) Concept: The rank of a web page is higher if many pages link to it. Links from highly ranked pages are given greater weight than links from less highly ranked pages.

Page Ranks Citing page P1 P2 P3 P4 P5 P6 P1 1 1 1 P2 1 P3 1 P4 1 1 1 1 Cited page Number 2 1 4 2 2 2

Normalize by Number of Links from Page Citing page P1 P2 P3 P4 P5 P6 P1 1 0.25 0.5 P2 0.25 P3 0.5 P4 0.5 0.25 0.5 0.5 P5 0.5 0.5 P6 0.25 0.5 = B Cited page Number 2 1 4 2 2 2

Calculating the Weights Initially all pages have weight 1 w1 = Recalculate weights w2 = Bw1 = Iterate wk+1 = Bwk until iteration convergences 1.75 0.25 0.50 1.00 0.75 1

Page Ranks (Basic Algorithm) Iteration converges when w = Bw This w is the high order eigenvector of B It ranks the pages by links to them, normalized by the number of citations from each page and weighted by the ranking of the cited pages

PageRank Intuitive Model A user: 1. Starts at a random page on the web 2a. With probability p, selects any random page and jumps to it 2b. With probability 1-p, selects a random hyperlink from the current page and jumps to the corresponding page 3. Repeats Step 2a and 2b a very large number of times Pages are ranked according to the relative frequency with which they are visited. The basic algorithm described on the previous slides is with p = 0

Information Discovery: 1992 and 2002 1992 2002 Content print online Computing expensive inexpensive Choice of content selective comprehensive Index creation human automatic Frequency one time monthly Vocabulary controlled not controlled Query Boolean ranked retrieval Users trained untrained