Automated Information Retrieval

Automated Information Retrieval
CS 502: Spring 2001 Guest Lecture Automated Information Retrieval William Y. Arms

Types of Information Discovery
media type text image, video, audio, etc. linking searching browsing CS 502 natural language processing catalogs, indexes (metadata) user-in-loop automatic CS 474

This Lecture media type text image, video, audio, etc. linking
searching browsing CS 502 natural language processing catalogs, indexes (metadata) user-in-loop automatic CS 474

Automated Information Discovery
Creating metadata records manually is labor-intensive and hence expensive. The aim of automated information discovery is for users to discover information without using skilled human effort to build indexes.

Resources for Automated Information Discovery
Computer power brute force computing ranking methods automatic generation of metadata The intelligence of the user browsing relevance feedback information visualization

Brute Force Computing Few people really understand Moore's Law
-- Computing power doubles every 18 months -- Increases 100 times in 12 years -- Increases 10,000 times in 25 years Simple algorithms + immense computing power may outperform human intelligence

Contrast with (Old-Fashioned) Boolean Searching
Traditional information retrieval uses Boolean retrieval, where a document either matches a query exactly or not at all. • Encourages short queries • Requires precise choice of index terms • Requires precise formulation of queries (professional training) Modern retrieval systems rank documents depending on the likelihood that they satisfy the user's needs. • Encourages long queries (to have as many dimensions as possible) • Benefits from large numbers of index terms • Permits queries with many terms, not all of which need match the document

Similarity Ranking Ranking methods using similarity
Measure the degree of similarity between a query and a document (or between two documents). Basic technique is the vector space model with term weighting. Similar Requests Documents Similar: How similar is document to a request?

Document Ranking Methods using document ranking
Rank the documents using some algorithm to rank them in order of importance. Best known technique is the Google PageRank algorithm.

Vector Space Methods: Concept
n-dimensional space, where n is the total number of different terms (words) in a set of documents. Each document is represented by a vector, with magnitude in each dimension equal to the (weighted) number of times that the corresponding term appears in the document. Similarity between two documents is the angle between their vectors. Much of this work was carried out by Gerald Salton and colleagues in Cornell's computer science department.

Example 1: Incidence Array
terms in d1 -> ant ant bee terms in d2 -> bee hog ant dog terms in d3 -> cat gnu dog eel fox terms ant bee cat dog eel fox gnu hog d d d Weights: tij = 1 if document i contains term j and zero otherwise

Three Terms Represented in 3-Dimensions
 t1

Vector Space Revision x = (x1, x2, x3, ..., xn) is a vector in an n-dimensional vector space Length of x is given by (extension of Pythagoras's theorem) |x|2 = x12 + x22 + x xn2 If x1 and x2 are vectors: Inner product (or dot product) is given by x1.x2 = x11x21 + x12x22 + x13x x1nx1n Cosine of the angle between the vectors x1 and x2: cos () = x1.x2 |x1| |x2|

Example 1 (continued) Similarity of documents in example: d1 d2 d3

Reasons for Term Weighting
Similarity using an incidence matrix measures the occurrences of terms, but no other characteristics of the documents. Terms are more useful for information retrieval if they: appear several times in one document (weighting by term frequency) only appear in some documents (weighting by document frequency) appear in short document (weighting by document length)

Weighting Very simple approach (basic tf.idf): wij = fij / dj
Where: wij is the weight given to term j in document i fij is the frequency with which term j appears in document i dj is the number of documents that contain term j

Term Frequency Concept
A term that appears many times within a document is likely to be more important than a term that appears only once.

Term Frequency Simple weighting to reflect term frequency:
Suppose term j appears fij times in document i Let mi = max (fij) i.e., mi is the maximum frequency of any term in document i Term frequency (tf): tfij = fij / mi i

Example 2: Frequency Array
terms in d1 -> ant ant bee terms in d2 -> bee hog ant dog terms in d3 -> cat gnu dog eel fox ant bee cat dog eel fox gnu hog d d d Weights: tij = frequency that term j occurs in document i

Example 2 (continued) Similarity of documents in example: d1 d2 d3
Similarity depends upon the weights given to the terms.

Inverse Document Frequency
Concept A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents.

Inverse Document Frequency
For term j number of documents = n document frequency (number of documents in which term j occurs) = dj One possible measure is: n/dj But this over-emphasizes small differences. Therefore a more useful definition is: Inverse document frequency (IDF): idfj = log2 (n/dj) + 1

Weighting Weights of the following form perform well in a wide variety of circumstances: (Weight of term j in document i) = (Term frequency) * (Inverse document frequency) The standard weighting scheme is: wij = tfij * idfj = (fij / mi) * (log2 (n/dj) + 1) Experience shows that the effectiveness depends on characteristics of the collection. There are few general improvements beyond this simple weighting scheme.

PageRank Algorithm (Google)
Concept: The rank of a web page is higher if many pages link to it. Links from highly ranked pages are given greater weight than links from less highly ranked pages.

Page Ranks Citing page P1 P2 P3 P4 P5 P6 P1 1 1 1 P2 1 P3 1 P4 1 1 1 1
Cited page Number

Normalize by Number of Links from Page
Citing page P P P P P P6 P P P P P P = B Cited page Number

Calculating the Weights
Initially all pages have weight 1 w1 = Recalculate weights w2 = Bw1 = Iterate wk+1 = Bwk until iteration convergences 1.75 0.25 0.50 1.00 0.75 1

Page Ranks (Basic Algorithm)
Iteration converges when w = Bw This w is the high order eigenvector of B It ranks the pages by links to them, normalized by the number of citations from each page and weighted by the ranking of the cited pages

PageRank Intuitive Model
A user: 1. Starts at a random page on the web 2a. With probability p, selects any random page and jumps to it 2b. With probability 1-p, selects a random hyperlink from the current page and jumps to the corresponding page 3. Repeats Step 2a and 2b a very large number of times Pages are ranked according to the relative frequency with which they are visited. The basic algorithm described on the previous slides is with p = 0

Information Discovery: 1992 and 2002
Content print online Computing expensive inexpensive Choice of content selective comprehensive Index creation human automatic Frequency one time monthly Vocabulary controlled not controlled Query Boolean ranked retrieval Users trained untrained

Automated Information Retrieval

Similar presentations

Presentation on theme: "Automated Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automated Information Retrieval

Similar presentations

Presentation on theme: "Automated Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback