CS 430: Information Discovery

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Information Retrieval in Practice
Search Engines and Information Retrieval
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
CS 430 / INFO 430 Information Retrieval
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Evaluating the Performance of IR Sytems
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
CS 430 / INFO 430 Information Retrieval
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Search Engines and Information Retrieval Chapter 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
1 CS 430: Information Discovery Lecture 12 Extending the Boolean Model.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Statistical Properties of Text
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Evaluation of Retrieval Effectiveness 1.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Automated Information Retrieval
Information Retrieval in Practice
Plan for Today’s Lecture(s)
Text Based Information Retrieval
CS 430: Information Discovery
CS 430: Information Discovery
Multimedia Information Retrieval
Thanks to Bill Arms, Marti Hearst
Information Retrieval
Representation of documents and queries
CS 430: Information Discovery
CS 430: Information Discovery
Introduction to Information Retrieval
Content Analysis of Text
CS 430: Information Discovery
INF 141: Information Retrieval
Information Retrieval and Web Design
Information Retrieval and Web Design
Information Retrieval
CS 430: Information Discovery
Presentation transcript:

CS 430: Information Discovery Lecture 2 Text Based Information Retrieval

Course Administration Web site: http://www.cs.cornell.edu/courses/cs430/2003fa Notices: See the home page of the course Web site Sign-up sheet: If you did not sign up at the first class, please sign up now.

Course Administration Please send all questions about the course to: cs430@cs.cornell.edu The message will be sent to William Arms Pavel Dmitriev Ariful Gani Heng-Scheng Chuang

Course Administration Discussion class, Wednesday, September 3 Upson B17, 7:30 to 8:30 p.m. Prepare for the class as instructed on the course Web site. Participation in the discussion classes is one third of the grade, but tomorrow's class will not be included in the grade calculation. Due date of Assignment 1 This date may be changed. Watch the Notices on the Web site.

Functional View 1. Matching Documents Query Index database Mechanism for determining whether a document matches a query. Set of hits

Matching: Recall and Precision If information retrieval were perfect ... Every hit would be relevant to the original query, and every relevant item in the body of information would be found. Precision: percentage (or fraction) of the hits that are relevant, i.e., the extent to which the set of hits retrieved by a query satisfies the requirement that generated the query. Recall: percentage (or fraction) of the relevant items that are found by the query, i.e., the extent to which the query found all the items that satisfy the requirement.

Recall and Precision: Example Collection of 10,000 documents, 50 on a specific topic Ideal search finds these 50 documents and reject all others Actual search identifies 25 documents; 20 are relevant but 5 were on other topics Precision: 20/ 25 = 0.8 (80% of hits were relevant) Recall: 20/50 = 0.4 (40% of relevant were found)

Measuring Precision and Recall Precision is easy to measure: A knowledgeable person looks at each document that is identified and decides whether it is relevant. In the example, only the 25 documents that are found need to be examined. Recall is difficult to measure: To know all relevant items, a knowledgeable person must go through the entire collection, looking at every object to decide if it fits the criteria. In the example, all 10,000 documents must be examined.

Ranking Methods Methods that look for matches assume that a document is either relevant to a query or not relevant. Ranking methods: measure the degree of similarity between a query and a document. Similar Query Documents Similar: How similar is document to a request?

Functional View 2. Ranking Methods Index database Documents Query Mechanism for determining the similarity of the request representation to the information item representation. Set of documents ranked by how similar they are to the query

Ranking: Recall and Precision If information retrieval were perfect ... Every document relevant to the original query would be ranked above every other document. Precision and recall are functions of the rank order. Precision(n): percentage (or fraction) of the n most highly ranked documents that are relevant. Recall (n) : percentage (or fraction) of the relevant items that are in the n most highly ranked documents.

Precision and Recall with Ranking Example "Your query found 349,871 possibly relevant documents. Here are the first eight." Examination of the first 8 finds that 5 of them are relevant.

Graph of Precision with Ranking Relevant? Y N Y Y N Y N Y Precision 1 1/1 1/2 2/3 3/4 3/5 4/6 4/7 5/8 Rank 1 2 3 4 5 6 7 8

Precision and Recall Precision and recall measure the results of a single query using a specific search system applied to a specific set of documents. Matching methods: Precision and recall are single numbers. Ranking methods: Precision and recall are represented by functions (or graphs) of the rank order.

Text Based Information Retrieval Most matching methods are based on Boolean operators. Most ranking methods are based on the vector space model. Many practical systems combine features of both approaches. In the basic form, both approaches treat words as separate tokens with minimal attempt to interpret them linguistically.

Documents A textual document is a digital object consisting of a sequence of words and other symbols, e.g., punctuation. The individual words and other symbols are known as tokens or terms. A textual document can be: • Free text, also known as unstructured text, which is a continuous sequence of tokens. • Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup. [Methods of markup, e.g., XML, are covered in CS 431.]

Word Frequency Observation: Some words are more common than others. Statistics: Most large collections of text documents have similar statistical characteristics. These statistics: • influence the effectiveness and efficiency of data structures used to index documents • many retrieval models rely on them

Word Frequency Example The following example is taken from: Jamie Callan, Characteristics of Text, 1997 Sample of 19 million words The next slide shows the 50 commonest words in rank order (r), with their frequency (f).

f f f the 1130021 from 96900 or 54958 of 547311 he 94585 about 53713 to 516635 million 93515 market 52110 a 464736 year 90104 they 51359 in 390819 its 86774 this 50933 and 387703 be 85588 would 50828 that 204351 was 83398 you 49281 for 199340 company83070 which 48273 is 152483 an 76974 bank 47940 said 148302 has 74405 stock 47401 it 134323 are 74097 trade 47310 on 121173 have 73132 his 47116 by 118863 but 71887 more 46244 as 109135 will 71494 who 42142 at 101779 say 66807 one 41635 mr 101679 new 64456 their 40910 with 101210 share 63925

Rank Frequency Distribution For all the words in a collection of documents, for each word w f is the frequency that w appears r is rank of w in order of frequency. (The most commonly occurring word has rank 1, etc.) f w has rank r and frequency f r

Rank Frequency Example The next slide shows the words in Callan's data normalized. In this example: r is the rank of word w in the sample. f is the frequency of word w in the sample. n is the total number of distinct words in the sample.

1000*rf/n 1000*rf/n 1000*rf/n the 59 from 92 or 101 of 58 he 95 about 102 to 82 million 98 market 101 a 98 year 100 they 103 in 103 its 100 this 105 and 122 be 104 would 107 that 75 was 105 you 106 for 84 company 109 which 107 is 72 an 105 bank 109 said 78 has 106 stock 110 it 78 are 109 trade 112 on 77 have 112 his 114 by 81 but 114 more 114 as 80 will 117 who 106 at 80 say 113 one 107 mr 86 new 112 their 108 with 91 share 114

Zipf's Law If the words, w, in a collection are ranked, r, by their frequency, f, they roughly fit the relation: r * f = c Different collections have different constants c. In English text, c tends to be about n / 10, where n is the number of distinct words in the collection. For a weird but wonderful discussion of this and many other examples of naturally occurring rank frequency distributions, see: Zipf, G. K., Human Behaviour and the Principle of Least Effort. Addison-Wesley, 1949

Luhn's Proposal "It is here proposed that the frequency of word occurrence in an article furnishes a useful measurement of word significance. It is further proposed that the relative position within a sentence of words having given values of significance furnish a useful measurement for determining the significance of sentences. The significance factor of a sentence will therefore be based on a combination of these two measurements." Luhn, H.P., The automatic creation of literature abstracts, IBM Journal of Research and Development, 2, 159-165 (1958)

Cut-off Levels for Significance Words Upper cut-off Lower cut-off Resolving power of significant words Significant words r from: Van Rijsbergen, Ch. 2

Methods that Build on Zipf's Law Stop lists: Ignore the most frequent words (upper cut-off). Used by almost all systems. Significant words: Ignore the most frequent and least frequent words (upper and lower cut-off). Rarely used. Term weighting: Give differing weights to terms based on their frequency, with most frequent words weighed less. Used by almost all ranking methods.

Examples of Weighting Document frequency A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents. Term frequency A term that appears several times in a document is weighted more heavily than a term that appears only once.

Approaches to Weighting Boolean information retrieval: Weight of term i in document j: w(i, j) = 1 if term i occurs in document j w(i, j) = 0 otherwise General weighting methods 0 < w(i, j) <= 1 if term i occurs in document j (The use of weighting for ranking is the topic of Lecture 4.)