CS 430 / INFO 430 Information Retrieval

Slides:



Advertisements
Similar presentations
Traditional IR models Jian-Yun Nie.
Advertisements

Chapter 5: Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
CS 430 / INFO 430 Information Retrieval
1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday.
Spring 2002NLE1 CC 384: Natural Language Engineering Week 2, Lecture 2 - Lemmatization and Stemming; the Porter Stemmer.
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 11 Latent Semantic Indexing Extending the Boolean Model.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Ch 4: Information Retrieval and Text Mining
Evaluating the Performance of IR Sytems
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 6 Vector Methods 2.
Modern Information Retrieval Chapter 4 Query Languages.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
CS 430 / INFO 430 Information Retrieval
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
IR Systems and Web Search By Sri Harsha Tumuluri (UNI: st2653)
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 CS 430: Information Discovery Lecture 12 Extending the Boolean Model.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
1 CS 430: Information Discovery Lecture 5 Ranking.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 Discussion Class 3 Stemming Algorithms. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Automated Information Retrieval
Why indexing? For efficient searching of a document
Text Based Information Retrieval
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
Multimedia Information Retrieval
Representation of documents and queries
CS 430: Information Discovery
CS 430: Information Discovery
Chapter 5: Information Retrieval and Web Search
CS 430: Information Discovery
Information Retrieval and Web Design
Information Retrieval and Web Design
Discussion Class 3 Stemming Algorithms.
CS 430: Information Discovery
Presentation transcript:

CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods

Course Administration

CS 430 / INFO 430 Information Retrieval Completion of Lecture 5

Porter Stemmer A multi-step, longest-match stemmer. M. F. Porter, An algorithm for suffix stripping. (Originally published in Program, 14 no. 3, pp 130-137, July 1980.) http://www.tartarus.org/~martin/PorterStemmer/def.txt Notation v vowel(s) c constant(s) (vc)m vowel(s) followed by constant(s), repeated m times Any word can be written: [c](vc)m[v] m is called the measure of the word

Porter's Stemmer Porter Stemming Algorithm Complex suffixes Complex suffixes are removed bit by bit in the different steps. Thus: GENERALIZATIONS becomes GENERALIZATION (Step 1) becomes GENERALIZE (Step 2) becomes GENERAL (Step 3) becomes GENER (Step 4).

Porter Stemmer: Step 1a Suffix Replacement Examples sses ss caresses -> caress ies i ponies -> poni ties -> ti ss ss caress -> caress s cats -> cat

Porter Stemmer: Step 1b Conditions Suffix Replacement Examples (m > 0) eed ee feed -> feed agreed -> agree (*v*) ed null plastered -> plaster bled -> bled (*v*) ing null motoring -> motor sing -> sing *v* - the stem contains a vowel

Porter Stemmer: Step 5a (m>1) E -> probate -> probat rate -> rate (m=1 and not *o) E -> cease -> ceas *o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP).

Stemming in Practice Evaluation studies have found that stemming can affect retrieval performance, usually for the better, but the results are mixed. • Effectiveness is dependent on the vocabulary. Fine distinctions may be lost through stemming. • Automatic stemming is as effective as manual conflation. • Performance of various algorithms is similar. Porter's Algorithm is entirely empirical, but has proved to be an effective algorithm for stemming English text with trained users.

Selection of tokens, weights, stop lists and stemming Special purpose collections (e.g., law, medicine, monographs) Best results are obtained by tuning the search engine for the characteristics of the collections and the expected queries. It is valuable to use a training set of queries, with lists of relevant documents, to tune the system for each application. General purpose collections (e.g., news articles) The modern practice is to use a basic weighting scheme (e.g., tf.idf), a simple definition of token, a short stop list and little stemming except for plurals, with minimal conflation. Web searching combine similarity ranking with ranking based on document importance.

CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods

Exact Matching (Boolean Model) Documents Query Index database Mechanism for determining whether a document matches a query. Set of hits

Boolean Queries Boolean query: two or more search terms, related by logical operators, e.g., and or not Examples: abacus and actor abacus or actor (abacus and actor) or (abacus and atoll) not actor

Boolean Diagram not (A or B) A and B A B A or B

Adjacent and Near Operators abacus adj actor Terms abacus and actor are adjacent to each other as in the string "abacus actor" abacus near 4 actor Terms abacus and actor are near to each other as in the string "the actor has an abacus" Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph).

Evaluation of Boolean Operators Precedence of operators must be defined: adj, near high and, not or low Example A and B or C and B is evaluated as (A and B) or (C and B)

Evaluating a Boolean Query Examples: abacus and actor Postings for abacus Postings for actor Document 19 is the only document that contains both terms, "abacus" and "actor". 3 19 22 To evaluate the and operator, merge the two inverted lists with a logical AND operation. 2 19 29

Evaluating an Adjacency Operation Examples: abacus adj actor Postings for abacus Postings for actor Document 19, locations 212 and 213, is the only occurrence of the terms "abacus" and "actor" adjacent. 3 94 19 7 19 212 22 56 2 66 19 213 29 45

Query Matching: Boolean Methods Query: (abacus or asp*) and actor 1. From the index file (word list), find the postings file for: "abacus" every word that begins "asp" "actor" Merge these posting lists. For each document that occurs in any of the postings lists, evaluate the Boolean expression to see if it is true or false. Step 2 should be carried out in a single pass.

Use of Postings File for Query Matching 1 abacus 3 94 19 7 19 212 22 56 2 actor 66 19 213 29 45 3 aspen 5 43 4 atoll 3 70 34 40

Query Matching: Vector Ranking Methods Query: abacus asp* 1. From the index file (word list), find the postings file for: "abacus" every word that begins "asp" Merge these posting lists. Calculate the similarity to the query for each document that occurs in any of the postings lists. Sort the similarities to obtain the results in ranked order. Steps 2 and 3 should be carried out in a single pass.

Contrast of Ranking with Matching With matching, a document either matches a query exactly or not at all • Encourages short queries • Requires precise choice of index terms • Requires precise formulation of queries (professional training) With retrieval using similarity measures, similarities range from 0 to 1 for all documents • Encourages long queries, to have as many dimensions as possible • Benefits from large numbers of index terms • Benefits from queries with many terms, not all of which need match the document

Problems with the Boolean model Counter-intuitive results: Query q = A and B and C and D and E Document d has terms A, B, C and D, but not E Intuitively, d is quite a good match for q, but it is rejected by the Boolean model. Query q = A or B or C or D or E Document d1 has terms A, B, C, D and E Document d2 has term A, but not B, C, D or E Intuitively, d1 is a much better match than d2, but the Boolean model ranks them as equal.

Problems with the Boolean model (continued) Boolean is all or nothing • Boolean model has no way to rank documents. • Boolean model allows for no uncertainty in assigning index terms to documents. • The Boolean model has no provision for adjusting the importance of query terms.

Extending the Boolean model Term weighting • Give weights to terms in documents and/or queries. • Combine standard Boolean retrieval with vector ranking of results Fuzzy sets • Relax the boundaries of the sets used in Boolean retrieval

Ranking methods in Boolean systems SIRE (Syracuse Information Retrieval Experiment) Term weights • Add term weights to documents Weights calculated by the standard method of term frequency * inverse document frequency. Ranking • Calculate results set by standard Boolean methods • Rank results by vector distances

Relevance feedback in SIRE SIRE (Syracuse Information Retrieval Experiment) Relevance feedback is particularly important with Boolean retrieval because it allow the results set to be expanded • Results set is created by standard Boolean retrieval • User selects one document from results set • Other documents in collection are ranked by vector distance from this document

Boolean model as sets d is either in the set A or not in A. d A

Boolean model as fuzzy sets d is more or less in A. d A

Basic concept • A document has a term weight associated with each index term. The term weight measures the degree to which that term characterizes the document. • Term weights are in the range [0, 1]. (In the standard Boolean model all weights are either 0 or 1.) • For a given query, calculate the similarity between the query and each document in the collection. • This calculation is needed for every document that has a non-zero weight for any of the terms in the query.

MMM: Mixed Min and Max model Fuzzy set theory dA is the degree of membership of an element to set A intersection (and) dAB = min(dA, dB) union (or) dAB = max(dA, dB)

MMM: Mixed Min and Max model Fuzzy set theory example standard fuzzy set theory set theory dA 1 1 0 0 0.5 0.5 0 0 dB 1 0 1 0 0.7 0 0.7 0 and dAB 1 0 0 0 0.5 0 0 0 or dAB 1 1 1 0 0.7 0.5 0.7 0

MMM: Mixed Min and Max model Terms: A1, A2, . . . , An Document: d, with index-term weights: d1, d2, . . . , dn qor = (A1 or A2 or . . . or An) Query-document similarity: S(qor, d) = or * max(d1, d2,.. , dn) + (1 - or) * min(d1, d2,.. , dn) With regular Boolean logic, or = 1

MMM: Mixed Min and Max model Terms: A1, A2, . . . , An Document: d, with index-term weights: d1, d2, . . . , dn qand = (A1 and A2 and . . . and An) Query-document similarity: S(qand, d) = and * min(d1,.. , dn) + (1 - and)* max(d1,.. , dn) With regular Boolean logic, and = 1

MMM: Mixed Min and Max model Experimental values: and in range [0.5, 0.8] or > 0.2 Computational cost is low. Retrieval performance much improved.

Other Models Paice model The MMM model considers only the maximum and minimum document weights. The Paice model takes into account all of the document weights. Computational cost is higher than MMM. P-norm model Document D, with term weights: dA1, dA2, . . . , dAn Query terms are given weights, a1, a2, . . . ,an Operators have coefficients that indicate degree of strictness Query-document similarity is calculated by considering each document and query as a point in n space.

Test data CISI CACM INSPEC P-norm 79 106 210 Paice 77 104 206 MMM 68 109 195 Percentage improvement over standard Boolean model (average best precision) Lee and Fox, 1988

Reading E. Fox, S. Betrabet, M. Koushik, W. Lee, Extended Boolean Models, Frake, Chapter 15 Methods based on fuzzy set concepts