LIS618 lecture 3 Thomas Krichel 2003-02-13. Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.

Slides:



Advertisements
Similar presentations
LIS618 lecture 2 Thomas Krichel Structure Theory: information retrieval performance Practice: more advanced dialog.
Advertisements

LIS618 lecture 4 Thomas Krichel Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:
LIS618 lecture 1 Thomas Krichel Structure of talk Recap on Boolean Before online searching Working with DIALOG –Overview –Search command –Bluesheets.
Chapter 5: Introduction to Information Retrieval
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
LIS618 lecture 9 Thomas Krichel Structure Google “theory”, see essay by Brin and Page fullpapers/1921/com1921.htm.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
Tries Standard Tries Compressed Tries Suffix Tries.
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Association Clusters Definition The frequency of a stem in a document,, is referred to as. Let be an association matrix with rows and columns, where. Let.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Modern Information Retrieval
WMES3103 : INFORMATION RETRIEVAL
Query Languages: Patterns & Structures. Pattern Matching Pattern –a set of syntactic features that must occur in a text segment Types of patterns –Words:
LAST WEEK  Retrieval evaluation  Why?  How?  Recall and precision – Venn’s Diagram & Contingency Table.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Chapter 4 : Query Languages Baeza-Yates, 1999 Modern Information Retrieval.
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Modern Information Retrieval Chapter 4 Query Languages.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
LIS618 lecture 11 i/r performance evaluation Thomas Krichel
Chapter 5: Information Retrieval and Web Search
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
LIS618 lecture 2 Thomas Krichel Structure of talk General round trip on theoretical matters, part –Information retrieval models vector model.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Modern Information Retrieval Chapter 7: Text Processing.
LIS618 lecture 1 Thomas Krichel economic rational for traditional model In olden days the cost of telecommunication was high. database use.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 Query Operations Relevance Feedback & Query Expansion.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
LIS618 lecture 4 Thomas Krichel Structure Brief discussion of the Dialog worksheet. Document preprocessing Practice: Nexis.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Planning a search strategy.  A search strategy may be broadly defined as a conscious approach to decision making to solve a problem or achieve an objective.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
LIS618 lecture 4 Thomas Krichel Structure Document preprocessing Practice: Nexis.
Chapter 6: Information Retrieval and Web Search
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Web- and Multimedia-based Information Systems Lecture 2.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Information Retrieval
Text Operations J. H. Wang Feb. 21, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
LIS618 lecture 8 Thomas Krichel Lexis/Nexis Lexis is a specialized legal research service Nexis is primarily a news services adds an important.
Refining Internet and Database Searches Created by Kathryn Reilly.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
LECTURE 3: DATABASE SEARCHING PRINCIPLES
Text Based Information Retrieval
Information Retrieval and Web Search
CS 430: Information Discovery
Data Mining Chapter 6 Search Engines
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

LIS618 lecture 3 Thomas Krichel

Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation

document preprocessing There are some operations that may be done to the documents before indexing –lexical analysis –stemming of words –elimination of stop words –selection of index terms –construction of term categorization structures we will look at those in turn in many cases, document preprocessing is not well documented by the provider. but searchers need to be aware of them…

lexical analysis divides a stream of characters into a stream of words seems easy enough but…. should we keep numbers? hyphens. compare "state-of-the-art" with "b-52" removal of punctuation, but "333B.C." casing. compare "bank" and "Bank"

stemming in general, users search for the occurrence of a term irrespective of grammar plural, gerund forms, past tense can be subject to stemming important algorithm by Porter evidence about the effect of stemming on information retrieval is mixed stemming is relatively rare these days.

elimination of stop words some words carry no meaning and should be eliminated in fact any word that appears in 80% of all documents is pretty much useless, but consider a searcher for "to be or not to be". It is better to reduce the index weight of terms that appear very frequently

index term selection some engines try to capture nouns only some nouns that appear heavily together can be considered to be one index term, such as "computer science" Dialog deals with this through phrase indexing. Most web engines, however, index all words, and all of the individually

thesauri a list of words and for each word, a list of related words –synonyms –broader terms –narrower terms used –to provide a consistent vocabulary for indexing and searching –to assist users with locating terms for query formulation –allow users to broaden or narrow query

use of thesauri Thesauri are limited to experimental systems, or some high-quality systems, see for an example. It can be confusing to users. Frequently the relationship between terms in the query is badly served by the relationships in the thesaurus. Thus thesaurus expansion of an initial query (if performed automatically) can lead to bad results.

simple queries single-word queries –one word only –Hopefully some word combinations are understood as one word, e.g. on-line Context queries –phrase queries (be aware of stop words) –proximity queries, generalize phrase queries Boolean queries

simple pattern queries prefix queries (e.g. "anal" for analogy) suffix queries (e.g. "oral" for choral) substring (e.g. "al" for talk) ranges (e.g. form "held" to "hero") within a distance, usually Levenshtein distance (i.e. the minimum number of insertions, deletions, and replacements) of query term

regular expressions come from UNIX computing build form strings where certain characters are metacharacters. example: "pro(blem)|(tein)s?" matches problem, problem, protein and proteins. example: New.*y matches "New Jersey" and "New York City", and "New Delhy". great variety of dialects, usually very powerful. Extremely important in digital libraries.

structured queries make use of document structures simplest example is when the documents are database records, we can search for terms is a certain field only. if there is sufficient structure to field contents, the field can be interpreted as meaning something different than the word it contains. example: dates

query protocols There are some standard languages –Z39.50 queries –CCL, "common command language" is a development of Z39.50 –CD-RDx "compact disk read only data exchange" is supported by US government agencies such as CIA and NASA –SFQL "structure full text query language" built on SQL

Thank you for your attention!

retrieval performance evaluation "Recall" and "Precision" are two classic measures to measure the performance of information retrieval in a single query. Both assume that there is an answer set of documents that contain the answer to the query. Performance is optimal if –the database returns all the documents in the answer set –the database returns only documents in the answer set Recall is the fraction of the relevant documents that the query result has captured. Precision is the fraction of the retrieved documents that is relevant.

recall and precision curves Assume that all the retrieved documents arrive at once and are being examined. During that process, the user discover more and more relevant documents. Recall increases. During the same process, at least eventually, there will be less and less useful document. Precision declines (usually). This can be represented as a curve.

Example Let the answer set be {0,1,2,3,4,5,6,7,8,9} and non-relevant documents represented by letters. A query reveals the following result: 7,a,3,b,c,9,n,j,l,5,r,o,s,e,4. For the first document, (recall, precision) is (10%,100%), for the third, (20%,66%), for the sixth (30%,50%), for the tenth (40%,40%) etc.

recall/precision curves Such curves can be formed for each query. An average curve, for each recall level, can be calculated for several queries. Recall and precision levels can also be used to calculate two single-valued summaries. –average precision at seen document –R-precision

average precision at seen document To find it, sum all the precision level for each new relevant document discovered by the user and divide by the total number of relevant documents for the query. In our example, it is 0.57 This measure favors retrieval methods that get the relevant documents to the top.

R-precision a more ad-hoc measure. Let R be the size of the answer set. Take the first R results of the query. Find the number of relevant documents Divide by R. In our example, the R-precision is.4. An average can be calculated for a number of queries.

critique of recall & precision Recall has to be estimated by an expert Recall is very difficult to estimate in a large collection They focus on one query only. No serious user works like this. There are some other measures, but that is more for an advanced course in IR.