Current Topics in Information Access: IR Background

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Information Retrieval Review
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
INFO 624 Week 3 Retrieval System Evaluation
9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
DOK 324: Principles of Information Retrieval Hacettepe University Department of Information Management.
Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
Evaluating the Performance of IR Sytems
Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
9/19/2000Information Organization and Retrieval Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.
September 7, 2000Information Organization and Retrieval Introduction to Information Retrieval Ray Larson & Marti Hearst University of California, Berkeley.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Search Engines and Information Retrieval Chapter 1.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Search and Retrieval: Finding Out About Prof. Marti Hearst SIMS 202, Lecture 18.
Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Plan for Today’s Lecture(s)
Evaluation of Information Retrieval Systems
What is Information Retrieval (IR)?
Information Retrieval and Web Search
Why the interest in Queries?
Token generation - stemming
Representation of documents and queries
4. Boolean and Vector Space Retrieval Models
Evaluation of Information Retrieval Systems
Boolean and Vector Space Retrieval Models
Retrieval Performance Evaluation - Measures
Presentation transcript:

Current Topics in Information Access: IR Background Marti Hearst Fall ‘98

Last Time The problem of information access Matching task and search type Why text is tough Current topics and issues

Today Background on IR Basics System architecture Tokenization Boolean queries Term weighting and Ranking Algorithms Inverted Indices Evaluation

Some IR History Roots in the scientific “Information Explosion” following WWII Interest in computer-based IR from mid 1950’s H.P. Luhn at IBM (1958) Probabilistic models at Rand (Maron & Kuhns) (1960) Boolean system development at Lockheed (‘60s) Vector Space Model (Salton at Cornell 1965) Statistical Weighting methods and theoretical advances (‘70s) Refinements and Advances in application (‘80s) User Interfaces, Large-scale testing and application (‘90s)

Information Retrieval Task Statement Build a system that retrieves documents that users are likely to find relevant to their information needs.

Structure of an IR System Search Line Storage Line Interest profiles & Queries Documents & data Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Formulating query in terms of descriptors Indexing (Descriptive and Subject) Storage of profiles Storage of Documents Store1: Profiles/ Search requests Store2: Document representations Comparison/ Matching Adapted from Soergel, p. 19 Potentially Relevant Documents

User’s Information Need text input Query Parse

Collections Pre-process Index

User’s Information Need Collections Pre-process text input Query Index Parse Rank or Match

User’s Information Need Collections Pre-process text input Query Index Parse Rank or Match Query Reformulation

Steps in a “typical IR System” Document preprocessing Tokenization Stemming/Normalizing Query processing Interpretation of Syntax (Query Expansion) Retrieval of documents according to similarity to query Presentation of Retrieval Results Relevance Feedback/Query reformulation

Stemming and Morphological Analysis Goal: “normalize” similar words Morphology (“form” of words) Inflectional Morphology E.g,. inflect verb endings and noun number Never change grammatical class dog, dogs Derivational Morphology Derive one word from another, Often change grammatical class build, building; health, healthy

Automated Methods Powerful multilingual tools exist for morphological analysis PCKimmo, Xerox Lexical technology Require a grammar and dictionary Use “two-level” automata Stemmers: Very dumb rules work well (for English) Porter Stemmer: Iteratively remove suffixes Improvement: pass results through a lexicon Use considered too expensive to store a dictionary

Errors Generated by Porter Stemmer (Krovetz 93)

Query Languages A way to express the question (information need) Types: Boolean Natural Language Stylized Natural Language Form-Based (GUI)

Simple query language: Boolean Terms + Connectors terms words normalized (stemmed) words phrases thesaurus terms connectors AND OR NOT

Boolean Queries Cat Cat OR Dog Cat AND Dog (Cat AND Dog) (Cat AND Dog) OR Collar (Cat AND Dog) OR (Collar AND Leash) (Cat OR Dog) AND (Collar OR Leash)

Boolean Queries (Cat OR Dog) AND (Collar OR Leash) Each of the following combinations works: Cat x x x x x Dog x x x x x Collar x x x Leash x x x x x

Boolean Queries (Cat OR Dog) AND (Collar OR Leash) None of the following combinations work: Cat x x Dog x x Collar x x Leash x x

Boolean Logic B A

Boolean Searching Cracks Width Beams measurement Prestressed concrete “Measurement of the width of cracks in prestressed concrete beams” Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete Cracks Width measurement Beams Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P) Prestressed concrete

Boolean Problems Disjunctive (OR) queries lead to too many results, often off-target Conjunctive (AND) queries lead to reduced, and commonly zero result Not intuitive to most people

Advantages and Disadvantage of the Boolean Model Disadvantages Complete expressiveness for any identifiable subset of collection Exact and simple to program The whole panoply of Boolean Algebra available Complex query syntax is often misunderstood (if understood at all) Problems of Null output and Information Overload Output is not ordered in any useful fashion

Psuedo-Boolean Queries A new notation, from web search +cat dog +collar leash Does not mean the same thing! Need a way to group combinations. Phrases: “stray cat” AND “frayed collar” +“stray cat” + “frayed collar”

Boolean Extensions Fuzzy Logic Adds weights to each term/concept ta AND tb is interpreted as MIN(w(ta),w(tb)) ta OR tb is interpreted as MAX (w(ta),w(tb)) Proximity/Adjacency operators Interpreted as additional constraints on Boolean AND

Ranking Algorithms Assign weights to the terms in the query. Assign weights to the terms in the documents. Compare the weighted query terms to the weighted document terms. Rank order the results.

User’s Information Need Collections Pre-process text input Query Index Parse Rank or Match

Indexing and Representation: The Vector Space Model Document represented by a vector of terms Words (or word stems) Phrases (e.g. computer science) Removes words on “stop list” Documents aren’t about “the” Often assumed that terms are uncorrelated. Correlations between term vectors implies a similarity between documents. For efficiency, an inverted index of terms is often stored.

Document Representation What values to use for terms Boolean (term present /absent) tf (term frequency) - Count of times term occurs in document. The more times a term t occurs in document d the more likely it is that t is relevant to the document. Used alone, favors common words, long documents. df document frequency The more a term t occurs throughout all documents, the more poorly t discriminates between documents tf-idf term frequency * inverse document frequency - High value indicates that the word occurs more often in this document than average.

Document Vectors Documents are represented as “bags of words” Represented as vectors when used computationally A vector is like an array of floating point Has direction and magnitude Each vector holds a place for every term in the collection Therefore, most vectors are sparse

Vector Representation Documents and Queries are represented as vectors. Position 1 corresponds to term 1, position 2 to term 2, position t to term t

Document Vectors A B C D E F G H I Document ids A B C D E F G H I nova galaxy heat h’wood film role diet fur 1.0 0.5 0.3 0.5 1.0 1.0 0.8 0.7 0.9 1.0 0.5 1.0 1.0 0.9 1.0 0.5 0.7 0.9 0.6 1.0 0.3 0.2 0.8 0.7 0.5 0.1 0.3

Assigning Weights Want to weight terms highly if they are frequent in relevant documents … BUT infrequent in the collection as a whole

Assigning Weights tf x idf measure: term frequency (tf) inverse document frequency (idf)

tf x idf Normalize the term weights (so longer documents are not unfairly given more weight)

tf x idf normalization Normalize the term weights (so longer documents are not unfairly given more weight) normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.

Vector Space Similarity Measure combine tf x idf into a similarity measure

Computing Similarity Scores 1.0 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 1.0

Documents in Vector Space

Computing a similarity score

Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

Problems with Vector Space There is no real theoretical basis for the assumption of a term space it is more for visualization that having any real basis most similarity measures work about the same regardless of model Terms are not really orthogonal dimensions Terms are not independent of all other terms

Probabilistic Models Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) Relies on accurate estimates of probabilities for accurate results

Probabilistic Retrieval Goes back to 1960’s (Maron and Kuhns) Robertson’s “Probabilistic Ranking Principle” Retrieved documents should be ranked in decreasing probability that they are relevant to the user’s query. How to estimate these probabilities? Several methods (Model 1, Model 2, Model 3) with different emphases on how estimates are done.

Probabilistic Models: Some Notation D = All present and future documents Q = All present and future queries (Di,Qj) = A document query pair x = class of similar documents, y = class of similar queries, Relevance is a relation:

Probabilistic Models: Logistic Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown next

Probabilistic Models: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document -- logged

Probabilistic Models: Logistic Regression Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables. Log odds of relevance is a linear function of attributes: Term contributions summed: Probability of Relevance is inverse of log odds:

Probabilistic Models Advantages Disadvantages Strong theoretical basis In principle should supply the best predictions of relevance given available information Can be implemented similarly to Vector Relevance information is required -- or is “guestimated” Important indicators of relevance may not be term -- though terms only are usually used Optimally requires on-going collection of relevance information

Vector and Probabilistic Models Support “natural language” queries Treat documents and queries the same Support relevance feedback searching Support ranked retrieval Differ primarily in theoretical basis and in how the ranking is calculated Vector assumes relevance Probabilistic relies on relevance judgments or estimates

Simple Presentation of Results Order by similarity Decreased order of presumed relevance Items retrieved early in search may help generate feedback by relevance feedback Select top k documents Select documents within of query

Problems with Vector Space There is no real theoretical basis for the assumption of a term space it is more for visualization that having any real basis most similarity measures work about the same regardless of model Terms are not really orthogonal dimensions Terms are not independent of all other terms

Evaluation Relevance Evaluation of IR Systems Precision vs. Recall Cutoff Points Test Collections/TREC Blair & Maron Study

What to Evaluate? How much learned about the collection? How much learned about a topic? How much of the information need is satisfied? How inviting the system is?

What to Evaluate? What can be measured that reflects users’ ability to use system? (Cleverdon 66) Coverage of Information Form of Presentation Effort required/Ease of Use Time and Space Efficiency Recall proportion of relevant material actually retrieved Precision proportion of retrieved material actually relevant effectiveness

Relevance In what ways can a document be relevant to a query? Answer precise question precisely. Partially answer question. Suggest a source for more information. Give background information. Remind the user of other knowledge. Others ...

Standard IR Evaluation Retrieved Documents Precision Recall # relevant retrieved # retrieved # relevant retrieved # relevant in collection Collection

Precision/Recall Curves There is a tradeoff between Precision and Recall So measure Precision at different levels of Recall precision x x x x recall

Precision/Recall Curves Difficult to determine which of these two hypothetical results is better: precision x x x x recall

Precision/Recall Curves

Document Cutoff Levels Another way to evaluate: Fix the number of documents retrieved at several levels: top 5, top 10, top 20, top 50, top 100, top 500 Measure precision at each of these levels Take (weighted) average over results This is a way to focus on high precision

The E-Measure Combine Precision and Recall into one number (van Rijsbergen 79) P = precision R = recall b = measure of relative importance of P or R For example, b = 0.5 means user is twice as interested in precision as recall

TREC Text REtrieval Conference/Competition Run by NIST (National Institute of Standards & Technology) 1997 was the 6th year Collection: 3 Gigabytes, >1 Million Docs Newswire & full text news (AP, WSJ, Ziff) Government documents (federal register) Queries + Relevance Judgments Queries devised and judged by “Information Specialists” Relevance judgments done only for those documents retrieved -- not entire collection! Competition Various research and commercial groups compete Results judged on precision and recall, going up to a recall level of 1000 documents

Sample TREC queries (topics) <num> Number: 168 <title> Topic: Financing AMTRAK <desc> Description: A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK) <narr> Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to aMTRAK would also be relevant.

TREC Benefits: made research systems scale to large collections (pre-WWW) allows for somewhat controlled comparisons Drawbacks: emphasis on high recall, which may be unrealistic for what most users want very long queries, also unrealistic comparisons still difficult to make, because systems are quite different on many dimensions focus on batch ranking rather than interaction no focus on the WWW

TREC Results Excitement is in the new tracks Interactive Multilingual Differ each year For the main track: Best systems not statistically significantly different Small differences sometimes have big effects how good was the hyphenation model how was document length taken into account Systems were optimized for longer queries and all performed worse for shorter, more realistic queries Excitement is in the new tracks Interactive Multilingual NLP

Blair and Maron 1985 Highly influential paper A classic study of retrieval effectiveness earlier studies were on unrealistically small collections Studied an archive of documents for a legal suit ~350,000 pages of text 40 queries focus on high recall Used IBM’s STAIRS full-text system Main Result: System retrieved less than 20% of the relevant documents for a particular information needs when lawyers thought they had 75% But many queries had very high precision

Blair and Maron, cont. Why recall was low users can’t foresee exact words and phrases that will indicate relevant documents “accident” referred to by those responsible as: “event,” “incident,” “situation,” “problem,” … differing technical terminology slang, misspellings Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied

Blair and Maron, cont. Why recall was low users can’t foresee exact words and phrases that will indicate relevant documents “accident” referred to by those responsible as: “event,” “incident,” “situation,” “problem,” … differing technical terminology slang, misspellings Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied

Information need Collections Pre-process text input Query Index Parse Rank or Match

Creating a Keyword Index For each document Tokenize the document Break it up into tokens: words, stems, punctuation There are many variations on this Record which tokens occurred in this document Called an Inverted Index Dictionary: a record of all the tokens in the collection and their overall frequency Postings File: a list recording for each token, which document it occurs in and how often it occurs

Inverted files Permit fast search for individual terms Search results for each term is a list of document IDs (and optionally, frequency and/or positional information) These lists can be used to solve Boolean queries: country: d1, d2 manor: d2 country and manor: d2

Inverted Files Lots of alternative implementations E.g.: Cheshire builds within-document frequency using a hash table during parsing Document IDs and frequency info are stored in a B-tree index keyed by the term.

How Are Inverted Files Created Documents are parsed to extract tokens. These are saved with the Document ID. Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight

How Inverted Files are Created After all documents have been parsed the inverted file is sorted

How Inverted Files are Created Multiple term entries for a single document are merged and frequency information added

How Inverted Files are Created Then the file is split into a Dictionary and a Postings file

An Example IR System CHESHIRE system (Ray Larson, SIMS) Full Service Full Text Search Client/Server architecture Z39.50 IR protocol Interprets documents written in SGML Probabilistic Ranking

Next Time Background on User Interfaces and Informaiton Access