Information Retrieval Review

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Multimedia Database Systems
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Search and Ye Shall Find (maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
ISP 433/533 Week 2 IR Models.
Full-Text Indexing Session 10 INFM 718N Web-Enabled Databases.
The Vector Space Model LBSC 796/CMSC828o Session 3, February 9, 2004 Douglas W. Oard.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Evaluating the Performance of IR Sytems
Advance Information Retrieval Topics Hassan Bashiri.
Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
The Wharton School of the University of Pennsylvania OPIM 101 2/16/19981 The Information Retrieval Problem n The IR problem is very hard n Why? Many reasons,
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Indexing and Complexity. Agenda Inverted indexes Computational complexity.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Search Engine Architecture
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Structure of IR Systems INST 734 Module 1 Doug Oard.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Information Retrieval
Evidence from Content INST 734 Module 2 Doug Oard.
Structure of IR Systems LBSC 796/INFM 718R Session 1, January 26, 2011 Doug Oard.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.
Augmenting (personal) IR Readings Review Evaluation Papers returned & discussed Papers and Projects checkin time.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Search Engines Session 5 INST 301 Introduction to Information Science.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
IR 6 Scoring, term weighting and the vector space model.
SIMS 202, Marti Hearst Final Review Prof. Marti Hearst SIMS 202.
Automated Information Retrieval
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Information Retrieval and Web Search
Modern Information Retrieval
CSE 635 Multimedia Information Retrieval
CS 430: Information Discovery
6. Implementation of Vector-Space Retrieval
Lecture 8 Information Retrieval Introduction
Chapter 5: Information Retrieval and Web Search
INF 141: Information Retrieval
Presentation transcript:

Information Retrieval Review LBSC 796/INFM 718R 1

Structure of IR Systems IR process model System architecture Information needs Visceral, conscious, formalized, compromised Utility vs. relevance Known item vs. ad hoc search

Supporting the Search Process Source Selection Query Formulation IR System Search Query Selection Ranked List Indexing Index Examination Document Acquisition Collection Delivery Document

Relevance Relevance relates a topic and a document Duplicates are equally relevant, by definition Constant over time and across users Pertinence relates a task and a document Accounts for quality, complexity, language, … Utility relates a user and a document Accounts for prior knowledge

Taylor’s Model of Question Formation Q1 Visceral Need End-user Search Q2 Conscious Need Intermediated Search Q3 Formalized Need Q4 Compromised Need (Query)

Evidence from Content and Ranked Retrieval Inverted indexing Postings, postings file Bag of terms Segmentation, phrases, stemming, stopwords Boolean retrieval Vector space ranked retrieval TF, IDF, length normalization, BM25 Blind relevance feedback

An “Inverted Index” Term Index Postings Term Doc 1 Doc 2 Doc 3 Doc 4 Postings List AI aid 1 1 4, 8 A AL all 1 1 1 2, 4, 6 BA back 1 1 1 1, 3, 7 B BR brown 1 1 1 1 1, 3, 5, 7 Postings List C come 1 1 1 1 2, 4, 6, 8 D dog 1 1 3, 5 F fox 1 1 1 3, 5, 7 G good 1 1 1 1 2, 4, 6, 8 Postings List J jump 1 3 L lazy 1 1 1 1 1, 3, 5, 7 M men 1 1 1 2, 4, 8 N now 1 1 1 2, 6, 8 O over 1 1 1 1 1 1, 3, 5, 7, 8 P party 1 1 6, 8 Q quick 1 1 1, 3 TH their 1 1 1 1, 5, 7 T TI time 1 1 1 2, 4, 6 15

A Partial Solution: TF*IDF High TF is evidence of meaning Low DF is evidence of term importance Equivalently high “IDF” Multiply them to get a “term weight” Add up the weights for each query term

Cosine Normalization Example 1 2 3 4 1 2 3 4 1 2 3 4 complicated 5 2 0.301 1.51 0.60 0.57 0.69 contaminated 4 1 3 0.125 0.50 0.13 0.38 0.29 0.13 0.14 fallout 5 4 3 0.125 0.63 0.50 0.38 0.37 0.19 0.44 information 6 3 3 2 0.000 interesting 1 0.602 0.60 0.62 nuclear 3 7 0.301 0.90 2.11 0.53 0.79 retrieval 6 1 4 0.125 0.75 0.13 0.50 0.77 0.05 0.57 siberia 2 0.602 1.20 0.71 Length 1.70 0.97 2.67 0.87 query: contaminated retrieval, Result: 2, 4, 1, 3 (compare to 2, 3, 1, 4)

Interaction Query formulation vs. Query by example Summarization Indicative vs. informative Clustering Visualization Projection, starfield, contour maps

Evaluation Criteria Measures of effectiveness User studies Effectiveness, efficiency, usability Measures of effectiveness Recall Precision F-measure Mean Average Precision User studies

Set-Based Effectiveness Measures Precision How much of what was found is relevant? Often of interest, particularly for interactive searching Recall How much of what is relevant was found? Particularly important for law, patents, and medicine

Accuracy and exhaustiveness Space of all documents Relevant + Retrieved Relevant Retrieved Not Relevant + Not Retrieved

Mean Average Precision Average of precision at each retrieved relevant document Relevant documents not retrieved contribute zero to score Hits 1-10 Precision 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10 Hits 11-20 Precision 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 4/20 Assume total of 14 relevant documents: 8 relevant documents not retrieved contribute eight zeros MAP = .2307 = relevant document

Blair and Maron (1985) A classic study of retrieval effectiveness Earlier studies used unrealistically small collections Studied an archive of documents for a lawsuit 40,000 documents, ~350,000 pages of text 40 different queries Used IBM’s STAIRS full-text system Approach: Lawyers wanted at least 75% of all relevant documents Precision and recall evaluated only after the lawyers were satisfied with the results David C. Blair and M. E. Maron. (1984) An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System. Communications of the ACM, 28(3), 289--299.

Blair and Maron’s Results Mean precision: 79% Mean recall: 20% (!!) Why recall was low? Users can’t anticipate terms used in relevant documents Differing technical terminology Slang, misspellings Other findings: Searches by both lawyers had similar performance Lawyer’s recall was not much different from paralegal’s “accident” might be referred to as “event”, “incident”, “situation”, “problem,” …

Web Search Crawling PageRank Anchor text Deep Web i.e., database-generated content

Evidence from Behavior Implicit feedback Privacy risks Recommender systems

Evidence from Metadata Standards e.g., Dublin Core Controlled vocabulary Text classification Information extraction

Filtering Retrieval Filtering Information needs differ for stable collection Filtering Collection differs for stable information needs

Multimedia IR Image retrieval Video: Motion detection Color histograms Video: Motion detection Camera, object Video: Shot structure Boundary detection, classification Video: OCR Closed caption, on screen caption, scene text