CS/Info 430: Information Retrieval

Slides:



Advertisements
Similar presentations
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Advertisements

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
An obvious way to implement the Boolean search is through the inverted file. We store a list for each keyword in the vocabulary, and in each list put the.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.
Database Systems: A Practical Approach to Design, Implementation and Management International Computer Science S. Carolyn Begg, Thomas Connolly Lecture.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 17 Methodology – Physical Database Design for Relational Databases Transparencies © Pearson Education Limited 1995, 2005.
Team Dosen UMN Physical DB Design Connolly Book Chapter 18.
Indexing and Complexity. Agenda Inverted indexes Computational complexity.
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Lecture 9 Methodology – Physical Database Design for Relational Databases.
LIS618 lecture 2 the Boolean model Thomas Krichel
1 CS 430: Information Discovery Lecture 16 Thesaurus Construction.
Introduction to Digital Libraries Searching
CS 430: Information Discovery
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
1 CS 430: Information Discovery Lecture 25 Cluster Analysis 2 Thesaurus Construction.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Methodology – Physical Database Design for Relational Databases.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
Evidence from Content INST 734 Module 2 Doug Oard.
FILE ORGANIZATION.
User-Friendly Systems Instead of User-Friendly Front-Ends Present user interfaces are not accepted because the underlying systems are too difficult to.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
1 CS 430: Information Discovery Lecture 24 Cluster Analysis.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
CS315 Introduction to Information Retrieval Boolean Search 1.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Automated Information Retrieval
Why indexing? For efficient searching of a document
Text Indexing and Search
Indexing Structures for Files and Physical Database Design
Text Based Information Retrieval
CS 430: Information Discovery
Information Retrieval and Web Search
Methodology – Physical Database Design for Relational Databases
Multimedia Information Retrieval
國立臺北科技大學 課程:資料庫系統 fall Chapter 18
CSCE 561 Information Retrieval System Models
FILE ORGANIZATION.
IR Theory: Evaluation Methods
Data Mining Chapter 6 Search Engines
CS 430: Information Discovery
6. Implementation of Vector-Space Retrieval
Spreadsheets, Modelling & Databases
CS 430: Information Discovery
CS 430: Information Discovery
Presentation transcript:

CS/Info 430: Information Retrieval Sample Midterm Examination Notes on the Solutions

Midterm Examination -- Question 1 1(a) Define the terms inverted file, inverted list, posting. Inverted file: a list of the words in a set of documents and the documents in which they appear. Inverted list: All the entries in an inverted file that apply to a specific word. Posting: Entry in an inverted list 1(b) When implementing an inverted file system, what are the criteria that you would use to judge whether the system is suitable for very large-scale information retrieval?

Q1 (continued) Storage Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resource.

Q1 (continued) 1(c) You are designing an inverted file system to be used with Boolean queries on a very large collection of textual documents. New documents are being continually added to the collection.   (i) What file structure(s) would you use?   (ii) How well does your design satisfy the criteria listed in Part (b)?

Q1 (continued) Separate inverted index from lists of postings postings file Term Pointer to list of postings ant bee cat dog elk fox gnu hog inverted index Lists of postings

Question 1 (continued) (a) Postings file may be stored sequentially as a linked list. (b) Index file is best stored as a tree. Binary trees provide fast searching but have problems with updating. B-trees are better, with B+-trees as the best. Note: Other answers are possible to this part of the question.

Question 1 (continued) 1(c)(ii) How well does your design satisfy the criteria listed in Part (b)? • Sequential list for each term is efficient for storage and for processing Boolean queries. The disadvantage is a slow update time for long inverted lists. • B-trees combine fast retrieval with moderately efficient updating. • Bottom-up updating is usual fast, but may require recursive tree climbing to the root. • The main weakness is poor storage utilization; typically buckets are only 0.69 full.

Midterm Examination -- Question 2 2(b) You have the collection of documents that contain the following index terms: D1: alpha bravo charlie delta echo foxtrot golf D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta (i) Use an incidence matrix of terms to calculate a similarity matrix for these four documents, with no term weighting.

Incidence array D1: alpha bravo charlie delta echo foxtrot golf D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta 7 3 4 alpha bravo charlie delta echo foxtrot golf D1 1 1 1 1 1 1 1 D2 1 1 1 D3 1 1 1 1 D4 1 1 1 1

Document similarity matrix D1 D2 D3 D4 D1 0.65 0.76 0.76 D2 0.65 0.00 0.87 D3 0.76 0.00 0.25 D4 0.76 0.87 0.25

Question 2 (continued) 2b(ii) Use a frequency matrix of terms to calculate a similarity matrix for these documents, with weights proportional to the term frequency and inversely proportional to the document frequency.

Frequency Array D1: alpha bravo charlie delta echo foxtrot golf D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta alpha bravo charlie delta echo foxtrot golf D1 1 1 1 1 1 1 1 D2 1 1 3 D3 3 1 1 1 D4 2 1 1 2

Inverse Document Frequency Weighting Principle: (a) Weight is proportional to the number of times that the term appears in the document (b) Weight is inversely proportional to the number of documents that contain the term: wik = fik / dk Where: wik is the weight given to term k in document i fik is the frequency with which term k appears in document i dk is the number of documents that contain term k

Frequency Array with Weights D1: alpha bravo charlie delta echo foxtrot golf D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta alpha bravo charlie delta echo foxtrot golf D1 0.33 0.50 0.50 0.33 0.50 0.33 0.33 D2 0.33 0.33 1.00 D3 1.50 0.50 0.50 0.33 D4 0.67 0.33 0.33 0.67 length 0.94 0.65 1.08 0.76 dk 3 2 2 3 2 3 3

Document similarity matrix D1 D2 D3 D4 D1 0.46 0.74 0.58 D2 0.46 0.00 0.86 D3 0.74 0.00 0.06 D4 0.56 0.86 0.06

Question 3 3(a) Define the terms recall and precision. 3(b) Q is a query. D is a collection of 1,000,000 documents. When the query Q is run, a set of 200 documents is returned. (i) How in a practical experiment would you calculate the precision? Have an expert examine each of the 200 documents and decide whether it is relevant. Precision is number judged relevant divided by 200. (ii) How in a practical experiment would you calculate the recall? It is not feasible to examine 1,000,000 records. Therefore sampling must be used ...

Question 3 (continued) 3(c) Suppose that, by some means, it is known that 100 of the documents in D are relevant to Q. Of the 200 documents returned by the search, 50 are relevant. (i) What is the precision? 50/200 = 0.25 (ii) What is the recall? 50/100 = 0.5 3(d) Explain in general terms the method used by TREC to estimate the recall.

Question 3 (continued) For each query, a pool of potentially relevant documents is assembled, using the top 100 ranked documents from each participant The human expert who set the query looks at every document in the pool and determines whether it is relevant. Documents outside the pool are not examined. [In TREC-8: 7,100 documents in the pool 1,736 unique documents (eliminating duplicates) 94 judged relevant]