1 CS 430: Information Discovery Lecture 3 Inverted Files.

Slides:



Advertisements
Similar presentations
1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.
Advertisements

Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Multimedia Database Systems
CS 430 / INFO 430 Information Retrieval
CS 430 / INFO 430 Information Retrieval
Information Retrieval in Practice
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Database Systems: A Practical Approach to Design, Implementation and Management International Computer Science S. Carolyn Begg, Thomas Connolly Lecture.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
CS/Info 430: Information Retrieval
Evaluating the Performance of IR Sytems
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 6 Vector Methods 2.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Lecture 9 Methodology – Physical Database Design for Relational Databases.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
CS 430: Information Discovery
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Web- and Multimedia-based Information Systems Lecture 2.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
Evidence from Content INST 734 Module 2 Doug Oard.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
1 Midterm Examination. 2 General Observations Examination was too long! Most people submitted by .
Automated Information Retrieval
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Text Based Information Retrieval
CS 430: Information Discovery
Implementation Issues & IR Systems
CS 430: Information Discovery
Multimedia Information Retrieval
Query Languages.
CSCE 561 Information Retrieval System Models
Searching and Indexing
CS 430: Information Discovery
CS 430: Information Discovery
Chapter 5: Information Retrieval and Web Search
CS 430: Information Discovery
Presentation transcript:

1 CS 430: Information Discovery Lecture 3 Inverted Files

2 Course Administration Course team: Bill Matt Doug Mitarotonda Trishul sent to: will go to these four people

3 Course Administration Assignment 1 will be posted early next week. It is a programming assignment and is due on Friday, September 20 at 5 p.m. A text book was left in the Discussion Class yesterday.

4 Inverted File -- Concept (Basic Definition) Inverted file: a list of the words in a set of documents and the documents in which they appear. Word Document abacus actor aspen 5 atoll Stop words are removed and stemming carried out before building the index.

5 Inverted List -- Concept Inverted list: All the entries in an inverted file that apply to a specific word, e.g. abacus Posting: Entry in an inverted list, e.g., the postings for "abacus" are documents 3, 19, 22.

6 Keywords and Controlled Vocabulary Keyword: A term that is used to describe the subject matter in a document. It is sometimes called an index term. Keywords can be extracted automatically from a document or assigned by a human cataloguer or indexer. Controlled vocabulary: A list of words that can be used as keywords, e.g., in a medical system, a list of medical terms. Inverted file (more complete definition): A list of the keywords that apply to a set of documents, the documents in which they appear and related information.

7 Enhancements to Inverted Files -- Concept Location: The inverted file holds information about the location of each term within the document. Uses adjacency and near operators user interface design -- highlight location of search term Frequency: The inverted file includes the number of postings for each term. Uses term weighting query processing optimization

8 Inverted File -- Concept (Enhanced) WordPostings DocumentLocation abacus actor aspen atoll

9 Example: Boolean Queries Boolean query: two or more search terms, related by logical operators, e.g., andornot Examples: abacus and actor abacus or actor (abacus and actor) or (abacus and atoll) not actor

10 Boolean Diagram A B A and B A or B not (A or B)

11 Evaluating a Boolean Query To evaluate the and operator, merge the two inverted lists with a logical AND operation. Examples: abacus and actor Postings for abacus Postings for actor Document 19 is the only document that contains both terms, "abacus" and "actor".

12 Adjacent and Near Operators abacus adj actor Terms abacus and actor are adjacent to each other as in the string "abacus actor" abacus near 4 actor Terms abacus and actor are near to each other as in the string "the actor has an abacus" Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph).

13 Evaluating an Adjacency Operation Examples: abacus adj actor Postings for abacus Postings for actor Document 19, locations 212 and 213, is the only occurrence of the terms "abacus" and "actor" adjacent

14 Evaluation of Boolean Operators Precedence of operators must be defined: adj, nearhigh and, not or low Example A and B or C and B is evaluated as (A and B) or (C and B)

15 SetRecordsUnique Terms A2,6535,123 B38,304c.25,000 Sizes of Inverted Files Set A has an average of 14 postings per term and a maximum of over 2,000 postings per term. Set B has an average of 88 postings per record. Examples from Harman and Candela, 1990

16 Representation of Inverted Files Index (word list, vocabulary) file: Stores list of terms (keywords). Designed for rapid searching and processing range queries (lexicographic index). Often held in memory. Postings file: Stores an inverted list (postings list) of postings for each term. Designed for rapid merging of lists. Each list may be stored sequentially. Document file: Stores the documents. Important for user interface design. [Repositories for the storage of document collections are covered in CS 502.]

17 Organization of Inverted Files TermPointer to postings ant bee cat dog elk fox gnu hog Inverted lists Index filePostings file Documents file

18 Decisions in Building Inverted Files Underlying character set, e.g., printable ASCII, Unicode, UTF8. Whether to use a controlled vocabulary. If so, what words to include. List of stopwords. Rules to decide the beginning and end of words, e.g., spaces or punctuation. Character sequences not to be indexed, e.g., sequences of numbers.

19 Efficiency Criteria Storage Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resources.

20 Index Files On disk If an index is held on disk, search time is dominated by the number of disk accesses. In memory Suppose that an index has 1,000,000 distinct terms. Each index entry consists of the term and a pointer to the inverted list, average 100 characters. Size of index is 100 megabytes, which can easily be held in memory.

21 Postings File Merging inverted lists is the most computationally intensive task in many information retrieval systems. Since inverted lists may be very long, it is important to match postings efficiently. Usually, the inverted lists will be held on disk and paged into memory for matching. Therefore algorithms for matching postings process the lists sequentially. For efficient matching, the inverted lists should all be sorted in the same sequence. Inverted lists are commonly cached to minimize disk accesses.

22 Efficiency and Query Languages Some query options may require huge computation, e.g., Regular expressions If inverted files are stored in alphabetical order, comp* can be processed efficiently *comp cannot be processed efficiently Boolean terms If A and B are search terms A or B can be processed by comparing two moderate sized lists (not A) or (not B) requires two very large lists

23 SMART System An experimental system for automatic information retrieval automatic indexing to assign terms to documents and queries collect related documents into common subject classes identify documents to be retrieved by calculating similarities between documents and queries procedures for producing an improved search query based on information obtained from earlier searches Gerald Salton and colleagues Harvard Cornell

24 Vector Space Methods Problem: Given two text documents, how similar are they? (One document may be a query.) Vector space methods that measure similarity do not assume exact matches. Benefits of similarity measures rather than exact matches Encourage long queries, which are rich in information. An abstract should be very similar to its source document. Accept probabilistic aspects of writing and searching. Different words will be used if an author writes the same document twice.

25 Vector Space Revision x = (x 1, x 2, x 3,..., x n ) is a vector in an n-dimensional vector space Length of x is given by (extension of Pythagoras's theorem) |x| 2 = x x x x n 2 If x 1 and x 2 are vectors: Inner product (or dot product) is given by x 1.x 2 = x 11 x 21 + x 12 x 22 + x 13 x x 1n x 1n Cosine of the angle between the vectors x 1 and x 2: cos (  ) = x 1.x 2 |x 1 | |x 2 |