J IANPING F AN D EPT OF C OMPUTER S CIENCE UNC-C HARLOTTE Inverted Files, Signature Files, Bitmaps.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
Introduction to Information Retrieval
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
File Management Chapter 12. File Management A file is a named entity used to save results from a program or provide data to a program. Access control.
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
Modern Information Retrieval Chapter 8 Indexing and Searching.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Modern Information Retrieval
BTrees & Bitmap Indexes
Inverted Files, Signature Files, Bitmaps
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
File Processing - Indexing MVNC1 Indexing Jim Skon.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Author: Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA Presented By: Anamika Mukherji 13/26/2013Indexing The World Wide Web.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Evidence from Content INST 734 Module 2 Doug Oard.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Module D: Hashing.
Chapter 5 Record Storage and Primary File Organizations
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
CS315 Introduction to Information Retrieval Boolean Search 1.
Why indexing? For efficient searching of a document
Large Scale Search: Inverted Index, etc.
Information Retrieval in Practice
Record Storage, File Organization, and Indexes
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Indexing & querying text
Database Management System
CS522 Advanced database Systems
Database Management Systems (CS 564)
COMP 430 Intro. to Database Systems
9/12/2018.
Implementation Issues & IR Systems
CS 430: Information Discovery
Chapter 12: Query Processing
CS222P: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Disk Storage, Basic File Structures, and Hashing
Disk Storage, Basic File Structures, and Buffer Management
Lecture 12 Lecture 12: Indexing.
Lecture 2- Query Processing (continued)
Database Design and Programming
CS222/CS122C: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #05 Index Overview and ISAM Tree Index Instructor: Chen Li.
Presentation transcript:

J IANPING F AN D EPT OF C OMPUTER S CIENCE UNC-C HARLOTTE Inverted Files, Signature Files, Bitmaps

G ENERATING D OCUMENT R EPRESENTATIONS Use significant terms to build representations of documents referred to as indexing Manual indexing : professional indexers Assign terms from a controlled vocabulary Typically phrases Automatic indexing : machine selects Terms can be single words, phrases, or other features from the text of documents 2

I NDEX L ANGUAGES Language used to describe docs and queries Exhaustivity # of different topics indexed, completeness or breadth increased exhaustivity => higher recall/ lower precision Specificity - accuracy of indexing, detail increased specificity => higher precision/lower recall 3 retrieved output size increases because documents are indexed by any remotely connected content information When doc represented by fewer terms, content may be lost. A query that refers to the lost content,will fail to retrieve the document

I NDEX L ANGUAGES Pre-coordinate indexing – combinations of terms (e.g. phrases) used as an indexing term Post-coordinate indexing - combinations generated at search time Faceted classification - group terms into facets that describe basic structure of a domain, less rigid than predefined hierarchy Enumerative classification - an alphabetic listing, underlying order less clear e.g. Library of Congress class for “socialism, communism and anarchism” at end of schedule for social sciences, after social pathology and criminology 4

H OW DO WE RETRIEVE INFORMATION ? 1. Search the whole text sequentially (i.e., on-line search) A good strategy if the text is small the only choice unaffordable index space overhead 2. Build data structures over the text ( indices ) to speed up the search A good strategy if the text collection is large the text is semi-static 5

I NDEXING TECHNIQUES Inverted files best choice for most applications Signature files & bitmaps word-oriented index structures based on hashing Arrays faster for phrase searches & less common queries harder to build & maintain Design issues: Search cost & space overhead Cost of building & updating 6

I NVERTED L IST : MOST COMMON INDEXING TECHNIQUE Source file: collection, organized by document Inverted file: collection organized by term one record per term, listing locations where term occurs Searching: traverse lists for each query term OR: the union of component lists AND: an intersection of component lists Proximity: an intersection of component lists SUM: the union of component lists; each entry has a score 7

I NVERTED F ILES Contains inverted lists one for each word in the vocabulary identifies locations of all occurrences of a word in the original text which ‘documents’ contain the word Perhaps locations of occurrence within documents Requires a lexicon or vocabulary list provides mapping between word and its inverted list Single term query could be answered by 1. scan the term’s inverted list 2. return every doc on the list 8

I NVERTED F ILES Index granularity refers to the accuracy with which term locations are identified coarse grained may identify only a block of text each block may contain several documents moderate grained will store locations in terms of document numbers finely grained indices will return a sentence, word number, or byte number (location in original text) 9

T HE INVERTED LISTS Data stored in inverted list: The term, document frequency (df), list of DocIds government, 3, List of pairs of DocId and term frequency (tf) government, 3 List of DocId and positions government, 3 10

I NVERTED F ILES : C OARSE 11

I NVERTED F ILES : M EDIUM 12

I NVERTED F ILES : F INE 13

I NDEX G RANULARITY Can you think of any differences between these in terms of storage needs or search effectiveness? coarse : identify a block of text (potentially many docs) fine : store sentence, word or byte number 14 less storage space, but more searching of plain text to find exact locations of search terms more false matches when multiple words. Why? Enables queries to contain proximity information e.g.) “green house” versus green AND house Proximity info increases index size 2-3x only include doc info if proximity will not be used

I NDEXES : B ITMAPS Bag-of-words index only: term x document array For each term, allocate vector with 1 bit per document If term present in document n, set n ’th bit to 1, else 0 Boolean operations very fast Extravagant of storage: N*n bits needed 2 Gbytes text requires 40 Gbyte bitmap Space efficient for common terms as high prop. bits set Space inefficient for rare terms (why?) Not widely used 15

I NDEXES : S IGNATURE F ILES Bag-of-words only: probabilistic indexing Allocate fixed size s -bit vector ( signature ) per term Use multiple hash functions generating values in the range 1.. s the values generated by each hash are the bits to set in the signature OR the term signatures to form document signature Match query to doc: check whether bits corresponding to term signature are set in doc signature 16

I NDEXES : S IGNATURE F ILES When a bit is set in a q-term mask, but not in doc mask, word is not present in doc s -bit signature may not be unique Corresponding bits can be set even though word is not present ( false drop ) Challenge: design file to ensure p(false drop) is low, while keeping signature file as short as possible document must be fetched and scanned to ensure a match 17

S IGNATURE F ILES 18 TermHash String cold days hot in it like nine old pease porridge pot some the What is the descriptor for doc 1?

I NDEXES : S IGNATURE F ILES At query time: Lookup signature for query term If all corresponding 1-bits on in document signature, document probably contains that term do false drop checking Vary s to control P (false drop) vs space Optimal s changes as collection grows why? – larger vocab. =>more signature overlap Wider signatures => lower p(false drop), but storage increases Shorter signatures => lower storage, but require more disk access to test for false drops 19

I NDEXES : S IGNATURE F ILES Many variations, widely studied, not widely used. Require more space than inverted files Inefficient w/ variable size documents since each doc still allocated the same number of signature bits Longer docs have more terms: more likely to yield false hits Signature files most appropriate for Conventional databases w/ short docs of similar lengths Long conjunctive queries compressed inverted indices are almost always superior wrt storage space and access time 20

I NVERTED F ILE In general, stores a hierarchical set of address at an extreme: word number within sentence number within paragraph number within chapter number within volume number Uncompressed take up considerable space 50 – 100% of the space the text takes up itself stopword removal significantly reduces the size compressing the index is even better 21

T HE D ICTIONARY Binary search tree Worst case O(dictionary-size) time must look at every node Average O(lg(dictionary-size)) must look at only half of the nodes Needs space for left and right pointers nodes with smaller values go in left branch nodes with larger values go in right branch A sorted list is generated by traversal 22

T HE DICTIONARY A sorted array Binary search to find term in array O(log(size- dictionary)) must search half the array to find the item Insertion is slow O(size-dictionary) 23

T HE DICTIONARY A hash table Search is fast O(1) Does not generate a sorted dictionary 24

T HE INVERTED FILE Dictionary Stored in memory or Secondary storage Each record contains a pointer to inverted list, the term, possibly df, and a term number/ID A postings file - a sequential file with inverted lists sorted by term ID 25

26

B UILDING AN I NVERTED F ILE 1. Initialization 1. Create an empty dictionary structure S 2. Collect term appearances a. For each document D i in the collection i. Scan D i (parse into index terms) b. Fore each index term t i. Let f d,t be the freq of term t in Doc d ii. search S for t iii. if t is not in S, insert it iv. Append a node storing (d, f d,t ) to t’s inverted list 3. Create inverted file 1. Start a new inverted file entry for each new t 2. For each (d, f d,t ) in the list for t, append (d, f d,t ) to its inverted file entry 3. Compress inverted file entry if need be 4. Append this inverted file entry to the inverted file 27

W HAT ARE THE CHALLENGES ? Index is much larger than memory (RAM) Can create index in batches and merge Fill memory buffer, sort, compress, then write to disk Compressed buffers can be read, uncompressed on the fly, and merge sorted Compressed indices improve query speed since time to uncompress is offset by reduced I/O costs Collection is larger than disk space (e.g. web) Incremental updates Can be expensive Build index for new docs, merge new with old index In some environments (web), docs are only removed from the index when they can’t be found 28

W HAT ARE THE CHALLENGES ? Time limitations (e.g.incremental updates for 1 day should take < 1 day) Reliability requirements (e.g. 24 x 7?) Query throughput or latency requirements Position/proximity queries 29

I NVERTED F ILES /S IGNATURE F ILES /B ITMAPS Signature/inverted files consume order of magnitude less 2ry storage than do bitmaps Sig files false drops cause unnecessary accesses to main text Can be reduced by increasing signature size, at cost of increased storage Queries can be difficult to process Long or variable length docs cause problems 2-3x larger than compressed inverted files No need to store vocabulary separately, when 1. Dictionary too large for main memory 2. vocabulary is very large and queries contain 10s or 100s of words inverted file will require 1 more disk access per query term, so sig file may be more efficient 30

I NVERTED F ILES /S IGNATURE F ILES /B ITMAPS Inverted Files If access inverted lists in order of length, then require no more disk accesses than signature files As efficient for typical conjunctive queries as signature files Can be compressed to address storage problems Most useful for indexing large collection of variable length documents 31