Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Encoders and Decoders.
Part IV: Memory Management
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.
Hinrich Schütze and Christina Lioma Lecture 5: Index Compression
Modern Information Retrieval Chapter 8 Indexing and Searching.
Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.
Modern Information Retrieval
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
Information Retrieval IR 4. Plan This time: Index construction.
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
CS 206 Introduction to Computer Science II 04 / 29 / 2009 Instructor: Michael Eckmann.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Organizing files for performance Chapter Data compression Advantages of reduced file size Redundancy reduction: state code example Repeating sequences:
Indexing and Searching
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.
Information Retrieval Space occupancy evaluation.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
CSE Lectures 22 – Huffman codes
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student University of Leicester.
Data Representation and Storage Lecture 5. Representations A number value can be represented in many ways: 5 Five V IIIII Cinq Hold up my hand.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large  It might be useful for some modern devices to support information retrieval.
8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Compression of Inverted Indexes for Fast Query Evaluation Falk Scholer Hugh Williams John Yiannis Justin Zobel (RMIT University, Melbourne, Australia)
1 Chapter 3.2 : Virtual Memory What is virtual memory? What is virtual memory? Virtual memory management schemes Virtual memory management schemes Paging.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Positional Data Organization and Compression in Web Inverted Indexes Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication Engineering,
UTILITIES Group 3 Xin Li Soma Reddy. Data Compression To reduce the size of files stored on disk and to increase the effective rate of transmission by.
Inverted File Compression In Managing Gigabytes 과목 : 정보검색론 강의 : 부산대학교 권혁철.
Sec 14.7 Bitmap Indexes Shabana Kazi. Introduction A bitmap index is a special kind of index that stores the bulk of its data as bit arrays (commonly.
Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.
Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
Evidence from Content INST 734 Module 2 Doug Oard.
Hanyang University Hyunok Oh Energy Optimal Bit Encoding for Flash Memory.
Efficient Processing of Updates in Dynamic XML Data Changqing Li, Tok Wang Ling, Min Hu.
CHAPTER 3-3: PAGE MAPPING MEMORY MANAGEMENT. VIRTUAL MEMORY Key Idea Disassociate addresses referenced in a running process from addresses available in.
Negative Integers Unsigned binary representation can not be used to represent negative numbers. With paper and pencil arithmetic, a number is made negative.
Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.
University of Maryland Baltimore County
Chapter 3 Data Representation
COMP9319: Web Data Compression and Search
3.3 Fundamentals of data representation
Information Retrieval in Practice
Memory Management Virtual Memory.
Negative Integers Unsigned binary representation can not be used to represent negative numbers. With paper and pencil arithmetic, a number is made negative.
Information Retrieval in Practice
Ch. 8 File Structures Sequential files. Text files. Indexed files.
Implementation Issues & IR Systems
The University of Adelaide, School of Computer Science
Lecture 3: Main Memory.
A Small and Fast IP Forwarding Table Using Hashing
Index construction: Compression of postings
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Performance of Compressed Inverted Indexes

Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance of query evaluation operations

Factors Affecting Index Performance  Retrieval time for index lists (index size)  Complexity of decoding index lists

Standard Techniques  Translate absolute location of terms into differences between locations  Use bitwise encoding schemes such as Golomb-Rice or Elias coding  Usually reduce an index to about 15% of the size of the collection  Performance is generally equal or better than an uncompressed index

Articles Reviewed  Compression of Inverted Indexes For Fast Query Evaluation Scholer, Williams, Yiannis and Zobel, 2002 School of Computer Science and Information Technology, RMIT University, Melbourne, Australia  Index Compression vs. Retrieval Time of Inverted Files for XML Documents Fuhr and Govert, 2002 University of Dortmund, Germany

Article 1: Improving Performance  Two techniques were chosen to attempt to improve the performance of compressed indexes: Optimization of existing bitwise compression routines Implementation of bytewise compression routines

Optimized Bitwise Compression Routines  Improved existing code developed by Williams and Zobel  Optimized for the Intel / Linux platform  Decoding speed improved to 60% of that achieved by Williams and Zobel

Bytewise Compression Routines  Integers are stored in standard binary form using only 7 bits of a byte  Each integer only takes up as many bytes as necessary to store the integer  1 bit per byte is used as a flag to indicate that a byte is the final byte for the integer  Decoding of the integers is much simpler than the complex bitwise encodings

Bitwise vs. Bytewise  Bytewise encoding of indexes takes up nearly 20% of the original document size (33% more than bitwise encodings)  Bytewise encoding provides query performance that is double that of the optimized bitwise encodings  Even when the index is small enough to be stored in memory, bytewise encoding shows small improvements over uncompressed indexes

Article 2: Structured Indexes  Most IR approaches in the past have ignored the structure and formatting of documents  The widespread adoption of HTML and XML has created the need for improvements in structured IR

Inverted Indexes of XML Documents  The document structure must be stored or referenced from the inverted index  Standard schemes use a Path-In-List (PIL) approach; structure data is stored within the inverted list for each term  Indexes are generally much larger than the original text when uncompressed

Compression of Inverted Lists  Problem: the uncompressed PIL approach generates an index that is too large  Two possible solutions were explored: Use bitwise compression schemes to compress the existing PIL representation Store only a pointer in the list that points into another data structure that models the document structure

XML Structure (XS) Tree  The XS Tree is a compact representation of the structure of an XML document  Size of XS Tree is generally 1-2% of the original document size  XS Trees for an entire document collection can usually be kept in memory

Performance of PIL vs. XS Trees  The XS Tree index, including the XS Trees, is generally 2-3 times smaller than the compressed PIL approach  Both approaches yield indexes that are smaller than the document collection  In both cases, compression results in retrieval performance that is far worse than uncompressed PIL.  Retrieval performance of the XS Tree approach is times worse than that of the uncompressed PIL

Conclusions  Retrieval performance is dependent on: the retrieval time of the index (index size) the complexity of decoding the index entries  Scholer et. al. find the ideal balance with bytewise compression, which results in optimal retrieval times

Conclusions  The XS Tree’s goal of compressing the size of the index is successful  The complexity of decoding the XS Tree structure results in nearly unusable performance  Future research should be undertaken to find a structure that is quicker to decode than the XS Tree