Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance.

Similar presentations


Presentation on theme: "Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance."— Presentation transcript:

1 Performance of Compressed Inverted Indexes

2 Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance of query evaluation operations

3 Factors Affecting Index Performance  Retrieval time for index lists (index size)  Complexity of decoding index lists

4 Standard Techniques  Translate absolute location of terms into differences between locations  Use bitwise encoding schemes such as Golomb-Rice or Elias coding  Usually reduce an index to about 15% of the size of the collection  Performance is generally equal or better than an uncompressed index

5 Articles Reviewed  Compression of Inverted Indexes For Fast Query Evaluation Scholer, Williams, Yiannis and Zobel, 2002 School of Computer Science and Information Technology, RMIT University, Melbourne, Australia  Index Compression vs. Retrieval Time of Inverted Files for XML Documents Fuhr and Govert, 2002 University of Dortmund, Germany

6 Article 1: Improving Performance  Two techniques were chosen to attempt to improve the performance of compressed indexes: Optimization of existing bitwise compression routines Implementation of bytewise compression routines

7 Optimized Bitwise Compression Routines  Improved existing code developed by Williams and Zobel  Optimized for the Intel / Linux platform  Decoding speed improved to 60% of that achieved by Williams and Zobel

8 Bytewise Compression Routines  Integers are stored in standard binary form using only 7 bits of a byte  Each integer only takes up as many bytes as necessary to store the integer  1 bit per byte is used as a flag to indicate that a byte is the final byte for the integer  Decoding of the integers is much simpler than the complex bitwise encodings

9 Bitwise vs. Bytewise  Bytewise encoding of indexes takes up nearly 20% of the original document size (33% more than bitwise encodings)  Bytewise encoding provides query performance that is double that of the optimized bitwise encodings  Even when the index is small enough to be stored in memory, bytewise encoding shows small improvements over uncompressed indexes

10 Article 2: Structured Indexes  Most IR approaches in the past have ignored the structure and formatting of documents  The widespread adoption of HTML and XML has created the need for improvements in structured IR

11 Inverted Indexes of XML Documents  The document structure must be stored or referenced from the inverted index  Standard schemes use a Path-In-List (PIL) approach; structure data is stored within the inverted list for each term  Indexes are generally much larger than the original text when uncompressed

12 Compression of Inverted Lists  Problem: the uncompressed PIL approach generates an index that is too large  Two possible solutions were explored: Use bitwise compression schemes to compress the existing PIL representation Store only a pointer in the list that points into another data structure that models the document structure

13 XML Structure (XS) Tree  The XS Tree is a compact representation of the structure of an XML document  Size of XS Tree is generally 1-2% of the original document size  XS Trees for an entire document collection can usually be kept in memory

14 Performance of PIL vs. XS Trees  The XS Tree index, including the XS Trees, is generally 2-3 times smaller than the compressed PIL approach  Both approaches yield indexes that are smaller than the document collection  In both cases, compression results in retrieval performance that is far worse than uncompressed PIL.  Retrieval performance of the XS Tree approach is 10-100 times worse than that of the uncompressed PIL

15 Conclusions  Retrieval performance is dependent on: the retrieval time of the index (index size) the complexity of decoding the index entries  Scholer et. al. find the ideal balance with bytewise compression, which results in optimal retrieval times

16 Conclusions  The XS Tree’s goal of compressing the size of the index is successful  The complexity of decoding the XS Tree structure results in nearly unusable performance  Future research should be undertaken to find a structure that is quicker to decode than the XS Tree


Download ppt "Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance."

Similar presentations


Ads by Google