Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2004, M. Fontoura VLDB, Toronto, September 2004 High Performance Index Build Algorithms for Intranet Search Engines Marcus Fontoura, Eugene Shekita,

Similar presentations


Presentation on theme: "© 2004, M. Fontoura VLDB, Toronto, September 2004 High Performance Index Build Algorithms for Intranet Search Engines Marcus Fontoura, Eugene Shekita,"— Presentation transcript:

1 © 2004, M. Fontoura VLDB, Toronto, September 2004 High Performance Index Build Algorithms for Intranet Search Engines Marcus Fontoura, Eugene Shekita, Jason Zien, Sridhar Rajagopalan, Andreas Neumann

2 © 2004, M. Fontoura VLDB, Toronto, September 2004 Agenda Overview and problem description Global analysis Major data structures for index build Index build algorithm

3 © 2004, M. Fontoura VLDB, Toronto, September 2004 Overview and problem description Trevi goal is to provide high quality intranet search capability to corporate portals such as w3.ibm.com –Scalable text search engine that is being developed by a joint IBM Research and Software Group team This talk focuses on how to efficiently incorporate global analysis into the index build process

4 © 2004, M. Fontoura VLDB, Toronto, September 2004 Global analysis (GA) Duplicate detection –Computes fingerprints for each page (64 bit shingle) –Master are identified by using the (previous) static rank Anchor text (D1: Trevi ) –Appends anchor text tokens to documents Static rank –Host in-degree, i.e., number of hosts that point to a page (~ PageRank on the IBM intranet)

5 © 2004, M. Fontoura VLDB, Toronto, September 2004 Index build requires GA Rebuild the inverted text index and update the global analysis (GA ) –Duplicate documents are deleted from the index –Anchor text is indexed together with the document’s content –Static rank gives the index ordering, allowing for early termination during query evaluation The time to rebuild the index will be dominated by the GA time, as analysis get more complex –Semantic search

6 © 2004, M. Fontoura VLDB, Toronto, September 2004 Major data structures Store –Storage for the tokenized version of each document Index –Inverted text index over the Store Delta store and delta index –Small versions of the Store and Index with new and modified documents –Allow for hourly updates of the Index content

7 © 2004, M. Fontoura VLDB, Toronto, September 2004 Index build algorithm (1/3) Index build merges the current version of the Store (Store i ) and with the current version of the DeltaStore and generates the new version of the Store and the new Index, Store i+1 and Index i+1 Index Build Store i DeltaStore Store i+1 Index i+1

8 © 2004, M. Fontoura VLDB, Toronto, September 2004 Index build algorithm (2/3) Index build using global analysis DeltaStore Global Analysis Index Build DeltaIndex Build Store i Newly crawled documents DeltaStore j Store i DeltaStore Dup i+1 AnchorText i+1 Rank i+1 Store i+1 Index i+1 DeltaStore j+1 DeltaIndex j+1

9 © 2004, M. Fontoura VLDB, Toronto, September 2004 Index build algorithm (3/3) Index build using lagging global analysis Store i DeltaStore GA i Global Analysis Index Build DeltaIndex Build Newly crawled documents DeltaStore j GA inputs Store i+1 Index i+1 DeltaStore j+1 DeltaIndex j+1 GA i GA i+1 Global Analysis and DeltaIndex build can proceed in parallel

10 © 2004, M. Fontoura VLDB, Toronto, September 2004 Indexing algorithm Radix sort –Linear time sorting –Flexibility in defining the sort criteria –Bigger sort buffers increase performance Pipelining load and sort phases

11 © 2004, M. Fontoura VLDB, Toronto, September 2004 Experimental results Lagging global analysis does not degrade quality –More than 25% of performance improvement –Even more advantageous when analysis are more complex Indexing algorithm scales linearly with the number of documents Superior performance when compared to several state-of-the art indexing algorithms

12 © 2004, M. Fontoura VLDB, Toronto, September 2004 Hardware and software architectures Query Server Crawler Index Build Crawled Documents Store Index DeltaStore DeltaIndex Local Gigabit Switch data copy IP Sprayer Link to the global IBM Intranet Store Index DeltaStore DeltaIndex


Download ppt "© 2004, M. Fontoura VLDB, Toronto, September 2004 High Performance Index Build Algorithms for Intranet Search Engines Marcus Fontoura, Eugene Shekita,"

Similar presentations


Ads by Google