We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byCamren Syrett
Modified over 2 years ago
© 2004, M. Fontoura VLDB, Toronto, September 2004 High Performance Index Build Algorithms for Intranet Search Engines Marcus Fontoura, Eugene Shekita, Jason Zien, Sridhar Rajagopalan, Andreas Neumann email@example.com
© 2004, M. Fontoura VLDB, Toronto, September 2004 Agenda Overview and problem description Global analysis Major data structures for index build Index build algorithm
© 2004, M. Fontoura VLDB, Toronto, September 2004 Overview and problem description Trevi goal is to provide high quality intranet search capability to corporate portals such as w3.ibm.com –Scalable text search engine that is being developed by a joint IBM Research and Software Group team This talk focuses on how to efficiently incorporate global analysis into the index build process
© 2004, M. Fontoura VLDB, Toronto, September 2004 Global analysis (GA) Duplicate detection –Computes fingerprints for each page (64 bit shingle) –Master are identified by using the (previous) static rank Anchor text (D1: Trevi ) –Appends anchor text tokens to documents Static rank –Host in-degree, i.e., number of hosts that point to a page (~ PageRank on the IBM intranet)
© 2004, M. Fontoura VLDB, Toronto, September 2004 Index build requires GA Rebuild the inverted text index and update the global analysis (GA ) –Duplicate documents are deleted from the index –Anchor text is indexed together with the document’s content –Static rank gives the index ordering, allowing for early termination during query evaluation The time to rebuild the index will be dominated by the GA time, as analysis get more complex –Semantic search
© 2004, M. Fontoura VLDB, Toronto, September 2004 Major data structures Store –Storage for the tokenized version of each document Index –Inverted text index over the Store Delta store and delta index –Small versions of the Store and Index with new and modified documents –Allow for hourly updates of the Index content
© 2004, M. Fontoura VLDB, Toronto, September 2004 Index build algorithm (1/3) Index build merges the current version of the Store (Store i ) and with the current version of the DeltaStore and generates the new version of the Store and the new Index, Store i+1 and Index i+1 Index Build Store i DeltaStore Store i+1 Index i+1
© 2004, M. Fontoura VLDB, Toronto, September 2004 Index build algorithm (2/3) Index build using global analysis DeltaStore Global Analysis Index Build DeltaIndex Build Store i Newly crawled documents DeltaStore j Store i DeltaStore Dup i+1 AnchorText i+1 Rank i+1 Store i+1 Index i+1 DeltaStore j+1 DeltaIndex j+1
© 2004, M. Fontoura VLDB, Toronto, September 2004 Index build algorithm (3/3) Index build using lagging global analysis Store i DeltaStore GA i Global Analysis Index Build DeltaIndex Build Newly crawled documents DeltaStore j GA inputs Store i+1 Index i+1 DeltaStore j+1 DeltaIndex j+1 GA i GA i+1 Global Analysis and DeltaIndex build can proceed in parallel
© 2004, M. Fontoura VLDB, Toronto, September 2004 Indexing algorithm Radix sort –Linear time sorting –Flexibility in defining the sort criteria –Bigger sort buffers increase performance Pipelining load and sort phases
© 2004, M. Fontoura VLDB, Toronto, September 2004 Experimental results Lagging global analysis does not degrade quality –More than 25% of performance improvement –Even more advantageous when analysis are more complex Indexing algorithm scales linearly with the number of documents Superior performance when compared to several state-of-the art indexing algorithms
© 2004, M. Fontoura VLDB, Toronto, September 2004 Hardware and software architectures Query Server Crawler Index Build Crawled Documents Store Index DeltaStore DeltaIndex Local Gigabit Switch data copy IP Sprayer Link to the global IBM Intranet Store Index DeltaStore DeltaIndex
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
Search engine note. Search Signals “Heuristics” which allow for the sorting of search results – Word based: frequency, position, … – HTML based: emphasis,
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Published in May 2007 Presented by : Shruthi Venkateswaran.
Setting up a search engine KS 2 Search: appreciate how results are selected.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Kelly Boccia Abi Natarajan Konstantin Livitski Senthil Anand Subbanan Meyyappan 1.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
CS246 Search Engine Scale. Junghoo "John" Cho (UCLA Computer Science) 2 High-Level Architecture Major modules for a search engine? 1. Crawler Page.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
© 2017 SlidePlayer.com Inc. All rights reserved.