We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byCamren Syrett
Modified about 1 year ago
© 2004, M. Fontoura VLDB, Toronto, September 2004 High Performance Index Build Algorithms for Intranet Search Engines Marcus Fontoura, Eugene Shekita, Jason Zien, Sridhar Rajagopalan, Andreas Neumann
© 2004, M. Fontoura VLDB, Toronto, September 2004 Agenda Overview and problem description Global analysis Major data structures for index build Index build algorithm
© 2004, M. Fontoura VLDB, Toronto, September 2004 Overview and problem description Trevi goal is to provide high quality intranet search capability to corporate portals such as w3.ibm.com –Scalable text search engine that is being developed by a joint IBM Research and Software Group team This talk focuses on how to efficiently incorporate global analysis into the index build process
© 2004, M. Fontoura VLDB, Toronto, September 2004 Global analysis (GA) Duplicate detection –Computes fingerprints for each page (64 bit shingle) –Master are identified by using the (previous) static rank Anchor text (D1: Trevi ) –Appends anchor text tokens to documents Static rank –Host in-degree, i.e., number of hosts that point to a page (~ PageRank on the IBM intranet)
© 2004, M. Fontoura VLDB, Toronto, September 2004 Index build requires GA Rebuild the inverted text index and update the global analysis (GA ) –Duplicate documents are deleted from the index –Anchor text is indexed together with the document’s content –Static rank gives the index ordering, allowing for early termination during query evaluation The time to rebuild the index will be dominated by the GA time, as analysis get more complex –Semantic search
© 2004, M. Fontoura VLDB, Toronto, September 2004 Major data structures Store –Storage for the tokenized version of each document Index –Inverted text index over the Store Delta store and delta index –Small versions of the Store and Index with new and modified documents –Allow for hourly updates of the Index content
© 2004, M. Fontoura VLDB, Toronto, September 2004 Index build algorithm (1/3) Index build merges the current version of the Store (Store i ) and with the current version of the DeltaStore and generates the new version of the Store and the new Index, Store i+1 and Index i+1 Index Build Store i DeltaStore Store i+1 Index i+1
© 2004, M. Fontoura VLDB, Toronto, September 2004 Index build algorithm (2/3) Index build using global analysis DeltaStore Global Analysis Index Build DeltaIndex Build Store i Newly crawled documents DeltaStore j Store i DeltaStore Dup i+1 AnchorText i+1 Rank i+1 Store i+1 Index i+1 DeltaStore j+1 DeltaIndex j+1
© 2004, M. Fontoura VLDB, Toronto, September 2004 Index build algorithm (3/3) Index build using lagging global analysis Store i DeltaStore GA i Global Analysis Index Build DeltaIndex Build Newly crawled documents DeltaStore j GA inputs Store i+1 Index i+1 DeltaStore j+1 DeltaIndex j+1 GA i GA i+1 Global Analysis and DeltaIndex build can proceed in parallel
© 2004, M. Fontoura VLDB, Toronto, September 2004 Indexing algorithm Radix sort –Linear time sorting –Flexibility in defining the sort criteria –Bigger sort buffers increase performance Pipelining load and sort phases
© 2004, M. Fontoura VLDB, Toronto, September 2004 Experimental results Lagging global analysis does not degrade quality –More than 25% of performance improvement –Even more advantageous when analysis are more complex Indexing algorithm scales linearly with the number of documents Superior performance when compared to several state-of-the art indexing algorithms
© 2004, M. Fontoura VLDB, Toronto, September 2004 Hardware and software architectures Query Server Crawler Index Build Crawled Documents Store Index DeltaStore DeltaIndex Local Gigabit Switch data copy IP Sprayer Link to the global IBM Intranet Store Index DeltaStore DeltaIndex
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Introduction to Information Retrieval Kangnam Univ. Introduction to Information Retrieval Kangnam Univ. Lecture 4: Index Construction.
1 XML warehouse – XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme December 2002.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
For more information please send to or EFFICIENT QUERY SUBSCRIPTION PROCESSING.
Database Planning, Design, and Administration. Stages of the Database System Development Lifecycle.
ICS 434 Advanced Database Systems Dr. Abdallah Al-Sukairi Second Semester (032) King Fahd University of Petroleum & Minerals.
Xyleme, January Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
SCSC 311 Information Systems: hardware and software.
Windows Clusters Asma Ounnas, Bilel Remmache, Tom Davis and Toby Weiss.
Executional Architecture Lecture Conceptual vs execution Conceptual Architecture Execution Architecture Component Connector Domain-level responsibilities.
Chapter 5: Tree Constructions Breadth-First Search (BFS) –layer-based using Dijkstras algorithm –update-based using the Bellman-Ford algorithm Distributed.
The Anatomy of a Large-Scale Hypertextual Web Search Engine A review by: Adam Chamberlain, Adrian Hudnott, Rob Garrood & Ben Smith November 2005.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Chapter 18: Database System Architectures Centralized Systems Client--Server Systems Parallel.
2000 Making DADS distributed a Nordunet2 project Jochen Hollmann Chalmers University of Technology.
Kenji SHIMIZU NTT Network Innovation Labs. This work was partially supported by the National Institute of Information and Communications Technology. 1.
Chapter7. System Organization. System Organization - How computers and their major components are interconnected and managed at the system level. 7.1.
Chapter 7 Requirement Modeling : Flow, Behaviour, Patterns And WebApps Unit - II.
Cultural Heritage in REGional NETworks REGNET Status of WP 2 / WP 3 Extended Project Management Group Meeting
UNIT – IV VIRTUAL MEMORY MANAGEMENT Handled by K. Venkatesh & Razia Sultana.
Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.
Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Chapter 10 Architectural Design.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 20: Database System.
Chapter 6 – Architectural Design 1Chapter 6 Architectural design Software Engineering Ian Sommerville, Software Engineering, 9 th Edition Pearson.
Cultural Heritage in REGional NETworks REGNET Project Meeting Content Group Part 1: Usability Testing.
© 2016 SlidePlayer.com Inc. All rights reserved.