Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Chapter 5: Introduction to Information Retrieval
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
Final Project of Information Retrieval and Extraction by d 吳蕙如.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Search Engines and Information Retrieval
LYU0101 Wireless Digital Information System Lam Yee Gordon Yeung Kam Wah Supervisor Prof. Michael Lyu Second semester FYP Presentation 2001~2002.
1 Overview of Storage and Indexing Chapter 8 (part 1)
LYU0101 Wireless Digital Information System Lam Yee Gordon Yeung Kam Wah Supervisor Prof. Michael Lyu Second semester FYP Presentation 2001~2002.
Database Systems: A Practical Approach to Design, Implementation and Management International Computer Science S. Carolyn Begg, Thomas Connolly Lecture.
Databases and Database Management System. 2 Goals comprehensive introduction to –the design of databases –database transaction processing –the use of.
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.
1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.
Overview of Search Engines
Cloud Computing Other Mapreduce issues Keke Chen.
Chapter 17 Methodology – Physical Database Design for Relational Databases Transparencies © Pearson Education Limited 1995, 2005.
GPGPU platforms GP - General Purpose computation using GPU
1.A file is organized logically as a sequence of records. 2. These records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Overview of the Database Development Process
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Chapter 13 File Structures. Understand the file access methods. Describe the characteristics of a sequential file. After reading this chapter, the reader.
Search Engines and Information Retrieval Chapter 1.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Lecture 9 Methodology – Physical Database Design for Relational Databases.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Physical Database Design Chapter 6. Physical Design and implementation 1.Translate global logical data model for target DBMS  1.1Design base relations.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Chapter 16 Methodology – Physical Database Design for Relational Databases.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
By Sergey Melnik, Sriram Raghavan, Beberly Yang and Garcia-Molina 10/22/2015Building a Distributed Full-Text Index for the Web1.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.
Physical Database Design The last phase of database design. It is to determine how to store the database. RDBMSs usually support a number of alternative.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Modern Information Retrieval Chapter 9: Parallel and Distributed IR Section 9.1: Introduction Section : MIMD Architectures Inverted Files November.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Complex Queries over Web Repositories Sriram Raghavan and Hector Garcia-Molina Computer Science Department Stanford University Gülfem IŞIKLAR M.Mirac KOCATÜRK.
Methodology – Physical Database Design for Relational Databases.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
CS4432: Database Systems II Query Processing- Part 2.
Database Indexing 1 After this lecture, you should be able to:  Understand why we need database indexing.  Define indexes for your tables in MySQL. 
K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston
CS5604: Final Presentation ProjOpenDSA: Log Support Victoria Suwardiman Anand Swaminathan Shiyi Wei Department of Computer Science, Virginia Tech December.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
IMS 4212: Database Implementation 1 Dr. Lawrence West, Management Dept., University of Central Florida Physical Database Implementation—Topics.
Modern Information Retrieval
Unit-8 Introduction Of MySql. Types of table in PHP MySQL supports various of table types or storage engines to allow you to optimize your database. The.
General Architecture of Retrieval Systems 1Adrienn Skrop.
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
Information Retrieval in Practice
Memory COMPUTER ARCHITECTURE
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Information Retrieval in Practice
Join Processing in Database Systems with Large Main Memories (part 2)
Yoram Bachrach Yiftah Ben-Aharon
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Unit 12 Index in Database 大量資料存取方法之研究 Approaches to Access/Store Large Data 楊維邦 博士 國立東華大學 資訊管理系教授.
Accelerating Regular Path Queries using FPGA
Presentation transcript:

Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented By Guan

Overview INTRODUCTION TESTBED ARCHITECTURE PIPELINED INDEXER DESIGN MANAGING INVERTED FILES IN AN EMBEDDED DATABASE SYSTEM COLLECTING GLOBAL STATISTICS CONCLUSIONS

Inverted Index Book Index Inverted Index fddf  similar 

Steps to build an inverted index Web scale and growth rate Rate of change processing each page to extract postings sorting the postings first on index terms and then on locations writing out the sorted postings as a collection of inverted lists on disk Index build time becomes critical for two reasons:

Purpose of The Paper? To ptimize build times for massive(web) collections (challenges and solutions). –Propose a pipeline architecture on each indexing node to enhance performance through intra-node parallelism. (building performance issues) –Propose an appropriate format for inverted files that makes optimal use of the features of such a database system –Any distributed system for building inverted indexes needs to address the issue of collecting global statistics (e.g., inverse document frequency - IDF ). We examine different strategies for collecting such statistics from a distributed collection

TESTBED ARCHITECTURE Distributors. These nodes store the collection of Web pages to be indexed. Pages are gathered by a Web Indexers. These nodes execute the core of the index building engine. Query servers. Each of these nodes stores a portion of the final inverted index and an associated lexicon. The lexicon lists all the terms in the corresponding portion of the index and their associated statistics. Overview of indexing process.

PIPELINED INDEXER DESIGN Logic phases The core of the indexing system is the index-builder process that executes on each indexer.

PIPELINED INDEXER DESIGN Multi-threaded execution Performance gain through pipelining save 1.5hours for 5 million pages % in general

MANAGING INVERTED FILES IN AN EMBEDDED DATABASE SYSTEM Challenges 1: Custom Implementation VS existing data management systems Solution: Berkeley DB Challenges 2: designing a scheme for storing inverted files that makes optimal use of the storage structures provided by the data management system. Full list, Single payload, Mixed list:

3 types of schemas: –1. Full list: The key is an index term, and the value is the complete inverted list for that term. –2. Single payload: Each posting (an index term, location pair) is a separate key.

3. Mixed list:

Comparison of storage schemes Index size -- With the mixed list scheme, the length of the value field is approximately constant. Zig-zag joins -- In the full list scheme, the entire list must be retrieved to compute the join, whereas with the mixed list scheme, access to specific portions of the inverted list is available. Hot updates -- Since we limit the length of the value field, hot updates are faster with mixed lists than with full lists.

Experimental Results 2 million Web pages, 4.9 million distinct terms, 312 million postings Optimal mixed list 30% better than full list

COLLECTING GLOBAL STATISTICS ME Strategy (sending local information during merging). FL Strategy (sending local information during flushing).

Experiments In general, experiments show the FL strategy outperforming ME, although they seem to converge as the collection size becomes large. Furthermore, as the collection size grows, the relative overheads of both strategies decrease.

CONCLUSIONS In this paper we addressed the problem of efficiently constructing inverted indexes over large collections of Web pages. We proposed a new pipelining technique to speed up index construction and demonstrated how to identify the right buffer sizes for maximum performance. We proposed and compared different schemes for storing and managing inverted files using an embedded database system. Finally, we identified the key characteristics of methods for efficiently collecting global statistics from distributed inverted indexes.

Q & A