Author: Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA Presented By: Anamika Mukherji 13/26/2013Indexing The World Wide Web.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Comp 335 File Structures Indexes. The Search for Information When searching for information, the information desired is usually associated with a key.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Information Retrieval in Practice
Memory Management Design & Implementation Segmentation Chapter 4.
Modern Information Retrieval
CPSC 231 Organizing Files for Performance (D.H.) 1 LEARNING OBJECTIVES Data compression. Reclaiming space in files. Compaction. Searching. Sorting, Keysorting.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Efficient Storage and Retrieval of Data
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Overview of Search Engines
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
1.A file is organized logically as a sequence of records. 2. These records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Data Compression By, Keerthi Gundapaneni. Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal.
CIS250 OPERATING SYSTEMS Memory Management Since we share memory, we need to manage it Memory manager only sees the address A program counter value indicates.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
IDA / ADIT Databasteknik Databaser och bioinformatik Data structures and Indexing (I) Fang Wei-Kleiner.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
Evidence from Content INST 734 Module 2 Doug Oard.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Chapter 5 Record Storage and Primary File Organizations
Chapter 7 Memory Management Eighth Edition William Stallings Operating Systems: Internals and Design Principles.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.
Memory management The main purpose of a computer system is to execute programs. These programs, together with the data they access, must be in main memory.
Indexing The World Wide Web: The Journey So Far Abhishek Das, Ankit Jain 2011 Paper Presentation : Abhishek Rangnekar 1.
( ) 1 Chapter # 8 How Data is stored DATABASE.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Information Retrieval in Practice
University of Maryland Baltimore County
COMP9319: Web Data Compression and Search
CSC317 Greedy algorithms; Two main properties:
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Multidimensional Access Structures
File System Structure How do I organize a disk into a file system?
COMP 430 Intro. to Database Systems
Chapter 12: Query Processing
Database Implementation Issues
Computer Architecture
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
DATABASE IMPLEMENTATION ISSUES
Database Implementation Issues
Database Implementation Issues
Presentation transcript:

Author: Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA Presented By: Anamika Mukherji 13/26/2013Indexing The World Wide Web

Is Indexing Difficult? - Yes! Words not known beforehand Content available in different languages Variations in Grammar and Style No structure – riddled with colors, fonts, images, etc. Various byte-encoding schemes 23/26/2013Indexing The World Wide Web

Answering The User’s Query Retrieval for a typical query Find terms in dictionary Start with the least frequent term since posting list will be the shortest. Fetch corresponding posting lists Intersect the lists on document identifiers to get relevant documents Rank and re-order the documents to present it to user. To get quality results as fast as possible, understanding of each usage is required Disk Space Disk Transfer Memory CPU Time Choice of data structure impacts CPU and storage Fixed-length array wasteful if posting lists kept in memory Singly linked list allows cheap insertions and updates Variable length array require less CPU time Linked list of fixed length arrays can be used for each term. Avoid pointers when storing the posting list in memory. 33/26/2013Indexing The World Wide Web

Better Understanding of User Intent Check proximity of different terms Positional Index expands storage, slows down query processing. Phrase based Indexing – expensive, no accurate mechanism for identifying which phrase might be used. – Use a good phrase. 43/26/2013Indexing The World Wide Web

Document vs. Term Based Partitioning 53/26/2013Indexing The World Wide Web

Memory vs. Disk Storage 63/26/2013Indexing The World Wide Web

Compressing The Index Advantages of compressed index Faster transfer of data from disk to memory Reduces disk seek time Compressions schemes Variable Encoding Bit-level Encoding Using gaps Original posting lists: the: 1, 9 2, 8 3, 8 4, 5 5, 6 6, 9 to: 1, 5 3, 1 4, 2 5, 2 6, 6 john: 2, 4 4, 1 6, 4 With gaps: the: 1, 9 1, 8 1, 8 1, 5 1, 6 1, 9 to: 1, 5 2, 1 1, 2 1, 2 1, 6 john: 2, 4 2, 1 2, 4 73/26/2013Indexing The World Wide Web

Variable Byte Encoding Uses an integral but adaptive number of bytes depending upon the gap size. First bit of each byte is a continuation bit. Remaining 7 bits in each byte are used to encode part of gap. To decode a byte: Read sequence of bytes till continuation bit flips. Extract and concatenate the 7-bit parts to get the magnitude of a gap. 83/26/2013Indexing The World Wide Web

Bit Level Encoding Used when disk space is at premium. These codes adapt the length of the code on a finer grained bit level. Codeword is divided into 2 parts – prefix and suffix Prefix indicates the binary magnitude of the value and tells the decoder how many bits are there in the suffix part. Suffix indicates the value of the number within the corresponding binary range. Query processing is more time consuming. 93/26/2013Indexing The World Wide Web

Ordering by Highest Impact First Example: ( ): 12, 2 17, 2 29, 1 32, 1 40, 6 78, 1 101, 3 106, 1. When the list is reordered by term frequency, it gets transformed: 40, 6 101, 3 12, 2 17, 2 29, 1 32, 1 78, 1 106, 1. The repeated frequency information can then be factored out into a prefix component with a counter that indicates how many documents there are with this same frequency value: 6 : 1 : 40 3 : 1 : : 2 : 12, 17 1 : 4 : 29, 32, 78, 106. Not storing the repeated frequencies gives a considerable saving. Finally, if differences of document identifiers are taken, we get the following: 6 : 1 : 40 3 : 1 : : 2 : 12, 5 1 : 4 : 29, 3, 46, 28. The document gaps within each equal-frequency segment of the list are now on average larger than when the document identifiers were sorted, thereby requiring more encoding bits/bytes. 103/26/2013Indexing The World Wide Web

Managing Multiple Indices Multiples indices bucketed by rate of refreshing. The Large, rarely refreshing pages index The small, ever-refreshing pages index The dynamic real-time/news pages index Waterfall approach Pages discovered in one tier can be passed over the next over time. Invalidate older index and crawl file entries 113/26/2013Indexing The World Wide Web

SCALING THE SYSTEM Web search engines use Distributed indexing algorithms for index construction Distributed File System In order to manage large amounts of data across large commodity clusters, a distributed file system that provides efficient remote file access, file transfers, and the ability to carry out concurrent independent operations while being extremely fault tolerant is essential. Map-Shuffle-Reduce Map: The master node chops up the problem into small chunks and assigns each chunk to a worker. The worker either processes the chunk of data with the mapper and returns the result to the master or further chops up the input data and assigns it hierarchically. Shuffle: Group key-value pair from mapper. Reduce: Take sub-answers and combine to create final output. 123/26/2013Indexing The World Wide Web

FUTURE RESEARCH DIRECTIONS Real Time Data and Search – What can we do with each tweet? Create a Social Graph Extract and Index links Real-Time Related Topics Sentiment Analysis Social and Personalized Web Search Facebook, Twitter, etc. Facebook Users post a wealth of information Static – book, movie interest Dynamic – user locations, status updates, wall posts Learning user’s personal information can personalize search results Facebook impacting the world of search Opened data to third party service Search for 2 degrees of user 133/26/2013Indexing The World Wide Web

Pros and Cons What I liked about it Delves into the history of Search Engines Talks about the Future Enhancement Explains how a search engine works What I didn’t like Skims through the surface without going deep. Includes very few examples which make understanding difficult. Compressing the Index section lacks structure which makes it difficult to understand. 143/26/2013Indexing The World Wide Web