Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley, 2008.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
Jeffrey D. Ullman Stanford University. 2 Formal Definition Implementation Fault-Tolerance Example: Join.
Distributed Computations MapReduce
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Introduction to MapReduce Amit K Singh. “The density of transistors on a chip doubles every 18 months, for the same cost” (1965) Do you recognize this.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Information Retrieval Lecture 9. Outline Map Reduce, cont. Index compression [Amazon Web Services]
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Image taken from: slideshare
Information Retrieval in Practice
Some slides adapted from those of Yuan Yu and Michael Isard
Indexing UCSB 293S, 2017 Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley,
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
COS 418: Distributed Systems Lecture 1 Mike Freedman
Cse 344 May 2nd – Map/reduce.
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Distributed System Gang Wu Spring,2018.
Inverted Indexing for Text Retrieval
CS639: Data Management for Data Science
INF 141: Information Retrieval
Presentation transcript:

Chapter 5 Ranking with Indexes 1

2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees and arrays faster for phrase searches and less common queries harder to build and maintain  Signature files word-oriented index structures based on hashing n Design issues:  Search cost and space overhead  Cost of building and updating

Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines use a particular form of search: ranking  Docs are retrieved in sorted order according to a score computing using the doc representation, the query, and a ranking algorithm n What is a reasonable abstract model for ranking?  Enables discussion of indexes without details of retrieval model 3

Abstract Model of Ranking 4

More Concrete Model 5 c

Inverted Index n Each index term is associated with an inverted list  Contains lists of documents, or lists of word occurrences in documents, and other information  Each entry is called a posting  The part of the posting that refers to a specific document or location is called a pointer  Each document in the collection is given a unique number  Lists are usually document-ordered (sorted by document number) 6

Example “Collection” 7

Simple Inverted Index 8 posting

Inverted Index with counts - supports better ranking algorithms 9 No. of time the word occurs Doc #

Inverted Index with Positions - Supports Proximity Matches 10 Position in the doc Doc #

Proximity Matches n Matching phrases or words within a window  e.g., "tropical fish", or “find tropical within 5 words of fish” n Word positions in inverted lists make these types of query features efficient  e.g., 11

MapReduce n Distributed programming framework that focuses on data placement and distribution n A programming model (code) for processing & generating large data sets n Mapper (or the map function)  Transforms a list of items (key/value pairs) into another list of items (intermediate key/value pairs) of the same length n Reducer (or the reduce function)  Transforms/merges a list of items (immediate key/value pairs) into a single item (with the same intermediate key) n Many mapper & reducer tasks on a cluster of machines 12

Mappers and Reducers n Map-Reduce job  Map function (inputs  key-value pairs)  Reduce function (key & list of values  outputs) n Map and Reduce Tasks apply Map or Reduce function to (typically) many of their inputs  Unit of parallelism n Mapper = application of the Map function to a single input n Reducer = application of the Reduce function to a single key-(list of values) pair 13

MapReduce n In 2003 a system was built at Google to simplify construction of the inverted index for handling searches n Example. Counting the number of occurrences of each word in a large collection of documents Map(String key, String value): // key: document name // value: document contents for each word w in value EmitIntermediate(w, “1”) Reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values results += ParseInt(v); Emit(AsString(results)); 14

15 Inverted Index Creation n Input: Large number of text documents n Output: Postings lists for every term in the collection  For every word, all documents that contain the word & the positions I saw the cat on the mat I saw the dog on the mat I saw the cat mat

16 Inverted Index Creation n Solution to the problem:  Mapper: For each word in a doc, generates (word, [URL, position])  Reducer: Aggregate all the information on the same word // Pseudo-code for “inverted index”: Map(String key, String value): // key: document URL // value: document contents vector words = tokenize(value) for position from 0 to len(words): EmitIntermediate(w, {key, position}); Reduce(String key, Iterator values): // key: a word // values: a list of {URL, position} tuples postings_list = []; for each v in values: postings_list.append(v); sort(postings_list); // Sort by URL, position Emit(key, AsString(postings_list));

17 Inverted Index Creation n Inverted index combiners:  Combiners reduces the number of intermediate outputs, aggregating all occurrences of document words

MapReduce n MapReduce automatically parallelizes & executes a program on a large cluster of commodity machines n The runtime system  Partitioning the input data  Scheduling the program’s execution  Handling machine failures  Managing required inter-machine communication n A MapReduce computation processes many terabytes of data on hundreds/thousands of machines n More than 100,000 MapReduce jobs are executed on Google’s clusters every day 18

MapReduce n Basic process  Map stage which transforms data records into pairs, each with a key and a value  Shuffle uses a hash function so that all pairs with the same key end up next to each other and on the same machine  Reduce stage processes records in batches, where all pairs with the same key are processed at the same time n Idempotence of Mapper & Reducer provides fault tolerance  multiple operations on same input gives same output 19

MapReduce 20

Example: Natural Join n Join of R(A, B) with S(B, C) is the set of tuples (a, b, c) such that (a, b) is in R and (b, c) is in S n Mappers need to send R(a, b) and S(b, c) to the same reducer, so they can be joined there n Mapper output: key = B-value, value = relation and other component (A or C)  Example: R(1, 2)  (2, (R, 1)) S(2, 3)  (2, (S, 3)) 21

Mapping Tuples 22 Mapper for R(1, 2) R(1, 2) (2, (R, 1)) Mapper for R(4, 2) R(4, 2) Mapper for S(2, 3) S(2, 3) Mapper for S(5, 6) S(5, 6) (2, (R, 4)) (2, (S, 3)) (5, (S, 6))

Grouping Phase n There is a reducer for each key n Every key-value pair generated by any mapper is sent to the reducer for its key 23

Mapping Tuples 24 Mapper for R(1, 2) (2, (R, 1)) Mapper for R(4, 2) (2, (R, 4)) Mapper for S(2, 3) (2, (S, 3)) Mapper for S(5, 6) (5, (S, 6)) Reducer for B = 2 Reducer for B = 5

Constructing Value-Lists n The input to each reducer is organized by the system into a pair:  The key  The list of values associated with that key 25

The Reduce Function for Join n Given key b and a list of values that are either (R, a i ) or (S, c j ), output each triple (a i, b, c j )  Thus, the number of outputs made by a reducer is the product of the number of R’s on the list and the number of S’s on the list 26 Reducer for B = 2 (2, [(R, 1), (R, 4), (S, 3)]) (1, 2, 3), (4, 2, 3) Reducer for B = 5 (5, [(S, 6)])

The Drug-Interaction Problem n Data consists of records for 3,000 drugs  List of patients taking the drugs, dates, and diagnoses.  About 1MB of data per drug n Problem is to find drug interactions  Example. two drugs that when taken together increase the risk of heart attack n Must examine each pair of drugs and compare their data 27

Initial Map-Reduce Algorithm n The first attempt used the following plan:  Key = set of two drugs { i, j }  Value = the record for one of these drugs n Given drug i and its record R i, the mapper generates all key-value pairs ({ i, j }, R i ), where j is any other drug besides i n Each reducer receives its key and a list of the two records for that pair: ({ i, j }, [R i, R j ]) 28

Example: Three Drugs 29 Mapper for Drug 2 Mapper for Drug 1 Mapper for Drug 3 Drug 1 data {1, 2} Drug 1 data {1, 3} Drug 2 data {1, 2} Drug 2 data {2, 3} Drug 3 data {1, 3} Drug 3 data {2, 3}

Example: Three Drugs 30 Mapper for Drug 2 Mapper for Drug 1 Mapper for Drug 3 Reducer for {1, 2} Reducer for {2, 3} Reducer for {1, 3} Drug 1 data {1, 2} Drug 1 data {1, 3} Drug 2 data {1, 2} Drug 2 data {2, 3} Drug 3 data {1, 3} Drug 3 data {2, 3}

Example: Three Drugs 31 Drug 1 data {1, 2} Reducer for {1, 2} Reducer for {2, 3} Reducer for {1, 3} Drug 1 data Drug 2 data {2, 3} Drug 3 data {1, 3} Drug 3 data