MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 

Slides:



Advertisements
Similar presentations
Map Reduce Based on A. Rajaraman and J.D.Ullman. Mining Massive Data Sets. Cambridge U. Press, 2009; Ch. 2. Some figures “stolen” from their slides.
Advertisements

พีชคณิตแบบสัมพันธ์ (Relational Algebra) บทที่ 3 อ. ดร. ชุรี เตชะวุฒิ CS (204)321 ระบบฐานข้อมูล 1 (Database System I)
COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 5A Relational Algebra.
Two-Pass Algorithms Based on Sorting
Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.
D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.
Near-Duplicates Detection
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
1 Chapter 10 Query Processing: The Basics. 2 External Sorting Sorting is used in implementing many relational operations Problem: –Relations are typically.
ONE PASS ALGORITHM PRESENTED BY: PRADHYUMAN RAOL ID : 114 Instructor: Dr T.Y. LIN.
CPSC-608 Database Systems Fall 2009 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #2.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
Jeffrey D. Ullman Stanford University. 2 Formal Definition Implementation Fault-Tolerance Example: Join.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes 1.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Divide-and-Conquer 7 2  9 4   2   4   7
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
DISTRIBUTED DATABASES IN ADBMS Shilpa Seth
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
CSCE Database Systems Chapter 15: Query Execution 1.
Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.
© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 Graphs.
DBSQL 3-1 Copyright © Genetic Computer School 2009 Chapter 3 Relational Database Model.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Relational Algebra (Chapter 7)
Chapter 6 The Relational Algebra Copyright © 2004 Ramez Elmasri and Shamkant Navathe.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 4 Relational Algebra.
Database Management Systems 1 Raghu Ramakrishnan Relational Algebra Chpt 4 Xin Zhang.
Copyright © Cengage Learning. All rights reserved.
CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.
Database Management Systems, R. Ramakrishnan1 Relational Algebra Module 3, Lecture 1.
2 2.1 © 2012 Pearson Education, Inc. Matrix Algebra MATRIX OPERATIONS.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 13: Graphs Data Abstraction & Problem Solving with C++
CSCE Database Systems Chapter 15: Query Execution 1.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 12 – Introduction to.
Chapter 12 Query Processing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
An Algorithm for the Consecutive Ones Property Claudio Eccher.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Chapter 10 The Basics of Query Processing. Copyright © 2005 Pearson Addison-Wesley. All rights reserved External Sorting Sorting is used in implementing.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Jeffrey D. Ullman Stanford University.  A real story from CS341 data-mining project class.  Students involved did a wonderful job, got an “A.”  But.
MapReduce and Hadoop Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 10, 2014.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Module 2: Intro to Relational Model
Database Management System
Chapter 2: Relational Model
Chapter 2: Intro to Relational Model
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
One-Pass Algorithms for Database Operations (15.2)
Chapter 2: Intro to Relational Model
Implementation of Relational Operations
Chapter 2: Intro to Relational Model
Example of a Relation attributes (or columns) tuples (or rows)
Chapter 2: Intro to Relational Model
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Presentation transcript:

MapReduce and the New Software Stack

Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce  Relational-Algebra Operations  Relational-Algebra by MapReduce  Matrix Multiplication  Matrix Multiplication with one MapReduce

Algorithm Using MapReduce

Matrix-Vector Multiplication Sample application: The ranking of Web pages that goes on at search engines.

Matrix-Vector Multiplication by MapReduce Assume that n is large, but not so large that vector v cannot fit in main memory and thus be available to every Map task. The matrix M and the vector v each will be stored in a file of the DFS. v is first read into main memory at the compute node executing a Map task.  Map:  From each matrix element m ij it produces the key-value pair (i,m ij v j ).  Reduce:  Sum all the values associated with a given key i. The result will be a pair (i, x i ).

Matrix-Vector Multiplication  If the Vector v Cannot Fit in Main MemoryMatrix-Vector

Relational-Algebra Operations

Example Relations Attribute Schema Tuples For instance, the first row of Figure is the pair (url1, url2) that says the Web page url1 has a link to page url2.

Selection  Apply a condition C to each tuple in the relation and produce as output only those tuples that satisfy C. The result of this selection is denoted σ C (R). sIDNameGradeClass S01Adem5C S02Bob4A S03Cathy6B S04Dave5B Student Result = δ Grade >5 (Students) sIDNameGradeClass S03Cathy6B

Computing Selections by MapReduce  Map: For each tuple t in R, test if it satisfies C. If so, produce the key-value pair (t, t). That is, both the key and value are t.  Reduce: The Reduce function is the identity. It simply passes each key-value pair to the output.

Projection  For some subset S of the attributes of the relation, produce from each tuple only the components for the attributes in S. The result of this projection is denoted π S (R). Result = π sID, Name (Students) sIDName S01Adem S02Bob S03Cathy S04Dave

Computing Projections by MapReduce  Map: For each tuple t in R, construct a tuple t′ by eliminating from t those components whose attributes are not in S. Output the key-value pair (t′, t′).  Reduce: For each key t′ produced by any of the Map tasks, there will be one or more key-value pairs (t′, t′). The Reduce function turns (t′, [t′, t′,..., t′]) into (t′, t′), so it produces exactly one pair (t′, t′) for this key t′.

Union

Union by MapReduce  Map: Turn each input tuple t into a key-value pair (t, t).  Reduce: Associated with each key t there will be either one or two values. Produce output (t, t) in either case.

Intersection

Intersection by MapReduce  Map: Turn each tuple t into a key-value pair (t, t).  Reduce: If key t has value list [t, t], then produce (t, t). Otherwise, produce nothing.

Difference

Difference by MapReduce  Map: For a tuple t in R, produce key-value pair (t,R), and for a tuple t in S, produce key-value pair (t, S). Note that the intent is that the value is the name of R or S (or better, a single bit indicating whether the relation is R or S), not the entire relation.  Reduce: For each key t, if the associated value list is [R], then produce (t, t). Otherwise, produce nothing.

Example:Form

Natural Join  Given two relations, compare each pair of tuples, one from each relation. If the tuples agree on all the attributes that are common to the two schemas, then produce a tuple that has components for each of the attributes in either schema and agrees with the two tuples on each attribute. Result = Students Departments

Natural Join by MapReduce  Map: For each tuple (a, b) of R, produce the key-value pair (b, (R, a)). For each tuple (b, c) of S, produce the key-value pair (b, (S, c)).  Reduce:  Each key value b will be associated with a list of pairs that are either of the form (R, a) or (S, c). Construct all pairs consisting of one with first component R and the other with first component S, say (R, a) and (S, c). The output from this key and value list is a sequence of key-value pairs.

Grouping and Aggregation  Given a relation R, partition its tuples according to their values in one set of attributes G, called the grouping attributes. Then, for each group, aggregate the values in certain other attributes.  We denote a grouping-and-aggregation operation on a relation R by γ X (R), where X is a list of elements that are either (a) A grouping attribute, or (b) An expression θ(A), where θ is one of the five aggregation operations such as SUM, and A is an attribute not among the grouping attributes.

Example Imagine that a social-networking site has a relation Friends(User, Friend) This relation has tuples that are pairs (a, b) such that b is a friend of a. The site might want to develop statistics about the number of friends members have. This operation can be done by grouping and aggregation, specifically γUser,COUNT(Friend)(Friends) The result will be one tuple for each group, and a typical tuple would look like (Sally, 300), if user “Sally” has 300 friends.

Grouping and Aggregation by MapReduce  Map: For each tuple (a, b, c) produce the key-value pair (a, b).  Reduce: Each key a represents a group. Apply the aggregation operator θ to the list [b 1, b 2,..., b n ] of B-values associated with key a. The output is the pair (a, x), where x is the result of applying θ to the list.

Matrix Multiplication

Matrix Multiplication by MapReduce  Map: For each matrix element m ij, produce the key value pair (j, (M, i,m ij )). Likewise, for each matrix element n jk, produce the key value.  Reduce : For each key j, examine its list of associated values. For each value that comes from M, say (M, i,m ij ), and each value that comes from N, say (N, k, n jk ), produce a key-value pair with key equal to (i, k) and value equal to the product of these elements, m ij n jk.

Matrix Multiplication with one MapReduce  Map: For each element m ij of M, produce all the key-value pairs ((i, k), (M, j,m ij )) for k = 1, 2,..., up to the number of columns of N. Similarly, for each element n jk of N, produce all the key-value pairs ((i, k), (N, j, n jk )) for i = 1, 2,..., up to the number of rows of M. As before, M and N are really bits to tell which of the two relations a value comes from.  Reduce: Each key (i, k) will have an associated list with all the values (M, j,m ij ) and (N, j, n jk ), for all possible values of j. The Reduce function needs to connect the two values on the list that have the same value of j, for each j. An easy way to do this step is to sort by j the values that begin with M and sort by j the values that begin with N, in separate lists. The jth values on each list must have their third components, m ij and n jk extracted and multiplied. Then, these products are summed and the result is paired with (i, k) in the output of the Reduce function.