Kurt Rohloff & Richard E. Schantz Presented by Nerya Or as part of seminar 67877 at the Hebrew University of Jerusalem, 2011 1.

Slides:



Advertisements
Similar presentations
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Advertisements

Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.
HADOOP ADMIN: Session -2
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
1 Lab 2 and Merging Data (with SQL) HRP223 – 2009 October 19, 2009 Copyright © Leland Stanford Junior University. All rights reserved. Warning:
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Integrating Big Data into the Computing Curricula 02/2015 Achmad Benny Mutiara
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Rationale Databases are an integral part of an organization. Aspiring Database Developers should be able to efficiently design and implement databases.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Logics for Data and Knowledge Representation
February 26th – Map/Reduce
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
Overview of big data tools
Lab 2 HRP223 – 2010 October 18, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected.
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Map Reduce, Types, Formats and Features
Presentation transcript:

Kurt Rohloff & Richard E. Schantz Presented by Nerya Or as part of seminar at the Hebrew University of Jerusalem,

 RDF data and the SPARQL query language  Introduction to MapReduce and Hadoop  Rohloff and Schantz’s article 2

 This lecture has a lot of background material!  Therefore, we’ll only be touching the basics of the introduction material. 3

 RDF – Resource Description Framework.  Used to model information about data.  Information is represented using statements about resources.  Each resource is an entity on which multiple statements can be made. 4

 A very common way to present RDF data is in XML files Cranbook Road Bristol Damian Steer 137 Cranbook Road Bristol Damian Steer Example source:

 The data can be seen as a graph.  Each directed edge goes from a subject node to an object node, with the edge label being a predicate. 6 Example source:

 Information in RDF can be seen as a set of triples.  A triple: subject-predicate-object.  For example: Barak wears glasses ◦ “Barak” is the subject. ◦ “wears” is the predicate. ◦ “glasses” is the object.  A triples data store keeps all of the data as a set of such triples. 7 Image source:

 So, what do we need to know about RDF for the lecture today?  Understand the concept of a triples data store and how it can be used to model a knowledge base.  There’s much more to RDF – not in the scope of today’s lecture. 8

 SPARQL: ◦ Simple Protocol and RDF Query Language  A query language for RDF data.  Somewhat similar to SQL, but aimed at RDF data.  Note to members of the audience who took course “Data on the Web” last year – ◦ The syntax is very similar to what was used in the course project. 9

 The ‘?’ prefix denotes a variable.  In a query result dataset, all possible ways to substitute all variables while satisfying the query are returned. 10 SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. ?car :madeIn :Detroit. } SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. ?car :madeIn :Detroit. }

 Can also be seen as: 11 SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. ?car :madeIn :Detroit. } SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. ?car :madeIn :Detroit. }

12  The goal: return all possible ways to align nodes in the database with variables in the query.

 How big can a data graph get? Huge.  For example, we can model a social network using a triples data store. ◦ A triple to represent (directed) friendship: “Alice follows Bob”  How much processing power is required to run a query on the entire Twitter network, for example? 13

 One way to tackle processing of huge amounts of data is to parallelize tasks among multiple machines.  We can use lots of cheap hardware to easily increase performance.  This approach requires specialized algorithms for distributed computing. 14

 Today we’ll be focusing on an algorithm which runs under the Hadoop system.  Hadoop is a framework for distributed applications, meant for work on massive amounts of data.  It’s open source and available for free. 15

 Hadoop is responsible for managing jobs on a distributed cluster of computers. ◦ Communication between nodes, reliability against node failures, etc. is all performed by the Hadoop system.  Provides an abstraction for running distributed algorithms, freeing the developer from handing technical aspects of the cluster.  Developers implement MapReduce algorithms, and Hadoop handles the rest. 16

 In addition to providing a distributed computing platform with MapReduce, Hadoop also provides a distributed filesystem called HDFS.  Since today’s lecture focuses specifically on a MapReduce algorithm, we will not go into details about HDFS. 17

 Introduced by Google in ◦ Paper: “MapReduce: Simplified Data Processing on Large Clusters” by J. Dean and S. Ghemawat.  A way to perform distributed processing on a large number of machines (“nodes”), collectively referred to as a “cluster”. 18

 Input data is split to independent chunks, and each of those is processed in parallel to the others.  Processing is also divided into two steps: Map and Reduce.  This results in many small independent problems which can be run in parallel on the cluster. ◦ Note, the Reduce stage is dependent on the Map stage.  A system (e.g. Hadoop) is in place to manage these jobs across the cluster. 19

 We shall now go over the basic MapReduce algorithm, from a developer’s point of view.  At minimum, a developer must specify a Map function, and a Reduce function. The rest is taken care of by the framework. 20

21 Source:

 Our initial input is made of many (key, value) pairs. ◦ These are each assigned to a single Map function call.  Input for the Map function: a single (key,value) pair.  Output of the Map function: a list of (key,value) pairs. ◦ Possibly of a different type than that of the input. 22 Map(k_in,v_in) → list(k,v)

 In the Partition stage: ◦ Each Map output is sent to a reducer using some function (e.g. hash the key and modulo the number of reducers).  In the Comparison stage: ◦ Input for Reduce is grouped and sorted by key and sorted by value. ◦ Each Reduce call will be given a single key and its list of values. 23

 Input for the Reduce function: a single key and a list of values for that key.  Output from the Reduce function: a list of values. ◦ The type of these values can be different than that of the input. 24 Reduce(k, list (v)) → list(v_out)

 So overall, what do we have?  Our initial input was a list of (key, value) pairs.  The final output of the MapReduce operation is the unification of all of the lists of elements produced by the Reduce function calls. 25 MapReduce(list(k_in,v_in)) → list(v_out)

26 Source:

 With this framework, developers can perform wonders by simply implementing a Java interface for the Map and Reduce functions. 27 Check out the Hadoop interface! Check out the Hadoop interface!

 A classic example of MapReduce operations is a word counting algorithm over a set of documents.  Input: ◦ A set of (documentName, documentContents) pairs.  Output: ◦ A set of (word, numberOfAppearances) pairs. 28

29 Source:

 This example may seem silly, but given enough machines in a cluster we can process petabytes of data quickly! That’s quite impressive.  Next in this lecture, we will learn of a more interesting use for MapReduce. 30

 The article under the spotlight today is called “ Clause-Iteration with MapReduce to Scalably Query Data Graphs in the SHARD Graph-Store”.  This “SHARD graph-store” is the name of a system built by the authors over Hadoop, using the algorithm from the article. ◦ Therefore, we will not mention it anymore in the lecture. 31

 The article (presented in DIDC 2011) gives a MapReduce solution to the problem of answering SPARQLE queries over a triples data store. 32

 The query above contains 3 clauses.  Two variables are mentioned: “person” and “car”. ◦ The variable “car” could potentially be a “:dog” if it weren’t for the “?car :a :car” clause!  Our mission: find all entities in the database that can satisfy all restrictions in the query. 33 SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. ?car :madeIn :Detroit. } SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. ?car :madeIn :Detroit. }

 The algorithm makes an important assumption: ◦ The i’th clause in the query contains at least one variable in common with any clauses 1,…,(i-1).  This is without loss of generality – ◦ if a query can’t be rearranged to satisfy this assumption, then the query can be split into independent sub-queries that can be processed separately and joined later. 34

 The database we will be working with consists of lines of text, each being a “triple line”.  A triple line consists of a subject entity and its predicates and objects of those predicates. ◦ “Barak; wears,glasses; owns,car; likes,BarbieDolls; …” 35

 Throughout the algorithm, variables from the query will be bound to entities in the data store.  To represent such a binding of variables to entities, a string of text is used.  Each bound variable’s name appears, prefixed by a ‘?’ and followed by the assigned entity in this binding. ◦ “?person Barak ?car car0 …” 36

 Given a query, this algorithm performs a MapReduce operation for each of the clauses in the query. ◦ Thus for a query with N clauses we will perform N MapReduce operations.  We incrementally bind query variables to data nodes this way. 37

38

 An initial MapReduce operation finds all legal assignments of variables to the first clause in the query.  We then iterate on the next clauses, each time joining the new assignment options with what we already had, using a MapReduce operation.  A final MapReduce operation leaves only variables in the SELECT, and removes duplicate results. 39

 When performing the intermediate MapReduce operations, we always pass the union of the current variable bindings (from the previous iteration), and the initial triples database to the map operations.  Inside the intermediate MapReduce, there is a condition to handle tuple input and binding input differently. 40

 We shall now look at each of the three different MapReduce operation types that are performed in the algorithm.  These are “firstClauseMapReduce”, “intermediateClauseMapReduce”, and “selectMapReduce”. 41

42

 Input: ◦ Each map job gets.  A triples line: “Barak; wears,glasses; owns,car; likes,BarbieDolls; …” ◦ Also note that the first clause in the query is passed.  Output: ◦ A set of. The bindings are all the possible ways to satisfy clause 0 in the query.  A binding: “?person Barak ?car car0 …” 43

 Input: ◦ Since we only add null as the value in the map job, each reduce job gets a set of.  Output: ◦, i.e. the reduce job simply removes any duplicates. 44

 Once the initial MapReduce operation is finished, we have all the possible variable bindings for the first clause in the query.  We then start iterating on the rest of the clauses. 45

46

 Input: ◦ Each map job gets either, or. ◦ commVars is supplied, which is the list of variables that are both currently bound (from past iterations) and appear in the current clause.  Always non-empty by the algorithm’s assumption! ◦ The current clause in the query is also supplied. 47

 Output: ◦ If input was a triples line, the output is a set with bindings for the current clause, each binding being outputted in the format. ◦ If input was a binding line, simply output a re- organized version of the binding:. 48

49 Variables are in curVars Variables are in boundVars old

 Input: ◦ Each reduce job receives as input. 50

 A note: ◦ We have 2 categories of bound variables at this stage (the two are not necessarily distinct):  Variables that are present in the current clause (curVars)  Variables which are already bound from previous clauses (boundVars). 51

 A note (continued): ◦ Note that because of the map operation, we always have any variables in the intersection of the two categories as the key in the reducer input. ◦ Therefore the value set consists of many bindings, and each binding must be either exclusively in curVars or exclusively in boundVars. 52

 Output for each reduce job is with the following logic: ◦ Mark the input as. ◦ We output all possible variations of: ◦.  This Cartesian product is implemented with a simple nested loop in the reducer. 53

 Each key in the ‘reduce’ is a binding, with the variables in commVars.  If the binding in the key contains any new assignments (different than the assignments until now on these variables), then the “oldBindings” set will be empty when the algorithm runs, and no results with this assignment will be outputted. 54

 Else, all bindings inside the key have appeared in previous iterations as they are.  The only case here in which we would want to filter out a result, is when this past binding contradicts the current clause’s restriction.  Indeed, in this case the “newBindings” set will be empty and thus no result will be outputted. 55

 The only case that’s left to look at is when the bindings inside the key have appeared as they are in the past, and in addition they don’t contradict the current clause.  In this case both “newBindings” and “oldBindings” will be non-empty, and we will output the key prefixed to their Cartesian product. 56

 Therefore, we can always be sure that the following holds –  After the i’th iteration, all outputted tuples from the Reduce runs satisfy the query up to and including the i’th clause. ◦ And those are all the tuples that satisfy this property (i.e. none are missing). 57

 After iterating over all clauses, we run one last MapReduce job to take only the variables in the query’s SELECT.  We also remove any duplicate output in this final MapReduce. 58

59

 Input: ◦ Every Map job gets input: ◦ The list of variables in the query’s SELECT is also passed.  Output: ◦ Only the variables in the input binding which also appear in the query’s SELECT. 60

 Input: ◦ Every Reduce’s input is:  Output: ◦ A list of all keys in the input (removed duplicates). 61

 As we’ve seen, the algorithm is correct. ◦ Its correctness is not trivial to see…  Since it runs on Hadoop, the process is parallelized and scalable to extremely large data sets.  We now present a sample run of the algorithm. ◦ Source for demo:

63 SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. } SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. }

64 SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. } SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. }

65 SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. } SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. }   The final MapReduce will remove “?car” bindings.

 The authors have implemented the algorithm over Cloudera Hadoop.  The code was run on an Amazon EC2 cloud of 20 XL compute nodes.  Benchmark used: LUBM ◦ Input data is auto-generated. ◦ Data revolves around students, courses and lecturers in universities, with relationships from that area. 66

 Queries 1,9,14 of the LUBM benchmark were run. ◦ Query 1 asks for the students that take a particular course and returns a very small set of responses. ◦ Query 9 asks for all teachers, students and courses such that the teacher is the adviser of the student who takes a course taught by the teacher. ◦ Query 14 is relatively simple as it asks for all undergraduate students (but the response is very large). 67

 Data: approx. 800 million edges - several GB ◦ Query 1: 404 sec. (approx. 0.1 hr.) ◦ Query 9: 740 sec. (approx. 0.2 hr.) ◦ Query 14: 118 sec. (approx hr.)  Results get better as more hardware is added. 68

 The results were compared against an industry- standard triple store DB: DAMLDB. ◦ DAMLDB runs on a single server. ◦ Other triple store DBs timed out when loading the input data.  A comparison of the time taken (in hours): ◦ The results are great! 69 SHARDDAMLDB Query 10.1 Query Query

 Comprehensive information about RDF:  SPARQL Query Language for RDF (official specification):  “MapReduce: Simplified Data Processing on Large Clusters”, by J. Dean and S. Ghemawat:  Apache™ Hadoop™:  “Clause-Iteration with MapReduce to Scalably Query Data Graphs in the SHARD Graph-Store “ by K. Rohloff and R. Schantz: systems.bbn.com/people/krohloff/papers/2011/Rohloff_Schantz_DIDC_201 1.pdfhttp:// systems.bbn.com/people/krohloff/papers/2011/Rohloff_Schantz_DIDC_201 1.pdf 70