MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce: Simplified Data Processing on Large Clusters
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Take a Close Look at MapReduce Xuanhua Shi. Acknowledgement  Most of the slides are from Dr. Bing Chen,
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
L22: Parallel Programming Language Features (Chapel and MapReduce) December 1, 2009.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
CS 440 Database Management Systems Parallel DB & Map/Reduce Some slides due to Kevin Chang 1.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Image taken from: slideshare
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 2nd – Map/reduce.
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Distributed System Gang Wu Spring,2018.
CS 345A Data Mining MapReduce This presentation has been altered.
Charles Tappert Seidenberg School of CSIS, Pace University
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute

Roadmap Motivation MapReduce by Google Tools Hadoop Hbase MapReduce for SPARQL HRdfStore References

Motivation MapReduce inspired by the map and reduce primitives present in Lisp and other functional programming languages. Managing large amounts of data on the clusters of machines. Processing this data in distributed fashion without aggregating it at single point. Minimal user expertise required to carry out the tasks in parallel on cluster of machines.

MapReduce by Google Input: a set of key-value pairs Output: a set of key-value pairs (not necessarily same as input)! map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Example of Word-Count

MapReduce by Google (contd..) Architecture has a master server Map task workers Reduce task workerss Map task split into M splits and distributed to Map workers Reduce invocations distributed to R nodes by partitioning the intermediate key space. E.g. Hash(key) mod R

MapReduce by Google (contd..) Uses Google File System (GFS) Provides fault-tolerance Preserves locality by scheduling jobs on machines on same cluster and having replicated input data. Trade-off in selection of M and R values. Master makes O(M + R) scheduling decisions, and keeps O(M*R) states in memory Typically M = 200,000 R = 5,000 with 2000 worker machines.

Tools Hadoop ( (This talk is not going to detail Hadoop APIs) Uses Hadoop Distributed File System (HDFS) – specifically meant for large distributed data intensive applications running on commodity hardware. Inspired by Google File System (GFS) For MapReduce operations Master JobTracker and one slave TaskTracker per cluster-node Applications specify input/output locations Supply map and reduce functions implementing appropriate interfaces and abstract classes. Implemented in Java but applications can be written in other languages Using Hadoop Streaming

Tools (contd..) Hbase ( ) Inspired by Google’s Bigtable architecture for distributed data storage using sparse tables It’s like a multidimensional sorted map, indexed by a row key, column key, and a timestamp. A column name has the form : Single table enforces set of column families. Column families stored physically close on the disk to improve locality while searching.

Hbase storage row Keytimestampcontentsanchormime anchor:cnnsi.comanchor:my.look.ca com.cnn.wwwt9 CNN t8 CNN.com t6 Text/html t5 t3 Hbase table view

Hbase storage Row keyTimestampContents com.cnn.wwwt6 t5 t3 Row keyTimestampAnchor cnnsi.commy.look.ca com.cnn.wwwt9CNN t8CNN.com Row keyTimestampMime com.cnn.wwwt6text/html

Hbase architecture Table is a list of data tuples sorted by the row key. Physically broken into HRegions -> tablename, start and end key. HRegion served by HRegionServer. HStore for each column group. HStoreFiles B-Tree like structure HMaster to control HRegionServers META table to store meta info about HRegions and HRegionServer locations.

Hbase architecture HMaster HRegionServer1 HRegion1 HRegion2 HRegionServer2 HRegion3 HRegion4 HRegionServer3 HRegion5 HRegion6

MapReduce for SPARQL (HRdfStore) Use HRdfStore Data Loader (HDL) to read RDF files and organize data in HBase. Sparcity of RDF data specifically useful to store in Hbase. Hbase’s compression techniques useful HRdfStore Query Processor (HQP) executes RDF queries on HBase tables. SPARQL Query -> Parse tree -> Logical operator tree -> Physical operator tree -> Execution

MapReduce for SPARQL (some more thoughts) How to organize RDF data in Hbase? Subjects/Object as Row Keys! “Predicates” column family Each predicate as “label” e.g. “Predicates-rdf:type”. Or predicates as row keys Subjects/Objects as column families. Convert each SPARQL query into associated query for Hbase.

MapReduce for SPARQL (some more thoughts) Each RDF triple mapped to one of more keys and stored in Hbase according to these keys. Each cluster node being responsible for triples associated with one or more particular keys. Map each triple pattern in the SPARQL query to a key with associated restrictions e.g. FILTERs. Execute the query by mapping the triple patterns to cluster nodes associated with those keys. This is nothing but Distributed Hash Table like system. Can employ a different hashing scheme to avoid skew in triple distribution as experienced in conventional DHT based P2P systems.

Map-Reduce-Merge (an application) Map-Reduce do not work well with heterogeneous databases. It does not directly support join. Map-Reduce-Merge (as proposed by Yahoo! And UCLA researchers) support features of Map-Reduce while providing relational algebra to the list of database principles.

References MapReduce BigTable Hadoop – Hbase HrdfStore IBM MapReduce tool for Eclipse Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters, SIGMOD’07.