CSE 486/586 CSE 486/586 Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Google Chubby Lock Service Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
On Availability of Intermediate Data in Cloud Computations Steven Y. Ko, Imranul Hoque, Brian Cho, and Indranil Gupta Distributed Protocols Research Group.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) August 31, 2010 Lecture 3  2010, I. Gupta.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplified Data Processing on Large Cluster Authors: Jeffrey Dean and Sanjay Ghemawat Presented by: Yang Liu, University of Michigan EECS 582.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
MapReduce and Beyond Steve Ko 1. Trivia Quiz: What’s Common? 2 Data-intensive computing with MapReduce!
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Large-scale file systems and Map-Reduce
Map Reduce.
湖南大学-信息科学与工程学院-计算机与科学系
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Charles Tappert Seidenberg School of CSIS, Pace University
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Pig and pig latin: An Introduction
Presentation transcript:

CSE 486/586 CSE 486/586 Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo

CSE 486/586 Recap RPC enables programmers to call functions in remote processes. IDL (Interface Definition Language) allows programmers to define remote procedure calls. Stubs are used to make it appear that the call is local. Semantics –Cannot provide exactly once –At least once –At most once –Depends on the application requirements 2

CSE 486/586 Two Questions We’ll Answer What is data analytics? What are the programming paradigms for it? 3

CSE 486/586 Example 1: Scientific Data CERN (European Organization for Nuclear Geneva: Large Hadron Collider (LHC) Experiment –300 GB of data per second –“15 petabytes (15 million gigabytes) of data annually – enough to fill more than 1.7 million dual-layer DVDs a year” 4

CSE 486/586 Example 2: Web Data Google –20+ billion web pages »~20KB each = 400 TB –~ 4 months to read the web –And growing… »1999 vs. 2009: ~ 100X Yahoo! –US Library of Congress every day (20TB/day) –2 billion photos –2 billion mail + messenger sent per day –And growing… 5

CSE 486/586 Data Analytics Computations on very large data sets –How large? TBs to PBs –Much time is spent on data moving/reading/writing Shift of focus –Used to be: computation (think supercomputers) –Now: data 6

CSE 486/586 Popular Environment Environment for storing TBs ~ PBs of data Cluster of cheap commodity PCs –As we have been discussing in class… –1000s of servers –Data stored as plain files on file systems –Data scattered over the servers –Failure is the norm How do you process all this data? 7

CSE 486/586 Turn to History Dataflow programming –Data sources and operations –Data items go through a series of transformations using operations. –Very popular concept Many examples –Even CPU designs back in 80’s and 90’s –SQL, data streaming, etc. Challenges –How to efficiently fetch data? –When and how to schedule different operations? –What if there’s a failure (both for data and computation)? *

CSE 486/586 Dataflow Programming This style of programming is now very popular with large clusters. Many examples –MapReduce, Pig, Hive, Dryad, Spark, etc. Two examples we’ll look at –MapReduce and Pig 9

CSE 486/586 What is MapReduce? A system for processing large amounts of data Introduced by Google in 2004 Inspired by map & reduce in Lisp OpenSource implementation: Hadoop by Yahoo! Used by many, many companies –A9.com, AOL, Facebook, The New York Times, Last.fm, Baidu.com, Joost, Veoh, etc. 10

CSE 486/586 Background: Map & Reduce in Lisp Sum of squares of a list (in Lisp) (map square ‘( )) –Output: ( ) [processes each record individually] fff f 1 1

CSE 486/586 Background: Map & Reduce in Lisp Sum of squares of a list (in Lisp) (reduce + ‘( )) –(+ 16 (+ 9 (+ 4 1) ) ) –Output: 30 [processes set of all records in a batch] fff initial returned

CSE 486/586 Background: Map & Reduce in Lisp Map –processes each record individually Reduce –processes (combines) set of all records in a batch 13

CSE 486/586 What Google People Have Noticed Keyword search –Find a keyword in each web page individually, and if it is found, return the URL of the web page –Combine all results (URLs) and return it Count of the # of occurrences of each word –Count the # of occurrences in each web page individually, and return the list of –For each word, sum up (combine) the count Notice the similarities? 14 Map Reduce Map Reduce

CSE 486/586 What Google People Have Noticed Lots of storage + compute cycles nearby Opportunity –Files are distributed already! (GFS) –A machine can processes its own web pages (map) 15 CP U

CSE 486/586 Google MapReduce Took the concept from Lisp, and applied to large-scale data-processing Takes two functions from a programmer (map and reduce), and performs three steps Map –Runs map for each file individually in parallel Shuffle –Collects the output from all map executions –Transforms the map output into the reduce input –Divides the map output into chunks Reduce –Runs reduce (using a map output chunk as the input) in parallel 16

CSE 486/586 Programmer’s Point of View Programmer writes two functions – map() and reduce() The programming interface is fixed –map (in_key, in_value) -> list of (out_key, intermediate_value) –reduce (out_key, list of intermediate_value) -> (out_key, out_value) Caution: not exactly the same as Lisp 17

CSE 486/586 Inverted Indexing Example Word -> list of web pages containing the word 18 every -> … its -> … every -> … its -> … Input: web pagesOutput: word-> urls

CSE 486/586 Map Interface –Input: pair => –Output: list of intermediate pairs => list of 19 key = value = “every happy family is alike.” key = value = “every happy family is alike.” … … map() Map Input: … … key = value = “every unhappy family is unhappy in its own way.” key = value = “every unhappy family is unhappy in its own way.” Map Output: list of

CSE 486/586 Shuffle MapReduce system –Collects outputs from all map executions –Groups all intermediate values by the same key 20 every -> every -> … … … … Map Output: list of Reduce Input: happy -> happy -> unhappy -> unhappy -> family -> family ->

CSE 486/586 Reduce Interface –Input: –Output: 21 every -> every -> Reduce Input: happy -> happy -> unhappy -> unhappy -> family -> family -> <every, “ <every, “ <happy, “ <happy, “ <unhappy, “ <unhappy, “ <family, “ <family, “ Reduce Output: reduce()

CSE 486/586 Execution Overview 22 Map phase Shuffle phase Reduce phase

CSE 486/586 Implementing MapReduce Externally for user –Write a map function, and a reduce function –Submit a job; wait for result –No need to know anything about the environment (Google: 4000 servers disks, many failures) Internally for MapReduce system designer –Run map in parallel –Shuffle: combine map results to produce reduce input –Run reduce in parallel –Deal with failures 23

CSE 486/586 Execution Overview 24 Master Input Files Output Map workers Reduce workers M M M R R Input files sent to map tasks Intermediate keys partitioned into reduce tasks

CSE 486/586 Task Assignment 25 Master Map workers Reduce workers M M M R R Output Input Splits

CSE 486/586 Fault-tolerance: Re-execution Master Map workers Reduce workers M M M R R Re-execute on failure Input Splits Output

CSE 486/586 Machines Share Roles 27 Master So far, logical view of cluster In reality –Each cluster machine stores data –And runs MapReduce workers Lots of storage + compute cycles nearby M R M R M R M R M R M R

CSE 486/586 Problems of MapReduce Any you can think of? –There’s only two functions you can work with (not expressive enough sometimes.) –Functional-style (a barrier for some people) Turing completeness (or computationally universal) –If it can simulate a single-taped Turing machine. –Most general languages (C/C++, Java, Lisp, etc.) are. –SQL is. –MapReduce is not. 28

CSE 486/586 Pig Why Pig? –MapReduce has limitations: only two functions –Many tasks require more than one MapReduce –Functional thinking: barrier for some Pig –Defines a set of high-level simple “commands” –Compiles the commands and generates multiple MapReduce jobs –Runs them in parallel 29

CSE 486/586 Pig Example load ‘/data/visits’; group visits by url; foreach gVisits generate url, count(visits); load ‘/data/urlInfo’; join visitCounts by url, urlInfo by url; group visitCounts by category; foreach gCategories generate top(visitCounts,10); 30

CSE 486/586 Pig Example 31 Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10(urls) Foreach category generate top10(urls) Reduce 1 Map 2 Reduce 2 Map 3 Reduce 3 Map 1

CSE 486/586 Summary Data analytics shifts the focus from computation to data. Many programming paradigms are emerging. –MapReduce –Pig –Many others 32

CSE 486/586 More Details Papers –J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” OSDI 2004 –C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig Latin: A Not-So-Foreign Language For Data Processing,” SIGMOD 2008 URLs – – – – Slides – slides/index.htmlhttp://labs.google.com/papers/mapreduce-osdi04- slides/index.html – reduce/slides/intro.pdfhttp:// reduce/slides/intro.pdf – – 33

CSE 486/ Acknowledgements These slides contain material developed and copyrighted by Indranil Gupta (UIUC).