Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University.

Slides:



Advertisements
Similar presentations
Democracy, Innovation and Partnerships Jens Thorhauge Danish Agency for Libraries and Media.
Advertisements

Verify Unit of Measure in a Multivariate Equation ©
Distributed and Parallel Processing Technology Chapter2. MapReduce
Starfish: A Self-tuning System for Big Data Analytics.
Herodotos Herodotou Shivnath Babu Duke University.
Three Perspectives & Two Problems Shivnath Babu Duke University.
Presented by Carl Erhard & Zahid Mian Authors: Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University.
1 Sizing the Streaming Media Cluster Solution for a Given Workload Lucy Cherkasova and Wenting Tang HPLabs.
Communication costs of LU decomposition algorithms for banded matrices Razvan Carbunescu 12/02/20111.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Februari Organisation 22. Februari KI´s education and research DANDERYD HOSPITAL 155 FTE students Research SEK 39 million 31 FTE employees.
HJ-Hadoop An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University 1.
Roy Ernest Database Administrator Pinnacle Sports Worldwide SQL Server High Availability.
Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Presented by Nirupam Roy Starfish: A Self-tuning System for Big Data Analytics Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong,
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
HADOOP ADMIN: Session -2
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Profiling, What-if Analysis and Cost- based Optimization of MapReduce Programs Oct 7 th 2013 Database Lab. Wonseok Choi.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Storage Management in Virtualized Cloud Environments Sankaran Sivathanu, Ling Liu, Mei Yiduo and Xing Pu Student Workshop on Frontiers of Cloud Computing,
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
MapReduce and the New Software Stack CHAPTER 2 1.
Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Next Generation of Apache Hadoop MapReduce Owen
Performance Assurance for Large Scale Big Data Systems
An Open Source Project Commonly Used for Processing Big Data Sets
Large-scale file systems and Map-Reduce
Spark Presentation.
Introduction to MapReduce and Hadoop
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MapReduce Simplied Data Processing on Large Clusters
Cloud Distributed Computing Environment Hadoop
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Data processing with Hadoop
Declarative Transfer Learning from Deep CNNs at Scale
5/7/2019 Map Reduce Map reduce.
Map Reduce, Types, Formats and Features
Presentation transcript:

Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University

Analysis in the Big Data Era 9/26/20112 Massive Data Data Analysis Insight Key to Success = Timely and Cost-Effective Analysis Starfish

Hadoop MapReduce Ecosystem Popular solution to Big Data Analytics 9/26/20113 MapReduce Execution Engine Distributed File System Hadoop Java / C++ / R / Python OozieHivePig Elastic MapReduce Jaql HBase Starfish

Practitioners of Big Data Analytics Who are the users? Data analysts, statisticians, computational scientists… Researchers, developers, testers… You! Who performs setup and tuning? The users! Usually lack expertise to tune the system 9/26/20114Starfish

Tuning Challenges Heavy use of programming languages for MapReduce programs (e.g., Java/python) Data loaded/accessed as opaque files Large space of tuning choices Elasticity is wonderful, but hard to achieve (Hadoop has many useful mechanisms, but policies are lacking) Terabyte-scale data cycles 9/26/20115Starfish

Our goal: Provide good performance automatically Starfish: Self-tuning System 9/26/20116 MapReduce Execution Engine Distributed File System Hadoop Java / C++ / R / Python OozieHivePig Elastic MapReduce Jaql HBase Starfish Analytics System Starfish

What are the Tuning Problems? 9/26/20117 Job-level MapReduce configuration Workload management Data layout tuning Cluster sizing Workflow optimization J1J1 J2J2 J3J3 J4J4 Starfish

Starfishs Core Approach to Tuning 9/26/ ) if Δ(conf. parameters) then what …? 2) if Δ(data properties) then what …? 3) if Δ(cluster properties) then what …? Profiler Collects concise summaries of execution What-if Engine Estimates impact of hypothetical changes on execution Optimizers Search through space of tuning choices Job Workflow Workload Data layout Cluster Starfish

Starfish Architecture 9/26/20119 Profiler What-if Engine Workflow Optimizer Workload OptimizerElastisizer Job Optimizer Data Manager Metadata Mgr. Intermediate Data Mgr. Data Layout & Storage Mgr. Starfish

MapReduce Job Execution 9/26/ split 0 map out 0 reduce Two Map WavesOne Reduce Wave split 2 map split 1 map split 3 map Out 1 reduce job j = Starfish

What Controls MR Job Execution? Space of configuration choices: Number of map tasks Number of reduce tasks Partitioning of map outputs to reduce tasks Memory allocation to task-level buffers Multiphase external sorting in the tasks Whether output data from tasks should be compressed Whether combine function should be used 9/26/ job j = Starfish

Effect of Configuration Settings Use defaults or set manually (rules-of-thumb) Rules-of-thumb may not suffice 9/26/ Two-dimensional projection of a multi- dimensional surface (Word Co-occurrence MapReduce Program) Rules-of-thumb settings Starfish

MapReduce Job Tuning in a Nutshell Goal: Challenges: p is an arbitrary MapReduce program; c is high-dimensional; … 9/26/ Profiler What-if Engine Optimizer Runs p to collect a job profile (concise execution summary) of Given profile of, estimates virtual profile for Enumerates and searches through the optimization space S efficiently Starfish

Job Profile Concise representation of program execution as a job Records information at the level of task phases Generated by Profiler through measurement or by the What-if Engine through estimation 9/26/ Memory Buffer Merge Sort, [Combine], [Compress] Serialize, Partition map Merge split DFS SpillCollectMapRead Starfish

Job Profile Fields Dataflow: amount of data flowing through task phases Map output bytes Number of spills Number of records in buffer per spill 9/26/ Costs: execution times at the level of task phases Read phase time in the map task Map phase time in the map task Spill phase time in the map task Dataflow Statistics: statistical information about dataflow Width of input key-value pairs Map selectivity in terms of records Map output compression ratio Cost Statistics: statistical information about resource costs I/O cost for reading from local disk per byte CPU cost for executing the Mapper per record CPU cost for uncompressing the input per byte Starfish

Generating Profiles by Measurement Goals Have zero overhead when profiling is turned off Require no modifications to Hadoop Support unmodified MapReduce programs written in Java or Hadoop Streaming/Pipes (Python/Ruby/C++) Approach: Dynamic (on-demand) instrumentation Event-condition-action rules are specified (in Java) Leads to run-time instrumentation of Hadoop internals Monitors task phases of MapReduce job execution We currently use Btrace (Hadoop internals are in Java) 9/26/201116Starfish

Generating Profiles by Measurement 9/26/ split 0 map out 0 reduce split 1 map raw data map profile reduce profile job profile Use of Sampling Profile fewer tasks Execute fewer tasks JVM = Java Virtual Machine, ECA = Event-Condition-Action JVM Enable Profiling ECA rules Starfish

What-if Engine Job Oracle Virtual Job Profile for What-if Engine 9/26/ Task Scheduler Simulator Job Profile Properties of Hypothetical job Input Data Properties Cluster Resources Configuration Settings Possibly Hypothetical Starfish

Virtual Profile Estimation 9/26/ Given profile for job j = estimate profile for job j' = (Virtual) Profile for j' Dataflow Statistics Dataflow Cost Statistics Costs Profile for j Input Data d 2 Confi- guration c 2 Resources r 2 Costs White-box Models Cost Statistics Relative Black-box Models Dataflow White-box Models Dataflow Statistics Cardinality Models Starfish

Job Optimizer 9/26/ Best Configuration Settings for Subspace Enumeration Recursive Random Search Just-in-Time Optimizer Job Profile Input Data Properties Cluster Resources What-if calls Starfish

Workflow Optimization Space 9/26/ Job-level Configuration Dataset-level Configuration Physical Optimization Space Logical Join Selection Partition Function Selection Vertical Packing Inter-job Starfish

Optimizations on TF-IDF Workflow 9/26/ Logical Optimization M1 R1 … D0 J1 D1 D2 D4 M2 R2 … J2 … J3, J4 … Partition:{D} Sort: {D,W} M1 R1 M2 R2 … D0 J1, J2 D2 D4 M3 R3 M4 … J3, J4 … Physical Optimization Reducers= 50 Compress = off Memory = 400 … … … … Reducers= 20 Compress = on Memory = 300 … Legend D = docname f = frequency W = word c = count t = TF-IDF M3 R3 M4 Starfish

New Challenges What-if challenges: Support concurrent job execution Estimate intermediate data properties Optimization challenges Interactions across jobs Extended optimization space Find good configuration settings for individual jobs 9/26/ J1J2 J3 J4 Workflow Starfish

Cluster Sizing Problem Use-cases for cluster sizing Tuning the cluster size for elastic workloads Workload transitioning from development cluster to production cluster Multi-objective cluster provisioning Goal Determine cluster resources & job-level configuration parameters to meet workload requirements 9/26/201124Starfish

Multi-objective Cluster Provisioning 9/26/ Cloud enables users to provision clusters in minutes Starfish

Experimental Evaluation 9/26/ Starfish (versions 0.1, 0.2) to manage Hadoop on EC2 Different scenarios: Cluster × Workload × Data EC2 Node Type CPU: EC2 units MemI/O Perf.Cost /hour #Maps /node #Reds /node MaxMem /task m1.small1 (1 x 1)1.7 GBmoderate$ MB m1.large4 (2 x 2)7.5 GBhigh$ MB m1.xlarge8 (4 x 2)15 GBhigh$ MB c1.medium5 (2 x 2.5)1.7 GBmoderate$ MB c1.xlarge20 (8 x 2.5)7 GBhigh$ MB cc1.4xlarge33.5 (8)23 GBvery high$ MB Starfish

Experimental Evaluation 9/26/ Starfish (versions 0.1, 0.2) to manage Hadoop on EC2 Different scenarios: Cluster × Workload × Data Abbr.MapReduce ProgramDomainDataset COWord Co-occurrenceNatural Lang Proc.Wikipedia (10GB – 22GB) WCWordCountText AnalyticsWikipedia (30GB – 1TB) TSTeraSortBusiness AnalyticsTeraGen (30GB – 1TB) LGLinkGraphGraph ProcessingWikipedia (compressed ~6x) JOJoinBusiness AnalyticsTPC-H (30GB – 1TB) TFTerm Freq. - Inverse Document Freq. Information RetrievalWikipedia (30GB – 1TB) Starfish

Job Optimizer Evaluation 9/26/ Hadoop cluster: 30 nodes, m1.xlarge Data sizes: GB Starfish

Estimates from the What-if Engine 9/26/ Hadoop cluster: 16 nodes, c1.medium MapReduce Program: Word Co-occurrence Data set: 10 GB Wikipedia True surfaceEstimated surface Starfish

Profiling Overhead Vs. Benefit 9/26/ Hadoop cluster: 16 nodes, c1.medium MapReduce Program: Word Co-occurrence Data set: 10 GB Wikipedia Starfish

Multi-objective Cluster Provisioning 9/26/ Instance Type for Source Cluster: m1.large Starfish

More info: 9/26/ Job-level MapReduce configuration Workflow optimization Workload management Data layout tuning Cluster sizing J1J1 J2J2 J3J3 J4J4 Starfish