Presented by Nirupam Roy Starfish: A Self-tuning System for Big Data Analytics Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong,

Slides:



Advertisements
Similar presentations
Barcelona Supercomputing Center. The BSC-CNS objectives: R&D in Computer Sciences, Life Sciences and Earth Sciences. Supercomputing support to external.
Advertisements

Starfish: A Self-tuning System for Big Data Analytics.
Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University.
Herodotos Herodotou Shivnath Babu Duke University.
Three Perspectives & Two Problems Shivnath Babu Duke University.
Presented by Carl Erhard & Zahid Mian Authors: Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University.
Kensington Oracle Edition: Open Discovery Workflow Meets Oracle 10g Professor Yike Guo.
Achieving Elasticity for Cloud MapReduce Jobs Khaled Salah IEEE CloudNet 2013 – San Francisco November 13, 2013.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications Piyush Shivam, Shivnath Babu, Jeffrey Chase Duke University.
Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.
Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads Jackie.
1 Community (Optimize both Yarn & Non Yarn Hadoop clusters)
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Introduction to Systems Analysis and Design
Cloud Computing Other Mapreduce issues Keke Chen.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
New Challenges in Cloud Datacenter Monitoring and Management
Distribution Statement A. Approved for public release; distribution is unlimited. Test and Evaluation/Science and Technology Program Rapid Data Analyzer.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
GreenHadoop: Leveraging Green Energy in Data-Processing Frameworks Íñigo Goiri, Kien Le, Thu D. Nguyen, Jordi Guitart, Jordi Torres, and Ricardo Bianchini.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.
Tyson Condie.
A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人:碩資工一甲 董耀文.
Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013.
Profiling, What-if Analysis and Cost- based Optimization of MapReduce Programs Oct 7 th 2013 Database Lab. Wonseok Choi.
Emalayan Vairavanathan
DISTRIBUTED COMPUTING
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HAMS Technologies 1
Cluster Reliability Project ISIS Vanderbilt University.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
20 October 2006Workflow Optimization in Distributed Environments Dynamic Workflow Management Using Performance Data David W. Walker, Yan Huang, Omer F.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
Stochastic DAG Scheduling using Monte Carlo Approach Heterogeneous Computing Workshop (at IPDPS) 2012 Extended version: Elsevier JPDC (accepted July 2013,
Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.
Automated Control in Cloud Computing: Challenges and Opportunities Harold C. Lim, Shivnath Babu, Jeffrey S. Chase, and Sujay S. Parekh ACM’s First Workshop.
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
GreenSched: An Energy-Aware Hadoop Workflow Scheduler
CARDIO: Cost-Aware Replication for Data-Intensive workflOws Presented by Chen He.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Web Log Data Analytics with Hadoop
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Software-defined network(SDN)
Data Mining Techniques Applied in Advanced Manufacturing PRESENT BY WEI SUN.
Prediction-Based Multivariate Query Modeling Analytic Queries.
An Introduction To Big Data For The SQL Server DBA.
1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.
18 May 2006CCGrid2006 Dynamic Workflow Management Using Performance Data Lican Huang, David W. Walker, Yan Huang, and Omer F. Rana Cardiff School of Computer.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
Organizations Are Embracing New Opportunities
Resource Elasticity for Large-Scale Machine Learning
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Department of Computer Science University of California, Santa Barbara
Resource Allocation in a Middleware for Streaming Data
Presentation transcript:

Presented by Nirupam Roy Starfish: A Self-tuning System for Big Data Analytics Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, Shivnath Babu Department of Computer Science Duke University

The Growth of Data

MAD: Features of Ideal Analytics System M agnetism A gility D epth -- accept all data -- allow complex analysis -- adapt with data, real-time processing

M agnetism A gility D epth -- accept all data -- adapt with data, real-time processing -- allow complex analysis Hadoop is MAD - Blindly loads data into HDFS. - Fine-grained scheduler - End-to-end data pipeline - Dynamic node addition/ dropping - Well integrated with programming languages

Tuning for Good Performance: Challenges - Multiple dimensions of performance -- time, cost, scalability … - Tons of Parameters -- more than 190 parameters in Hadoop. - Multiple levels of abstraction -- job-level, workflow-level, workload-level …

Thumb rule Tuning for Good Performance: Challenges

Thumb rule Tuning for Good Performance: Challenges

Starfish: A Self-tuning System - Builds on Hadoop - Tunes to ‘good’ performance automatically

Starfish Architecture

The “What-if” Engine Model + simulation based prediction algo. Predicted performance Learning from previous job profiles Analytical models to estimate dataflow Simulating the execution of MR workload Profile of a job (P) + New parameter set (S) [Ref:] A What-if Engine for Cost-based MapReduce Optimization. H. Herodotou et.al.

The “What-if” Engine Ground truthEstimated by the What-if engine

Starfish Architecture: Job Level

Just-in-time optimizer -- Searches the parameter space Profiler -- Collects info. on MapReduce job execution through dynamic instrumentation -- Reports timings, data size, and resource utilization Sampler -- Generates profile statistics from training benchmark jobs

Starfish Architecture: Workflow Level

Scheduler to balanced distribution of data Block placement policy for data collocation -- deals with skewed data, add/drop of nodes, tradeoff between balanced data v/s data-locality -- Local-write v/s round-robin

Starfish Architecture: Workflow Level Producer Consumer Wasted production

Starfish Architecture: Workflow Level File level parallelism Block level parallelism

Starfish Architecture: Workflow Level What-if simulation Workflow Aware Optimizer Select best data layout and job parameters MR job execution Task scheduling Block placement Compare cost & benefits Running time? Data layout?

Starfish Architecture: Workload Level

Workload Optimizer Elastisizer Determine best cluster and Hadoop configurations Jumbo operator Cost based estimation for best optimization

Starfish: Summary - Optimizes on different granularities -- Workload, workflow, job (procedural & declarative) - Considers different decision points -- Provisioning, optimization, Scheduling, Data layout

Starfish: Piazza Discussion 1) Limited evaluation: 10 Top criticisms (till 1:30pm, 17 reviews): 2) Not explained well: 7 3) Profiler overhead/better search algo: 5 * What is the effect of wrong prediction? * What-if engine requires prior knowledge.

Thank you. Photo courtesy: Starfish group, Duke University

Going MAD with Big Data M agnetic system A gile system and Analytics D eep Analytics D ata Life Cycle Awareness E lasticity R obustness

Backup: What-if Engine 1

Backup: What-if Engine 2