Using Map-reduce to Support MPMD Peng

Slides:



Advertisements
Similar presentations
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
MapReduce Simplified Data Processing on Large Clusters
MapReduce.
SDN + Storage.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Veli Hasanov Fatih University.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
Developing a MapReduce Application – packet dissection.
Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
O’Reilly – Hadoop: The Definitive Guide Ch.6 How MapReduce Works 16 July 2010 Taewhi Lee.
Resource Management with YARN: YARN Past, Present and Future
UC Berkeley Job Scheduling with the Fair and Capacity Schedulers Matei Zaharia Wednesday, June 10, 2009 Santa Clara Marriott.
Quincy: Fair Scheduling for Distributed Computing Clusters Microsoft Research Silicon Valley SOSP’09 Presented at the Big Data Reading Group by Babu Pillai.
A Batch Job Queuing System on Clouds with Hadoop and Hbase Presents By Niharika Potharam.
Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Delay.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
GreenHadoop: Leveraging Green Energy in Data-Processing Frameworks Íñigo Goiri, Kien Le, Thu D. Nguyen, Jordi Guitart, Jordi Torres, and Ricardo Bianchini.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Investigation of Data Locality in MapReduce
Distributed and Parallel Processing Technology Chapter6
MobSched: An Optimizable Scheduler for Mobile Cloud Computing S. SindiaS. GaoB. Black A.LimV. D. AgrawalP. Agrawal Auburn University, Auburn, AL 45 th.
A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人:碩資工一甲 董耀文.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Introduction to MapReduce ECE7610. The Age of Big-Data  Big-data age  Facebook collects 500 terabytes a day(2011)  Google collects 20000PB a day (2011)
Operating system Structure and Operation by Dr. Amin Danial Asham.
Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
HAMS Technologies 1
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
A Hierarchical MapReduce Framework Yuan Luo and Beth Plale School of Informatics and Computing, Indiana University Data To Insight Center, Indiana University.
Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.
GreenSched: An Energy-Aware Hadoop Workflow Scheduler
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
Using Map-reduce to Support MPMD Peng
Matchmaking: A New MapReduce Scheduling Technique
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Grid Appliance The World of Virtual Resource Sharing Group # 14 Dhairya Gala Priyank Shah.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Part III BigData Analysis Tools (YARN) Yuan Xue
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Tao Zhu1,2, Chengchun Shu1, Haiyan Yu1
Chapter 10 Data Analytics for IoT
Hadoop MapReduce Framework
Edinburgh Napier University
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
Apache Hadoop YARN: Yet Another Resource Manager
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Cloud Distributed Computing Environment Hadoop
MapReduce: Data Distribution for Reduce
Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming
COS 518: Distributed Systems Lecture 11 Mike Freedman
Presentation transcript:

Using Map-reduce to Support MPMD Peng

Our Motivation The default job scheduler in Hadoop has a first-in-first-out queue of jobs for each priority level. The scheduler always assigns task slots to the first job in the highest-level priority queue that is in need of tasks. Problems: – Difficult to share a MapReduce cluster between users (Multi- tasks) – Difficult to implement a composite tasks having more that one jobs with inter-dependency. A strong motivation to improve the Hadoop framework to – support Multi-tasks – support Composite-tasks

Multi-tasks Problem One solution to this problem is to create separate MapReduce clusters for different user groups with Hadoop On-Demand, but this hurts system utilization because a group's cluster may be mostly idle for long periods of time. Advanced solution: – Facebook Fair Scheduler – Yahoo Capacity Scheduler

Facebook Fair Scheduler Jobs are placed into named “pools”. Each pool can have a “guaranteed capacity” that is specified through a config file, which gives a minimum number of map slots and reduce slots to allocate to the pool. When there are pending jobs in the pool, it gets at least this many slots, but if it has no jobs, the slots can be used by other pools. Excess capacity that is not going toward a pool’s minimum is allocated between jobs using fair sharing. – Fair sharing splits up compute time proportionally between jobs that have been submitted, emulating an "ideal" scheduler that gives each job 1/Nth of the available capacity.

Yahoo Capacity Scheduler Define a number of named queues. Each queue has a configurable number of map and reduce slots. The scheduler gives each queue its capacity when it contains jobs, and shares any unused capacity between the queues. However, within each queue, FIFO scheduling with priorities is used, except for one aspect – you can place a limit on percent of running tasks per user, so that users share a cluster equally.

There is still a Problem! Both Yahoo and Facebook’s scheduler assigns dedicated map and reduce slots to those tasks, they are not in compliance with “Moving computation to data” Out solution: – Turning Hadoop into MPMD (computation resource sharing): Different users can submit multiple tasks which will be assigned to different mappers/reducers and run simultaneously. Load balancing achieved by keeping the computing nodes busy with tasks

Using the traditional Map-reduce to support MPMD Data 1 Data 2 Data 3 …… Data n RunnerMap …… RunnerMap Output 1 Output 2 …… Output n Output Lookup the code for Data RunnerReduce MapProcedure ReduceProcedure

Running WordCount and Hadoop Blast using extended framework WordCountMapP rocedure Extends Abstract class MapProcedure WordCountReduc eProcedure Extends Abstract class ReduceProcedure BlastMapProcedu re Extends Abstract class MapProcedure RunnerMap RunnerReduce blast_input_1.fa:edu.indiana.cs.b649.BlastMa pProcedure wordcount_input_1.txt:edu.indiana.cs.b649. WordCountMapProcedure

Composite task problem To support a composite task having more that one jobs with inter-dependency.

Support Composite-tasks in out out0 Map Reduce File 1 File 3 File 4 File 2 empty_file Part-r blast_intp ut_0.fa blast_intp ut_1.fa Map emtpy_fil e.out

Demo Running Hadoop Blast + Advanced WordCount – Single node mode: 2 mappers + 2 reducers – Input files: blast_input_0.fa blast_input_1.fa wordcount_input_0.txt wordcount_input_1.txt empty_file – Output files: blast_input_0.fa.out blast_input_1.fa.out empty_file.out

Performance Test Task execution time (ms) = job launching time + job execute time

Roles of team member Peng – Implemented the framework to support Multi- tasks Yuan – Improved the framework to support Composite- tasks

Q&A Thanks!