We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byAlberta Fisher
Modified over 4 years ago
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect @acmurthy@acmurthy (@hortonworks) Formerly Architect, MapReduce @ Yahoo! 8 years @ Yahoo! © Hortonworks Inc. 2011June 29, 2011
Hello! I’m Arun… Architect & Lead, Apache Hadoop MapReduce Development Team at Hortonworks (formerly at Yahoo!) Apache Hadoop Committer and Member of PMC −Full-time contributor to Apache Hadoop since early 2006
Hadoop MapReduce Today JobTracker −Manages cluster resources and job scheduling TaskTracker −Per-node agent −Manage tasks
Current Limitations Scalability −Maximum Cluster size – 4,000 nodes −Maximum concurrent tasks – 40,000 −Coarse synchronization in JobTracker Single point of failure −Failure kills all queued and running jobs −Jobs need to be re-submitted by users Restart is very tricky due to complex state Hard partition of resources into map and reduce slots © Hortonworks Inc. 2011 5
Current Limitations Lacks support for alternate paradigms −Iterative applications implemented using MapReduce are 10x slower. −Example: K-Means, PageRank Lack of wire-compatible protocols −Client and cluster must be of same version −Applications and workflows cannot migrate to different clusters © Hortonworks Inc. 2011 6
Requirements Reliability Availability Scalability - Clusters of 6,000-10,000 machines −Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks −100,000+ concurrent tasks −10,000 concurrent jobs Wire Compatibility Agility & Evolution – Ability for customers to control upgrades to the grid software stack. © Hortonworks Inc. 2011 7
Design Centre Split up the two major functions of JobTracker −Cluster resource management −Application life-cycle management MapReduce becomes user-land library © Hortonworks Inc. 2011 8
Resource Manager −Global resource scheduler −Hierarchical queues Node Manager −Per-machine agent −Manages the life-cycle of container −Container resource monitoring Application Master −Per-application −Manages application scheduling and task execution −E.g. MapReduce Application Master © Hortonworks Inc. 2011 10
Improvements vis-à-vis current MapReduce Scalability −Application life-cycle management is very expensive −Partition resource management and application life-cycle management −Application management is distributed −Hardware trends - Currently run clusters of 4,000 machines 6,000 2012 machines > 12,000 2009 machines v/s © Hortonworks Inc. 2011 11
Improvments vis-à-vis current MapReduce Fault Tolerance and Availability −Resource Manager No single point of failure – state saved in ZooKeeper Application Masters are restarted automatically on RM restart Applications continue to progress with existing resources during restart, new resources aren’t allocated −Application Master Optional failover via application-specific checkpoint MapReduce applications pick up where they left off via state saved in HDFS © Hortonworks Inc. 2011 12
Improvements vis-à-vis current MapReduce Wire Compatibility −Protocols are wire-compatible −Old clients can talk to new servers −Rolling upgrades © Hortonworks Inc. 2011 13
Improvements vis-à-vis current MapReduce Innovation and Agility −MapReduce now becomes a user-land library −Multiple versions of MapReduce can run in the same cluster (a la Apache Pig) Faster deployment cycles for improvements −Customers upgrade MapReduce versions on their schedule −Users can customize MapReduce e.g. HOP without affecting everyone! © Hortonworks Inc. 2011 14
Improvements vis-à-vis current MapReduce Utilization −Generic resource model Memory CPU Disk b/w Network b/w −Remove fixed partition of map and reduce slots © Hortonworks Inc. 2011 15
Improvements vis-à-vis current MapReduce Support for programming paradigms other than MapReduce −MPI −Master-Worker −Machine Learning −Iterative processing −Enabled by allowing use of paradigm-specific Application Master −Run all on the same Hadoop cluster © Hortonworks Inc. 2011 16
Summary MapReduce.Next takes Hadoop to the next level −Scale-out even further −High availability −Cluster Utilization −Support for paradigms other than MapReduce © Hortonworks Inc. 2011 17
Status – July, 2011 Feature complete Rigorous testing cycle underway −Scale testing at ~500 nodes Sort/Scan/Shuffle benchmarks GridMixV3! −Integration testing Pig integration complete! Coming in the next release of Apache Hadoop! Beta deployments of next release of Apache Hadoop at Yahoo! in Q4, 2011 © Hortonworks Inc. 2011 18
Questions? http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/ http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen-scheduler/ © Hortonworks Inc. 2011 19
Thank You. @acmurthy @acmurthy © Hortonworks Inc. 2011
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Project presentation by Mário Almeida Implementation of Distributed Systems KTH 1.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
HadoopDB Inneke Ponet. Introduction Technologies for data analysis HadoopDB Desired properties Layers of HadoopDB HadoopDB Components.
© 2011 VMware Inc. All rights reserved High Availability Module 7.
Hadoop 2.0 and YARN SUBASH D’SOUZA. Who am I? Senior Specialist Engineer at Shopzilla Co-Organizer for the Los Angeles Hadoop User group Organizer.
© Hortonworks Inc Running Non-MapReduce Applications on Apache Hadoop Hitesh Shah & Siddharth Seth Hortonworks Inc. Page 1.
Wei-Chiu Chuang 10/17/2013 Permission to copy/distribute/adapt the work except the figures which are copyrighted by ACM.
Hadoop YARN in the Cloud Junping Du Staff Engineer, VMware China Hadoop Summit, 2013.
Resource Management with YARN: YARN Past, Present and Future
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Hortonworks Eric Baldeschwieler – CEO © Hortonworks Inc Architecting the Future of Big Data June 29, 2011.
Why static is bad! Hadoop Pregel MPI Shared cluster Today: static partitioningWant dynamic sharing.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
GridGain In-Memory Data Fabric:
© 2019 SlidePlayer.com Inc. All rights reserved.