Next Generation of Apache Hadoop MapReduce Owen

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Hadoop 2.0 and YARN SUBASH D’SOUZA. Who am I?  Senior Specialist Engineer at Shopzilla  Co-Organizer for the Los Angeles Hadoop User group  Organizer.
Wei-Chiu Chuang 10/17/2013 Permission to copy/distribute/adapt the work except the figures which are copyrighted by ACM.
Hadoop YARN in the Cloud Junping Du Staff Engineer, VMware China Hadoop Summit, 2013.
Resource Management with YARN: YARN Past, Present and Future
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Hadoop Ecosystem Overview
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Developing and Deploying Apache Hadoop Security Owen O’Malley - Hortonworks Co-founder and © Hortonworks Inc.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
State of the Elephant Hadoop yesterday, today, and tomorrow Page 1 Owen
Our Experience Running YARN at Scale Bobby Evans.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop implementation of MapReduce computational model Ján Vaňo.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
A Platform for Fine-Grained Resource Sharing in the Data Center
Hadoop & Neptune Feb 김형준.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Part III BigData Analysis Tools (YARN) Yuan Xue
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
An Open Source Project Commonly Used for Processing Big Data Sets
Chapter 10 Data Analytics for IoT
Introduction to HDFS: Hadoop Distributed File System
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Distributed Systems CS
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Technopoints.
Overview of big data tools
Lecture 16 (Intro to MapReduce and Hadoop)
CS 345A Data Mining MapReduce This presentation has been altered.
Distributed Systems CS
Presentation transcript:

Next Generation of Apache Hadoop MapReduce Owen

What is Hadoop?  A framework for storing and processing big data on lots of commodity machines. - Up to 4,000 machines in a cluster - Up to 20 PB in a cluster  Open Source Apache project  High reliability done in software - Automated failover for data and computation  Implemented in Java  Primary data analysis platform at Yahoo! - 40,000+ machines running Hadoop

What is Hadoop?  HDFS – Distributed File System - Combines cluster’s local storage into a single namespace. - All data is replicated to multiple machines. - Provides locality information to clients  MapReduce - Batch computation framework - Tasks re-executed on failure - User code wrapped around a distributed sort - Optimizes for data locality of input

twice the engagement 3 Personalized for each visitor Result: twice the engagement +160% clicks vs. one size fits all +79% clicks vs. randomly selected +43% clicks vs. editor selected Recommended links News InterestsTop Searches Case Study: Yahoo Front Page

Hadoop MapReduce Today  JobTracker - Manages cluster resources and job scheduling  TaskTracker - Per-node agent - Manage tasks

Current Limitations  Scalability - Maximum Cluster size – 4,000 nodes - Maximum concurrent tasks – 40,000 - Coarse synchronization in JobTracker  Single point of failure - Failure kills all queued and running jobs - Jobs need to be re-submitted by users  Restart is very tricky due to complex state  Hard partition of resources into map and reduce slots

Current Limitations  Lacks support for alternate paradigms - Iterative applications implemented using MapReduce are 10x slower. - Users use MapReduce to run arbitrary code - Example: K-Means, PageRank  Lack of wire-compatible protocols - Client and cluster must be of same version - Applications and workflows cannot migrate to different clusters

MapReduce Requirements for 2011  Reliability  Availability  Scalability - Clusters of 6,000 machines - Each machine with 16 cores, 48G RAM, 24TB disks - 100,000 concurrent tasks - 10,000 concurrent jobs  Wire Compatibility  Agility & Evolution – Ability for customers to control upgrades to the grid software stack.

MapReduce – Design Focus  Split up the two major functions of JobTracker - Cluster resource management - Application life-cycle management  MapReduce becomes user-land library

Architecture

 Resource Manager - Global resource scheduler - Hierarchical queues  Node Manager - Per-machine agent - Manages the life-cycle of container - Container resource monitoring  Application Master - Per-application - Manages application scheduling and task execution - E.g. MapReduce Application Master

Improvements vis-à-vis current MapReduce  Scalability - Application life-cycle management is very expensive - Partition resource management and application life-cycle management - Application management is distributed - Hardware trends - Currently run clusters of 4,000 machines 6, machines > 12, machines v/s

Improvements vis-à-vis current MapReduce  Availability - Application Master Optional failover via application-specific checkpoint MapReduce applications pick up where they left off - Resource Manager No single point of failure - failover via ZooKeeper Application Masters are restarted automatically

Improvements vis-à-vis current MapReduce  Wire Compatibility - Protocols are wire-compatible - Old clients can talk to new servers - Evolution toward rolling upgrades

Improvements vis-à-vis current MapReduce  Innovation and Agility - MapReduce now becomes a user-land library - Multiple versions of MapReduce can run in the same cluster (a la Apache Pig) Faster deployment cycles for improvements - Customers upgrade MapReduce versions on their schedule - Users can use customized MapReduce versions without affecting everyone!

Improvements vis-à-vis current MapReduce  Utilization - Generic resource model Memory CPU Disk b/w Network b/w - Remove fixed partition of map and reduce slots

Improvements vis-à-vis current MapReduce  Support for programming paradigms other than MapReduce - MPI - Master-Worker - Machine Learning and Iterative processing - Enabled by paradigm-specific Application Master - All can run on the same Hadoop cluster

Summary  Takes Hadoop to the next level - Scale-out even further - High availability - Cluster Utilization - Support for paradigms other than MapReduce

Questions?