We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byDaniel Anderson
Modified about 1 year ago
© Hortonworks Inc. 2011 Running Non-MapReduce Applications on Apache Hadoop Hitesh Shah & Siddharth Seth Hortonworks Inc. Page 1
© Hortonworks Inc. 2011 Who am I? Hitesh Shah –Member of Technical Staff at Hortonworks Inc. –Apache Hadoop PMC member and committer –Apache Tez and Apache Ambari PPMC member and committer Siddharth Seth –Member of Technical Staff at Hortonworks Inc. –Apache Hadoop PMC member and committer –Apache Tez PPMC member and committer Page 2 Architecting the Future of Big Data
© Hortonworks Inc. 2011 Agenda Apache Hadoop v1 to v2 YARN Applications on YARN YARN Best Practices Page 3 Architecting the Future of Big Data
© Hortonworks Inc. 2011 Apache Hadoop v1 Page 4 Architecting the Future of Big Data Job Client Submit Job JobTracker TaskTracker Map Slot Reduce Slot
© Hortonworks Inc. 2011 Apache Hadoop v1 Pros: –A framework to run MapReduce jobs that allows you to run the same piece of code on a single node cluster to one spanning 1000s of machines. Cons: –It is a framework to run MapReduce jobs. Page 5 Architecting the Future of Big Data
© Hortonworks Inc. 2011 Apache Giraph Page 6 Architecting the Future of Big Data Iterative graph processing on a Hadoop cluster An iterative approach on MapReduce would require running multiple jobs. To avoid MR overheads, runs everything as a Map-only job. Map Task: Master Map Task: Master Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker
© Hortonworks Inc. 2011 Apache Oozie Page 7 Architecting the Future of Big Data Workflow scheduler system to manage Hadoop jobs. Running a PIG script through Oozie JobTracker Oozie MapTask: Pig Script Launcher MapTask: Pig Script Launcher Submit Job Submit Subsequent MR jobs 1 1 2 2 3 3
© Hortonworks Inc. 2011 Apache Hadoop v2 Page 8 Architecting the Future of Big Data
© Hortonworks Inc. 2011 YARN The Operating System of a Hadoop cluster Architecting the Future of Big Data Page 9
© Hortonworks Inc. 2011 The YARN Stack Page 10 Architecting the Future of Big Data
© Hortonworks Inc. 2011 YARN Glossary Page 11 Architecting the Future of Big Data Installer –Application Installer or Application Client Client –Application Client Supervisor –Application Master Workers –Application Containers
© Hortonworks Inc. 2011 YARN Architecture Page 12 Architecting the Future of Big Data ResourceManager NodeManager Client Submit Application NodeManager App Master App Master Container App Master App Master Container
© Hortonworks Inc. 2011 YARN Application Flow Page 13 Architecting the Future of Big Data Application Client Resource Manager Resource Manager Application Master NodeManager YarnClient App Specific API App Specific API Application Client Protocol AMRMClient NMClient Application Master Protocol Container Management Protocol App Container App Container
© Hortonworks Inc. 2011 YARN Protocols & Client Libraries Application Client Protocol: Client to RM interaction –Library: YarnClient –Application Lifecycle control –Access Cluster Information Application Master Protocol: AM – RM interaction –Library: AMRMClient / AMRMClientAsync –Resource negotiation –Heartbeat to the RM Container Management Protocol: AM to NM interaction –Library: NMClient/NMClientAsync –Launching allocated containers –Stop Running containers Page 14 Architecting the Future of Big Data
© Hortonworks Inc. 2011 Applications on YARN Architecting the Future of Big Data Page 15
© Hortonworks Inc. 2011 YARN Applications Page 16 Architecting the Future of Big Data Categorizing Applications –What does the Application do? –Application Lifetime –How Applications accept work –Language Application Lifetime –Job submit to complete. –Long-running Services Job Submissions – One job : One Application – Multiple jobs per application
© Hortonworks Inc. 2011 Language considerations Hadoop RPC uses Google Protobuf –Protobuf bindings: C/C++, GO, Java, Python… Accessing HDFS –WebHDFS –libhdfs for C –Python client by Spotify Labs: Snakebite YARN Application Logic –ApplicationMaster in Java and containers in any language Page 17 Architecting the Future of Big Data
© Hortonworks Inc. 2011 Tez ( App Submission) Page 18 Architecting the Future of Big Data Distributed Execution framework – computation is expressed as a DAG Takes MapReduce to the next level – where each job was limited to a Map and/or Reduce stage. YARNTasks Resource Manager DAG execution logic Task co-ordination Local Task Scheduling DAG execution logic Task co-ordination Local Task Scheduling Tez AM Node Manager(s) Launch AM AM Launched Job Submission Monitoring Job Submission Monitoring Tez Client Request Resources Allocated Resources Launch Tasks Launch AM Tasks Launched Heartbeat Submit DAG
© Hortonworks Inc. 2011 HOYA ( Long Running App ) Page 19 Architecting the Future of Big Data On Demand HBase cluster setup Share cluster resources – persist and shutdown the cluster when not needed Dynamically handles Node failures Allows re-sizing of a running HBase cluster
© Hortonworks Inc. 2011 Resource Manager Node Manager(s) YARN Get New Containers Kafka (Streams) Samza AM Task Container Task Task Container Task Task Container Task Container Finished Launch Container Samza on YARN ( Failure Handling App ) Page 20 Architecting the Future of Big Data Stream processing system – uses YARN as the execution framework Makes use of CGroups support in YARN for CPU isolation Uses Kafka as underlying store
© Hortonworks Inc. 2011 YARN Eco-system Page 21 Architecting the Future of Big Data Powered by YARN Apache Giraph – Graph Processing Apache Hama - BSP Apache Hadoop MapReduce – Batch Apache Tez – Batch/Interactive Apache S4 – Stream Processing Apache Samza – Stream Processing Apache Storm – Stream Processing Apache Spark – Iterative/Interactive applications Cloudera Llama DataTorrent HOYA – HBase on YARN RedPoint Data Management YARN Utilities/Frameworks Weave by Continuity REEF by Microsoft Spring support for Hadoop 2
© Hortonworks Inc. 2011 YARN Best Practices Architecting the Future of Big Data Page 22
© Hortonworks Inc. 2011 Best Practices Page 23 Architecting the Future of Big Data Use provided Client libraries Resource Negotiation –You may ask but you may not get what you want - immediately. –Locality requests may not always be met. –Resources like memory/CPU are guaranteed. Failure handling –Remember, anything can fail ( or YARN can pre-empt your containers) –AM failures handled by YARN but container failures handled by the application. Checkpointing –Check-point AM state for AM recovery. –If tasks are long running, check-point task state.
© Hortonworks Inc. 2011 Best Practices Page 24 Architecting the Future of Big Data Cluster Dependencies –Try to make zero assumptions on the cluster. –Your application bundle should deploy everything required using YARN’s local resources. Client-only installs if possible –Simplifies cluster deployment, and multi-version support Securing your Application –YARN does not secure communications between the AM and its containers.
© Hortonworks Inc. 2011 Testing/Debugging your Application Page 25 Architecting the Future of Big Data MiniYARNCluster –Regression tests Unmanaged AM –Support to run the AM outside of a YARN cluster for manual testing. Logs –Log aggregation support to push all logs into HDFS –Accessible via CLI, UI.
© Hortonworks Inc. 2011 Future work in YARN Page 26 Architecting the Future of Big Data ResourceManager High Availability and Work-preserving restart –Work-in-Progress Scheduler Enhancements –SLA Driven Scheduling, Gang scheduling –Multiple resource types – disk/network/GPUs/affinity Rolling upgrades Long running services –Better support to running services like HBase –Discovery of services, upgrades without downtime More utilities/libraries for Application Developers –Failover/Checkpointing http://hadoop.apache.org
© Hortonworks Inc. 2011 Questions? Architecting the Future of Big Data Page 27
Wei-Chiu Chuang 10/17/2013 Permission to copy/distribute/adapt the work except the figures which are copyrighted by ACM.
© Hortonworks Inc Hadoop: Beyond MapReduce Steve Loughran, Big Data workshop, June 2013.
Resource Management with YARN: YARN Past, Present and Future
Hadoop Ecosystem Overview
Part III BigData Analysis Tools (YARN) Yuan Xue
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Our Experience Running YARN at Scale Bobby Evans.
Hadoop YARN in the Cloud Junping Du Staff Engineer, VMware China Hadoop Summit, 2013.
Next Generation of Apache Hadoop MapReduce Owen
State of the Elephant Hadoop yesterday, today, and tomorrow Page 1 Owen
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Apache Tez : Accelerating Hadoop Query Processing Page 1.
Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.
Hadoop implementation of MapReduce computational model Ján Vaňo.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
© Hortonworks Inc Apache Hadoop 2.0 Migration from 1.0 to 2.0 Vinod Kumar Vavilapalli Hortonworks Inc vinodkv [at] Page 1.
Hadoop 2.0 and YARN SUBASH D’SOUZA. Who am I? Senior Specialist Engineer at Shopzilla Co-Organizer for the Los Angeles Hadoop User group Organizer.
Can’t We All Just Get Along? Sandy Ryza. Introductions Software engineer at Cloudera MapReduce, YARN, Resource management Hadoop committer.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – https://cern.ch/zbaranow/CVM.txt 2.
Page 1 © Hortonworks Inc – All Rights Reserved Apache Hadoop - Virtualization Winter 2015 Version 1.4 Hortonworks. We do Hadoop.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Database Laboratory Regular Seminar TaeHoon Kim.
Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
INTRODUCTION TO HADOOP. OUTLINE What is Hadoop The core of Hadoop Structure of Hadoop Distributed File System Structure of MapReduce Framework.
THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.
Apache Spark and the future of big data applications Eric Baldeschwieler.
Workflow Management CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
© Hortonworks Inc Inside hadoop-dev Steve Loughran– Apachecon EU, November 2012.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
1 Tree and Graph Processing On Hadoop Ted Malaska.
Developing a MapReduce Application – packet dissection.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Introduction to Hadoop and HDFS
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
© 2014 MapR Technologies 1 Ted Dunning February 20, 2015.
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
f ACT s Data intensive applications with Petabytes of data Web pages billion web pages x 20KB = 400+ terabytes One computer can read
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Learn Hadoop and Big Data Technologies. Hadoop An Open source framework that stores and processes Big Data in distributed manner on a large groups of.
HAMS Technologies 1
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
HADOOP Course Content By Mr. Kalyan, 7+ Years of Realtime Exp. M.Tech, IIT Kharagpur, Gold Medalist. Introduction to Big Data and Hadoop Big Data › What.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
IoT Scenario - Connected Cars / Devices Cloud gateways Queue Service Get Data Get Reference Data Business Logic Store Raw Data Store Reporting Data.
© 2017 SlidePlayer.com Inc. All rights reserved.