We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byDaniel Anderson
Modified about 1 year ago
© Hortonworks Inc Running Non-MapReduce Applications on Apache Hadoop Hitesh Shah & Siddharth Seth Hortonworks Inc. Page 1
© Hortonworks Inc Who am I? Hitesh Shah –Member of Technical Staff at Hortonworks Inc. –Apache Hadoop PMC member and committer –Apache Tez and Apache Ambari PPMC member and committer Siddharth Seth –Member of Technical Staff at Hortonworks Inc. –Apache Hadoop PMC member and committer –Apache Tez PPMC member and committer Page 2 Architecting the Future of Big Data
© Hortonworks Inc Agenda Apache Hadoop v1 to v2 YARN Applications on YARN YARN Best Practices Page 3 Architecting the Future of Big Data
© Hortonworks Inc Apache Hadoop v1 Page 4 Architecting the Future of Big Data Job Client Submit Job JobTracker TaskTracker Map Slot Reduce Slot
© Hortonworks Inc Apache Hadoop v1 Pros: –A framework to run MapReduce jobs that allows you to run the same piece of code on a single node cluster to one spanning 1000s of machines. Cons: –It is a framework to run MapReduce jobs. Page 5 Architecting the Future of Big Data
© Hortonworks Inc Apache Giraph Page 6 Architecting the Future of Big Data Iterative graph processing on a Hadoop cluster An iterative approach on MapReduce would require running multiple jobs. To avoid MR overheads, runs everything as a Map-only job. Map Task: Master Map Task: Master Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker
© Hortonworks Inc Apache Oozie Page 7 Architecting the Future of Big Data Workflow scheduler system to manage Hadoop jobs. Running a PIG script through Oozie JobTracker Oozie MapTask: Pig Script Launcher MapTask: Pig Script Launcher Submit Job Submit Subsequent MR jobs
© Hortonworks Inc Apache Hadoop v2 Page 8 Architecting the Future of Big Data
© Hortonworks Inc YARN The Operating System of a Hadoop cluster Architecting the Future of Big Data Page 9
© Hortonworks Inc The YARN Stack Page 10 Architecting the Future of Big Data
© Hortonworks Inc YARN Glossary Page 11 Architecting the Future of Big Data Installer –Application Installer or Application Client Client –Application Client Supervisor –Application Master Workers –Application Containers
© Hortonworks Inc YARN Architecture Page 12 Architecting the Future of Big Data ResourceManager NodeManager Client Submit Application NodeManager App Master App Master Container App Master App Master Container
© Hortonworks Inc YARN Application Flow Page 13 Architecting the Future of Big Data Application Client Resource Manager Resource Manager Application Master NodeManager YarnClient App Specific API App Specific API Application Client Protocol AMRMClient NMClient Application Master Protocol Container Management Protocol App Container App Container
© Hortonworks Inc YARN Protocols & Client Libraries Application Client Protocol: Client to RM interaction –Library: YarnClient –Application Lifecycle control –Access Cluster Information Application Master Protocol: AM – RM interaction –Library: AMRMClient / AMRMClientAsync –Resource negotiation –Heartbeat to the RM Container Management Protocol: AM to NM interaction –Library: NMClient/NMClientAsync –Launching allocated containers –Stop Running containers Page 14 Architecting the Future of Big Data
© Hortonworks Inc Applications on YARN Architecting the Future of Big Data Page 15
© Hortonworks Inc YARN Applications Page 16 Architecting the Future of Big Data Categorizing Applications –What does the Application do? –Application Lifetime –How Applications accept work –Language Application Lifetime –Job submit to complete. –Long-running Services Job Submissions – One job : One Application – Multiple jobs per application
© Hortonworks Inc Language considerations Hadoop RPC uses Google Protobuf –Protobuf bindings: C/C++, GO, Java, Python… Accessing HDFS –WebHDFS –libhdfs for C –Python client by Spotify Labs: Snakebite YARN Application Logic –ApplicationMaster in Java and containers in any language Page 17 Architecting the Future of Big Data
© Hortonworks Inc Tez ( App Submission) Page 18 Architecting the Future of Big Data Distributed Execution framework – computation is expressed as a DAG Takes MapReduce to the next level – where each job was limited to a Map and/or Reduce stage. YARNTasks Resource Manager DAG execution logic Task co-ordination Local Task Scheduling DAG execution logic Task co-ordination Local Task Scheduling Tez AM Node Manager(s) Launch AM AM Launched Job Submission Monitoring Job Submission Monitoring Tez Client Request Resources Allocated Resources Launch Tasks Launch AM Tasks Launched Heartbeat Submit DAG
© Hortonworks Inc HOYA ( Long Running App ) Page 19 Architecting the Future of Big Data On Demand HBase cluster setup Share cluster resources – persist and shutdown the cluster when not needed Dynamically handles Node failures Allows re-sizing of a running HBase cluster
© Hortonworks Inc Resource Manager Node Manager(s) YARN Get New Containers Kafka (Streams) Samza AM Task Container Task Task Container Task Task Container Task Container Finished Launch Container Samza on YARN ( Failure Handling App ) Page 20 Architecting the Future of Big Data Stream processing system – uses YARN as the execution framework Makes use of CGroups support in YARN for CPU isolation Uses Kafka as underlying store
© Hortonworks Inc YARN Eco-system Page 21 Architecting the Future of Big Data Powered by YARN Apache Giraph – Graph Processing Apache Hama - BSP Apache Hadoop MapReduce – Batch Apache Tez – Batch/Interactive Apache S4 – Stream Processing Apache Samza – Stream Processing Apache Storm – Stream Processing Apache Spark – Iterative/Interactive applications Cloudera Llama DataTorrent HOYA – HBase on YARN RedPoint Data Management YARN Utilities/Frameworks Weave by Continuity REEF by Microsoft Spring support for Hadoop 2
© Hortonworks Inc YARN Best Practices Architecting the Future of Big Data Page 22
© Hortonworks Inc Best Practices Page 23 Architecting the Future of Big Data Use provided Client libraries Resource Negotiation –You may ask but you may not get what you want - immediately. –Locality requests may not always be met. –Resources like memory/CPU are guaranteed. Failure handling –Remember, anything can fail ( or YARN can pre-empt your containers) –AM failures handled by YARN but container failures handled by the application. Checkpointing –Check-point AM state for AM recovery. –If tasks are long running, check-point task state.
© Hortonworks Inc Best Practices Page 24 Architecting the Future of Big Data Cluster Dependencies –Try to make zero assumptions on the cluster. –Your application bundle should deploy everything required using YARN’s local resources. Client-only installs if possible –Simplifies cluster deployment, and multi-version support Securing your Application –YARN does not secure communications between the AM and its containers.
© Hortonworks Inc Testing/Debugging your Application Page 25 Architecting the Future of Big Data MiniYARNCluster –Regression tests Unmanaged AM –Support to run the AM outside of a YARN cluster for manual testing. Logs –Log aggregation support to push all logs into HDFS –Accessible via CLI, UI.
© Hortonworks Inc Future work in YARN Page 26 Architecting the Future of Big Data ResourceManager High Availability and Work-preserving restart –Work-in-Progress Scheduler Enhancements –SLA Driven Scheduling, Gang scheduling –Multiple resource types – disk/network/GPUs/affinity Rolling upgrades Long running services –Better support to running services like HBase –Discovery of services, upgrades without downtime More utilities/libraries for Application Developers –Failover/Checkpointing
© Hortonworks Inc Questions? Architecting the Future of Big Data Page 27
MapReduce Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan.
Distributed and Parallel Processing Technology Chapter2. MapReduce Sun Jo 1.
Data-Awareness and Low- Latency on the Enterprise Grid Getting the Most out of Your Grid with Enterprise IMDG Shay Hassidim Deputy CTO Oct 2007.
Multiple Processor Systems Bits of Chapters 4, 10, 16 Operating Systems: Internals and Design Principles, 6/E William Stallings.
Microsoft and Community Tour 2011 – Infrastrutture in evoluzione Planning, Deploying and Managing a Microsoft VDI Infrastructure Level Advanced.
©Siebel Systems 2003 – Do not distribute or re-use without permission Implementing Siebel 7 for High Availability Richard Sands Siebel Expert Services.
OS Organization Continued Andy Wang COP 5611 Advanced Operating Systems.
High availability and Disaster Recovery in a Multi-Site Virtual Environment using virtualization Henk Den Baes Technology Advisor Microsoft BeLux.
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
DC-API: Unified API for Desktop Grid Systems Gábor Gombás MTA SZTAKI.
© 2012 IBM Corporation January 19, 2014 The Big Deal About Big Data Dean Compher Data Management Technical Professional for UT, NV
Condor Project Computer Sciences Department University of Wisconsin-Madison Introduction Condor.
Page 1 LAITS Laboratory for Advanced Information Technology and Standards Duh 7/10/03 Geospatial Service Workflow Concepts and Tools Liping Di Laboratory.
RMS and Scheduling for Future Generation Grids Ramin Yahyapour University Dortmund Leader CoreGRID Institute on Resource Management and Scheduling CoreGRID.
® IBM Software Group © 2008 IBM Corporation A new feature providing mainframe development flexibility David Myers Rational Developer for System z Product.
1 Chapter 11: Data Centre Administration Objectives Data Centre Structure Data Centre Structure Data Centre Administration Data Centre Administration Data.
What is an Operating System? A program that acts as an intermediary between a user of a computer and the computer hardware. Operating system goals: Execute.
Designing a Java-based Grid Scheduler using Commodity Services Patrick Wendel Arnold Fung Moustafa Ghanem Yike Guo Discovery NetInforSense Department of.
Architecture Track Session 2 Designing Enterprise Applications for Microsoft ® Windows ® Server 2003.
CRM 3.0 Whats New in Microsoft CRM 3.0 – Technical.
Architecting to be Cloud Native On Windows Azure or Otherwise BU MET CS755, Cloud Computing, Dino Konstantopoulos 21-Mar-2013 (6:00 – 9:00 PM EDT) Bill.
CLUSTERING ARCHITECTURES IN GIS/SI Jason Miller System Integration Architects 11/06/2008.
Software Development QA Best Practices May 20, 2010 Suzette Hackl, CSM Senior Project Manager Skyline Technologies, Inc.
Continuous Integration (CI) By Jim Rush Version Control Build Test Report.
Workflows and Scheduling in Grids Ramin Yahyapour University Dortmund Leader CoreGRID Institute on Resource Management and Scheduling CoreGRID – Summer.
The Grid and Virtualization Orran Krieger Sr. Technical Staff Member Vmware.
© 2007 IBM Corporation | Workshop on Middleware for Next Gen Apps IBM TJ Watson Research Center Middleware Challenges for the Emerging Application Environments.
© Hortonworks Inc MapReduce over snapshots HBASE-8369 Enis Soztutar Enis [at] apache [dot] Page 1.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Chapter 18: Database System Architectures Centralized Systems Client--Server Systems Parallel.
© 2016 SlidePlayer.com Inc. All rights reserved.