We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byEmely Dauber
Modified about 1 year ago
© Hortonworks Inc. 2014 Apache Hadoop 2.0 Migration from 1.0 to 2.0 Vinod Kumar Vavilapalli Hortonworks Inc vinodkv [at] apache.org @tshooter Page 1
© Hortonworks Inc. 2014 Hello! 6.5 Hadoop-years old Previously at Yahoo!, @Hortonworks now. Last thing at School – a two node Tomcat cluster. Three months later, first thing at job, brought down a 800 node cluster ;) Two hats –Hortonworks: Hadoop MapReduce and YARN –Apache: Apache Hadoop YARN lead. Apache Hadoop PMC, Apache Member Worked/working on –YARN, Hadoop MapReduce, HadoopOnDemand, CapacityScheduler, Hadoop security –Apache Ambari: Kickstarted the project and its first release –Stinger: High performance data processing with Hadoop/Hive Lots of random trouble shooting on clusters 99% + code in Apache, Hadoop Page 2 Architecting the Future of Big Data
© Hortonworks Inc. 2014 Agenda Apache Hadoop 2 Migration Guide for Administrators Migration Guide for Users Summary Page 3 Architecting the Future of Big Data
© Hortonworks Inc. 2014 Apache Hadoop 2 Next Generation Architecture Architecting the Future of Big Data Page 4
© Hortonworks Inc. 2014 Hadoop 1 vs Hadoop 2 HADOOP 1.0 HDFS (redundant, reliable storage) MapReduce (cluster resource management & data processing) HDFS2 (redundant, highly-available & reliable storage) YARN (cluster resource management) MapReduce (data processing) Others HADOOP 2.0 Single Use System Batch Apps Multi Purpose Platform Batch, Interactive, Online, Streaming, … Page 5
© Hortonworks Inc. 2014 Why Migrate? 2.0 > 2 * 1.0 –HDFS: Lots of ground-breaking features –YARN: Next generation architecture –Beyond MapReduce with Tez, Storm, Spark; in Hadoop! –Did I mention Services like HBase, Accumulo on YARN with HoYA? Return on Investment: 2x throughput on same hardware! Page 6 Architecting the Future of Big Data
© Hortonworks Inc. 2014 Yahoo! On YARN (0.23.x) Moving fast to 2.x Page 7 Architecting the Future of Big Data http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-more-ever-54421.html
© Hortonworks Inc. 2014 Twitter Page 8 Architecting the Future of Big Data
© Hortonworks Inc. 2014 HDFS High Availability – NameNode HA Scale further – Federation Time-machine – HDFS Snapshots NFSv3 access to data in HDFS Page 9 Architecting the Future of Big Data
© Hortonworks Inc. 2014 HDFS Contd. Support for multiple storage tiers – Disk, Memory, SSD Finer grained access – ACLs Faster access to data – DataNode Caching Operability – Rolling upgrades Page 10 Architecting the Future of Big Data
© Hortonworks Inc. 2014 YARN: Taking Hadoop Beyond Batch Page 11 Applications Run Natively in Hadoop HDFS2 (Redundant, Reliable Storage) YARN (Cluster Resource Management) BATCH (MapReduce) INTERACTIVE (Tez) STREAMING (Storm, S4,…) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) ONLINE (HBase) OTHER (Search) (Weave…) Store ALL DATA in one place… Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service
© Hortonworks Inc. 2014 5 5 Key Benefits of YARN 1.Scale 2.New Programming Models & Services 3.Improved cluster utilization 4.Agility 5.Beyond Java Page 12
© Hortonworks Inc. 2014 Any catch? I could go on and on about the benefits, but what’s the catch? Nothing major! Major architectural changes But the impact on user applications and APIs kept to a minimal –Feature parity –Administrators –End-users Page 13 Architecting the Future of Big Data
© Hortonworks Inc. 2014 Administrators Guide to migrating your clusters to Hadoop-2.x Architecting the Future of Big Data Page 14
© Hortonworks Inc. 2014 New Environment Hadoop Common, HDFS and MR are installable separately, but optional Env –HADOOP_HOME deprecated, but works –The environment variables - HADOOP_COMMON_HOME, HADOOP_HDFS_HOME, HADOOP_MAPRED_HOME, –HADOOP_YARN_HOME : New Commands –bin/hadoop works as usual but some sub-commands are deprecated –Separate commands for mapred and hdfs –hdfs fs -ls –mapred job -kill –bin/yarn-daemon.sh etc for starting yarn daemons Page 15 Architecting the Future of Big Data
© Hortonworks Inc. 2014 Wire compatibility Not RPC wire compatible with prior versions of Hadoop Admins cannot mix and match versions Clients must be updated to use the same version of Hadoop client library as the one installed on the cluster. Page 16 Architecting the Future of Big Data
© Hortonworks Inc. 2014 Capacity management Slots -> Dynamic memory based Resources Total memory on each node –yarn.nodemanager.resource.memory-mb Minimum and maximum sizes –yarn.scheduler.minimum-allocation-mb –yarn.scheduler.maximum-allocation-mb MapReduce configs don’t change –mapreduce.map.memory.mb –mapreduce.map.java.opts Page 17 Architecting the Future of Big Data
© Hortonworks Inc. 2014 Cluster Schedulers Concepts stay the same –CapacityScheduler: Queues, User-limits –FairScheduler: Pools –Warning: Configuration names now have YARN-isms Key enhancements –Hierarchical Queues for fine-grained control –Multi-resource scheduling (CPU, Memory etc.) –Online administration (add queues, ACLs etc.) –Support for long-lived services (HBase, Accumulo, Storm) (In progress) –Node Labels for fine-grained administrative controls (Future) Page 18 Architecting the Future of Big Data
© Hortonworks Inc. 2014 Configuration Watch those damn knobs! Should work if you are using the previous configs in Common, HDFS and client side MapReduce configs MapReduce server side is toast –No migration –Just use new configs Past sins –From 0.21.x –Configuration names changed for better separation: client and server config names –Cleaning up naming: mapred.job.queue.name → mapreduce.job.queuename Old user-facing, job related configs work as before but deprecated Configuration mappings exist Page 19 Architecting the Future of Big Data
© Hortonworks Inc. 2014 Installation/Upgrade Fresh install Upgrading from an existing version Fresh Install –Apache Ambari : Fully automated! –Traditional manual install of RPMs/Tarballs Upgrade –Apache Ambari –Semi automated –Supplies scripts which take care of most things –Manual upgrade Page 20 Architecting the Future of Big Data
© Hortonworks Inc. 2014 HDFS Pre-upgrade Backup Configuration files Stop users! Run fsck and fix any errors –hadoop fsck / -files -blocks -locations > /tmp/dfs-old-fsck-1.log Capture the complete namespace –hadoop dfs -lsr / > dfs-old-lsr-1.log Create a list of DataNodes in the cluster –hadoop dfsadmin -report > dfs-old- report-1.log Save the namespace –hadoop dfsadmin -safemode enter –hadoop dfsadmin –saveNamespace Back up NameNode meta-data –dfs.name.dir/edits –dfs.name.dir/image/fsimage –dfs.name.dir/current/fsimage –dfs.name.dir/current/VERSION Finalize the state of the filesystem –hadoop namenode –finalize Other meta-data backup –Hive Metastore, Hcat, Oozie –mysqldump Page 21 Architecting the Future of Big Data
© Hortonworks Inc. 2014 HDFS Upgrade Stop all services Tarballs/RPMs Page 22 Architecting the Future of Big Data
© Hortonworks Inc. 2014 HDFS Post-upgrade Process liveliness Verify that all is well –Namenode goes out of safe mode: hdfs dfsadmin -safemode wait File-System health Compare from before –Node list –Full Namespace You can start HDFS without finalizing the upgrade. When you are ready to discard your backup, you can finalize the upgrade. –hadoop dfsadmin -finalizeUpgrade Page 23 Architecting the Future of Big Data
© Hortonworks Inc. 2014 MapReduce upgrade Ask users to stop their thing Stop the MR sub-system Replace everything Page 24 Architecting the Future of Big Data
© Hortonworks Inc. 2014 HBase Upgrade Tarballs/RPMs HBase 0.95 removed support for Hfile V1 –Before the actual upgrade, check if there are HFiles in V1 format using HFileV1Detector /usr/lib/hbase/bin/hbase upgrade –execute Page 25 Architecting the Future of Big Data
© Hortonworks Inc. 2014 Users Guide to migrating your applications to Hadoop-2.x Architecting the Future of Big Data Page 26
© Hortonworks Inc. 2014 Migrating the Hadoop Stack MapReduce MR Streaming Pipes Pig Hive Oozie Page 27 Architecting the Future of Big Data
© Hortonworks Inc. 2014 MapReduce Applications Binary Compatibility of org.apache.hadoop.mapred APIs –Full binary compatibility for vast majority of users and applications –Nothing to do! Use existing MR application jars of your existing application via bin/hadoop to submit them directly to YARN mapreduce.framework.name yarn Page 28 Architecting the Future of Big Data
© Hortonworks Inc. 2014 MapReduce Applications contd. Source Compatibility of org.apache.hadoop.mapreduce API –Minority of users –Proved to be difficult to ensure full binary compatibility to the existing applications Existing application using mapreduce APIs are source compatible Can run on YARN with no changes, need recompilation only Page 29 Architecting the Future of Big Data
© Hortonworks Inc. 2014 MapReduce Applications contd. MR Streaming applications –work without any changes Pipes applications –will need recompilation Page 30 Architecting the Future of Big Data
© Hortonworks Inc. 2014 MapReduce Applications contd. Examples –Can run with minor tricks Benchmarks –To compare 1.x vs 2.x Things to do –Play with YARN –Compare performance Page 31 Architecting the Future of Big Data http://hortonworks.com/blog/running-existing-applications-on-hadoop-2-yarn/
© Hortonworks Inc. 2014 MapReduce feature parity Setup, cleanup tasks are no longer separate tasks, –And we dropped the optionality (which was a hack anyways). JobHistory –JobHistory file format changed to avro/json based. –Rumen automatically recognizes the new format. –Parsing history files yourselves? Need to move to new parsers. Page 32 Architecting the Future of Big Data
© Hortonworks Inc. 2014 User logs Putting user-logs on DFS. –AM logs too! –While the job is running, logs are on the individual nodes –After that on DFS Provide pretty printers and parsers for various log files – syslog, stdout, stderr User logs directory with quotas beyond their current user directories Logs expire after a month by default and get GCed. Page 33 Architecting the Future of Big Data
© Hortonworks Inc. 2014 Application recovery No more lost applications on the master restart! –Applications do not lose previously completed work –If AM crashes, RM will restart it from where it stopped –Applications can (WIP) continue to run while RM is down –No need to resubmit if RM restarts Specifically for MR jobs –Changes to semantics of OutputCommitter –We fixed FileOutputCommitter, but if you have your own OutputCommitter, you need to care about application-recoverability Page 34 Architecting the Future of Big Data
© Hortonworks Inc. 2014 JARs No single hadoop-core jar Common, hdfs and mapred jars separated Projects completely mavenized and YARN has separate jars for API, client and server code –Good. You don’t link to server side code anymore Some jars like avro, jackson etc are upgraded to their later versions –If they have compatibility problems, you will have too –You can override that behavior by putting your jars first in the Classpath Page 35 Architecting the Future of Big Data
© Hortonworks Inc. 2014 More features Uber AM –Run small jobs inside the AM itself –No need for launching tasks. –Is seamless – JobClient will automatically determine if this is a small job. Speculative tasks –Was not enabled by default in 1.x –Much better in 2.x, supported No JVM-Reuse: Feature dropped Netty based zero-copy shuffle MiniMRcluster →MiniMRYarnCluster Page 36 Architecting the Future of Big Data
© Hortonworks Inc. 2014 Web UI Web UIs completely overhauled. –Rave reviews ;) –And some rotten tomatoes too Functional improvements –capability to sort tables by one or more columns –filter rows incrementally in "real time". Any user applications or tools that depends on Web UI and extract data using screen-scrapping will cease to function –Web services! AM web UI, History server UI, RM UI work together Page 37 Architecting the Future of Big Data
© Hortonworks Inc. 2014 Apache Pig One of the two major data process applications in the Hadoop ecosystem Existing Pig scripts that work with Pig 0.10.1 and beyond will work just fine on top of YARN ! Versions prior to pig-0.10.1 may not run directly on YARN –Please accept my sincere condolences! Page 38 Architecting the Future of Big Data
© Hortonworks Inc. 2014 Apache Hive Queries on Hive 0.10.0 and beyond will work without changes on top of YARN! Hive 0.13 & beyond: Apache TEZ!! –Interactive SQL queries at scale! –Hive + Stinger: Petabyte Scale SQL, in Hadoop – Alan Gates & Owen O’Malley 1.30pm Thu (2/13) at Ballroom F Page 39 Architecting the Future of Big Data
© Hortonworks Inc. 2014 Apache Oozie Existing oozie workflows can start taking advantage of YARN in 0.23 and 2.x with Oozie 3.2.0 and above ! Page 40 Architecting the Future of Big Data
© Hortonworks Inc. 2014 Cascading & Scalding Cascading 2.5 - Just works, certified! Scalding too! Page 41 Architecting the Future of Big Data
© Hortonworks Inc. 2014 Beyond upgrade Where do I go from here? Architecting the Future of Big Data Page 42
© Hortonworks Inc. 2014 YARN Eco-system Page 43 Applications Powered by YARN Apache Giraph – Graph Processing Apache Hama - BSP Apache Hadoop MapReduce – Batch Apache Tez – Batch/Interactive Apache S4 – Stream Processing Apache Samza – Stream Processing Apache Storm – Stream Processing Apache Spark – Iterative applications Elastic Search – Scalable Search Cloudera Llama – Impala on YARN DataTorrent – Data Analysis HOYA – HBase on YARN Frameworks Powered By YARN Apache Twill REEF by Microsoft Spring support for Hadoop 2
© Hortonworks Inc. 2014 Summary Page 44 Architecting the Future of Big Data Apache Hadoop 2 is, at least, twice as good! –No, seriously! Exciting journey with Hadoop for this decade… –Hadoop is no longer just HDFS & MapReduce Architecture for the future –Centralized data and multi-variage applications –Possibility of exciting new applications and types of workloads Admins –A bit of work End-user –Mostly should just work as is
© Hortonworks Inc. 2014 YARN Book coming soon! Page 45 Architecting the Future of Big Data
© Hortonworks Inc. 2014 Thank you! Page 46 http://hortonworks.com/products/hortonworks-sandbox/ Download Sandbox: Experience Apache Hadoop Both 2.x and 1.x Versions Available! http://hortonworks.com/products/hortonworks-sandbox/ Questions?
© Hortonworks Inc Running Non-MapReduce Applications on Apache Hadoop Hitesh Shah & Siddharth Seth Hortonworks Inc. Page 1.
Next Generation of Apache Hadoop MapReduce Owen
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Our Experience Running YARN at Scale Bobby Evans.
State of the Elephant Hadoop yesterday, today, and tomorrow Page 1 Owen
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
1 Resource Management with YARN: YARN Past, Present and Future Anubhav Dhoot Software Engineer Cloudera.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Wei-Chiu Chuang 10/17/2013 Permission to copy/distribute/adapt the work except the figures which are copyrighted by ACM.
Introduction to Hadoop and HDFS. Table of Contents Hadoop – Overview Hadoop Cluster HDFS.
eBay Marketplaces Ming Ma June 27 th, 2013.
Apache Spark and the future of big data applications Eric Baldeschwieler.
HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Can’t We All Just Get Along? Sandy Ryza. Introductions Software engineer at Cloudera MapReduce, YARN, Resource management Hadoop committer.
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and.
Apache Tez : Accelerating Hadoop Query Processing Page 1.
© Hortonworks Inc Hadoop: Beyond MapReduce Steve Loughran, Big Data workshop, June 2013.
Page 1 © Hortonworks Inc – All Rights Reserved More Data, More Problems A Practical Guide to Testing on Hadoop 2015 Michael Miklavcic.
BIG DATA/ Hadoop Interview Questions.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop Dean, J. and Ghemawat, S MapReduce: simplified data processing on large clusters. Communication.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.
Cloudera Image for hands-on Installation instruction – https://cern.ch/zbaranow/CVM.txt 2.
Hortonworks: Hadoop for the Enterprise ONLY 100 open source Apache Hadoop data platform % Founded in 2011 HADOOP 1 ST distribution to go public IPO Fall.
Hadoop & Cheetah. Key words Cluster data center – Lots of machines thousands Node a server in a data center – Commodity device fails very easily Slot.
© Hortonworks Inc Secure SQL Standard based Authorization for Apache Hive Thejas Page 1.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Making Apache Hadoop Secure Devaraj Das Yahoo’s Hadoop Team.
O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.
© Hortonworks Inc MapReduce over snapshots HBASE-8369 Enis Soztutar Enis [at] apache [dot] Page 1.
f ACT s Data intensive applications with Petabytes of data Web pages billion web pages x 20KB = 400+ terabytes One computer can read
HADOOP Course Content By Mr. Kalyan, 7+ Years of Realtime Exp. M.Tech, IIT Kharagpur, Gold Medalist. Introduction to Big Data and Hadoop Big Data › What.
Presented by CH.Anusha. Apache Hadoop framework HDFS and MapReduce Hadoop distributed file system JobTracker and TaskTracker Apache Hadoop NextGen.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Page 1 © Hortonworks Inc – All Rights Reserved What's new in Hive 2.0 Sergey Shelukhin.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Introduction Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots File I/O Operations and Replica Management File.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
INTRODUCTION TO HADOOP. OUTLINE What is Hadoop The core of Hadoop Structure of Hadoop Distributed File System Structure of MapReduce Framework.
© 2012 The McGraw-Hill Companies, Inc. All rights reserved. 1 Third Edition Chapter 5 Windows XP Professional McGraw-Hill.
All rights reserved. © 2009 Tableau Software Inc. Implementing Tableau Server in an Enterprise Environment Andrew Beers and Jeff Solomon Tableau Software.
Hadoop 2.0 and YARN SUBASH D’SOUZA. Who am I? Senior Specialist Engineer at Shopzilla Co-Organizer for the Los Angeles Hadoop User group Organizer.
© 2017 SlidePlayer.com Inc. All rights reserved.