Presentation is loading. Please wait.

Presentation is loading. Please wait.

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Hadoop - Virtualization Winter 2015 Version 1.4 Hortonworks. We do Hadoop.

Similar presentations


Presentation on theme: "Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Hadoop - Virtualization Winter 2015 Version 1.4 Hortonworks. We do Hadoop."— Presentation transcript:

1 Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Hadoop - Virtualization Winter 2015 Version 1.4 Hortonworks. We do Hadoop.

2 Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Contents HWX Overview Fundamentals VMWare Extensions Virtualization Best Practices

3 Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP delivers a comprehensive data management platform Hortonworks Data Platform 2.2 YARN : Data Operating System (Cluster Resource Management) 1°°°°°°° °°°°°°°° Script Pig SQL Hive Tez Java Scala Cascading Tez °° °° °°°°° °°°°° Others ISV Engines HDFS (Hadoop Distributed File System) Stream Storm Search Solr NoSQL HBase Accumulo Slider SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Cluster: Ranger Deployment Choice LinuxWindows On-PremisesCloud YARN is the architectural center of HDP Enables batch, interactive and real-time workloads Provides comprehensive enterprise capabilities The widest range of deployment options Delivered Completely in the OPEN

4 Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP IS Apache Hadoop There is ONE Enterprise Hadoop: everything else is a vendor derivation Hortonworks Data Platform 2.2 Hadoop &YARN Pig Hive & HCatalog HBase Sqoop Oozie Zookeeper Ambari Storm Flume Knox Phoenix Accumulo 2.2.0 0.12.0 2.4.0 0.12.1 Data Management 0.13.0 0.96.1 0.98.0 0.9.1 1.4.4 1.3.1 1.4.0 1.4.4 1.5.1 3.3.2 4.0.0 3.4.5 0.4.0 4.0.0 1.5.1 Falcon 0.5.0 Ranger Spark Kafka 0.14.0 0.98.4 1.6.1 4.2 0.9.3 1.2.0 0.6.0 0.8.1 1.4.5 1.5.0 1.7.0 4.1.0 0.5.0 0.4.0 2.6.0 * version numbers are targets and subject to change at time of general availability in accordance with ASF release process 3.4.5 Tez 0.4.0 Slider 0.60 HDP 2.0 October 2013 HDP 2.2 October 2014 HDP 2.1 April 2014 Solr 4.7.2 4.10.0 0.5.1 Data Access Governance & Integration Security Operations

5 Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache HDFS – Hadoop Distributed File System Very large scale distributed file system 10K nodes, tens of millions files and Peta Bytes of data Supports large files Designed to run on commodity hardware, assumes hardware failures Files are replicated to handle hardware failure Detect failures and recovers from them automatically Optimized for Batch processing Data locations are exposed so that the computations can move to where data resides Data Coherency Write once and read many times access pattern Appending is supported for existing files Files are broken up in chunks called ‘blocks’ Blocks are distributed over nodes

6 Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDFS: Key Services NameNode Master service Manages the file system namespace Single service across the cluster (HA can be enabled) Regulates access to files by clients Maps file name to a set of blocks Maps a block to the DataNode where it resides Replication engine for blocks DataNode Slave service. Runs on slave nodes Block Server Manages block read/write for HDFS, Stores data in the local file system Periodically sends a report of all existing blocks to the NameNode Pings NameNode for instructions If heat beat fails, DataNode is removed from the cluster and replicated blocks take over Standby NameNode Merges Namenode’s file system image and edit logs

7 Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDFS Architecture (Master-Slave)

8 Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Cluster Topology HDFS Client Master Services NameNode Resource Manager HBase Master etc.. Slave Services DataNode NodeManager Region Server Rack NameNode Secondary NameNode Secondary NameNode Other Master Svcs Other Master Svcs DataNode Rack

9 Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION HDFS: File create lifecycle Page 9 NameNode RACK1 RACK2 RACK3 FILE B1 B2 FILE HDFS CLIENT Create B1 B2 ack 2 2 1 1 3 3 Complete 4 4

10 Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION HDFS Distributed file system designed to run on commodity Hardware. Key Assumptions –Hardware failure is the norm –Need streaming access to data sets. –Optimized for high throughput –Data sets are large –Append only file system. Write once read many times –Moving computation is cheaper than moving data Page 10

11 Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop at Scale Yahoo – 34000 nodes, 478 PB eBay – 10000 nodes, 150 PB Linkedin – 5000 nodes, Twitter – 3500 nodes, 30 to 50 PB Spotify – 700 nodes, 15PB of data Facebook – Thousands

12 Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Problems Companies have been focused on virtualizing for over a decade. Provisioning bare metal servers can be difficult and lengthy. Datacenter space is at a premium Companies have instituted virtualization mandates

13 Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved When to Virtualize Master nodes are good virtualization candidates due to the HA requirements Virtualize development environments Virtualize Pilot or POC environments but caveat emptor Cloud computing or when you need compute to be elastic and separate from storage

14 Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop Virtualization Extension The existing replica placement policy includes: - Multiple replicas are not placed on the same node - 1st replica is on the local node of the writer; - 2nd replica is on a remote rack of the 1st replica; - 3rd replica is on the same rack as the 2nd replica; - Remaining replicas are placed randomly across rack to meet minimum restriction. With awareness of the node group, the extended replica placement policy includes: - Multiple replicas are not be placed on the same node or on nodes under the same node group - 1st replica is on the local node or local node group of the writer; - 2nd replica is on a remote rack of the 1st replica; - 3rd replica is on the same rack as the 2nd replica; - Remaining replicas are placed randomly across rack and node group to meet minimum restriction. http://www.vmware.com/files/pdf/Hadoop-Virtualization-Extensions-on-VMware-vSphere-5.pdf

15 Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Virtualization Best Practices Use latest chip generations that support hardware-assisted Memory Management Unit (MMU) Memory reservations should be large enough to avoid kernal swapping between ESX and guest OS Do not disable the balloon driver Use as few v-CPU as possible and enable hyperthreading in Intel Core i7 processors For high I/O load, use multiple v-SCSI controllers. Separate I/O traffic types and put them in their own SCSI controller Make sure timekeeping is properly set in Virtual Machines

16 Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Next Steps Refer to vendor documentation. http://www.vmware-vSphere-5.pdfhttp://www.vmware-vSphere-5.pdf Virtualize Pilot or POC environments but caveat emptor Consider cloud computing when you need compute to be elastic and separate from storage Wait…technology is changing and engineering efforts are underway JIRA HADOOP-8468


Download ppt "Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Hadoop - Virtualization Winter 2015 Version 1.4 Hortonworks. We do Hadoop."

Similar presentations


Ads by Google