Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Hadoop - Virtualization Winter 2015 Version 1.4 Hortonworks. We do Hadoop.

Slides:

Advertisements

Similar presentations

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China

Advertisements

A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.

Apache Hadoop and Hive Dhruba Borthakur Apache Hadoop Developer

Hadoop Ecosystem Overview

SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Megabytes Gigabytes Terabytes Petabytes Purchase detail Purchase record Payment record Purchase detail Purchase record Payment record ERP CRM.

Megabytes Gigabytes Terabytes Petabytes Purchase detail Purchase record Payment record Purchase detail Purchase record Payment record ERP CRM.

The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.

Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore INTRODUCTION TO HADOOP.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.

HDFS Hadoop Distributed File System

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.

GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.

HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.

Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.

Hadoop implementation of MapReduce computational model Ján Vaňo.

HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.

Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.

{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.

Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.

Microsoft Partner since 2011

BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.

File Systems for Cloud Computing Chittaranjan Hota, PhD Faculty Incharge, Information Processing Division Birla Institute of Technology & Science-Pilani,

Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.

Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.

OMOP CDM on Hadoop Reference Architecture

PROTECT | OPTIMIZE | TRANSFORM

Hadoop Aakash Kag What Why How 1.

Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.

Why is my Hadoop* job slow?

Introduction to Distributed Platforms

Big Data Technologies Based on MapReduce and Hadoop

What is Apache Hadoop? Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created.

Dhruba Borthakur Apache Hadoop Developer Facebook Data Infrastructure

HDFS Yarn Architecture

Chapter 10 Data Analytics for IoT

Hadoop in the Enterprise

Introduction to HDFS: Hadoop Distributed File System

Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Hadoop Clusters Tess Fulkerson.

Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016

Software Engineering Introduction to Apache Hadoop Map Reduce

Central Florida Business Intelligence User Group

Overview of Azure Data Lake Store

Ministry of Higher Education

The Basics of Apache Hadoop

GARRETT SINGLETARY.

Hadoop Technopoints.

Introduction to Apache

Charles Tappert Seidenberg School of CSIS, Pace University

Data science laboratory (DSLAB)

Big-Data Analytics with Azure HDInsight

Oracle 1z0-928 Oracle Cloud Platform Big Data Management 2018 Associate.

Presentation transcript:

Page 1 © Hortonworks Inc – All Rights Reserved Apache Hadoop - Virtualization Winter 2015 Version 1.4 Hortonworks. We do Hadoop.

Page 2 © Hortonworks Inc – All Rights Reserved Contents HWX Overview Fundamentals VMWare Extensions Virtualization Best Practices

Page 3 © Hortonworks Inc – All Rights Reserved HDP delivers a comprehensive data management platform Hortonworks Data Platform 2.2 YARN : Data Operating System (Cluster Resource Management) 1°°°°°°° °°°°°°°° Script Pig SQL Hive Tez Java Scala Cascading Tez °° °° °°°°° °°°°° Others ISV Engines HDFS (Hadoop Distributed File System) Stream Storm Search Solr NoSQL HBase Accumulo Slider SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Cluster: Ranger Deployment Choice LinuxWindows On-PremisesCloud YARN is the architectural center of HDP Enables batch, interactive and real-time workloads Provides comprehensive enterprise capabilities The widest range of deployment options Delivered Completely in the OPEN

Page 4 © Hortonworks Inc – All Rights Reserved HDP IS Apache Hadoop There is ONE Enterprise Hadoop: everything else is a vendor derivation Hortonworks Data Platform 2.2 Hadoop &YARN Pig Hive & HCatalog HBase Sqoop Oozie Zookeeper Ambari Storm Flume Knox Phoenix Accumulo Data Management Falcon Ranger Spark Kafka * version numbers are targets and subject to change at time of general availability in accordance with ASF release process Tez Slider 0.60 HDP 2.0 October 2013 HDP 2.2 October 2014 HDP 2.1 April 2014 Solr Data Access Governance & Integration Security Operations

Page 5 © Hortonworks Inc – All Rights Reserved Apache HDFS – Hadoop Distributed File System Very large scale distributed file system 10K nodes, tens of millions files and Peta Bytes of data Supports large files Designed to run on commodity hardware, assumes hardware failures Files are replicated to handle hardware failure Detect failures and recovers from them automatically Optimized for Batch processing Data locations are exposed so that the computations can move to where data resides Data Coherency Write once and read many times access pattern Appending is supported for existing files Files are broken up in chunks called ‘blocks’ Blocks are distributed over nodes

Page 6 © Hortonworks Inc – All Rights Reserved HDFS: Key Services NameNode Master service Manages the file system namespace Single service across the cluster (HA can be enabled) Regulates access to files by clients Maps file name to a set of blocks Maps a block to the DataNode where it resides Replication engine for blocks DataNode Slave service. Runs on slave nodes Block Server Manages block read/write for HDFS, Stores data in the local file system Periodically sends a report of all existing blocks to the NameNode Pings NameNode for instructions If heat beat fails, DataNode is removed from the cluster and replicated blocks take over Standby NameNode Merges Namenode’s file system image and edit logs

Page 7 © Hortonworks Inc – All Rights Reserved HDFS Architecture (Master-Slave)

Page 8 © Hortonworks Inc – All Rights Reserved Cluster Topology HDFS Client Master Services NameNode Resource Manager HBase Master etc.. Slave Services DataNode NodeManager Region Server Rack NameNode Secondary NameNode Secondary NameNode Other Master Svcs Other Master Svcs DataNode Rack

Page 9 © Hortonworks Inc – All Rights Reserved © Hortonworks Inc. 2013: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION HDFS: File create lifecycle Page 9 NameNode RACK1 RACK2 RACK3 FILE B1 B2 FILE HDFS CLIENT Create B1 B2 ack Complete 4 4

Page 10 © Hortonworks Inc – All Rights Reserved © Hortonworks Inc. 2013: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION HDFS Distributed file system designed to run on commodity Hardware. Key Assumptions –Hardware failure is the norm –Need streaming access to data sets. –Optimized for high throughput –Data sets are large –Append only file system. Write once read many times –Moving computation is cheaper than moving data Page 10

Page 11 © Hortonworks Inc – All Rights Reserved Hadoop at Scale Yahoo – nodes, 478 PB eBay – nodes, 150 PB Linkedin – 5000 nodes, Twitter – 3500 nodes, 30 to 50 PB Spotify – 700 nodes, 15PB of data Facebook – Thousands

Page 12 © Hortonworks Inc – All Rights Reserved Problems Companies have been focused on virtualizing for over a decade. Provisioning bare metal servers can be difficult and lengthy. Datacenter space is at a premium Companies have instituted virtualization mandates

Page 13 © Hortonworks Inc – All Rights Reserved When to Virtualize Master nodes are good virtualization candidates due to the HA requirements Virtualize development environments Virtualize Pilot or POC environments but caveat emptor Cloud computing or when you need compute to be elastic and separate from storage

Page 14 © Hortonworks Inc – All Rights Reserved Hadoop Virtualization Extension The existing replica placement policy includes: - Multiple replicas are not placed on the same node - 1st replica is on the local node of the writer; - 2nd replica is on a remote rack of the 1st replica; - 3rd replica is on the same rack as the 2nd replica; - Remaining replicas are placed randomly across rack to meet minimum restriction. With awareness of the node group, the extended replica placement policy includes: - Multiple replicas are not be placed on the same node or on nodes under the same node group - 1st replica is on the local node or local node group of the writer; - 2nd replica is on a remote rack of the 1st replica; - 3rd replica is on the same rack as the 2nd replica; - Remaining replicas are placed randomly across rack and node group to meet minimum restriction.

Page 15 © Hortonworks Inc – All Rights Reserved Virtualization Best Practices Use latest chip generations that support hardware-assisted Memory Management Unit (MMU) Memory reservations should be large enough to avoid kernal swapping between ESX and guest OS Do not disable the balloon driver Use as few v-CPU as possible and enable hyperthreading in Intel Core i7 processors For high I/O load, use multiple v-SCSI controllers. Separate I/O traffic types and put them in their own SCSI controller Make sure timekeeping is properly set in Virtual Machines

Page 16 © Hortonworks Inc – All Rights Reserved Next Steps Refer to vendor documentation. Virtualize Pilot or POC environments but caveat emptor Consider cloud computing when you need compute to be elastic and separate from storage Wait…technology is changing and engineering efforts are underway JIRA HADOOP-8468