Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

RAID- Redundant Array of Inexpensive Drives. Purpose Provide faster data access and larger storage Provide data redundancy.
1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.
WHAT IS RAID? Christopher J Dutra Seton Hall University.
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
Availability in Globally Distributed Storage Systems
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
How to Cluster both Servers and Storage W. Curtis Preston President The Storage Group.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
By : Nabeel Ahmed Superior University Grw Campus.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore INTRODUCTION TO HADOOP.
Hadoop Distributed File System by Swathi Vangala.
Module 3 - Storage MIS5122: Enterprise Architecture for IT Auditors.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
1 The Google File System Reporter: You-Wei Zhang.
DISKS IS421. DISK  A disk consists of Read/write head, and arm  A platter is divided into Tracks and sector  The R/W heads can R/W at the same time.
LAN / WAN Business Proposal. What is a LAN or WAN? A LAN is a Local Area Network it usually connects all computers in one building or several building.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
Implementing Multi-Site Clusters April Trần Văn Huệ Nhất Nghệ CPLS.
Business Continuity and Disaster Recovery Chapter 8 Part 2 Pages 914 to 945.
The Hadoop Distributed File System
N-Tier Client/Server Architectures Chapter 4 Server - RAID Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept RAID – Redundant Array.
I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
IT Infrastructure Chap 1: Definition
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
01 NUTANIX INC. – CONFIDENTIAL AND PROPRIETARY Nutanix: bringing compute and storage together Mohit Aron, Co-founder & CTO.
HDFS Hadoop Distributed File System
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
Redundant Array of Independent Disks.  Many systems today need to store many terabytes of data.  Don’t want to use single, large disk  too expensive.
Server VirtualizationServer Virtualization Hyper-V 2012.
Distributed systems [Fall 2015] G Lec 1: Course Introduction.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
Cloud Computing Lecture 5-6 Muhammad Ahmad Jan.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Virtual Machine Movement and Hyper-V Replica
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
CubicRing ENABLING ONE-HOP FAILURE DETECTION AND RECOVERY FOR DISTRIBUTED IN- MEMORY STORAGE SYSTEMS Yiming Zhang, Chuanxiong Guo, Dongsheng Li, Rui Chu,
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
BIG DATA/ Hadoop Interview Questions.
Network-Attached Storage. Network-attached storage devices Attached to a local area network, generally an Ethernet-based network environment.
RAID.
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Introduction to Networks
Introduction to Networks
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Software Engineering Introduction to Apache Hadoop Map Reduce
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
RAID RAID Mukesh N Tekwani
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Technopoints.
Specialized Cloud Architectures
RAID RAID Mukesh N Tekwani April 23, 2019
Kenichi Kourai Kyushu Institute of Technology
Presentation transcript:

Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data

Hardware Considerations Original Design Idea: Hadoop infrastructure is designed and developed to run on a normal commodity hardware. Actual Production Environment: Large production Hadoop clusters do require a proper hardware infrastructure planning to have a minimal latency in storing and processing of large volume of data. Hardware architecture should be carefully designed based on the nature of data and the jobs that will be run and the agreed SLA ©2013 OpalSoft Big Data

Hardware Considerations Various aspects of hadoop hardware infrastructure Servers (Named Node, Job Tracker, Data Node) Racks Network Switch Storage Backup Number of copies of data ©2013 OpalSoft Big Data

Name Node & Job Tracker Server RAM size should be sized based on the following – Number of data nodes in cluster – Approximate number of blocks that will be stored in the cluster – Number of different hadoop process that is run the machine ©2013 OpalSoft Big Data

I/O Adapters – Not a critical element as named node does not participate in data transfers. Processor – Minimally need multi core processors Standby Node Server – Should be as same capacity as primary named node ©2013 OpalSoft Big Data

Data Node Servers RAM size should be sized based on the following – Approximate number of blocks that will be stored in the cluster – Number of different hadoop process that is run the machine. ©2013 OpalSoft Big Data

I/O Adapters – High throughput I/O adapter is needed Processor – Need multi core, multi processors for parallel execution of more than one map reduce job Virtualization is not recommended ©2013 OpalSoft Big Data

Racks Hadoop is rack aware Configure hadoop with node’s rack information Servers should be distributed at least across two racks to prevent any data loss due to rack failure ©2013 OpalSoft Big Data

Hadoop automatically performs block replication across servers in two different racks Servers located in same rack has low latency of data transfer because, all data transfer occurs via rack’s network switch ©2013 OpalSoft Big Data

Network Switch Its recommended to have a separate private network for hadoop cluster. Both core and rack network switch should support high bandwidth duplex data transfer Higher capacity core and rack switch will be required if the number of data copies are more than the standard 3 copies. ©2013 OpalSoft Big Data

Storage Locally attached storage provide better performance than a NFS or SAN storage Hard disk with higher RPM provides better read/write throughput More number of smaller capacity hard disk should be used instead of a single large capacity disk will allow concurrent read/writes and reduces disk level bottle neck ©2013 OpalSoft Big Data

Named Node and Job Tracker servers should have raid configurations which is highly fault tolerant Data Node raid configuration is not as critical. Data is already replicated across multiple servers Using SSD will improve performance drastically at the expense of higher setup cost ©2013 OpalSoft Big Data

Backup Named Node data is the most critical information that needs more frequent backup Named node data is regularly streamed to a standby node to restore hadoop cluster operation in case of primary named node failure. ©2013 OpalSoft Big Data

Another backup server is recommended to perform periodic backup and checksum verification of named node data Data node backup is requirement depends upon the criticality and availability of data. Doesn’t require frequent backup though a regular back up is needed only to recover from a data center level failures. ©2013 OpalSoft Big Data

Number of Copies of Data Following determines the number of copies of a block – Criticality of data – Number of Concurrent map reduce jobs that will be executed in the data set – More replicas of data allows more number of jobs to be run concurrently on the same data set. – Reduces job execution time as most often all the required data is either available locally or at least in the same rack ©2013 OpalSoft Big Data

– Be aware, more number of replicas affects the write performance of hadoop cluster. – Having more replicas only for most frequently used data provides maximum benefit instead of having a general replication factor across the cluster. ©2013 OpalSoft Big Data