Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.

Hadoop tutorials

Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2

Hadoop Introduction

What is Hadoop? (1) A framework for large scale data processing Volume Variety Velocity 4

What Hadoop is? (2) Solution for big data processing Sequential data access – a brute force approach Simplified data structures (no relational model) Ideal for ad-hoc data analitics Instead of some clever data lookups with indexing etc. Data analitic cases has to be known before hand Complex data design 5

What is Hadoop? (3) Data locality (shared nothing) – scales out Interconnect network MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks Node 1Node 2Node 3Node 4 Node 5 Node X 6

What is Hadoop? (4) Optimized storage access (for HDD) Big data blocks >=128MB Seqential IO Instead of Random IO HDD drive 7200rpm speed: -Sequential IO: ~120MB/s -Random IO: 0.5 - 50MB/s 7

Hadoop eco system HDFS Hadoop Distributed File System Hbase NoSql columnar store YARN Cluster resource manager MapReduce Hive SQL Pig Scripting Flume Log data collector Sqoop Data exchange with RDBMS Oozie Workflow manager Mahout Machine learning Zookeeper Coordination Impala SQL Spark Large scale data proceesing 8

Hadoop cluster architecture One master and slaves approach Interconnect network Node 1Node 2Node 3Node 4 Node 5 Node X HDFS DataNode Various component agents and masters YARN Node Manager HDFS NameNode HDFS DataNode Various component agents and masters YARN Node Manager YARN ResourceManager HDFS DataNode Various component agents and demons YARN Node Manager Hive metastore HDFS DataNode Various component agents and demons YARN Node Manager HDFS DataNode Various component agents and demons YARN Node Manager HDFS DataNode Various component agents and demons YARN Node Manager 9

What to not use the Hadoop for? Online Transaction Processing system No transactions No locks No data updates (only appends and overwrites) Response time in seconds rather miliseconds Not good for systems with relational data Interactive applications Accounting systems Etc. 10

What to use the Hadoop for? For Big Data! Storing Analysis Write once – read many Scalable out system (CPU, IO, RAM) transparent to the users (data placement, data analysis) Good for data exploration: in a batch fashion statistics, aggregations, correlation Data warehouses Logs 11

Hadoop @CERN 4 main clusters (provided by DSS and DB) lxhadoop – mainly for ATLAS activities (EventIndex, etc.) 16 machines, 256GB namenode, 12GB worker nodes analytix – general purpose (CASTOR, Dashboards, ITmon..) 20 machines, 256GB namenode, 64GB worker nodes hadalytic – new SQL orineted installation (ACCLOG, SCADA) 17 machines, 48GB – 64GB RAC9 – our internal one for testing 16 machines, 24GB RAM 12

Summury Hadoop is a solution for massive data processing Designed to scale out On a commodity hardware Optimized for sequential reads Hadoop architecture HDFS is a core Many components with multiple functionalities spreaded accross the hardware 13

Questions? 14

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.

Similar presentations

Presentation on theme: "Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.

Similar presentations

Presentation on theme: "Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2."— Presentation transcript:

Similar presentations

About project

Feedback