Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore INTRODUCTION TO HADOOP.

Slides:



Advertisements
Similar presentations
Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!
Advertisements

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Case Study - GFS.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop Distributed File System by Swathi Vangala.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
The Hadoop Distributed File System
Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
INTRODUCTION TO BIGDATA & HADOOP
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Software Engineering Introduction to Apache Hadoop Map Reduce
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Technopoints.
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
Apache Hadoop and Spark
Presentation transcript:

Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore INTRODUCTION TO HADOOP

Contents Distributed System DFS Hadoop Why its is needed? Issues Mutate / lease

Operating systems Operating system - Software that supervises and controls tasks on a computer. Individual OS: –Batch processing  jobs are collected, placed in a queue, no interaction with job during processing –Time shared  computing resources are provided to different users, interaction with program during execution –RT systems  fast response, can be interrupted

Distributed Systems Consists of a number of computers that are connected and managed so that they automatically share the job processing load among the constituent computers. A distributed operating system is one that appears to its users as a traditional uniprocessor system, even though it is actually composed of multiple processors. It gives a single system view to its users and provides a single service. Users are transparent to location of files. It provides a virtual computing env. Eg The Internet, ATM banking networks, mobile computing networks, Global Positioning Systems and Air Traffic Control DISTRIBUTED SYSTEM IS A COLLECTION OF INDEPENDENT COMPUTERS THAT APPEARS TO IS USERS AS A SINGLE COHERENT SYSTEM

Network Operating System In a network operating system the users are aware of the existence of multiple computers. The operating system of individual computers must have facilities to have communication and functionality. Each machine runs its own OS and has its own user. Remote login and file access Less transparent but more independency Applicatio n Distributed Operating System Services Application Network OS Distributed OSNetworked OS

DFS Resource sharing is the motivation behind distributed Systems. To share files  file system File System is responsible for the organization, storage, retrieval, naming, sharing, and protection of files. The file system is responsible for controlling access to the data and for performing low-level operations such as buffering frequently used data and issuing disk I/O requests The goal is to allow users of physically distributed computers to share data and storage resources by using a common file system.

Hadoop What is Hadoop? It's a framework for running applications on large clusters of commodity hardware which produces huge data and to process it Apache Software Foundation Project Open source Amazon’s EC2 alpha (0.18) release available for download Hadoop Includes HDFS ­ a distributed filesystem Map/Reduce ­ HDFS implements this programming model. It is an offline computing engine Concept Moving computation is more efficient than moving large data

Data intensive applications with Petabytes of data. Web pages billion web pages x 20KB = 400+ terabytes –One computer can read MB/sec from disk ~four months to read the web –same problem with 1000 machines, < 3 hours Difficulty with a large number of machines –communication and coordination –recovering from machine failure –status reporting –debugging –optimization –locality

FACTS Single-thread performance doesn’t matter We have large problems and total throughput/price more important than peak performance Stuff Breaks – more reliability If you have one server, it may stay up three years (1,000 days) If you have 10,000 servers, expect to lose ten a day “Ultra-reliable” hardware doesn’t really help At large scales, super-fancy reliable hardware still fails, albeit less often – software still needs to be fault-tolerant – commodity machines without fancy hardware give better perf/price DECISION : COMMODITY HARDWARE. DFS : HADOOP – REASONS????? WHAT SOFTWARE MODEL????????

HDFS Why? Seek vs Transfer CPU & transfer speed, RAM & disk size double every months Seek time nearly constant (~5%/year) Time to read entire drive is growing vs transfer rate. Moral: scalable computing must go at transfer rate BTree (Relational DBS) – operate at seek rate, log(N) seeks/access -- memory / stream based sort/merge flat files (MapReduce) – operate at transfer rate, log(N) transfers/sort -- Batch based

Characteristics Fault tolerant, scalable, Efficient, reliable distributed storage system Moving computation to place of data Single cluster with computation and data. Process huge amounts of data. Scalable: store and process petabytes of data. Economical: –It distributes the data and processing across clusters of commonly available computers. –Clusters PCs into a storage and computing platform. –It minimises no of CPU cycles, RAM on individual machines etc. Efficient: –By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This makes it extremely rapid. –Computation is moved to place where data is present. Reliable: –Hadoop automatically maintains multiple copies of data –Automatically redeploys computing tasks based on failures.

Cluster node runs both DFS and MR

Data Model – Data is organized into files and directories – Files are divided into uniform sized blocks and distributed across cluster nodes – Replicate blocks to handle hardware failure – Checksums of data for corruption detection and recovery – Expose block placement so that computes can be migrated to data large streaming reads and small random reads Facility for multiple clients to append to a file

Assumes commodity hardware that fails –Files are replicated to handle hardware failure –Checksums for corruption detection and recovery –Continues operation as nodes / racks added / removed Optimized for fast batch processing –Data location exposed to allow computes to move to data –Stores data in chunks/blocks on every node in the cluster –Provides VERY high aggregate bandwidth

Files are broken in to large blocks. – Typically 128 MB block size – Blocks are replicated for reliability –One replica on local node, another replica on a remote rack, Third replica on local rack, Additional replicas are randomly placed Understands rack locality – Data placement exposed so that computation can be migrated to data Client talks to both NameNode and DataNodes – Data is not sent through the namenode, clients access data directly from DataNode – Throughput of file system scales nearly linearly with the number of nodes.

Block Placement

Hadoop Cluster Architecture:

Components DFS Master “Namenode” –Manages the file system namespace –Controls read/write access to files –Manages block replication –Checkpoints namespace and journals namespace changes for reliability Metadata of Name node in Memory – The entire metadata is in main memory – No demand paging of FS metadata Types of Metadata: List of files, file and chunk namespaces; list of blocks, location of replicas; file attributes etc.

DFS SLAVES or DATA NODES Serve read/write requests from clients Perform replication tasks upon instruction by namenode Data nodes act as: 1) A Block Server – Stores data in the local file system – Stores metadata of a block (e.g. CRC) – Serves data and metadata to Clients 2) Block Report: Periodically sends a report of all existing blocks to the NameNode 3) Periodically sends heartbeat to NameNode (detect node failures) 4) Facilitates Pipelining of Data (to other specified DataNodes)

Map/Reduce Master “Jobtracker” –Accepts MR jobs submitted by users –Assigns Map and Reduce tasks to Tasktrackers –Monitors task and tasktracker status, re­ executes tasks upon failure Map/Reduce Slaves “Tasktrackers” –Run Map and Reduce tasks upon instruction from the Jobtracker –Manage storage and transmission of intermediate output.

SECONDARY NAME NODE Copies FsImage and Transaction Log from NameNode to a temporary directory Merges FSImage and Transaction Log into a new FSImage in temporary directory Uploads new FSImage to the NameNode – Transaction Log on NameNode is purged

HDFS Architecture NameNode: filename, offset­> block­id, block ­> datanode DataNode: maps block ­> local disk Secondary NameNode: periodically merges edit logs Block is also called chunk

JOBTRACKER, TASKTACKER AND JOBCLIENT

HDFS API Most common file and directory operations supported: – Create, open, close, read, write, seek, list, delete etc. Files are write once and have exclusively one writer Some operations peculiar to HDFS: – set replication, get block locations Support for owners, permissions

DATA CORRECTNESS Use Checksums to validate data – Use CRC32 File Creation – Client computes checksum per 512 byte – DataNode stores the checksum File access – Client retrieves the data and checksum from DataNode – If Validation fails, Client tries other replicas

MUTATION ORDER AND LEASES A mutation is an operation that changes the contents / metadata of a chunk such as append / write operation. Each mutation is performed at all replicas. Leases (order of mutations) are used to maintain consistency Master grants chunk lease to one replica (primary) Primary picks the serial order for all mutations to the chunk All replicas follow this order (consistency)

Software Model - ??? Parallel programming improves performance and efficiency. In a parallel program, the processing is broken up into parts, each of which can be executed concurrently Identify whether the problem can be parallelised (fib) Matrix operations with independency

Master/Worker The MASTER: –initializes the array and splits it up according to the number of available WORKERS –sends each WORKER its subarray –receives the results from each WORKER The WORKER: –receives the subarray from the MASTER –performs processing on the subarray –returns results to MASTER

The area of the square, denoted As = (2r)^2 or 4r^2. The area of the circle, denoted Ac, is pi * r 2. pi = Ac / r^2 As = 4r^2 r^2 = As / 4 pi = 4 * Ac / As pi= 4 * No of pts on the circle / num of points on the square CALCULATING PI

Randomly generate points in the square Count the number of generated points that are both in the circle and in the square  MAP (find ra = No of pts on the circle / num of points on the square) ra = the number of points in the circle divided by the number of points in the square  gather all ra PI = 4 * r  REDUCE Parallelised calculation of points on the circle (MAP) Then merged in to find PI  REDUCE

Cluster node runs both DFS and MR

WHAT IS MAP REDUCE PROGRAMMING Restricted parallel programming model meant for large clusters –User implements Map() and Reduce()‏ Parallel computing framework (HDFS lib) –Libraries take care of EVERYTHING else (abstraction) Parallelization Fault Tolerance Data Distribution Load Balancing Useful model for many practical tasks

Conclusion Why commodity hw ? because cheaper designed to tolerate faults Why HDFS ? network bandwidth vs seek latency Why Map reduce programming model? parallel programming large data sets moving computation to data single compute + data cluster