CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.

Slides:



Advertisements
Similar presentations
Introduction to Hadoop Richard Holowczak Baruch College.
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Hadoop Ecosystem Overview
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
HAMS Technologies 1
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Before we start, please download: VirtualBox: – The Hortonworks Data Platform: –
Hadoop implementation of MapReduce computational model Ján Vaňo.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
BIG DATA/ Hadoop Interview Questions.
Apache Hadoop on Windows Azure Avkash Chauhan
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Big Data is a Big Deal!.
Hadoop.
Introduction to Distributed Platforms
An Open Source Project Commonly Used for Processing Big Data Sets
Chapter 10 Data Analytics for IoT
Introduction to MapReduce and Hadoop
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
MIT 802 Introduction to Data Platforms and Sources Lecture 2
The Basics of Apache Hadoop
CS6604 Digital Libraries IDEAL Webpages Presented by
Hadoop Basics.
Introduction to Apache
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Pig Hive HBase Zookeeper
Presentation transcript:

CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook

Objectives Explain why and how Hadoop has become the foundation for virtually all modern data architectures Explain the architecture and design of HDFS and MapReduce

Agenda! What is it? Why should I care? Example Usage Hadoop Core – HDFS – MapReduce Ecosystem, maybe

WHAT IS HADOOP?

You tell me

Distributed Systems Collection of independent computers that appears to its users as a single coherent system* Take a bunch of Ethernet cable, wire together machines with a hefty amount of CPU, RAM, and storage Figure out how to build software to parallelize storage/computation/etc. across our cluster – While tolerating failures and all kinds of other garbage “The hard part” *Andrew S. Tanenbaum

Properties of Distributed Systems Reliability Availability Scalability Efficiency

Reliability Can the system deliver services in face of several component failures?

Availability How much latency is imposed on the system when a failure occurs? # of NinesAvailability %Per yearPer monthPer week 190%36.5 days72 hours16.8 hours 299%3.65 days7.20 hours1.68 hours 399.9%8.76 hours43.8 minutes10.1 minutes %52.56 minutes4.38 minutes1.01 minutes %5.26 minutes25.9 seconds6.05 seconds %31.5 seconds2.59 seconds604.8 ms %3.15 seconds ms60.48 ms % ms ms6.048 ms % ms ms ms

Scalability Can the system scale to support a growing number of tasks?

Efficiency How efficient is the system, in terms of latency and throughput?

WHY SHOULD I CARE?

The Short Answer is

The V’s of Big Data! Volume Velocity Variety Veracity Value – New and Improved!

Data Sources Social Media Web Logs Video Networks Sensors Transactions (banking, etc.) , Text Messaging Paper Documents

Value in all of this! Fraud Detection Predictive Models Recommendations Analyzing Threats Market Analysis Others!

How do we extract value? Monolithic Computing! Build bigger faster computers! Limited solution – Expensive and does not scale as data volume increases

Enter Distributed Computing Processing is distributed across many computers Distribute the workload to powerful compute nodes with some separate storage

New Class of Challenges Does not scale gracefully Hardware failure is common Sorting, combining, and analyzing data spread across thousands of machines is not easy

RDBMS is still alive! SQL is still very relevant, with many complex business rules mapping to a relational model But there is much more than can be done by leaving relational behind and looking at all of your data NoSQL can expand what you have, but it has its disadvantages

NoSQL Downsides Increased middle-tier complexity Constraints on query capability No standard semantics for query Complex to setup and maintain

An Ideal Cluster Linear horizontal scalability Analytics run in isolation Simple API with multiple language support Available in spite of hardware failure

Enter Hadoop Hits these major requirements Two core pieces – Distributed File System (HDFS) – Flexible analytic framework (MapReduce) Many ecosystem components to expand on what core Hadoop offers

Scalability Near-linear horizontal scalability Clusters are built on commodity hardware Component failure is an expectation

Data Access Moving data from storage to a processor is expensive Store data and process the data on the same machines Process data intelligently by being local

Disk Performance Disk technology has made significant advancements Take advantage of multiple disks in parallel – 1 disk, 3TB of data, 300MB/s, ~2.5 hours to read – 1,000 disks, same data, ~10 seconds to read Distribution of data and co-location of processing makes this a reality

Complex Processing Code Hadoop framework abstracts complex distributed computing environment – No synchronization code – No networking code – No I/O code MapReduce developer focuses on the analysis – Job runs the same on one node or 4,000 nodes!

Fault Tolerance Failure is inevitable, and it is planned for Component failure is automatically detected and seamlessly handled System continues to operate as expected with minimal degradation

Hadoop History Spun off of Nutch, an open source web search engine Google Whitepapers – GFS – MapReduce Nutch re-architected to birth Hadoop Hadoop is very mainstream today

Hadoop Versions Hadoop is the latest release Two Java APIs, referred to as the old and new APIs Two task management systems – MapReduce v1 JobTracker, TaskTracker – MapReduce v2 – A YARN application MR runs in containers in the YARN framework

Common Use Cases Log Processing Image Identification Extract Transform Load (ETL) Recommendation Engines Time-Series Storage and Processing Building Search Indexes Long-Term Archive Audit Logging

Non-Use Cases Data processing handled by one large server ACID Transactions

LinkedIn Architecture

LinkedIn Applications

HDFS

Hadoop Distributed File System Inspired by Google File System (GFS) High performance file system for storing data Relatively simple centralized management Fault tolerance through data replication Optimized for MapReduce processing – Exposing data locality Linearly scalable Written in Java APIs in all the useful languages

Hadoop Distributed File System Use of commodity hardware Files are write once, read many Leverages large streaming reads vs random Favors high throughput vs low latency Modest number of huge files

HDFS Architecture Split large files into blocks Distribute and replicate blocks to nodes Two key services – Master NameNode – Many DataNodes Backup/Checkpoint NameNode for HA User interacts with a UNIX file system

NameNode Single master service for HDFS Was a single point of failure Stores file to block to location mappings in a namespace All transactions are logged to disk Can recover based on checkpoints of the namespace and transaction logs

NameNode Memory HDFS prefers fewer larger files as a result of the metadata Consider 1GB of data with 64MB block size – Stored as one file Name: 1 item Blocks: 16*3 = 48 items Total items: 49 – Stored as MB files Names: 1024 items Blocks: 1024*3 = 3072 items Total items: 4096 Each item is around 200 bytes

Checkpoint Node (Secondary NN) Performs checkpoints of the namespace and logs Not a hot backup! HDFS 2.0 introduced NameNode HA – Active and a Standby NameNode service coordinated via ZooKeeper

DataNode Stores blocks on local disk Sends frequent heartbeats to NameNode Sends block reports to NameNode Clients connect to DataNodes for I/O

How HDFS Works - Writes Client contacts NameNode to write data NameNode says write it to these nodes Client sequentially writes blocks to DataNode

How HDFS Works - Writes DataNodes replicate data blocks, orchestrated by the NameNode

How HDFS Works - Reads Client contacts NameNode to read data NameNode says you can find it here Client sequentially reads blocks from DataNode

Client connects to another node serving that block How HDFS Works - Failure

HDFS Blocks Default block size was 64MB, now 128 MB – Configurable – Larger sizes are common, say 256 MB or 512 MB Default replication is three – Also configurable, not rarely changed Stored as files on the DataNode’s local fs – Cannot associate any block with its true file

Replication Strategy First copy is written to the same node as the client – If the client is not part of the cluster, first block goes to a random node Second copy is written to a node on a different rack Third copy is written to a different node on the same rack as the second copy

Data Locality Key in achieving performance MapReduce tasks run as close to data as possible Some terms – Local – On-Rack – Off-Rack

Data Corruption Use of checksums to to ensure block integrity Checksum is calculatd on read and compared against that when it was written – Fast to calculate and space-efficient If the checksums differ, client reads the block from another DataNode – Corrupted block will be deleted and replicated by a non-corrupt block DataNode periodically runs a background thread to do this checksum process – For those files that aren’t read very often

Fault Tolerance If no heartbeat is received from DN within a configurable time window, it is considered lost – 10 minute default NameNode will: – Determine which blocks were on the lost node – Locate other DataNodes with valid copies – Instruct DataNodes with copies to replicate blocks to other DataNodes

Interacting with HDFS Primary interfaces are CLI and Java API Web UI for read-only access WebHDFS provides RESTful read/write HttpFS also exists snakebite is a Python library from Spotify FUSE, allows HDFS to be mounted on standard file system – Typically used so legacy apps can use HDFS data ‘hdfs dfs -help’ will show CLI usage

HADOOP MAPREDUCE

MapReduce! A programming model for processing data Contains two phases! Map – Perform a map function on key/value pairs Reduce – Perform a reduce function on key/value groups Groups are created by sorting map output Operations on key/value pairs open the door for very parallelizable algorithms

Hadoop MapReduce Automatic parallelization and distribution of tasks Framework handles scheduling task and repeating failed tasks Developer can code many pieces of the puzzle Framework handles a Shuffle and Sort phase between map and reduce tasks. Developer need only focus on the task at hand, rather than how to manage where data comes from and where it goes

MRv1 and MRv2 Both manage compute resources, jobs, and tasks Job API the same MRv1 is proven in production – JobTracker / TaskTrackers MRv2 is a new application on YARN – YARN is a generic platform for developing distributed applications

MRv1 - JobTracker Monitors job and task progress Issues task attempts to TaskTrackers Re-tries failed task attempts Four failed attempts of same task = one failed job Default scheduler is FIFO order – CapacityScheduler or FairScheduler is preferred Single point of failure for MapReduce

MRv1 – TaskTrackers Runs on same node as DataNodes Sends heartbeats and task reports to JT Configurable number of map and reduce slots Runs map and reduce task attempts in a separate JVM

How MapReduce Works Client submits job to JobTracker JobTracker submits tasks to TaskTrackers Job output is written to DataNodes w/replication JobTracker reports metrics

How MapReduce Works - Failure JobTracker assigns task to different node

Map Input Map Output Reducer Input Groups Reducer Output

Example -- Word Count Count the number of times each word is used in a body of text Uses TextInputFormat and TextOutputFormat map(byte_offset, line) foreach word in line emit(word, 1) reduce(word, counts) sum = 0 foreach count in counts sum += count emit(word, sum)

Mapper Code public class WordMapper extends Mapper { private final static IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, ONE); }

Shuffle and Sort Mapper outputs to a single partitioned file Reducers copy their parts Reducer merges partitions

Reducer Code public class IntSumReducer extends Reducer { public void reduce(Text key, Iterable values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }

Resources, Wrap-up, etc. Supportive community – Hadoop-DC – Data Science MD – Baltimore Hadoop Users Group Plenty of resources available to learn more – Books – lists – Blogs

That Ecosystem I Mentioned Accumulo Avro Geode Flink Twill DataFu Falcon HCatalog Samza Tajo HAWQ Parquet Mahout Oozie Storm ZooKeeper Spark Cassandra Kafka Crunch Azkaban Knox HDFS MapReduce YARN Sqoop Flume NiFi Pig Hive Streaming HBase Greenplum Ambari Mesos Drill Giraph Nutch Sentry Chukwa MRUnit Phoenix Tez Ranger

References Hadoop: The Definitive Guide, Chapter data-ecosystem-at-linkedin data-ecosystem-at-linkedin content/uploads/2013/01/hadoop_ecosystem _full2.png content/uploads/2013/01/hadoop_ecosystem _full2.png Google Images