Introduction to Distributed Platforms

Slides:



Advertisements
Similar presentations
Threads, SMP, and Microkernels
Advertisements

A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Definition of a Distributed System (1) A distributed system is: A collection of independent computers that appears to its users as a single coherent system.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Computer System Architectures Computer System Software
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Simple introduction to HDFS Jie Wu. Some Useful Features –File permissions and authentication. –Rack awareness: to take a node's physical location into.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Kyung Hee University 1/41 Introduction Chapter 1.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
DISTRIBUTED COMPUTING Introduction Dr. Yingwu Zhu.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
© Chinese University, CSE Dept. Distributed Systems / Distributed Systems Topic 1: Characterization of Distributed & Mobile Systems Dr. Michael R.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Background Computer System Architectures Computer System Software.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Hadoop. Introduction Distributed programming framework. Hadoop is an open source framework for writing and running distributed applications that.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Prof. Jong-Moon Chung’s Lecture Notes at Yonsei University
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Chapter 1: Introduction
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop Aakash Kag What Why How 1.
Hadoop.
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Chapter 10 Data Analytics for IoT
Definition of Distributed System
Spark Presentation.
Grid Computing.
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Hadoop Clusters Tess Fulkerson.
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Software Engineering Introduction to Apache Hadoop Map Reduce
University of Technology
GRID COMPUTING PRESENTED BY : Richa Chaudhary.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Advanced Operating Systems
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
Distributed File Systems
Hadoop Technopoints.
Introduction to Apache
Overview of big data tools
Distributed computing deals with hardware
Lecture 16 (Intro to MapReduce and Hadoop)
Apache Hadoop and Spark
Introduction Chapter 1.
Pig Hive HBase Zookeeper
Presentation transcript:

Introduction to Distributed Platforms J. H. Wang Apr. 25, 2017

Outline Motivation Distributed Computing Popular distributed platforms Why Distribution? Distributed Computing Popular distributed platforms Hadoop Spark

Motivation Why distribution? We need more storage space Big data: volume, velocity, variety We need more storage space Space efficiency We need more computing power Time efficiency

Distributed Computing Distributed computing is the study of distributed systems A distributed system is a collection of autonomous computing elements that appears to its users as a single coherent system A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages

Distributed Systems as Middleware

Three characteristics Concurrency of components Lack of a global clock: synchronization Independent failure of components: fault tolerance Goals Resource availability Distribution transparency Openness Scalability

Example architectures Centralized: client-server Multi-tiered Decentralized: peer-to-peer Hybrid

Clients and Servers General interaction between a client and a server

Multi-tiered Architectures Alternative client-server organizations (a) – (e) 1-29

Peer-to-Peer vs. Hybrid

Parallel vs. Distributed Parallel computing Tightly coupled Shared memory Distributed computing Loosely coupled Message passing

Distributed Computing Systems Cluster computing Homogeneous Connected through a LAN Grid computing Heterogeneous Dispersed across WAN Cloud computing IaaS, PaaS, SaaS

Cloud Enabling Technology: Virtualization Computer Hardware OS1 App1 App2 App4 App3 User Virtual Machine Manager OS2 OS3 VM1 VM2 VM3 Computer Hardware Operating System Browser Editor Database Player User

Popular Distributed Platforms Hadoop Spark

Apache Hadoop Project For reliable, scalable, distributed computing http://Hadoop.apache.org/ Current version: 2.8.0

Hadoop Architecture

HDFS: Hadoop Distributed File System MapReduce Software framework for easily writing applications which process vast amounts of data in-parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner HDFS: Hadoop Distributed File System A distributed file system designed to run on commodity hardware Fault-tolerant Low-cost High throughput

Assumptions and Goals of HDFS Hardware failure Detection of faults and quick, automatic recovery from them is a core architectural goal Streaming data access Batch processing, high throughput Large data sets Simple coherency model Write-once-read-many access model Moving computation is cheaper than moving data Portability across heterogeneous hardware and software platforms

HDFS Architecture – Master/Slave

NameNode and DataNodes Master/slave architecture A single NameNode: master server that manages the file system namespace and regulates access to files by clients A number of DataNodes: manage storage attached to the nodes that they run on NameNode: file system namespace operations: opening, closing, renaming files and directories, determining the mapping of blocks to DataNodes DataNodes: serving read and write requests from clients, block creation, deletion, replication

Data Replication Each file as a sequence of blocks, which are replicated for fault tolerance Block size, replication factor: configurable All blocks except the last are the same size Files in HDFS are write-once (except for appends and truncates) and have strictly one writer at any time NameNode makes all decisions regarding replication of blocks It periodically receives a Heartbeat and a Blockreport from each of the DataNodes

MapReduce Map tasks Reduce tasks A single master ResourceManager One worker NodeManager per cluster node MRAppMaster per application

YARN: from Hadoop 2

Major Differences between Hadoop 1 and 2 HDFS Federation Resource manager YARN

Hadoop 2: HDFS Federation Two major components in HDFS Namespaces Blocks storage service Hadoop 1 A single namenode manages the entire namespace HDFS federation Multiple namenode servers manage namespaces Horizontal scaling, performance improvement, multiple namespaces

Hadoop 2: YARN Brings significant performance improvements for some applications Supports additional processing models Implements a more flexible execution engine

What is YARN? A resource manager separated from MapReduce in Hadoop 1 The operating system of Hadoop Managing and monitoring workloads Maintaining a multi-tenant environment Implementing security controls Managing high-availability features of Hadoop No longer limited to the I/O intensive, high-latency MapReduce model

Apache Spark For large-scale data processing http://spark.apache.org/ Current version: 2.1.0

Components in Spark Platform

Spark Architecture

Components Spark applications Cluster managers Executors Independent sets of processes on a cluster, coordinated by the SparkContext object in main (Driver) program Cluster managers Allocate resources across applications Executors Processes that run computations and store data Spark sends your application code (JAR or python files) to executors, and SparkContext sends tasks to executors to run

Note about the architecture Each application has its own executor processes Benefit: isolating applications from each other Cons: data cannot be shared across different applications without writing to external storage Spark is agnostic to the underlying cluster manager Driver program must listen for and accept incoming connections from its executors throughout its lifetime Driver must be network addressable from the workers Driver program should be run close to the worker nodes, since it schedules tasks on the cluster

Cluster Manager Types Standalone: a simple cluster manager included in Spark Faster job startup, but it doesn’t support communication with an HDFS secured with Kerberos authentication protocol Apache Mesos: a general cluster manager that can also runs Hadoop MapReduce A scalable and fault-tolerant “distributed systems kernel” written in C++, which also supports C++ and Python applications It is actually a “scheduler of scheduler frameworks” because of its two-level scheduling architecture Hadoop YARN: resource manager in Hadoop 2 YARN lets you run different types of Java applications, not just Spark It also provides methods for isolating and prioritizing applications among users and organizations The only cluster type that supports Kerberos-secured HDFS You don’t have to install Spark on all nodes in the cluster

Spark Standalone Cluster Two deploy modes for Spark Standalone cluster Client mode: driver is launched in the same process as the client that submits the application Cluster mode: driver is launched from one of the Worker processes in the cluster, and the client exits as soon as it fulfills its submission of application

Spark on YARN cluster mode

Spark on YARN client mode

Spark and Hadoop

Thanks for Your Attention!