Map reduce cloud platform

Map reduce cloud platform
Cs 595 Lecture 12

Hadoop Cloud Architecture topics
Virtual appliance management Load balancing/consolidation migration Fault Tolerance Hadoop HDFS Component communication Command line interface Java API Virtual appliances Mapping functions Shuffle and sort functions Reduce functions

Hadoop Hadoop includes: Open source, from Apache Written in Java
Distributed file system, HDFS – distributes data across cluster nodes Map/Reduce functions – distributes computation across cluster nodes Open source, from Apache Written in Java Runs on: Linux, Max OSX, Windows, Solaris Commodity hardware

Hadoop A MapReduce job usually splits the input data into independent chunks which are processed by map tasks in parallel. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically, compute nodes and storage nodes are the same. MapReduce framework and HDFS are running on the same set of nodes. This allows Hadoop to effectively schedule tasks on the nodes where data is already present, creating very high aggregate bandwidth. Applications specify input/output data locations in HDFS Supply mapping and reducing function by implementing appropriate interfaces and/or abstract classes

Hadoop Previous implementations discussed about Hadoop use bare metal configurations. The following slides will propose a vertical cloud architecture based on virtualized components of Hadoop and HDFS. Components of MapReduce/HDFS can be contained in VMs to give benefits related to IaaS/PaaS cloud computing architectures.

Hadoop – bare metal architecture

HDFS – Communication protocols
All HDFS communication protocols are layered on top of the TCP/IP protocol stack. Socket communication using the <IP, Port#> tuple A client establishes a connection to a configurable TCP port on the NameNode server. The DataNodes communicate with the NameNode use the DataNode protocol.

Remote Procedure Calls (RPC) abstraction wraps both the Client Protocol and DataNode protocol. The NameNode is a server process. It never initiates a request (pull) It only responds to RPC (push) requests issued by DataNodes or clients.

Application code  Client HDFS provides a Java API for applications to use. Fundamentally, the application uses the standard java.io interface. A C language wrapper for the HDFS Java API is also available. The client and application code are bound into the same address space.

Client  NameNode Communication between the NameNode and the client is governed by the Client Protocol documented in \hdfs\protocol\ClientProtocol.java Major functions of the Client Protocol: Create: Creates a new file Append: After a file is created and closed, this function allows adding data to the end of the file. Complete(close): The client has finished writing to a file. Read: Client wishes to read from a file. Error Reporting: Reports bad blocks detected by the client. Directory management: Rename, delete, copy

Client  DataNode A client communicates with a DataNode directly to transfer (send/receive) data using the DataTransferProtocol, which is defined in DataTransferProtocol.java. Uses streaming protocols, not RPC, for performance purposes. DataTransferProtocol functions: opReadBlock(): read a block oppWriteBlock(): write a block opReplaceBlock(): replace a block opCopyBlock(): copy a block opBlockChecksum(): returns a block’s Checksum

NameNode DataNode All communication between the NameNode and the DataNode is initiated by the DataNode. The NameNode never initiates communication to the DataNode, although responses from the NameNode may include commands that cause the DataNode to send further communications. DataNodes send information to the NameNode through four major interfaces defined in the DataNodeProtocol: DataNode Registration: informs the NameNode of its existence. The NameNode computes and returns the DataNode’s unique registration ID. This ID is used to authenticate DataNode functions. DataNode Heartbeats: Every few seconds, the DataNode sends info statistics about its capacity and current activity. The NameNode can respond to the heartbeat with a list of block oriented commands. DataNode Block Reports: Periodically (hourly) sends reports on blocks it contains. DataNode Notification of Block Received: Reports that it has received a new block from a client (new file write) or another DataNode (replication).

Designed to store large files Stores files as large blocks
HDFS Designed to store large files Stores files as large blocks Data is re-replicated when needed Accessed from command line, Java API, or C API Command Line API Java API

HDFS – Java api Writing data from local file system to HDFS:
public class FileWriteToHDFS { public static void main(String[] args) throws Exception { String localSrc = args[0]; //Source file in the local file system String dst = args[1]; //Destination file in HDFS //Input stream for the file in local file system to be written to HDFS InputStream in = new BufferedInputStream(new FileInputStream(localSrc)); //Get configuration of Hadoop system, reads config files Configuration conf = new Configuration(); System.out.println("Connecting to -- "+conf.get("fs.defaultFS")); FileSystem fs = FileSystem.get(URI.create(dst), conf); //Destination file in HDFS OutputStream out = fs.create(new Path(dst)); IOUtils.copyBytes(in, out, 4096, true); //Copy file from local to HDFS System.out.println(dst + " copied to HDFS"); }

HDFS – Java api Reading data from HDFS:
public class FileReadFromHDFS { public static void main(String[] args) throws Exception { String uri = args[0]; //File to read in HDFS Configuration conf = new Configuration();//Get config of Hadoop system FileSystem fs = FileSystem.get(URI.create(uri), conf); //Get the filesystem - HDFS FSDataInputStream in = null; try { in = fs.open(new Path(uri)); //Open the path mentioned in HDFS IOUtils.copyBytes(in, System.out, 4096, false); System.out.println("End Of file: HDFS file read complete"); } finally { IOUtils.closeStream(in); }

HDFS – command line Writing data from local file system to HDFS:
get Examples: hdfs dfs –get /user/Hadoop/file localFile hdfs dfs –get hdfs://example.com/user/Hadoop/file localFile Writes data from HDFS to the local file system.

HDFS – command line Reading data from HDFS: HDFS command: cat
Examples: hdfs dfs –cat /user/Hadoop/file hdfs dfs –cat hdfs://example.com/user/Hadoop/file Copies source paths to stdout.

Virtual Appliances MapReduce functionally, along with the associated HDFS can be contained within a virtualized environment. Allows for isolation of job requests/data sets in a multi-tenant architecture. Each VM can be tailored dynamically based on workload and job requests. Idle Hadoop VMs can decrease their amount of resource consumption to allow other active Hadoop VMs more resources. Adds scalability and elasticity layer.

Virtual Appliance – Map functions
High-order function that applies a given function to all elements. Numbers = List(1,2,3,4,5) Numbers.map(x  x2) == List(1,4,9,16,25) Hadoop Mappers: Process that transforms input data into intermediate records. Hadoop creates one map task for each InputSplit generated by the InputFormat for the job. Users can control which keys in the outputs of the mappers go to which reducers by creating custom Partitioner. Number of map functions is usually driven by the total size of the inputs, generally 10 – 100 mappers per node.

Virtual Appliance – Reduce functions
High-order function that analyzes an input data structure and recombining through use of a given combination operation. Numbers = List(1,2,3,4,5) Numbers.reduce(_ + _) == 15 Hadoop Reducers: Reduces a set of intermediate output data from mappers which share a key to a smaller set of values. Number of reducers is set by Job.setNumReduceTasks(int). Typically between 0 - numerOfMapTasks

Virtual Appliance – RAMDisks
Mappers/reducers using RAMdisks for performance A RAMdisk is a block of RAM that can be treated as a HDD/SSD (secondary storage). Performance of RAMdisks, in general, is orders of magnitude faster than other forms of storage media such as HDD/SSD. File access time is greatly reduced Max data throughput is only limited by RAM speed, data bus, and CPU Other forms of storage are further limited by interface busses such as SATA and USB. Distributed file systems read and write to secondary storage devices very often to maintain different states. Data output from map processes can be stored in RAMdisks to be accessed by reduce processes.

Virtual Appliance Management – load balancing
How load balancing can be used: Not significant in the execution of a MapReduce algorithm But essential when handling large input data when hardware resource use is critical Balance mechanisms can be implemented based on the disk space usage of HDFS on different nodes in the cluster. A cluster is considered balanced if for each data node, the ratio of used space at that node to the total capacity of the node (node utilization) different from the ratio of used space at the cluster to the total capacity of the cluster (cluster utilization) differs by no more than a predetermined threshold value. Smaller threshold values creates a more balanced cluster, but increases the time it takes to run balancing algorithms

Fault tolerance Hadoop system failures
Entire MapReduce process is computed again. Hadoop failures after successful Mapping process Intermediate Key, Value pairs generated successfully by Mappers can be saved and sent to another data node upon failure.

Fault tolerance Large Input Dataset from HDFS Mappers
Intermediate Key, Value Pairs Reducers Output Data to HDFS Backup of Intermediate Key, Value Pairs

Fault tolerance Advantages of “backing up” output from Mappers:
Unnecessary Map compute cycles reduced Backup is created as a new state in the MapReduce pipeline Disadvantage: Cost overhead May experience delays in Reduce functions since data has to be written to backup locations as well as piped into Reduce functions.

Applications of MapReduce
Distributed Grep Linux grep command example: grep –Eh <regex> <inDir>/* | sort | uniq –c | sort –nr Counts lines in all files in <inDir> that match <regex> and displays the counts in descending order grep –Eh ‘X|y’ input/* | sort | uniq –c | sort –nr File 1 File 2 Output X Y A y A B X Y 2 X 1 y

Map function for distributed grep: Input: File offset/location Output: Key, Value pair [(line, 1)], if it matches Empty list, if there’s no match Reduce function for distributed grep: (line, [1,1,…]) (line, n), where n is the number of 1s (matches) in the list)

Large scale PDF generation: New York Times needed to generate PDF files for 11m articles. All articles from 1851 – 1980 Articles were images scanned from the original papers, scaled and glued together

Large scale PDF generation – Technologies used: Amazon Elastic Compute Cloud (EC2) Virtualized computing environment Amazon Simple Storage Service (S3) Scalable, inexpensive storage Can retrieve and store any amount of data from anywhere on the Internet. Hadoop Open source implementation of MapReduce

Large scale PDF generation results: 4TB of scanned articles sent to S3. A cluster of EC2 VMs was configured to distribute the PDF generation using Hadoop. 100 large EC2 instances and 24 hour later, the NYT was able to convert the scanned articles to 1.5TB of PDF documents.

Geographical Data: Large data sets including road, intersection, feature, and terrain data. Google Maps uses MapReduce to solve: Locating roads connected to a given intersection Rendering of Google Map tiles Finding nearest feature to a given address or location

Geographical Data, Example: Input: List of roads and intersections Map: Creates pairs of connected points (road, intersection) or (road, road) Reduce output: List all points that connect to a particular road

PageRank Program implemented by Google to rank any type of recursive “documents” using MapReduce. Initially developed at Stanford University by Google founders, Larry Page and Sergey Brin, in 1995. Led to a functional prototype named Google in 1998. Still provides the basis for all of Google's web search tools.

MapReduce Deficiencies
Database management Not optimized for database integration Does not provide traditional DBMS features Lacks support for default DBMS tools Lack of a schema No separation from application program No indexes

Map reduce cloud platform

Similar presentations

Presentation on theme: "Map reduce cloud platform"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Map reduce cloud platform

Similar presentations

Presentation on theme: "Map reduce cloud platform"— Presentation transcript:

Similar presentations

About project

Feedback