Data processing with Hadoop

Slides:



Advertisements
Similar presentations
Beyond Mapper and Reducer
Advertisements

The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Simplified Data Processing on Large Clusters
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
Spark: Cluster Computing with Working Sets
Hadoop: The Definitive Guide Chap. 2 MapReduce
Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.
Lecture 3 – Hadoop Technical Introduction CSE 490H.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Big Data Analytics with R and Hadoop
大规模数据处理 / 云计算 Lecture 3 – Hadoop Environment 彭波 北京大学信息科学技术学院 4/23/2011 This work is licensed under a Creative Commons.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Distributed and Parallel Processing Technology Chapter6
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Distributed and Parallel Processing Technology Chapter7. MAPREDUCE TYPES AND FORMATS NamSoo Kim 1.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Introduction to Hadoop and HDFS
Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
O’Reilly – Hadoop: The Definitive Guide Ch.7 MapReduce Types and Formats 29 July 2010 Taikyoung Kim.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
CCD-410 Cloudera Certified Developer for Apache Hadoop (CCDH) Cloudera.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
CS350 - MAPREDUCE USING HADOOP Spring PARALLELIZATION: BASIC IDEA Parallelization is “easy” if processing can be cleanly split into n units:
Image taken from: slideshare
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Software Systems Development
INTRODUCTION TO BIGDATA & HADOOP
Map-Reduce framework.
Ch 8 and Ch 9: MapReduce Types, Formats and Features
Hadoop MapReduce Framework
MapReduce Types, Formats and Features
TABLE OF CONTENTS. TABLE OF CONTENTS Not Possible in single computer and DB Serialised solution not possible Large data backup difficult so data.
Lecture 17 (Hadoop: Getting Started)
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman
Hadoop MapReduce Types
COS 418: Distributed Systems Lecture 1 Mike Freedman
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Distributed Filesystem
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Lecture 18 (Hadoop: Programming Examples)
VI-SEEM data analysis service
Distributed Systems CS
Lecture 3 – Hadoop Technical Introduction
Charles Tappert Seidenberg School of CSIS, Pace University
MAPREDUCE TYPES, FORMATS AND FEATURES
MapReduce Algorithm Design
CS639: Data Management for Data Science
5/7/2019 Map Reduce Map reduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Distributed Systems and Concurrency: Map Reduce
Map Reduce, Types, Formats and Features
Presentation transcript:

Data processing with Hadoop Map/Reduce Data processing with Hadoop

Agenda Map/Reduce General Form Java Implementation of Map Reduce Input Format and its types Output Format Counters Side Data Distribution Map Reduce Library classes Shuffle and Sort Joins

Map/Reduce General Form The generic form of the map and reduce functions are as follows map(key1, val1) returns list(key2, val2) reduce(key2, List(val2)) returns list(k3, val3) Values are pulled by the mapper and sent to the reducer (if used) The Reducer takes all the values (in the form of a key, value list) and Aggregates the values Another layer can be added as well to the Map/Reduce stack - combiners. Combiners are an optimization step and act like reducers.

Java implementation of Map/Reduce Java uses generics in its implementation which allows for various types of keys and values to be passed to the mapper function. When Hadoop receives a job request, it determines if the job can run on the local JVM (Java Virtual Machine), if not it reserves other JVM instances on other nodes to run processes in parallel. Using streams, the JVM can pull data from sources outside of a java environment. The size of the job is determined by the number of mappers required to process the data, this can be configured and is defaulted to 10

MapReduce Input Formats Input split is the input chunk which is processed by a single map. A single split is processed by each map. Each split is divided into records, and the map processes each record—a key-value pair—in turn. Input splits are represented by the Java class InputSplit, which has two methods: first returns length of the split while other returns location of the split.

Types of Input Format File Input Format : The most commonly used Input Format is File Input Format. They are of mainly 3 types : FileInputFormat - It’s the base class for all implementations of InputFormat that use files as their data source. It provides two things: a place to define which files are included as the input to a job, and an implementation for generating splits for the input files. CombineFileInputFormat - This packs many files into each split so that each mapper has more to process. Hadoop handles small number of large files better than a large number of small files. WholeFileInputFormat - This a format where the keys are not used and the values are the file contents. It uses a FileSplit and hence converting it into a single record. 2. Multiple Input - It handles multiple input of different formats. 3 . Database Input - Input format for reading data from a relational database.

4. Text Input 5. Binary Input TextInputFormat - It’s the default Input Format. Every line of input has a record. Key is byte offset from beginning of the file and value is content of the line. KeyValueTextInputFormat - This class handles when each line of file is key value pair. NLineInputFormat - This is the class to handle fixed number of lines in input text file. 5. Binary Input SequenceFileInputFormat - It’s used to handle Hadoop’s sequential file format which consists of binary key value pair. SequenceFileAsTextInputFormat - This class converts Sequence file key and value to text objects to make it suitable for streaming. SequenceFileAsBinaryInputFormat - The class here fetches the sequence file’s keys and values as binary objects. FixedLengthInputFormat - It’s used for reading fixed-width binary records from a file where the records are not separated by delimiters.

Output Formats Determines the implementation of RecordWriter for writing output files Types: TextOutputFormat - default format,, writes key value pairs in individual lines SequenceFileOutputFormat - writes serialized output can be deserialized with SequenceFileInputFormat SequenceFileAsBinaryOutputFormat - writes to sequence file in binary format MapFileOutputFormat MultipleOutputs - writes to multiple files whose names are derived from keys LazyOutputFormat - creates output file only when a record is emitted DBOutputFormat - prepares batch SQL & writes to HBase and other relational databases

Counters Keeps track of event occurrence In-built counters: Task counters: keeps track information about task (MAP_INPUT_RECORDS, REDUCE_INPUT_RECORDS) Filesystem counters: Number of bytes read/written (BYTES_READ, BYTES_WRITE) Job counters: Number of tasks launched (TOTAL_LAUNCHED_MAPS, TOTAL_LAUNCHED_REDUCES) Custom counters: Defined by Java Enums public enum DIAGNOSIS_COUNTER { PULMONARY_DISEASE, BRONCHITIS_ASTHMA }; if(<some condition>) { context.getCounter(DIAGNOSIS_COUNTER.PULMONARY_DISEASE).increment(1); }

Side Data Distribution Side Data - the extra read-only data needed by a job to process the main dataset. There are two ways to make side data available: Using getConfiguration() method to set-key-value pairs. Then serialize using Hadoop’s Stringifier class. Great for passing small amounts of data Larger data (>2-3KB) can cause memory issues Distributed Cache Hadoop copies specified files and archives to the distributed file system Files are copied to the local disk - the cache The node manager maintains a reference count of tasks using each localized file. After files have been run, they are eligible for deletion and are deleted when the cache exceeds 10GB. Distributed cache API Uses the Job class

MapReduce Library classes

Shuffle and Sort The process by which the system performs the sort—and transfers the map outputs to the reducers as inputs—is known as the shuffle. •Map side and Reduce side –Map side •buffering writes in memory •spill the contents to disk •merge to one output file: partitioned and sorted –Reduce side •copy phase: reducer copies outputs when each completes •sort phase(merge phase): merges the map outputs •reduce phase: the output is written to HDFS

Shuffle and Sort Hadoop: The Definitive Guide, Tom White, p163

Joins To combine large datasets, joins are used in MapReduce Two different large data can be joined in map reduce programming also. Joins in Map phase refers as Map side join, while join at reduce side called as reduce side join. Lets go in detail, Why we would require to join the data in map reduce. If one Dataset A has master data and B has sort of transactional data(A & B are just for reference). we need to join them on a coexisting common key for a result. It is important to realize that we can share data with side data sharing techniques(passing key value pair in job configuration /distribution caching) if master data set is small. we will use map-reduce join only when we have both dataset is too big to use data sharing techniques.

Map side join and reduce side join For Map side Join Map-side join is a feature of MapReduce where the join is performed by the mapper. Joining at map side performs the join before data reached to map. Data should be partitioned and sorted in particular way. Each input data should be divided in same number of partition. Must be sorted with same key. All the records for a particular key must reside in the same partition. For Reduce side join Reduce side join also called as Repartitioned join or Repartitioned sort merge join and also it is mostly used join type. This type of join would be performed at reduce side. i.e it will have to go through sort and shuffle phase which would incur network overhead. to make it simple we are going to add the steps needs to be performed for reduce side join. Reduce side join uses few terms like data source, tag and group key Data Source is referring to data source files, probably taken from RDBMS Tag would be used to tag every record with it’s source name, so that it’s source can be identified at any given point of time be it is in map/reduce phase. Group key is referring column to be used as join key between two data sources.