Software Systems Development

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
HBase Elke A. Rundensteiner Fall 2013
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Hadoop & Neptune Feb 김형준.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Hadoop Aakash Kag What Why How 1.
HBase Mohamed Eltabakh
Hadoop.
Apache hadoop & Mapreduce
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
INTRODUCTION TO BIGDATA & HADOOP
An Open Source Project Commonly Used for Processing Big Data Sets
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Large-scale file systems and Map-Reduce
Hadoop.
Gowtham Rajappan.
Introduction to HDFS: Hadoop Distributed File System
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
The Basics of Apache Hadoop
CS6604 Digital Libraries IDEAL Webpages Presented by
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
February 26th – Map/Reduce
Hadoop Basics.
Cse 344 May 4th – Map/Reduce.
Ch 4. The Evolution of Analytic Scalability
Introduction to Apache
Lecture 16 (Intro to MapReduce and Hadoop)
CS 345A Data Mining MapReduce This presentation has been altered.
Distributed Systems CS
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Database Management Systems Unit – VI Introduction to Big Data, HADOOP: HDFS, MapReduce Prof. Deptii Chaudhari, Assistant Professor Department of.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Software Systems Development MAP-REDUCE , Hadoop, HBase

The problem Batch (offline) processing of huge data set using commodity hardware Linear scalability Need infrastructure to handle all the mechanics, allow for developer to focus on the processing logic/algorithms

Data Sets The New York Stock Exchange: 1 Terabyte of data per day Facebook: 100 billion of photos, 1 Petabyte(1000 Terabytes) Internet Archive: 2 Petabyte of data, growing by 20 Terabytes per month Can’t put data on a single node, need distributed file system to hold it

Batch processing Single write/append multiple reads Analyze Log files for most frequent URL Each data entry is self-contained At each step , each data entry can be treated individually After the aggregation, each aggregated data set can be treated individually

Grid Computing Grid computing Cluster of processing nodes attached to shared storage through fiber (typically Storage Area Network) Work well for computation intensive tasks, problem with huge data sets as network become a bottleneck Programming paradigm: Low level Message Passing Interface (MPI)

Hadoop Open-source implementation of 2 key ideas HDFS: Hadoop distributed file system Map-Reduce: Programming Model Build based on Google infrastructure (GFS, Map- Reduce papers published 2003/2004) Java/Python/C interfaces, several projects built on top of it

Approach Limited but simple model fit to broad range of applications Handle communications, redundancies , scheduling in the infrastructure Move computation to data instead of moving data to computation

Who is using Hadoop?

Distributed File System (HDFS) Files are split into large blocks (128M, 64M) Compare with typical FS block of 512Bytes Replicated among Data Nodes(DN) 3 copies by default Name Node (NN) keeps track of files and pieces Single Master node Stream-based I/O Sequential access

HDFS: File Read

HDFS: File Write

HDFS: Data Node Distance

Map Reduce A Programming Model Decompose a processing job into Map and Reduce stages Developer need to provide code for Map and Reduce functions, configure the job and let Hadoop handle the rest

Map-Reduce Model

MAP function Map each data entry into a pair Examples <key, value> Examples Map each log file entry into <URL,1> Map day stock trading record into <STOCK, Price>

Hadoop: Shuffle/Merge phase Hadoop merges(shuffles) output of the MAP stage into <key, valulue1, value2, value3> Examples <URL, 1 ,1 ,1 ,1 ,1 1> <STOCK, Price On day 1, Price On day 2..>

Reduce function Reduce entries produces by Hadoop merging processing into <key, value> pair Examples Map <URL, 1,1,1> into <URL, 3> Map <Stock, 3,2,10> into <Stock, 10>

Map-Reduce Flow

Hadoop Infrastructure Replicate/Distribute data among the nodes Input Output Map/Shuffle output Schedule Processing Partition Data Assign processing nodes (PN) Move code to PN(e.g. send Map/Reduce code) Manage failures (block CRC, rerun MAP/Reduce if necessary)

Example: Trading Data Processing Input: Historical Stock Data Records are CSV (comma separated values) text file Each line : stock_symbol, low_price, high_price 1987-2009 data for all stocks one record per stock per day Output: Maximum interday delta for each stock

Map Function: Part I

Map Function: Part II

Reduce Function

Running the Job : Part I

Running the Job: Part II

Inside Hadoop

Datastore: HBASE Distributed Column-Oriented database on top of HDFS Modeled after Google’s BigTable data store Random Reads/Writes on to of sequential stream- oriented HDFS Billions of Rows * Millions of Columns * Thousands of Versions

HBASE: Logical View Row Key Time Stamp Column Contents Column Family Anchor (Referred by/to) Column “mime” “com.cnn.www” T9 cnnsi.com cnn.com/1 T8 my.look.ca cnn.com/2 T6 “<html>.. “ Text/html T5 t3

Physical View Row Key Time Stamp Column: Contents Com.cnn.www T6 “<html>..” T5 T3 Row Key Time Stamp Column Family: Anchor Com.cnn.www T9 cnnsi.com cnn.com/1 T5 my.look.ca cnn.com/2 Row Key Time Stamp Column: mime Com.cnn.www T6 text/html

HBASE: Region Servers Tables are split into horizontal regions HDFS Each region comprises a subset of rows HDFS Namenode, dataNode MapReduce JobTracker, TaskTracker HBASE Master Server, Region Server

HBASE Architecture

HBASE vs RDMS HBase tables are similar to RDBS tables with a difference Rows are sorted with a Row Key Only cells are versioned Columns can be added on the fly by client as long as the column family they belong to preexists