Presentation is loading. Please wait.

Presentation is loading. Please wait.

Software Systems Development

Similar presentations


Presentation on theme: "Software Systems Development"— Presentation transcript:

1 Software Systems Development
MAP-REDUCE , Hadoop, HBase

2 The problem Batch (offline) processing of huge data set using commodity hardware Linear scalability Need infrastructure to handle all the mechanics, allow for developer to focus on the processing logic/algorithms

3 Data Sets The New York Stock Exchange: 1 Terabyte of data per day
Facebook: 100 billion of photos, 1 Petabyte(1000 Terabytes) Internet Archive: 2 Petabyte of data, growing by 20 Terabytes per month Can’t put data on a single node, need distributed file system to hold it

4 Batch processing Single write/append multiple reads
Analyze Log files for most frequent URL Each data entry is self-contained At each step , each data entry can be treated individually After the aggregation, each aggregated data set can be treated individually

5 Grid Computing Grid computing
Cluster of processing nodes attached to shared storage through fiber (typically Storage Area Network) Work well for computation intensive tasks, problem with huge data sets as network become a bottleneck Programming paradigm: Low level Message Passing Interface (MPI)

6 Hadoop Open-source implementation of 2 key ideas
HDFS: Hadoop distributed file system Map-Reduce: Programming Model Build based on Google infrastructure (GFS, Map- Reduce papers published 2003/2004) Java/Python/C interfaces, several projects built on top of it

7 Approach Limited but simple model fit to broad range of applications
Handle communications, redundancies , scheduling in the infrastructure Move computation to data instead of moving data to computation

8 Who is using Hadoop?

9 Distributed File System (HDFS)
Files are split into large blocks (128M, 64M) Compare with typical FS block of 512Bytes Replicated among Data Nodes(DN) 3 copies by default Name Node (NN) keeps track of files and pieces Single Master node Stream-based I/O Sequential access

10 HDFS: File Read

11 HDFS: File Write

12 HDFS: Data Node Distance

13 Map Reduce A Programming Model
Decompose a processing job into Map and Reduce stages Developer need to provide code for Map and Reduce functions, configure the job and let Hadoop handle the rest

14 Map-Reduce Model

15 MAP function Map each data entry into a pair Examples
<key, value> Examples Map each log file entry into <URL,1> Map day stock trading record into <STOCK, Price>

16 Hadoop: Shuffle/Merge phase
Hadoop merges(shuffles) output of the MAP stage into <key, valulue1, value2, value3> Examples <URL, 1 ,1 ,1 ,1 ,1 1> <STOCK, Price On day 1, Price On day 2..>

17 Reduce function Reduce entries produces by Hadoop merging processing into <key, value> pair Examples Map <URL, 1,1,1> into <URL, 3> Map <Stock, 3,2,10> into <Stock, 10>

18 Map-Reduce Flow

19 Hadoop Infrastructure
Replicate/Distribute data among the nodes Input Output Map/Shuffle output Schedule Processing Partition Data Assign processing nodes (PN) Move code to PN(e.g. send Map/Reduce code) Manage failures (block CRC, rerun MAP/Reduce if necessary)

20 Example: Trading Data Processing
Input: Historical Stock Data Records are CSV (comma separated values) text file Each line : stock_symbol, low_price, high_price data for all stocks one record per stock per day Output: Maximum interday delta for each stock

21 Map Function: Part I

22 Map Function: Part II

23 Reduce Function

24 Running the Job : Part I

25 Running the Job: Part II

26 Inside Hadoop

27 Datastore: HBASE Distributed Column-Oriented database on top of HDFS
Modeled after Google’s BigTable data store Random Reads/Writes on to of sequential stream- oriented HDFS Billions of Rows * Millions of Columns * Thousands of Versions

28 HBASE: Logical View Row Key Time Stamp Column Contents
Column Family Anchor (Referred by/to) Column “mime” “com.cnn.www” T9 cnnsi.com cnn.com/1 T8 my.look.ca cnn.com/2 T6 “<html>.. “ Text/html T5 t3

29 Physical View Row Key Time Stamp Column: Contents Com.cnn.www T6
“<html>..” T5 T3 Row Key Time Stamp Column Family: Anchor Com.cnn.www T9 cnnsi.com cnn.com/1 T5 my.look.ca cnn.com/2 Row Key Time Stamp Column: mime Com.cnn.www T6 text/html

30 HBASE: Region Servers Tables are split into horizontal regions HDFS
Each region comprises a subset of rows HDFS Namenode, dataNode MapReduce JobTracker, TaskTracker HBASE Master Server, Region Server

31 HBASE Architecture

32 HBASE vs RDMS HBase tables are similar to RDBS tables with a difference Rows are sorted with a Row Key Only cells are versioned Columns can be added on the fly by client as long as the column family they belong to preexists


Download ppt "Software Systems Development"

Similar presentations


Ads by Google