Presentation is loading. Please wait.

Presentation is loading. Please wait.

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

Similar presentations


Presentation on theme: "SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase."— Presentation transcript:

1 SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

2 The problem  Batch (offline) processing of huge data set using commodity hardware  Linear scalability  Need infrastructure to handle all the mechanics, allow for developer to focus on the processing logic/algorithms

3 Data Sets  The New York Stock Exchange: 1 Terabyte of data per day  Facebook: 100 billion of photos, 1 Petabyte(1000 Terabytes)  Internet Archive: 2 Petabyte of data, growing by 20 Terabytes per month  Can’t put data on a single node, need distributed file system to hold it

4 Batch processing  Single write/append multiple reads  Analyze Log files for most frequent URL  Each data entry is self-contained  At each step, each data entry can be treated individually  After the aggregation, each aggregated data set can be treated individually

5 Grid Computing  Grid computing  Cluster of processing nodes attached to shared storage through fiber (typically Storage Area Network)  Work well for computation intensive tasks, problem with huge data sets as network become a bottleneck  Programming paradigm: Low level Message Passing Interface (MPI)

6 Hadoop  Open-source implementation of 2 key ideas  HDFS: Hadoop distributed file system  Map-Reduce: Programming Model  Build based on Google infrastructure (GFS, Map- Reduce papers published 2003/2004)  Java/Python/C interfaces, several projects built on top of it

7 Approach  Limited but simple model fit to broad range of applications  Handle communications, redundancies, scheduling in the infrastructure  Move computation to data instead of moving data to computation

8 Who is using Hadoop?

9 Distributed File System (HDFS)  Files are split into large blocks (128M, 64M)  Compare with typical FS block of 512Bytes  Replicated among Data Nodes(DN)  3 copies by default  Name Node (NN) keeps track of files and pieces  Single Master node  Stream-based I/O  Sequential access

10 HDFS: File Read

11 HDFS: File Write

12 HDFS: Data Node Distance

13 Map Reduce  A Programming Model  Decompose a processing job into Map and Reduce stages  Developer need to provide code for Map and Reduce functions, configure the job and let Hadoop handle the rest

14 Map-Reduce Model

15 MAP function  Map each data entry into a pair   Examples  Map each log file entry into  Map day stock trading record into

16 Hadoop: Shuffle/Merge phase  Hadoop merges(shuffles) output of the MAP stage into   Examples 

17 Reduce function  Reduce entries produces by Hadoop merging processing into pair  Examples  Map into

18 Map-Reduce Flow

19 Hadoop Infrastructure  Replicate/Distribute data among the nodes  Input  Output  Map/Shuffle output  Schedule Processing  Partition Data  Assign processing nodes (PN)  Move code to PN(e.g. send Map/Reduce code)  Manage failures (block CRC, rerun MAP/Reduce if necessary)

20 Example: Trading Data Processing  Input:  Historical Stock Data  Records are CSV (comma separated values) text file  Each line : stock_symbol, low_price, high_price  1987-2009 data for all stocks one record per stock per day  Output:  Maximum interday delta for each stock

21 Map Function: Part I

22 Map Function: Part II

23 Reduce Function

24 Running the Job : Part I

25 Running the Job: Part II

26 Inside Hadoop

27 Datastore: HBASE  Distributed Column-Oriented database on top of HDFS  Modeled after Google’s BigTable data store  Random Reads/Writes on to of sequential stream- oriented HDFS  Billions of Rows * Millions of Columns * Thousands of Versions

28 HBASE: Logical View Row KeyTime Stamp Column Contents Column Family Anchor (Referred by/to) Column “mime” “com.cnn.www”T9cnnsi.comcnn.com/1 T8my.look.cacnn.com/2 T6“.. “Text/html T5“.. “ t3“.. “

29 Physical View Row KeyTime StampColumn: Contents Com.cnn.wwwT6“..” T5“..” T3“..” Row KeyTime StampColumn Family: Anchor Com.cnn.wwwT9cnnsi.comcnn.com/1 T5my.look.cacnn.com/2 Row KeyTime StampColumn: mime Com.cnn.wwwT6text/html

30 HBASE: Region Servers  Tables are split into horizontal regions  Each region comprises a subset of rows  HDFS  Namenode, dataNode  MapReduce  JobTracker, TaskTracker  HBASE  Master Server, Region Server

31 HBASE Architecture

32 HBASE vs RDMS  HBase tables are similar to RDBS tables with a difference  Rows are sorted with a Row Key  Only cells are versioned  Columns can be added on the fly by client as long as the column family they belong to preexists


Download ppt "SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase."

Similar presentations


Ads by Google