Presentation is loading. Please wait.

Presentation is loading. Please wait.

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

Similar presentations


Presentation on theme: "Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html"— Presentation transcript:

1 Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at http://developer.yahoo.com/hadoop/tutorial/ index.html http://developer.yahoo.com/hadoop/tutorial/ index.html

2  Namenode responsibilities: 1. Namespace management: file name, locations, access privileges etc. 2. Coordinating client operations: Directs clients to datanodes, garbage collection etc. 3. Maintaining the overall health of the system: replication factor, replica balancing etc. 4. Namenode does not take part in any computation

3  A MapReduce job use individual files as a basic unit for splitting input data.  Workloads are batch-oriented, dominated by long streaming reads and large sequential writes.  Applications are aware of the distributed file system.  File system can be implemented in an environment of cooperative users.  See figure 2.6 and understand  Operations: (mapper, reducer) {combiner} [partitioner, shuffle and sort] : these operations have specific meaning in the MR context. You must understand it fully before using them.  Finally study the job configuration: items you can specify declaratively and how to specify these attributes.

4  Module 4 in yahoo tutorial  Read every line of: Functional programming section  Understand the mapper, reducer and most importantly the driver method (job config)  Module 5: Read the details about partitioner  Metrics  Monitoring: web monitoring possible

5  Figure 2.1 map and fold  Map is a “transformation” function that can be carried out in parallel: can work on the elements of list in parallel  Fold is an “aggregation” function that has restrictions on data locality: requires elements of the list to be brought together before the operation  For operations that are associative and commutative, significant performance can be achieved by local aggregation and sorting.  User specifies the map&reduce operations and the execution framework coordinates the execution of the programs and data movement.

6  imposes structure to data ◦ Example 1: ◦ Example 2:  map: (k1, v1) → [(k2, v2)]  reduce: (k2, [v2]) → [(k3, v3)]  Map generates intermediate values, and they are implicitly operated using “group by” operator and are in order within a given reducer.  Each reducer output is written into a external file.  Reduce method is called once for each key value in the data space to be processed by reduce.  Mapper with identity reducer is essentially a sorter.  Typical Mapreduce processes data in distributed file system and writes back to the same file system.

7  Data Storage: output from MR could go into a sparse multi-dimensional table called BigTable in Google’s system.  The Apache open source version is HBASE.  HABSE is a column based table.  Rows, column families each with many columns.  Data is stored normalized in a relational schema.  Data in Hbase is not normalized by choice and by design.  Column families are stored together and storage methods optimized for this.

8  Very interesting since there are many tasks to manage.  Transparent, policy-driven, predictable multi- user scheduling  Speculative scheduling: Due to the barrier between M and R, the map is only as fast as the slowest Map; managing stragglers  But how to handle skew in the data: better local aggregation

9  Data/operation co-location  Synchronization: copying into reduce as the map is going on; existence of barrier between map and reduce  Error and fault-tolerance: hardware as well as software

10  Partitioners: Partitioners divide the intermediate key space and assign the parts to the reducers.  Combiners are optimization means by which local aggregation can be done before sort and shuffle.  Thus a complete MR job consists of mapper, reducer, combiner, partitioner and job configuration; rest is taken care of by the execution framework.


Download ppt "Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html"

Similar presentations


Ads by Google