Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clydesdale: Structured Data Processing on MapReduce Jackie.

Similar presentations


Presentation on theme: "Clydesdale: Structured Data Processing on MapReduce Jackie."— Presentation transcript:

1 Clydesdale: Structured Data Processing on MapReduce Jackie

2  Unmodified Hadoop  Aim at workload where the data fit s a star schema  Draw on existing techniques: columnar storage, tailored join plans, block iteration Introduction

3  Background  Clydesdale architecture  challenges  experiment Outline

4  InputFormats and OutputFormats  InputFormat implements two methods: getSplits(), getRecordReader()  MapRunners  Schedules  JVM reuse Background

5 Clydesdale Architecture

6  Avoid I/O for columns that are not used  Store each column in separate HDFS file  ColumnInputFormat ensures that different columns in a row are co-located at datanode Columnar Storage

7  Sql-like structured data processing  Map phase is responsible for joining the fact table with the dimension tables  Reduce phase is responsible for the grouping and aggregation Join Strategy

8 Flow of Clydesdale’s Join Job

9  Consider the following query Examples

10  Map phase  Build hashtable for each dimension table using predicates  Maptask checks whether the input was in the hashtables  Output the record that satisfies the join conditions  Key from the subset of columns needed for grouping.  Reduce phase  Aggregate the values of the same key  Sort at client Execution process

11 Pseudocode for the Query

12  Exploit multi-core parallelism  Single map task per node  Uses a custom MapRunner class to run a multi-threaded map task  Using MultiCIF packs multiple input splits into a single multi-split  Shared across consecutive map tasks that run on the same node  Task scheduling  Block iteration Optimizing for the Native Implementation

13 Pseudocode for MapRunner

14  Schedule only one map task from the join job on a given node  Schedule subsequent map tasks on the node where the dimension hash table has already been built  Communicate to the map task the number of slots, or processor cores it can use on the node Task Scheduling

15  High per-row overheads  B-CIF: return an array of rows over the same input Block Iteration

16  Support two join plans: repartition join, mapjoin  Reparttion join is a robust technique that works with any combination of sizes of tables  Mapjoin is designed for one table that is significantly samller than the other Hive Background

17 Hive’s Mapjoin plan

18 SQL-Logical Plan-Physical Plan-MR Workflow Workflow with Six Jobs Hive:SQL-Like Language

19  Cluster  Cluster A : 9 nodes, two quad-core processors,16G memory, 8*250G disk, 1G ethernet switch  Cluster B: 42nodes, two quad-core processors, 32G memory 5*500G disk 1G ethernet switch  Clydesdale on hadoop 0.21 and Hive on hadoop0.20.2  Workload  Storage Format: Clydesdale fact tables were stored in Multi-CIF, Hive is RCFile Experimental Setup

20 Comparison with Hive

21  Hive joins one dimension table at a time with the fact table  Hive maintains many copies of the hash table  Hive creates the hash tables on a single node and pays the cost of disseminating it to the entire cluster  Each task in Hive has to load and deserialize the hash table when it starts. Result Analysis

22 Analysis of Clydesdale

23 Limitations

24


Download ppt "Clydesdale: Structured Data Processing on MapReduce Jackie."

Similar presentations


Ads by Google