Clydesdale: Structured Data Processing on MapReduce Jackie.

Slides:



Advertisements
Similar presentations
Starfish: A Self-tuning System for Big Data Analytics.
Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Chapter 15 Algorithms for Query Processing and Optimization Copyright © 2004 Pearson Education, Inc.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
J OIN ALGORITHMS USING MAPREDUCE Haiping Wang
CS 540 Database Management Systems
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
SkewTune: Mitigating Skew in MapReduce Applications
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Spark: Cluster Computing with Working Sets
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, Yuanyuan Tian.
ACS-4902 Ron McFadyen Chapter 15 Algorithms for Query Processing and Optimization.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce VS Parallel DBMSs
THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel(University of Wisconsin-Madison) Eugene J. Shekita, Yuanyuan.
CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
HAMS Technologies 1
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
An Introduction to HDInsight June 27 th,
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
GreenSched: An Energy-Aware Hadoop Workflow Scheduler
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
O’Reilly – Hadoop: The Definitive Guide Ch.7 MapReduce Types and Formats 29 July 2010 Taikyoung Kim.
Avrilia Floratou (University of Wisconsin – Madison) Jignesh M. Patel (University of Wisconsin – Madison) Eugene J. Shekita (While at IBM Almaden Research.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Most slides & Paper by: Avrilia Floratou (University of Wisconsin – Madison) Jignesh M. Patel (University of Wisconsin – Madison) Eugene J. Shekita (While.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
1 VLDB, Background What is important for the user.
Only Aggressive Elephants are Fast Elephants Nov 11 th 2013 Database Lab. Wonseok Choi.
Apache Tez : Accelerating Hadoop Query Processing Page 1.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Image taken from: slideshare
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
HADOOP ADMIN: Session -2
Hadoop MapReduce Framework
Introduction to MapReduce and Hadoop
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
The Basics of Apache Hadoop
Overview of big data tools
Data processing with Hadoop
MAPREDUCE TYPES, FORMATS AND FEATURES
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Map Reduce, Types, Formats and Features
Presentation transcript:

Clydesdale: Structured Data Processing on MapReduce Jackie

 Unmodified Hadoop  Aim at workload where the data fit s a star schema  Draw on existing techniques: columnar storage, tailored join plans, block iteration Introduction

 Background  Clydesdale architecture  challenges  experiment Outline

 InputFormats and OutputFormats  InputFormat implements two methods: getSplits(), getRecordReader()  MapRunners  Schedules  JVM reuse Background

Clydesdale Architecture

 Avoid I/O for columns that are not used  Store each column in separate HDFS file  ColumnInputFormat ensures that different columns in a row are co-located at datanode Columnar Storage

 Sql-like structured data processing  Map phase is responsible for joining the fact table with the dimension tables  Reduce phase is responsible for the grouping and aggregation Join Strategy

Flow of Clydesdale’s Join Job

 Consider the following query Examples

 Map phase  Build hashtable for each dimension table using predicates  Maptask checks whether the input was in the hashtables  Output the record that satisfies the join conditions  Key from the subset of columns needed for grouping.  Reduce phase  Aggregate the values of the same key  Sort at client Execution process

Pseudocode for the Query

 Exploit multi-core parallelism  Single map task per node  Uses a custom MapRunner class to run a multi-threaded map task  Using MultiCIF packs multiple input splits into a single multi-split  Shared across consecutive map tasks that run on the same node  Task scheduling  Block iteration Optimizing for the Native Implementation

Pseudocode for MapRunner

 Schedule only one map task from the join job on a given node  Schedule subsequent map tasks on the node where the dimension hash table has already been built  Communicate to the map task the number of slots, or processor cores it can use on the node Task Scheduling

 High per-row overheads  B-CIF: return an array of rows over the same input Block Iteration

 Support two join plans: repartition join, mapjoin  Reparttion join is a robust technique that works with any combination of sizes of tables  Mapjoin is designed for one table that is significantly samller than the other Hive Background

Hive’s Mapjoin plan

SQL-Logical Plan-Physical Plan-MR Workflow Workflow with Six Jobs Hive:SQL-Like Language

 Cluster  Cluster A : 9 nodes, two quad-core processors,16G memory, 8*250G disk, 1G ethernet switch  Cluster B: 42nodes, two quad-core processors, 32G memory 5*500G disk 1G ethernet switch  Clydesdale on hadoop 0.21 and Hive on hadoop  Workload  Storage Format: Clydesdale fact tables were stored in Multi-CIF, Hive is RCFile Experimental Setup

Comparison with Hive

 Hive joins one dimension table at a time with the fact table  Hive maintains many copies of the hash table  Hive creates the hash tables on a single node and pays the cost of disseminating it to the entire cluster  Each task in Hive has to load and deserialize the hash table when it starts. Result Analysis

Analysis of Clydesdale

Limitations