Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.

Slides:

Advertisements

Similar presentations

Starfish: A Self-tuning System for Big Data Analytics.

Advertisements

The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.

Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Operating System Structures

Big Data + SDN SDN Abstractions. The Story Thus Far Different types of traffic in clusters Background Traffic – Bulk transfers – Control messages Active.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

MapReduce Online Veli Hasanov Fatih University.

Developing a MapReduce Application – packet dissection.

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

SkewTune: Mitigating Skew in MapReduce Applications

Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.

Programming Types of Testing.

Spark: Cluster Computing with Working Sets

Clydesdale: Structured Data Processing on MapReduce Jackie.

1 Routing and Scheduling in Web Server Clusters. 2 Reference The State of the Art in Locally Distributed Web-server Systems Valeria Cardellini, Emiliano.

Hadoop: The Definitive Guide Chap. 2 MapReduce

A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.

Xyleme A Dynamic Warehouse for XML Data of the Web.

Encapsulation by Subprograms and Type Definitions

Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

A Hadoop MapReduce Performance Prediction Method

Hadoop: The Definitive Guide Chap. 8 MapReduce Features

Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Chapter 3 Operating Systems Introduction to CS 1 st Semester, 2015 Sanghyun Park.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.

Chapter 4 Processor Management

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

Papers on Storage Systems 1) Purlieus: Locality-aware Resource Allocation for MapReduce in a Cloud, SC ) Making Cloud Intermediate Data Fault-Tolerant,

1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.

Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.

A Hierarchical MapReduce Framework Yuan Luo and Beth Plale School of Informatics and Computing, Indiana University Data To Insight Center, Indiana University.

A Throttling Layer-7 Web Switch James Furness. Motivation & Goals Specification & Design Design detail Demonstration Conclusion.

O’Reilly – Hadoop: The Definitive Guide Ch.7 MapReduce Types and Formats 29 July 2010 Taikyoung Kim.

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.

A Two-phase Execution Engine of Reduce Tasks In Hadoop MapReduce XiaohongZhang*GuoweiWang* ZijingYang*YangDing School of Computer Science and Technology.

Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.

Using Map-reduce to Support MPMD Peng

MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,

Bigtable: A Distributed Storage System for Structured Data

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

BIG DATA/ Hadoop Interview Questions.

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

Virtual memory.

Ioannis E. Venetis Department of Computer Engineering and Informatics

MapReduce Types, Formats and Features

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Threads Chapter 4.

MAPREDUCE TYPES, FORMATS AND FEATURES

CS639: Data Management for Data Science

MapReduce: Simplified Data Processing on Large Clusters

Student : Yan Wang student ID:

CS639: Data Management for Data Science

Map Reduce, Types, Formats and Features

Presentation transcript:

Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia Institute of Technology

Hadoop  The MapReduce programming model, along with its open-source implementation, Hadoop, has provided a cost effective solution for many data-intensive applications.  Hadoop stores data distributively and exploits data locality by assigning tasks to where data is stored.

Hadoop  Many data-intensive applications require two (or more) input data for each of their tasks.  Significant challenges for Hadoop as the inputs to one task often reside on multiple nodes  Hadoop is unable to discover data locality in this scenario.  Leads to excessive data transfers and significant degradations in application performance.

Bi-Hadoop  An efficient extension of Hadoop to better support binary-input applications.  Integrates an easy-to-use user interface, a binary-input aware task scheduler, and a caching subsystem.  Bi-Hadoop can significantly improve the execution of binary-input applications by reducing the data transfer overhead, and outperforms existing Hadoop by up to 3.3x.

Hadoop  The locality-awareness of Hadoop is based on a relatively strong assumption that a task is expected to work on a single data split.  This is in accordance with the MapReduce programming model, which defines one map task over each logical data split and thus requires users to describe the mapper function as a unary operator, applicable to only one single logical data split.

Example: Pattern Matching  Two inputs: one record of the template data, and another record of the stored data.  Hadoop limitations:  Developers need to work around the unary input requirement, which makes it less natural to program the applications.  When a workaround method is used, the built-in locality awareness of Hadoop becomes less effective or non- effective.  As binary-input tasks often share their data blocks, there are many unique locality optimization opportunities in these applications that cannot be exploited by existing Hadoop.

Bi-Hadoop  Motivations:  Lack of Support for Binary-Input Applications in Hadoop  Data Locality in Binary-Input Applications  Design Challenges:  Programmability 易于编程  Transparency 对用户透明  Non-intrusiveness 不妨碍用户其他进程

Bi-Hadoop Extension System Overview

Incidence Matrix Example

Algorithm for the Static Grouping Phase

Dynamic Task Dispatching  Considers ： (1) what replicas does each node have? (2) what splits is a node currently caching?  A task pack: the set of tasks that will be executed together by a compute node.  Two steps to form a task pack: (1) we choose a group from “static task grouping” that most splits local (2) then we pick a pack from the group so that it can fit into the cache of the local compute node

Criteria when forming task packs  The number of tasks in a pack should not exceed the total number of tasks dividing the number of nodes  The difference between the number of row splits and the number of column splits is small  At most half of the cache will be used for row splits or column splits.

Task Packs  Task packs will be created such that (1) tasks in a pack share their input splits (2) the input file splits are likely to be already has a local replica or in cache.  Tasks in a pack are then scheduled onto the corresponding compute one by one.

Caching Subsystem  two components: a file handler object and a service daemon  The handler object is constructed when opening files  The service daemon sits on top of the file system abstraction of each compute node (between MapReduce and HDFS systems)  We want the service daemon to remember the caching history among jobs and the service daemon can function for any Hadoop supported file system.

OpenCachedReadOnly function  The function returns our specialized file handler and users read data as usual using this handler.  The openCachedReadOnly function accepts an optional versionID parameter besides the usual path parameter.  We expect users to change this versionID if the data is modified. If the cached version is not equal to the user provided version, the block will be re-fetched.

Service Daemon  Serves handlers’ requests  Manages the cached blocks  Reports caching status to the TaskTracker for scheduling.  The daemon uses the usual file system API(such as HDFS API) to read the data and saves blocks into local file system.  The cached blocks are evicted using a least recently used policy if the capacity is reached.

Non-intrusiveness  If the application is unary-input as with existing Hadoop applications, our scheduler will detect the non-existence of the user- supplied input filter, and immediately hand everything over to the existing Hadoop scheduling subsystem.  Bi-Hadoop also consumes certain amount of memory when the caching subsystem attempts to cache file blocks at the nodes

Experiment Results

Pattern Matching

PageRank