Hyracks: A new partitioned-parallel platform for data-intensive computation Vinayak Borkar UC Irvine (joint work with M. Carey, R. Grover, N. Onose, and.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
Overview of MapReduce and Hadoop
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Spark: Cluster Computing with Working Sets
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Hot With an Increasing Chance of Clouds... Data Management Research Forecast: Hot With an Increasing Chance of Clouds... Michael Carey Information Systems.
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
Distributed computing using Dryad Michael Isard Microsoft Research Silicon Valley.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
HadoopDB project An Architetural hybrid of MapReduce and DBMS Technologies for Analytical Workloads Anssi Salohalla.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
HAMS Technologies 1
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Next Generation of Apache Hadoop MapReduce Owen
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Apache Tez : Accelerating Hadoop Query Processing Page 1.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Hadoop Aakash Kag What Why How 1.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Introduction to MapReduce and Hadoop
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
湖南大学-信息科学与工程学院-计算机与科学系
Ch 4. The Evolution of Analytic Scalability
Pregelix: Think Like a Vertex, Scale Like Spandex
Overview of big data tools
5/7/2019 Map Reduce Map reduce.
Presentation transcript:

Hyracks: A new partitioned-parallel platform for data-intensive computation Vinayak Borkar UC Irvine (joint work with M. Carey, R. Grover, N. Onose, and R. Vernica)

Today’s Presentation  A new platform? Why?  How is Hyracks different from …?  The MapReduce Architecture  Hyracks Architecture  Hyracks User Model  Hyracks Data support  Experimental evaluation  Future Work

Motivation for Hyracks  Data sizes are growing exponentially  Social Networks (Facebook: 500M active users, >50M status updates/day)  Web : Trillions of URLs  Data is not necessarily flat  Semi-structured data  Unstructured data  Different data models (Arrays, for e.g.)  Non-traditional treatment of data  Need for processing data using custom logic

Motivation for Hyracks (contd.)  Large clusters of commodity hardware feasible  Google claims 10000s of cheap computers in their cluster  Fast networks are commodity  1G already available  10G available soon  Last but not least: Attention from industry  Plenty of new and interesting use-cases

A little bit of history (not so recent)  Parallel databases in the mid 90’s  The Gamma Database system (DeWitt et al)  True shared-nothing architecture  Operators process partitions  Split-tables perform data re-distribution  However: Could process only relational data using relational operators  Fault-tolerant execution was not a goal  Relatively fewer machines (~ nodes)

A little bit of history (recent)  Google proposed MapReduce in 2004  “Simple” user model  Write two/three sequential functions and let the infrastructure parallelize on the cluster  Hadoop built soon after that by Yahoo as an open-source project  High-level declarative languages compiled to MapReduce jobs  PIG from Yahoo  Hive from Facebook  Sawzall from Google  …

Hyracks vs. Other Alternatives  Vs. Parallel Databases  Extensible Data model  Fault tolerance  Vs. Hadoop  More flexible computational model  Generalized runtime processing model  Better support (transparency) for scheduling  Vs. Dryad  Data as a first class citizen  Commonly occurring data operators  Support common data communication patterns

The MapReduce Architecture MapReduce Distributed File System Network MapReduce Job Tracker

MapReduce in a Nutshell M Map (k1, v1)  list(k2, v2) Processes one input key/value pair Produces a set of intermediate key/value pairs R Reduce (k2, list(v2))  list(k3, v3) Combines intermediate values for one particular key Produces a set of merged output values (usually one)

MapReduce Parallelism  Hash Partitioning

The Hyracks Architecture Node Controller Network Cluster Controller

Hyracks – Usage Scenarios  Designed with three possible uses in mind  High level data language compilation target  Compile directly to Native Hyracks Job  Compile to Hadoop jobs and run using compatibility layer  Run Hadoop jobs from existing Hadoop Clients  Run native Hyracks jobs

Hyracks Jobs  Job = Unit of work submitted by a Hyracks Client  Job = dataflow DAG of operators and connectors  Operators consume/produce partitions of data  Connectors repartition/route data between operators 13

Hyracks: Operator Activities 14

Hyracks: Runtime Task Graph 15

Hyracks Library (growing…)  Operators  File readers/writers: line files, delimited files, HDFS files  Mappers: native mapper, Hadoopmapper  Sorters: in-memory, external  Joiners: in-memory hash, hybrid hash  Aggregators: hash-based, preclustered  BTree Bulk Load, Search, Scan  Connectors  M:N hash-partitioner  M:N hash-partitioning merger  M:N range-partitioner  M:N replicator  1:1 16

Hyracks Data Movement  Fixed size memory blocks used to transport data  Cursors to iterate over serialized data  Common operations (comparison/hashing/ projection) can be performed directly on data in frames  Goal: Keep garbage collection to a minimum and minimize data copies  Anecdotally: Ran queries for ~60 mins, total GC time ~ 1.8 s. 17

Hyracks Operator/Connector Design  Operations on Collections of Abstract Data Types  Data model operations are provided during instantiation  For example:  Sort operator accepts a list of comparators  Hash-based groupby accepts a list of hash-functions and a list of comparators  Hash-Partitioning connector accepts a hash-function 18

Hyracks Operator/Connector Interface  Uniform push-style iterator interface that accept ByteBuffer objects  Operators can be connected to run in same or different threads. 19 public interface IFrameWriter { public void open() …; public void close() …; public void nextFrame(ByteBuffer frame)…; }

Hadoop Compatibility Layer Goal: – Run Hadoop jobs unchanged on top of Hyracks How: – Client-side library converts a Hadoop job spec into an equivalent Hyracks job spec – Hyracks has operators to interact with HDFS – Dcache provides distributed cache functionality 20

Hyracks Performance K-means (on Hadoop Compatibility Layer) DSS-style query execution (TPC-H based example)

Hyracks Performance (contd.) Fault-tolerant query execution (TPC-H based example)

Hyracks Performance Gains 23  K-Means  Push-based Job activation  Operations on serialized data representation  Not incurring extra disk I/O between Mapper and Reducer  Semantics of connector exploited at network level  TPC-H Query (In addition to the above)  Hash-based aggregation cheaper than sort based aggregation  Tagged joins sort |I1 + I2| amount of data  As an aside: Hash-join needs as much memory as the smaller input’s size to incur zero disk IO. Tagged join needs |I1 + I2| for an in-memory sort

Hyracks – Next steps 24  Fine-grained fault tolerance/recovery  Restart failed jobs in a more fine-grained manner  Exploit operator properties (natural blocking points) to get fault-tolerance at marginal (or no) extra cost  Automatic Scheduling  Use operator resource needs to decide sites for operator evaluation  Define protocol for interacting with HLL query planners

25 More info at: