EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
epiC: an Extensible and Scalable System for Processing Big Data
LIBRA: Lightweight Data Skew Mitigation in MapReduce
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Distributed Graph Processing Abhishek Verma CS425.
Spark: Cluster Computing with Working Sets
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Felix Halim, Roland H.C. Yap, Yongzheng Wu
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Distributed Computations
Distributed computing using Dryad Michael Isard Microsoft Research Silicon Valley.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Distributed Computations MapReduce
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Pregel: A System for Large-Scale Graph Processing
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel(University of Wisconsin-Madison) Eugene J. Shekita, Yuanyuan.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
Storage in Big Data Systems
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
Data Structures and Algorithms in Parallel Computing Lecture 4.
NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD 1.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Data Structures and Algorithms in Parallel Computing
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Pregel: A System for Large-Scale Graph Processing Nov 25 th 2013 Database Lab. Wonseok Choi.
Next Generation of Apache Hadoop MapReduce Owen
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Some slides adapted from those of Yuan Yu and Michael Isard
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
PREGEL Data Management in the Cloud
Introduction to MapReduce and Hadoop
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Data Structures and Algorithms in Parallel Computing
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Parallel Applications And Tools For Cloud Computing Environments
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
5/7/2019 Map Reduce Map reduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.
Presentation transcript:

epiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National University of Singapore College of Computer Science and Technology, Zhejiang University

Why we need another new MapReduce-like system? MapReduce  M/R framework cannot handle iterative processing efficiently  Everything needs to be transformed into map and reduce functions Pregel/GraphLab/Dryad  DAG based data flow  User should design how the graph is constructed and how different operators are linked Can we combine the advantages of both types of systems?

Overview of epiC Unit works independently Units communicate via “ ” Master works as mail server to forward the messages epiC is based on the actor- model

Compare epiC to MapReduce and Pregel Using PageRank as an example MapReduce:  Multi-iterations  The second job loads the output of the first job to continue the processing

Compare epiC to MapReduce and Pregel Pregel:  In each super-step, the vertex computes its new PageRank values and broadcasts the value to its neighbors

Compare epiC to MapReduce and Pregel epiC  0. send messages to unit to activate it  1. Unit loads a partition of graph data and score vector based on the received message  2. compute new score vector of vertices  3. generate new score vector files  4. send messages to master network

Compare epiC to MapReduce and Pregel Flexibility:  MR is not designed for such job. Pregel and epiC can express the algorithm more effectively. Unit in epiC is equivalent to the worker of Pregel Optimization:  Both MR and epiC supports customized optimization, e.g., buffering the intermediate results in local disk Extensibility:  MR and Pregel have their pre-defined programming model, while in epiC, users can create their own.

Using epiC to simulate MR Create two basic units: MapUnit and ReduceUnit MapUnit loads a partition of data and sends messages to all ReduceUnits ReduceUnit gets its input from the DFS. The locations of the input are obtained from the messages of MapUnits.

Using epiC to simulate Relational DB Three Units are created:  SingleTableUnit: Handles all processings on a single Table  JoinUnit: Joins two or more tables  AggregatUnit: applies the group by operator and computes the aggregation results

Using epiC to simulate Relational DB Example: TPC-H Q3 5 steps are required

TPC-H Q3 (Step 1)

TPC-H Q3 (Step 2 and 3)

TPC-H Q3 (Step 4 and 5)

Implementation Details

Fault Tolerance

EXPERIMENTS 65 nodes quad-core Intel Xeon 2.4GHz CPU, 8GB memory and two 500GB SCSI disks connected by a 10 Gbps cluster switch

System Settings Hadoop settings epiC settings GPS settings

Benchmark Tasks and Datasets Grep TeraSort TPC-H Q3 PageRank

The Grep Task

The TeraSort Task

The TPCH Q3 Task

The PageRank Task

Comparison with Inmemory Systems