MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Slides:

Advertisements

Similar presentations

Load Balancing Parallel Applications on Heterogeneous Platforms.

Advertisements

Distributed Data-Parallel Computing Using a High-Level Programming Language Yuan Yu Michael Isard Joint work with: Andrew Birrell, Mihai Budiu, Jon Currey,

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

SALSA HPC Group School of Informatics and Computing Indiana University.

Spark: Cluster Computing with Working Sets

DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep.

Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.

Clydesdale: Structured Data Processing on MapReduce Jackie.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Dryad / DryadLINQ Slides adapted from those of Yuan Yu and Michael Isard.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Pregel: A System for Large-Scale Graph Processing

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Computer System Architectures Computer System Software

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

1 COMPSCI 110 Operating Systems Who - Introductions How - Policies and Administrative Details Why - Objectives and Expectations What - Our Topic: Operating.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Chapter 1 Introduction. Goal to learn about computers and programming to compile and run your first Java program to recognize compile-time and run-time.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.

GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.

SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.

Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

SALSA HPC Group School of Informatics and Computing Indiana University.

Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.

Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Dryad and DryaLINQ. Dryad and DryadLINQ Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation Dryad provides.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

SALSA Group Research Activities April 27, Research Overview  MapReduce Runtime  Twister  Azure MapReduce  Dryad and Parallel Applications 

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC.

Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

NGS computation services: APIs and.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Background Computer System Architectures Computer System Software.

Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.

TensorFlow– A system for large-scale machine learning

Chapter 4: Multithreaded Programming

CS427 Multicore Architecture and Parallel Computing

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

CSCI5570 Large Scale Data Processing Systems

Conception of parallel algorithms

Spark Presentation.

Parallel Programming By J. H. Wang May 2, 2017.

Co-processing SPMD Computation on GPUs and CPUs with MapReduce Interface on Shared Memory System Date: 10/05/2012.

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Applying Twister to Scientific Applications

湖南大学-信息科学与工程学院-计算机与科学系

Chapter 4: Threads.

Chapter 4: Threads.

Data-Intensive Computing: From Clouds to GPU Clusters

Parallel Applications And Tools For Cloud Computing Environments

Overview of big data tools

Hybrid Programming with OpenMP and MPI

Chapter 4: Threads & Concurrency

Presentation transcript:

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction

Matrix Multiplication Fundamental kernel algorithm used by many applications Examples: Graph Theory, Physics, Electronics

Scalability Issues: Run on single machine: Memory overhead increase in terms of N^2 CPU overhead increase in terms of N^3 Run on multiple machines: Communication overhead increase in terms of N^2

Matrix Multiply Approaches Programming Mdoel AlgorithmCustomized Libraries User Implementation SequentialNaïve approach, tiles matrix multiply, Blas_dgemm Vendor supplied package (ie, Intel, AMD Blas), ATLAS Fortran, C, C++, C#, Java Shared memory parallelism Row PartitionATLASMulti Threads, TPL, PLINQ, OpenMP Distributed memory parallelism Row Column Partition, Fox Algorithm ScalePackOpenMPI, Twister, Dryad

Why DryadLINQ? Dryad is a general purpose runtime that supports the processing of data intensive application in Windows DryadLINQ is a high level programming language and compiler for Dryad Applicability: Dryad transparently deal with the parallelism, scheduling, fault. tolerance, messaging, and workload balancing issues. SQL-like interface, based on.NET platform, easy to have code. Performance: Intelligent job execution engine, optimized execution plan. Scale out for thousands of machines.

Parallel Algorithms for Matrix Multiplication MM algorithms can deal with matrices distributed on rectangular grids No single algorithm always achieves best performance on different matrix and grid shapes. MM Algorithms can be classified into categories according to the communication primitives Row Partition Row Column Partition Fox Algorithm (BMR) – broadcast, multiply, roll up

Row Partition Heavy communication overhead Large memory usage per node The full Matrix B is copied to every node The Matrix A row blocks are distributed to each node Pseudo Code sample: Partition matrix A by rows Broadcast matrix B Distributed matrix A row blocks Compute the matrix C row blocks

Row Column Partition Heavy communication overhead Scheduling overhead for each iteration Moderate memory usage Pseudo Code sample: Partitioned matrix A by rows Partitioned matrix B by columns For each iteration i: broadcast matrix A row block i distributed matrix B column blocks compute matrix C row blocks

Fox Algorithm Stage OneStage Two

Fox algorithm Less communication overhead than other approach Scale well for large matrices sizes Pseudo Code sample: Partitioned matrix A, B to blocks For each iteration i: 1) broadcast matrix A block (i%N, i%N) to row i 2) compute matrix C blocks add the to the previous result 3) roll-up matrix B block

Performance Analysis on Fox algorithm Cache Issue Cache miss (size), pollution, confliction Tiles matrix multiply Memory Issue Size (memory paging) Bandwidth, latency Cache Size Turning Point Absolute performance degrade as problem size increase for both cases Single node performance worse than multiple nodes due to memory issue.

Multicore level parallelism To use every core on a compute node for Dryad Job, the task must be programmed with multicore technology. (i.e. Task Parallel Library, Thread, PLINQ) For each thread, it will compute one row in matrix C or several rows in matrix C depends on the implementation. By using TPL or PLINQ, the optimization for threads is implicit and easier to use.

Timeline for term long project Stage One Familiar with HPC cluster Sequential MM with C# Multithreaded MM with C# Performance comparison of above two approaches Stage Two Familiar with DryadLINQ Interface Implement Row Partition algorithm with DryadLINQ Performance study Stage Three Refinement experiments results Report and presentation

Backup slides

 Input: C# and LINQ data objects  DryadLINQ distributed data objects.  DryadLINQ translates LINQ programs into distributed Dryad computations:  C# methods become code running on the vertices of a Dryad job.  Output: DryadLINQ distributed data objects .Net objects DryadLINQ Client machine (11) Distributed query plan.NET program Query Expr HPC Cluster Output Tables Results Input Tables Invoke Query Output DryadTable Dryad Execution.Net Objects JM ToTable foreach Vertex code Dryad Job Submission

Dryad Job Execution Flow

Performance on one Node

Performance on Multiple Node

Analysis for three algorithms

Performance for three algorithms Test done on 16 nodes of Tempest, using one core per node.

Performance for Multithreaded MM Test done on one node of Tempest, 24 cores