GAMMA: An Efficient Distributed Shared Memory Toolbox for MATLAB

Slides:



Advertisements
Similar presentations
Parallel Matlab Vikas Argod Research Computing and Cyberinfrastructure
Advertisements

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.
Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.
Puzzle Image Processing Sam Bair (Group Leader) Nick Halliday Nathan Malkin Joe Wang.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.
High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.
Contemporary Languages in Parallel Computing Raymond Hummel.
Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
Parallelization with the Matlab® Distributed Computing Server CBI cluster December 3, Matlab Parallelization with the Matlab Distributed.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
2006/1/23Yutaka Ishikawa, The University of Tokyo1 An Introduction of GridMPI Yutaka Ishikawa and Motohiko Matsuda University of Tokyo Grid Technology.
"Parallel MATLAB in production supercomputing with applications in signal and image processing" Ashok Krishnamurthy David Hudak John Nehrbass Siddharth.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
So, Jung-ki Distributed Computing System LAB School of Computer Science and Engineering Seoul National University Implementation of Package Management.
Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Slide 1 MIT Lincoln Laboratory Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E.
SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Copyright © 2002, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Parallelization of the Classic Gram-Schmidt QR-Factorization
Ohio State Univ Effective Automatic Parallelization of Stencil Computations * Sriram Krishnamoorthy 1 Muthu Baskaran 1, Uday Bondhugula 1, Atanas Rountev.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
HPC Components for CCA Manoj Krishnan and Jarek Nieplocha Computational Sciences and Mathematics Division Pacific Northwest National Laboratory.
CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Distributed Real-time Systems- Lecture 01 Cluster Computing Dr. Amitava Gupta Faculty of Informatics & Electrical Engineering University of Rostock, Germany.
CS-EE 481 Spring Founder’s Day, 2004 University of Portland School of Engineering Oregon Chub Beowulf Cluster Authors A.J. Supinski Billy Sword Advisor.
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
CNAF - 24 September 2004 EGEE SA-1 SPACI Activity Italo Epicoco.
Parallel Matlab: RTExpress on 64-bit SGI Altix with SCSL and MPT Cosmo Castellano Integrated Sensors, Inc 502 Court Street Suite.
A Parallel Data Mining Package Using MatlabMPI
Slurm User Group Meeting’16 - Athens, Greece – Sept , 2016
For Massively Parallel Computation The Chaotic State of the Art
Chapter 4: Multithreaded Programming
Programming Models for SimMillennium
CRESCO Project: Salvatore Raia
Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang
Is System X for Me? Cal Ribbens Computer Science Department
Fei Cai Shaogang Wu Longbing Zhang and Zhimin Tang
Performance Analysis, Tools and Optimization
Myoungjin Kim1, Yun Cui1, Hyeokju Lee1 and Hanku Lee1,2,*
University of Technology
System G And CHECS Cal Ribbens
Parallel Inversion of Polynomial Matrices
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Scalable Parallel Interoperable Data Analytics Library
GAMMA: An Efficient Distributed Shared Memory Toolbox for MATLAB
CARLA Buenos Aires, Argentina - Sept , 2017
Hybrid Programming with OpenMP and MPI
VSIPL++: Parallel Performance HPEC 2004
Types of Parallel Computers
An Implementation of User-level Distributed Shared Memory
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

GAMMA: An Efficient Distributed Shared Memory Toolbox for MATLAB Rajkiran Panuganti1, Muthu Baskaran1, Jarek Nieplocha2, Ashok Krishnamurthy3, Atanas Rountev1, P. Sadayappan1 1 The Ohio State University 2 PNNL 3 Ohio Supercomputer Center

Overview Motivation GAMMA Programming Model Implementation Overview Experimental Evaluation Conclusions 11/24/2018

High Productivity Computing Programmers’ productivity is extremely important C/Fortran – Good performance but poor productivity Parallel Programming in C/Fortran even harder MATLAB, Python etc. – Good programmer productivity Poor performance and inability to run large scale problems (memory limitations) 11/24/2018

MATLAB and High Productivity Numerous features resulting in High Programmer Productivity: Array Based Semantics Copy/Value based semantics Debugging and Profiling Support Integrated Development Environment Numerous Domain Specific libraries (Toolboxes) Visualization And a lot more...... Need to retain above features while addressing performance Issues 11/24/2018

Problem Out-Of-Memory! Out-Of-Memory! Performance! 199 sec 10.19 s 1. Remember the sizes for Class B for Fortran and EP! 199 sec 10.19 s 11/24/2018

ParaM :- ‘Parallel MATLAB’ USER user DParaM GAMMA Specialized Libraries user mexMPI Library Writers Compiler MATLAB GA + MVAPICH GA + MVAPICH 11/24/2018

Overview Motivation GAMMA Programming Model Implementation Overview Experimental Evaluation Conclusions 11/24/2018

Programming Model Global Shared View of the distributed Array Physical View Logical View (1,1) P1 P0 (250,75) P2 P3 (700,610) (1024,1024) A = GA([1024, 1024],distr); Block = A(250:700,75:610); 11/24/2018

Programming Model (Contd..) Get-Compute-Put Computation Model Get() Put() Put() Process 0 Get() Compute Process 1 Compute 11/24/2018

Other features in the Programming Model enabling Efficiency Pass-by-reference semantics for distributed arrays Intended for Library writers Management of Data Locality (NUMA) Distribution information can be retrieved by the programmer Reference based access to the local data Data replication Support for replicating near-neighbor data 11/24/2018

Other features in the Programming Model enabling Efficiency Contd.. Asynchronous operations Support for Library Writers Interoperable with ‘Message Passing’ Message Passing support using ‘mexMPI’ Interoperable with some other ‘Parallel MATLAB’ projects Interoperable with pMATLAB, Mathworks DCT 11/24/2018

Illustration by Example (FFT2) – 2D FFT [rank, nprocs] = Begin(); dims = [N N]; distr = [N N/nprocs]; A = GA(dims, distr); tmp=local(A); % GET() tmp = fft(tmp); % Compute() Put(A,tmp); % PUT() Sync(); ATmp = GA(A); Transpose(A,ATmp); % Collective Ops Tmp = local(ATmp); Put(ATmp,fft(Tmp)); Transpose(ATmp,A); GA_End(); Transpose 11/24/2018

Software Architecture User MATLAB Front-End GAMMA mexMPI MATLAB Computation Engine GA MPI SCALAPACK 11/24/2018

Overview Motivation GAMMA Programming Model Implementation Overview Experimental Evaluation Conclusions 11/24/2018

Evaluation OSC Pentium 4 Cluster Two 2.4 GHz Intel P4 processors per node, Linux kernel 2.6.6 , 4GB RAM, MVAPICH 0.9.4 Infiniband MATLAB Version 7.01 Fully distributed environment Evaluation using NAS Benchmarks 11/24/2018

Programmability Slight Increase in SLOC Moderate Increase in SLOC 11/24/2018

Performance Analysis 11/24/2018

Performance Analysis 11/24/2018

Speedup on Large Problem Sizes 11/24/2018

Related Work Early 90’s – MPI & Cluster Programming 1995 – ‘Why there isn’t a Parallel MATLAB?’ – Cleve Moler Embarrassingly Parallel Paralize(’98); Multi(’00); PLab(‘00); Parmatlab(‘01); Message Passing MultiMatlab(’96); PT(’96); DPToolbox(‘99); MATmarks(‘99); PMI(’99); MPITB/PVMTB(‘00); CMTM(‘01); Compilation Based Conlab(‘93); Falcon(’95); ParAL(‘95); Otter(‘98); Menhir(’98); MaJIC(’98); MATCH(‘00); RTExpress(’00); Backend Support Matpar(‘98); DLab(‘99); Netsolve(‘01); Paramat(‘01); 11/24/2018

Related Work (Currently Active) Star-P (’97) – MIT MatlabMPI(’98); pMATLAB(’02) – MIT-LL; File-based Message Passing Communication MATLAB_D (’00) – Rice Telescoping Compilation + HPF + JIT Compilation ParaM (’04) – OSU & OSC Mathworks(’04) – MDCE/MDCT 11/24/2018

Conclusions Discussed an efficient Distributed Shared Memory Toolbox for MATLAB Programming Model and Efficiency features of the toolbox Demonstrated efficiency using NAS Benchmarks Download available upon request 11/24/2018

Questions ? Contact: panugant@cse.ohio-state.edud 11/24/2018

Backup NAS FT – A NAS EP – A Implementation Issues 11/24/2018

Performance Analysis Contd… 11/24/2018

Implementation Issues Different Memory managers Automated Book Keeping Data layout inconsistencies In-Place Operations Data movement between different workspaces Out-of-order and irregular accesses 11/24/2018

11/24/2018