MIT Lincoln Laboratory 000921-hch-1 HCH 5/26/2016 Achieving Portable Task and Data Parallelism on Parallel Signal Processing Architectures Hank Hoffmann.

Slides:



Advertisements
Similar presentations
Nios Multi Processor Ethernet Embedded Platform Final Presentation
Advertisements

Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
OpenFOAM on a GPU-based Heterogeneous Cluster
Reference: Message Passing Fundamentals.
DISTRIBUTED CONSISTENCY MANAGEMENT IN A SINGLE ADDRESS SPACE DISTRIBUTED OPERATING SYSTEM Sombrero.
1/28/2004CSCI 315 Operating Systems Design1 Operating System Structures & Processes Notice: The slides for this lecture have been largely based on those.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 2: Operating-System Structures Modified from the text book.
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
Architectural Design Establishing the overall structure of a software system Objectives To introduce architectural design and to discuss its importance.
1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Computer System Architectures Computer System Software
“SEMI-AUTOMATED PARALLELISM USING STAR-P " “SEMI-AUTOMATED PARALLELISM USING STAR-P " Dana Schaa 1, David Kaeli 1 and Alan Edelman 2 2 Interactive Supercomputing.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.
SAMANVITHA RAMAYANAM 18 TH FEBRUARY 2010 CPE 691 LAYERED APPLICATION.
Slide-1 SC2002 Tutorial MIT Lincoln Laboratory DoD Sensor Processing: Applications and Supporting Software Technology Dr. Jeremy Kepner MIT Lincoln Laboratory.
OPERATING SYSTEMS Goals of the course Definitions of operating systems Operating system goals What is not an operating system Computer architecture O/S.
1 COMP 3438 – Part II-Lecture 1: Overview of Compiler Design Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.
Session-8 Data Management for Decision Support
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Invitation to Computer Science 5 th Edition Chapter 6 An Introduction to System Software and Virtual Machine s.
Component Technology. Challenges Facing the Software Industry Today’s applications are large & complex – time consuming to develop, difficult and costly.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
MIT Lincoln Laboratory S3p-HPEC-jvk.ppt S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures Mr. Henry.
Cousins HPEC 2002 Session 4: Emerging High Performance Software David Cousins Division Scientist High Performance Computing Dept. Newport,
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Haney - 1 HPEC 9/28/2004 MIT Lincoln Laboratory pMatlab Takes the HPCchallenge Ryan Haney, Hahn Kim, Andrew Funk, Jeremy Kepner, Charles Rader, Albert.
Distribution and components. 2 What is the problem? Enterprise computing is Large scale & complex: It supports large scale and complex organisations Spanning.
Page 1© Crown copyright 2004 FLUME Metadata Steve Mullerworth 3 rd -4 th October May 2006.
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
Silberschatz, Galvin and Gagne  Operating System Concepts UNIT II Operating System Services.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO Session 2 Computer Organization.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Lesson 1 1 LESSON 1 l Background information l Introduction to Java Introduction and a Taste of Java.
2.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition System Programs (p73) System programs provide a convenient environment.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory
MIT Lincoln Laboratory 1 of 4 MAA 3/8/2016 Development of a Real-Time Parallel UHF SAR Image Processor Matthew Alexander, Michael Vai, Thomas Emberley,
4/27/2000 A Framework for Evaluating Programming Models for Embedded CMP Systems Niraj Shah Mel Tsai CS252 Final Project.
Introduction To Software Development Environment.
Distributed Systems Architectures. Topics covered l Client-server architectures l Distributed object architectures l Inter-organisational computing.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Advanced Computer Systems
Distributed Processors
M. Richards1 ,D. Campbell1 (presenter), R. Judd2, J. Lebak3, and R
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Algorithm Design
Parallel Processing in ROSA II
Programming Languages
A BRIEF INTRODUCTION TO UNIX OPERATING SYSTEM
SAMANVITHA RAMAYANAM 18TH FEBRUARY 2010 CPE 691
Outline Chapter 2 (cont) OS Design OS structure
VSIPL++: Parallel Performance HPEC 2004
Presentation transcript:

MIT Lincoln Laboratory hch-1 HCH 5/26/2016 Achieving Portable Task and Data Parallelism on Parallel Signal Processing Architectures Hank Hoffmann Eddie Rutledge Jim Daly Glenn Schrader Jan Matlis Patrick Richardson *This work is sponsored by the US Navy, under Air Force Contract F C Opinions, interpretations, conclusions, and recommendations are those of the author and not necessarily endorsed by the United States Air Force.

MIT Lincoln Laboratory hch-2 HCH 5/26/2016 Overview Motivation - why write portable software? Philosophy –how to achieve portability –how to measure portability Overview of Software Library Example signal processing application Conclusion

MIT Lincoln Laboratory hch-3 HCH 5/26/2016 Motivation Take Advantage of New Processor Technology –Portable software enables rapid COTS insertion and technology refresh Interoperability –larger choice of platforms available System Development/Acquisition Stages 4 Years Program Milestones Technology Development Field Demo Engineering/ Manufacturing Insertion COTS Evolution 1st gen.2nd gen. 3rd gen.4th gen.5th gen.6th gen. Future Upgrade Systems

MIT Lincoln Laboratory hch-4 HCH 5/26/2016 Current Standards for Parallel Coding Industry standards (e.g. VSIPL, MPI) represent a significant improvement over coding with vendor-specific libraries –None of the work detailed in this presentation would be possible without the groundwork laid by standards such as VSIPL and MPI However, current industry standards still do not provide enough support to write truly portable parallel applications How can we build even more portable systems that work in parallel? Vendor Supplied Libraries Vendor Supplied Libraries Current Industry Standards Current Industry Standards ? ?

MIT Lincoln Laboratory hch-5 HCH 5/26/2016 Characteristics of Portable Software Portable software maintains functionality and performance with minimal code changes Functionality Performance Single Processor Parallel Processor Compile and run on new platform scale to new processor set handle new communication network Preserve performance (e.g. FFTW) Take advantage of processor specific traits (e.g. L1/L2/L3 cache vector processing, etc.) Handle everything for single processor case Load balancing across processors Exploit algorithm parallelism

MIT Lincoln Laboratory hch-6 HCH 5/26/2016 We need the ability to abstract parallelism away from the code, and to treat distributed objects as a single unit We need the ability to abstract parallelism away from the code, and to treat distributed objects as a single unit Writing Parallel Code Using Current Standards Algorithm & Mapping Code Proc 1 Proc 2 Proc 2 Pulse Compressor Proc 3 Proc 4 Proc 4 Detector while(!done) { if ( rank()==1 || rank()==2 ) pulse compress (); else if ( rank()==3 || rank()==4 ) detect(); } Proc 5 Proc 5 Proc 6 Proc 6 while(!done) { if ( rank()==1 || rank()==2 ) pulse compress(); else if ( rank()==3 || rank()==4) || rank()==5 || rank==6 ) detect(); }

MIT Lincoln Laboratory hch-7 HCH 5/26/2016 Overview Motivation - why write portable software? Philosophy –how to achieve portability –how to measure portability Overview of Software Library Example signal processing application Conclusion

MIT Lincoln Laboratory hch-8 HCH 5/26/2016 Philosophy “Application Developer” Converts algorithm into code while( !done ) { pulseCompress(); detect(); } Writes code once Easier to code, because only concerned with mathematics, not distribution “Mapper” Maps code to hardware Creates new mappings when code is scaled or ported PC MapDET Map Proc 0 Proc 1 Proc 2 Proc 3 Separate the job of writing a parallel application from the job of assigning hardware to that application Separate the job of writing a parallel application from the job of assigning hardware to that application

MIT Lincoln Laboratory hch-9 HCH 5/26/2016 Measuring Success if( rank() == 0 ) { //... } Vector Length Standards Our Lib Rate (Mflop/s) Code Complexity: Number of lines of application code that have to be changed to port or scale Code Complexity: Number of lines of application code that have to be changed to port or scale Performance: Must preserve the performance of a similar application built on lower-level libraries Performance: Must preserve the performance of a similar application built on lower-level libraries

MIT Lincoln Laboratory hch-10 HCH 5/26/2016 Overview Motivation - why write portable software? Philosophy –how to achieve portability –how to measure portability Overview of Software Library Example signal processing application Conclusion

MIT Lincoln Laboratory hch-11 HCH 5/26/2016 A New Parallel Signal Processing Library Combining the best of existing standards and STAPL into a new library STAPL: Space-Time Adaptive Processing Library Standards STAPL A New Library for PSP Interface Parallel Constructs Benefits Portability Performance Fully Parallel Familiar Interface Increased Productivity C++ Abstraction

MIT Lincoln Laboratory hch-12 HCH 5/26/2016 Signal Processing & Control Mapping Overview of Principal Library Constructs BothUsed to perform more complicated signal/image processing functions (as per VSIPL) on matrices/vectors (E.G. FFT) Computation DataUsed to perform matrix/vector algebra, providing VSIPL functionality except that data and operations may be distributed Matrix/Vector Task/controlProvides distributed data flow between tasks (transports matrices/vectors) Conduit Task/controlSupports algorithm decomposition into potentially distributed signal processing tasks Task Container class for MapsAtlas BothSpecifies how Tasks, Matrices/Vectors, and Computations are distributed on processor Map ParallelismDescriptionClass

MIT Lincoln Laboratory hch-13 HCH 5/26/2016 PVL Concepts Processor Each distributed object has a MAP consisting of: Grid (binding to physical machine) Distribution (of object over Grid) Maps provide portability and performance Each distributed object has a MAP consisting of: Grid (binding to physical machine) Distribution (of object over Grid) Maps provide portability and performance Grid IIGrid I Grid (part of a Map) provides machine abstraction Task I Task II Conduit Tasks provide “control scope” for computation Conduits transport data Tasks provide “control scope” for computation Conduits transport data E=FFT(D); D E FFT A=B+C; C B A

MIT Lincoln Laboratory hch-14 HCH 5/26/2016 Overview Motivation - why write portable software? Philosophy –how to achieve portability –how to measure portability Overview of Software Library Example signal processing application Conclusion

MIT Lincoln Laboratory hch-15 HCH 5/26/2016 Example of a Task and Data Parallel Application Digital Input generates a 52 channel by 768 range matrix Beamformer and Detector receive 52 x 384 matrix form beams apply detection template store results Low Pass Filter receive 52 x 768 matrix Apply coarse filter 2:1 decimation Apply fine filter Signal Processing algorithm with 3 steps

MIT Lincoln Laboratory hch-16 HCH 5/26/2016 Mapping Parallelism in the Algorithm to Library Constructs Digital InputLow Pass Filtering Beamforming and Detection Tasks - 52 x 768 Matrices - 52 x x 384 Conduits - FFT Computations - Matrix Mult.

MIT Lincoln Laboratory hch-17 HCH 5/26/2016 Implementing the Algorithm Examine Implementations of the algorithm using our library and VSIP/MPI Distributions: Single Processor Nodes Three Processors Six Processors Compare Lines of Code for the two different implementations on each mapping

MIT Lincoln Laboratory hch-18 HCH 5/26/2016 void DIT::init( Conduit & ditToLpfConduit) { m_ditToLpfConduit = &ditToLpfConduit; unsigned int outMatDims[2]; outMatDims[0] = NUM_CHANNELS; outMatDims[1] = NUM_RANGES; m_ditToLpfConduit->setupSrc( atlas.retrieveMap2d( DIT_OUTPUT_MAP), DOUBLE_BUFFER, outMatDims); } void DIT::run() { if (!*m_done && m_ditToLpfConduit-> insertAvailable()) { MatrixCplxFlt& outMat( m_ditToLpfConduit-> getInsertHandle()); fillMatrix(outMat); m_ditToLpfConduit->insert(); } while(!done) { DIT_generatePulse(ditLpfMatrix); } Single Processor Mapping PVL VSIPL Lines of Code VSIPL code has no communication overhead Some code compaction because PVL uses C++ instead of C VSIPL code has no communication overhead Some code compaction because PVL uses C++ instead of C

MIT Lincoln Laboratory hch-19 HCH 5/26/2016 void LPF::run() { if ( !*m_done && m_ditToLpfConduit-> extractReady() && m_lpfToBfdetConduit-> insertAvailable() ) { m_inMat = m_ditToLpfConduit-> getExtractHandle()); m_outMat = m_lpfToBfdetConduit-> getInsertHandle()); processPulse(); m_ditToLpfConduit->extract(); m_lpfToBfdetConduit->insert(); } Map Independent Code - Does not change! while(!done) { if( myRank == LPF_RANK ){ int blockStatus; int msgStatus; vsip_scalar_f* inData; vsip_scalar_f* outData; vsip_scalar_f* dummy; blockStatus = vsip_cblockrelease_f( ditLpfBlock, VSIP_FALSE, inData, dummy); MPI_Recv(inData, 2*NUM_CHANNELS*NUM_RANGES, MPI_FLOAT, DIT_RANK, DIT_LPF_TAG, MPI_COMM_WORLD, &msgStatus); vsip_cblockadmit_f(ditLpfBlock, VSIP_TRUE); vsip_cblockadmit_f(lpfBfdetBlock, VSIP_TRUE); LPF_processPulse(&lpf, ditLpfMatrix, lpfBfdetMatrix); blockStatus = vsip_cblockrelease_f( lpfBfdetBlock, VSIP_FALSE, outData, dummy); MPI_Send( outData, 2*NUM_CHANNELS* DECIMATED_RANGES, MPI_FLOAT, BFDET_RANK, LPF_BFDET_TAG, MPI_COMM_WORLD ); } Code must be added and changed to do Communication Three Processor Mapping PVL VSIPL + MPI Lines of Code Code must be added to the VSIPL implementation for communication PVL code does not need to change - it is parallel ready Code must be added to the VSIPL implementation for communication PVL code does not need to change - it is parallel ready

MIT Lincoln Laboratory hch-20 HCH 5/26/2016 while(!done) { if( myRank == BFDET_RANK ){ int blockStatus; int msgStatus; int blockOffset; int i; vsip_scalar_f* inData; vsip_scalar_f* outData; vsip_scalar_f* dummy; blockStatus = vsip_cblockrelease_f( lpfBfdetBlock, VSIP_FALSE, inData, dummy); for( i=0, blockOffset = 0; i < LPF_NODES; i++, blockOffset += 2*NUM_CHANNELS* DECIMATED_RANGES/LPF_NODES) { MPI_Recv(inData+blockOffset, 2*NUM_CHANNELS* DECIMATED_RANGES/LPF_NODES, MPI_FLOAT, LPF_RANKS[i], LPF_BFDET_TAG, MPI_COMM_WORLD, &msgStatus); } vsip_cblockadmit_f(lpfBfdetBlock, VSIP_TRUE); BFDET_processPulse(&bfdet, lpfBfdetMatrix, outputMatrix); } Code must be added and changed to do Communication void BFDET::run() { if(!*m_done && m_lpfToBfdetConduit->extractReady()) { m_inData = m_lpfToBfdetConduit-> getExtractHandle(); processData(); m_lpfToBfdetConduit->extract(); writeOutDetections(); } Map Independent Code - Does not change! Six Processor Mapping PVL VSIPL + MPI Lines of Code Small but tedious changes to the VSIPL/MPI implementation PVL code does not need to change Small but tedious changes to the VSIPL/MPI implementation PVL code does not need to change

MIT Lincoln Laboratory hch-21 HCH 5/26/2016 Overview Motivation - why write portable software? Philosophy –how to achieve portability –how to measure portability Overview of Software Library Example signal processing application Conclusion

MIT Lincoln Laboratory hch-22 HCH 5/26/2016 System Development Using Current Software Technology Application developer writes code Code compiled on target platform Code is run on target platform Done Is new mapping desired? Is port to new hardware desired? yes no yes no Traditional Code is: Map Dependent Inflexible Non-scalable Traditional Code is: Map Dependent Inflexible Non-scalable

MIT Lincoln Laboratory hch-23 HCH 5/26/2016 System Development Using Our Library and Philosophy Mapper edits map file for target platform Application developer writes code Code compiled on target platform Code debugged on workstation Code is run on target platform Done Is new mapping desired? Is port to new hardware desired? yes no yes no Developers change Maps, not Code PVL Code is: Map Independent Flexible Scalable Capable of being debugged on a workstation PVL Code is: Map Independent Flexible Scalable Capable of being debugged on a workstation Traditional Code is: Map Dependent Inflexible Non-scalable Traditional Code is: Map Dependent Inflexible Non-scalable

MIT Lincoln Laboratory hch-24 HCH 5/26/2016 Conclusion Parallel applications written on top of PVL can be fully portable: –0 lines of code changed when scaling the PVL application Applications written with VSIPL and MPI are not fully portable: –74 lines of code were added to scale to three processors –23 lines of code were added to scale from 3 to six processors A high-level signal processing library with task and data parallel constructs provides a huge increase in productivity for engineers developing signal processing applications because: –application code is more flexible - complicated changes to maps can be made without changes to code –application code is scalable - applications will work on 1 or 100 node systems without code modification –application programs can be written in a more natural way –ease of portability enables rapid COTS insertion and technology refresh