Mattan Erez The University of Texas at Austin

Slides:



Advertisements
Similar presentations
Parallel Programming Patterns Eun-Gyu Kim June 10, 2004.
Advertisements

Introduction to Parallel Computing
Parallel Programming Patterns Ralph Johnson. Why patterns? Patterns for Parallel Programming The road ahead.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Reference: Message Passing Fundamentals.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.
Parallel Processing (CS526) Spring 2012(Week 5).  There are no rules, only intuition, experience and imagination!  We consider design techniques, particularly.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Data Structures and Algorithms in Parallel Computing Lecture 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/498AL, University of Illinois, Urbana-Champaign 1 ECE 408/CS483 Applied Parallel Programming.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
Data Structures and Algorithms in Parallel Computing Lecture 7.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Parallel Computing Presented by Justin Reschke
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
Auburn University
Parallel Patterns.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Exploratory Decomposition Dr. Xiao Qin Auburn.
Conception of parallel algorithms
SOFTWARE DESIGN AND ARCHITECTURE
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Programming Patterns
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Pattern Parallel Programming
Task Scheduling for Multicore CPUs and NUMA Systems
CS 584 Lecture 3 How is the assignment going?.
Parallel Algorithm Design
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Objective of This Course
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Parallel Computation Patterns (Reduction)
Mattan Erez The University of Texas at Austin
CS 584.
Sungho Kang Yonsei University
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Lecture 2 The Art of Concurrency
Mattan Erez The University of Texas at Austin
Real time signal processing
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Parallel Programming in C with MPI and OpenMP
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Mattan Erez The University of Texas at Austin
Presentation transcript:

Mattan Erez The University of Texas at Austin EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 15 – Parallelism in Software III Mattan Erez The University of Texas at Austin EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Credits Most of the slides courtesy Dr. Rodric Rabbah (IBM) Taken from 6.189 IAP taught at MIT in 2007 Parallel Scan slides courtesy David Kirk (NVIDIA) and Wen-Mei Hwu (UIUC) Taken from EE493-AI taught at UIUC in Sprig 2007 EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

4 Common Steps to Creating a Parallel Program Partitioning decomposi t ion assignment orchest rat ion mapping UE0 UE1 UE0 UE1 UE0 UE1 UE2 UE3 UE2 UE3 UE2 UE3 Sequential Computation Tasks Units of Execution Parallel Processors program EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Data Decomposition Examples Molecular dynamics Geometric decomposition Merge sort Recursive decomposition problem subproblem split subproblem split split compute subproblem compute subproblem compute subproblem compute subproblem merge merge subproblem subproblem merge solution EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Dependence Analysis Given two tasks how to determine if they can safely run in parallel? EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Bernstein’s Condition Ri: set of memory locations read (input) by task Ti Wj: set of memory locations written (output) by task Tj Two tasks T1 and T2 are parallel if input to T1 is not part of output from T2 input to T2 is not part of output from T1 outputs from T1 and T2 do not overlap EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Example T1 a = x + y T2 b = x + z R1 = { x, y } W1 = { a } R2 = { x, z } W2 = { b } EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Patterns for Parallelizing Programs 4 Design Spaces Algorithm Expression Finding Concurrency Expose concurrent tasks Algorithm Structure Map tasks to processes to exploit parallel architecture Software Construction Supporting Structures Code and data structuring patterns Implementation Mechanisms Low level mechanisms used to write parallel programs Patterns for Parallel Programming. Mattson, Sanders, and Massingill (2005). EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Algorithm Structure Design Space Given a collection of concurrent tasks, what’s the next step? Map tasks to units of execution (e.g., threads) Important considerations Magnitude of number of execution units platform will support Cost of sharing information among execution units Avoid tendency to over constrain the implementation Work well on the intended platform Flexible enough to easily adapt to different architectures EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Major Organizing Principle How to determine the algorithm structure that represents the mapping of tasks to units of execution? Concurrency usually implies major organizing principle Organize by tasks Organize by data decomposition Organize by flow of data EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Organize by Tasks? Recursive? yes Divide and Conquer no Task Parallelism EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Task Parallelism Molecular dynamics Common factors Non-bonded force calculations, some dependencies Common factors Tasks are associated with iterations of a loop Tasks largely known at the start of the computation All tasks may not need to complete to arrive at a solution EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Divide and Conquer For recursive programs: divide and conquer Subproblems may not be uniform May require dynamic load balancing problem subproblem compute solution join split EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Organize by Data? Operations on a central data structure Arrays and linear data structures Recursive data structures Recursive? Recursive Data yes no Geometric Decomposition EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Recursive Data Computation on a list, tree, or graph Often appears the only way to solve a problem is to sequentially move through the data structure There are however opportunities to reshape the operations in a way that exposes concurrency EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Recursive Data Example: Find the Root Given a forest of rooted directed trees, for each node, find the root of the tree containing the node Parallel approach: for each node, find its successor’s successor, repeat until no changes O(log n) vs. O(n) 4 4 4 3 3 3 2 2 2 1 6 1 6 1 6 5 7 5 7 5 7 Step 1 Step 2 Step 3 EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Work vs. Concurrency Tradeoff Parallel restructuring of find the root algorithm leads to O(n log n) work vs. O(n) with sequential approach Most strategies based on this pattern similarly trade off increase in total work for decrease in execution time due to concurrency EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Organize by Flow of Data? In some application domains, the flow of data imposes ordering on the tasks Regular, one-way, mostly stable data flow Irregular, dynamic, or unpredictable data flow Regular? Pipeline yes no Event-based Coordination EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Pipeline Throughput vs. Latency Amount of concurrency in a pipeline is limited by the number of stages Works best if the time to fill and drain the pipeline is small compared to overall running time Performance metric is usually the throughput Rate at which data appear at the end of the pipeline per time unit (e.g., frames per second) Pipeline latency is important for real-time applications Time interval from data input to pipeline, to data output EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Event-Based Coordination In this pattern, interaction of tasks to process data can vary over unpredictable intervals Deadlocks are a danger for applications that use this pattern Dynamic scheduling has overhead and may be inefficient Granularity a major concern Another option is various “static” dataflow models E.g., synchronous dataflow EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Patterns for Parallelizing Programs 4 Design Spaces Algorithm Expression Finding Concurrency Expose concurrent tasks Algorithm Structure Map tasks to processes to exploit parallel architecture Software Construction Supporting Structures Code and data structuring patterns Implementation Mechanisms Low level mechanisms used to write parallel programs Patterns for Parallel Programming. Mattson, Sanders, and Massingill (2005). EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Code Supporting Structures Loop parallelism Master/Worker Fork/Join SPMD Map/Reduce Task dataflow EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Loop Parallelism Pattern Many programs are expressed using iterative constructs Programming models like OpenMP provide directives to automatically assign loop iteration to execution units Especially good when code cannot be massively restructured i = 0 i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 i = 8 i = 9 i = 10 i = 11 #pragma omp parallel for for(i = 0; i < 12; i++) C[i] = A[i] + B[i]; EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Master/Worker Pattern B C D E master Independent Tasks A B C D E worker worker worker worker EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Master/Worker Pattern Particularly relevant for problems using task parallelism pattern where task have no dependencies Embarrassingly parallel problems Main challenge in determining when the entire problem is complete EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Fork/Join Pattern Tasks are created dynamically Tasks can create more tasks Manages tasks according to their relationship Parent task creates new tasks (fork) then waits until they complete (join) before continuing on with the computation EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

SPMD Pattern Single Program Multiple Data: create a single source-code image that runs on each processor Initialize Obtain a unique identifier Run the same program each processor Identifier and input data differentiate behavior Distribute data Finalize EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

SPMD Challenges Split data correctly Correctly combine the results Achieve an even distribution of the work For programs that need dynamic load balancing, an alternative pattern is more suitable EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Map/Reduce Pattern Two phases in the program Map phase applies a single function to all data Each result is a tuple of value and tag Reduce phase combines the results The values of elements with the same tag are combined to a single value per tag -- reduction Semantics of combining function are associative Can be done in parallel Can be pipelined with map Google uses this for all their parallel programs EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Communication and Synchronization Patterns Point-to-point Broadcast Reduction Multicast Synchronization Locks (mutual exclusion) Monitors (events) Barriers (wait for all) Split-phase barriers (separate signal and wait) Sometimes called “fuzzy barriers” Named barriers allow waiting on subset EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Quick recap Decomposition Algorithm structure Supporting structures High-level and fairly abstract Consider machine scale for the most part Task, Data, Pipeline Find dependencies Algorithm structure Still abstract, but a bit less so Consider communication, sync, and bookkeeping Task (collection/recursive) Data (geometric/recursive) Dataflow (pipeline/event- based-coordination) Supporting structures Loop Master/worker Fork/join SPMD MapReduce EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Algorithm Structure and Organization (from the Book) Task parallelism Divide and conquer Geometric decomposition Recursive data Pipeline Event-based coordination SPMD **** *** ** Loop Parallelism Master/ Worker * Fork/ Join Patterns can be hierarchically composed so that a program uses more than one pattern EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Algorithm Structure and Organization (my view) Task parallelism Divide and conquer Geometric decomposition Recursive data Pipeline Event-based coordination SPMD **** ** * Loop Parallelism **** when no dependencies **** SWP to hide comm. Master/ Worker *** Fork/ Join Patterns can be hierarchically composed so that a program uses more than one pattern EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Patterns for Parallelizing Programs 4 Design Spaces Algorithm Expression Finding Concurrency Expose concurrent tasks Algorithm Structure Map tasks to processes to exploit parallel architecture Software Construction Supporting Structures Code and data structuring patterns Implementation Mechanisms Low level mechanisms used to write parallel programs Patterns for Parallel Programming. Mattson, Sanders, and Massingill (2005). EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

ILP, DLP, and TLP in SW and HW OOO Dataflow VLIW DLP SIMD Vector TLP Essentially multiple cores with multiple sequencers ILP Within straight-line code DLP Parallel loops Tasks operating on disjoint data No dependencies within parallelism phase TLP All of DLP + Producer-consumer chains EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

ILP, DLP, and TLP and Supporting Patterns Task parallelism Divide and conquer Geometric decomposition Recursive data Pipeline Event-based coordination ILP inline / unroll inline unroll DLP natural or local- conditions after enough divisions natural after enough branches difficult local- conditions TLP EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

ILP, DLP, and TLP and Supporting Patterns Task parallelism Divide and conquer Geometric decomposition Recursive data Pipeline Event-based coordination ILP inline / unroll inline unroll DLP natural or local- conditions after enough divisions natural after enough branches difficult local- conditions TLP EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

ILP, DLP, and TLP and Implementation Patterns SPMD Loop Parallelism Mater/Worker Fork/Join ILP pipeline unroll inline DLP natural or local- conditional natural local-conditional after enough divisions + local-conditional TLP EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

ILP, DLP, and TLP and Implementation Patterns SPMD Loop Parallelism Master/ Worker Fork/Join ILP pipeline unroll inline DLP natural or local- conditional natural local-conditional after enough divisions + local-conditional TLP EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Outline Molecular dynamics example Problem description Steps to solution Build data structures; Compute forces; Integrate for new; positions; Check global solution; Repeat Finding concurrency Scans; data decomposition; reductions Algorithm structure Supporting structures EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

GROMACS Highly optimized molecular-dynamics package Popular code Specifically tuned for protein folding Hand optimized loops for SSE3 (and other extensions) EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Gromacs Components Non-bonded forces Bonded forces Boundary conditions Water-water with cutoff Protein-protein tabulated Water-water tabulated Protein-water tabulated Bonded forces Angles Dihedrals Boundary conditions Verlet integrator Constraints SHAKE SETTLE Other Temperature–pressure coupling Virial calculation EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

GROMACS Water-Water Force Calculation Non-bonded long-range interactions Coulomb Lennard-Jones 234 operations per interaction Water-water interaction ~75% of GROMACS run-time EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

GROMACS Uses Non-Trivial Neighbor-List Algorithm Full non-bonded force calculation is o(n2) GROMACS approximates with a cutoff Molecules located more than rc apart do not interact O(nrc3) Efficient algorithm leads to variable rate input streams EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

GROMACS Uses Non-Trivial Neighbor-List Algorithm Full non-bonded force calculation is o(n2) GROMACS approximates with a cutoff Molecules located more than rc apart do not interact O(nrc3) central molecules neighbor molecules Efficient algorithm leads to variable rate input streams EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

GROMACS Uses Non-Trivial Neighbor-List Algorithm Full non-bonded force calculation is o(n2) GROMACS approximates with a cutoff Molecules located more than rc apart do not interact O(nrc3) central molecules neighbor molecules Efficient algorithm leads to variable rate input streams EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

GROMACS Uses Non-Trivial Neighbor-List Algorithm Full non-bonded force calculation is o(n2) GROMACS approximates with a cutoff Molecules located more than rc apart do not interact O(nrc3) central molecules neighbor molecules Efficient algorithm leads to variable rate input streams EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

GROMACS Uses Non-Trivial Neighbor-List Algorithm Full non-bonded force calculation is o(n2) GROMACS approximates with a cutoff Molecules located more than rc apart do not interact O(nrc3) Separate neighbor-list for each molecule Neighbor-lists have variable number of elements central molecules neighbor molecules Efficient algorithm leads to variable rate input streams EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez

Other Examples More patterns More examples Reductions Scans Search Building a data structure More examples Search Sort FFT as divide and conquer Structured meshes and grids Sparse algebra Unstructured meshes and graphs Trees Collections Particles Rays EE382N: Parallelilsm and Locality, Spring 2015 -- Lecture 15 (c) Rodric Rabbah, Mattan Erez