Mattan Erez The University of Texas at Austin

Slides:



Advertisements
Similar presentations
Practical techniques & Examples
Advertisements

Parallel Programming Patterns Eun-Gyu Kim June 10, 2004.
Parallel Programming Patterns Ralph Johnson. Why patterns? Patterns for Parallel Programming The road ahead.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Reference: Message Passing Fundamentals.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Parallel Programming Models and Paradigms
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.
Parallel Processing (CS526) Spring 2012(Week 5).  There are no rules, only intuition, experience and imagination!  We consider design techniques, particularly.
Csinparallel.org Patterns and Exemplars: Compelling Strategies for Teaching Parallel and Distributed Computing to CS Undergraduates Libby Shoop Joel Adams.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Algorithm structure Jakub Yaghob.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/498AL, University of Illinois, Urbana-Champaign 1 ECE 408/CS483 Applied Parallel Programming.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Decomposition and Parallel Tasks (cont.) Dr. Xiao Qin Auburn University
Parallel Programming Models EECC 756 David D. McGann 18 May, 1999.
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
Parallelisation of Desktop Environments Nasser Giacaman Supervised by Dr Oliver Sinnen Department of Electrical and Computer Engineering, The University.
These slides are based on the book:
Auburn University
Auburn University
Parallel Patterns.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Exploratory Decomposition Dr. Xiao Qin Auburn.
Conception of parallel algorithms
SOFTWARE DESIGN AND ARCHITECTURE
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Programming Patterns
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Data Partition Dr. Xiao Qin Auburn University.
Pattern Parallel Programming
Task Scheduling for Multicore CPUs and NUMA Systems
Parallel Algorithm Design
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
Mattan Erez The University of Texas at Austin
Objective of This Course
Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Parallel Computation Patterns (Reduction)
CS 584.
Gary M. Zoppetti Gagan Agrawal
The Vector-Thread Architecture
Lecture 2 The Art of Concurrency
Mattan Erez The University of Texas at Austin
FUNDAMENTAL CONCEPTS OF PARALLEL PROGRAMMING
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Mattan Erez The University of Texas at Austin EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 12 – Parallelism in Software III Mattan Erez The University of Texas at Austin EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Credits Most of the slides courtesy Dr. Rodric Rabbah (IBM) Taken from 6.189 IAP taught at MIT in 2007. EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Patterns for Parallelizing Programs 4 Design Spaces Algorithm Expression Finding Concurrency Expose concurrent tasks Algorithm Structure Map tasks to processes to exploit parallel architecture Software Construction Supporting Structures Code and data structuring patterns Implementation Mechanisms Low level mechanisms used to write parallel programs Patterns for Parallel Programming. Mattson, Sanders, and Massingill (2005). EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Quick recap Decomposition Dependencies Keep things general and simple Consider rough machine properties only (10, 1000, 1M, …) Task Natural in some programs Need to balance overheads of fine-grained with degree of par. Data Natural in some programs, less general than task Consider data structure Pipeline Overlap compute and comm. Reduce the degree of other parallelism needed Dependencies Equivalent to RAW/WAW/WAR EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Patterns for Parallelizing Programs 4 Design Spaces Algorithm Expression Finding Concurrency Expose concurrent tasks Algorithm Structure Map tasks to processes to exploit parallel architecture Software Construction Supporting Structures Code and data structuring patterns Implementation Mechanisms Low level mechanisms used to write parallel programs Patterns for Parallel Programming. Mattson, Sanders, and Massingill (2005). EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Algorithm Structure Design Space Given a collection of concurrent tasks, what’s the next step? Map tasks to units of execution (e.g., threads) Important considerations Magnitude of number of execution units platform will support Cost of sharing information among execution units Avoid tendency to over constrain the implementation Work well on the intended platform Flexible enough to easily adapt to different architectures EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Major Organizing Principle How to determine the algorithm structure that represents the mapping of tasks to units of execution? Concurrency usually implies major organizing principle Organize by tasks Organize by data decomposition Organize by flow of data EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Organize by Tasks? Recursive? yes Divide and Conquer no Task Parallelism EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Task Parallelism Molecular dynamics Common factors Non-bonded force calculations, some dependencies Common factors Tasks are associated with iterations of a loop Tasks largely known at the start of the computation All tasks may not need to complete to arrive at a solution EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Divide and Conquer For recursive programs: divide and conquer Subproblems may not be uniform May require dynamic load balancing problem subproblem compute solution join split EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Organize by Data? Operations on a central data structure Arrays and linear data structures Recursive data structures Recursive? Recursive Data yes no Geometric Decomposition EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Recursive Data Computation on a list, tree, or graph Often appears the only way to solve a problem is to sequentially move through the data structure There are however opportunities to reshape the operations in a way that exposes concurrency EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Recursive Data Example: Find the Root Given a forest of rooted directed trees, for each node, find the root of the tree containing the node Parallel approach: for each node, find its successor’s successor, repeat until no changes O(log n) vs. O(n) 4 4 4 3 3 3 2 2 2 1 6 1 6 1 6 5 7 5 7 5 7 Step 1 Step 2 Step 3 EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Work vs. Concurrency Tradeoff Parallel restructuring of find the root algorithm leads to O(n log n) work vs. O(n) with sequential approach Most strategies based on this pattern similarly trade off increase in total work for decrease in execution time due to concurrency EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Organize by Flow of Data? In some application domains, the flow of data imposes ordering on the tasks Regular, one-way, mostly stable data flow Irregular, dynamic, or unpredictable data flow Regular? Pipeline yes no Event-based Coordination EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Pipeline Throughput vs. Latency Amount of concurrency in a pipeline is limited by the number of stages Works best if the time to fill and drain the pipeline is small compared to overall running time Performance metric is usually the throughput Rate at which data appear at the end of the pipeline per time unit (e.g., frames per second) Pipeline latency is important for real-time applications Time interval from data input to pipeline, to data output EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Event-Based Coordination In this pattern, interaction of tasks to process data can vary over unpredictable intervals Deadlocks are a danger for applications that use this pattern Dynamic scheduling has overhead and may be inefficient Granularity a major concern Another option is various “static” dataflow models E.g., synchronous dataflow EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Patterns for Parallelizing Programs 4 Design Spaces Algorithm Expression Finding Concurrency Expose concurrent tasks Algorithm Structure Map tasks to processes to exploit parallel architecture Software Construction Supporting Structures Code and data structuring patterns Implementation Mechanisms Low level mechanisms used to write parallel programs Patterns for Parallel Programming. Mattson, Sanders, and Massingill (2005). EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Code Supporting Structures Loop parallelism Master/Worker Fork/Join SPMD Map/Reduce EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Loop Parallelism Pattern Many programs are expressed using iterative constructs Programming models like OpenMP provide directives to automatically assign loop iteration to execution units Especially good when code cannot be massively restructured i = 0 i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 i = 8 i = 9 i = 10 i = 11 #pragma omp parallel for for(i = 0; i < 12; i++) C[i] = A[i] + B[i]; EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Master/Worker Pattern B C D E master Independent Tasks A B C D E worker worker worker worker EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Master/Worker Pattern Particularly relevant for problems using task parallelism pattern where task have no dependencies Embarrassingly parallel problems Main challenge in determining when the entire problem is complete EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Fork/Join Pattern Tasks are created dynamically Tasks can create more tasks Manages tasks according to their relationship Parent task creates new tasks (fork) then waits until they complete (join) before continuing on with the computation EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

SPMD Pattern Single Program Multiple Data: create a single source-code image that runs on each processor Initialize Obtain a unique identifier Run the same program each processor Identifier and input data differentiate behavior Distribute data Finalize EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

SPMD Challenges Split data correctly Correctly combine the results Achieve an even distribution of the work For programs that need dynamic load balancing, an alternative pattern is more suitable EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Map/Reduce Pattern Two phases in the program Map phase applies a single function to all data Each result is a tuple of value and tag Reduce phase combines the results The values of elements with the same tag are combined to a single value per tag -- reduction Semantics of combining function are associative Can be done in parallel Can be pipelined with map Google uses this for all their parallel programs EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Communication and Synchronization Patterns Point-to-point Broadcast Reduction Multicast Synchronization Locks (mutual exclusion) Monitors (events) Barriers (wait for all) Split-phase barriers (separate signal and wait) Sometimes called “fuzzy barriers” Named barriers allow waiting on subset EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez

Algorithm Structure and Organization (from the Book) Task parallelism Divide and conquer Geometric decomposition Recursive data Pipeline Event-based coordination SPMD **** *** ** Loop Parallelism Master/ Worker * Fork/ Join Patterns can be hierarchically composed so that a program uses more than one pattern EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 12 (c) Rodric Rabbah, Mattan Erez