Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mattan Erez The University of Texas at Austin

Similar presentations


Presentation on theme: "Mattan Erez The University of Texas at Austin"— Presentation transcript:

1 Mattan Erez The University of Texas at Austin
EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 – Parallelism in Software I Mattan Erez The University of Texas at Austin EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

2 Credits Most of the slides courtesy Dr. Rodric Rabbah (IBM)
Taken from IAP taught at MIT in 2007 Parallel Scan slides courtesy David Kirk (NVIDIA) and Wen-Mei Hwu (UIUC) Taken from EE493-AI taught at UIUC in Sprig 2007 EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

3 Parallel programming from scratch
Start with an algorithm Formal representation of problem solution Sequence of steps Make sure there is parallelism In each algorithm step Minimize synchronization points Don’t forget locality Communication is costly Performance, Energy, System cost More often start with existing sequential code EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

4 Reengineering for Parallelism
Define a testing protocol Identify program hot spots: where is most of the time spent? Look at code Use profiling tools Parallelization Start with hot spots first Make sequences of small changes, each followed by testing Patterns provide guidance EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

5 4 Common Steps to Creating a Parallel Program
Partitioning decomposi t ion assignment orchest rat ion mapping UE0 UE1 UE0 UE1 UE0 UE1 UE2 UE3 UE2 UE3 UE2 UE3 Sequential Computation Tasks Units of Execution Parallel Processors program EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

6 Main consideration: coverage and Amdahl’s Law
Decomposition Identify concurrency and decide at what level to exploit it Break up computation into tasks to be divided among processes Tasks may become available dynamically Number of tasks may vary with time Enough tasks to keep processors busy Number of tasks available at a time is upper bound on achievable speedup Main consideration: coverage and Amdahl’s Law EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

7 Coverage Amdahl's Law: The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. Demonstration of the law of diminishing returns EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

8 Amdahl’s Law Potential program speedup is defined by the fraction of code that can be parallelized time Use 5 processors for parallel work 25 seconds + sequential 25 seconds + sequential parallel 50 seconds + 10 seconds + sequential sequential 25 seconds 25 seconds 100 seconds 60 seconds EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

9 Amdahl’s Law Speedup = old running time / new running time
Use 5 processors for parallel work 25 seconds + sequential 25 seconds + sequential parallel 50 seconds + 10 seconds + sequential sequential 25 seconds 25 seconds 100 seconds 60 seconds Speedup = old running time / new running time = 100 seconds / 60 seconds = (parallel version is 1.67 times faster) EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

10 Amdahl’s Law p = fraction of work that can be parallelized
n = the number of processor fraction of time to complete parallel work fraction of time to complete sequential work EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

11 Implications of Amdahl’s Law
Speedup tends to as number of processors tends to infinity Super linear speedups are possible due to registers and caches Typical speedup is less than linear linear speedup (100% efficiency) number of processors speedup Parallelism only worthwhile when it dominates execution EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

12 Main considerations: granularity and locality
Assignment Specify mechanism to divide work among PEs Balance work and reduce communication Structured approaches usually work well Code inspection or understanding of application Well-known design patterns As programmers, we worry about partitioning first Independent of architecture or programming model? Complexity often affects decisions Architectural model affects decisions Main considerations: granularity and locality EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

13 Fine vs. Coarse Granularity
Coarse-grain Parallelism High computation to communication ratio Large amounts of computational work between communication events Harder to load balance efficiently Fine-grain Parallelism Low computation to communication ratio Small amounts of computational work between communication stages High communication overhead Potential HW assist EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

14 Load Balancing vs. Synchronization
Fine Coarse PE0 PE1 PE0 PE1 EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

15 Load Balancing vs. Synchronization
Fine Coarse PE0 PE1 PE0 PE1 Expensive sync  coarse granularity Few units of exec + time disparity  fine granularity EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

16 Orchestration and Mapping
Computation and communication concurrency Preserve locality of data Schedule tasks to satisfy dependences early Survey available mechanisms on target system Main considerations: locality, parallelism, mechanisms (efficiency and dangers) EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

17 Parallel Programming by Pattern
Provides a cookbook to systematically guide programmers Decompose, Assign, Orchestrate, Map Can lead to high quality solutions in some domains Provide common vocabulary to the programming community Each pattern has a name, providing a vocabulary for discussing solutions Helps with software reusability, malleability, and modularity Written in prescribed format to allow the reader to quickly understand the solution and its context Otherwise, too difficult for programmers, and software will not fully exploit parallel hardware EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

18 History Berkeley architecture professor Christopher Alexander
In 1977, patterns for city planning, landscaping, and architecture in an attempt to capture principles for “living” design EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

19 Example 167 (p. 783): 6ft Balcony
Not just a collection of patterns, but a pattern language/ Patterns leads to other patterns. Patterns are hierarchical and compositional. Embodies design methodology and vocabulary. EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

20 Patterns in Object-Oriented Programming
Design Patterns: Elements of Reusable Object- Oriented Software (1995) Gang of Four (GOF): Gamma, Helm, Johnson, Vlissides Catalogue of patterns Creation, structural, behavioral Design patterns describe successful solutions to recurring design problems, and are organized hierarchically into a pattern language. Their form has proven to be a useful tool to model design experience, in architecture where they were originally conceived [1, 2] EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

21 Patterns for Parallelizing Programs
4 Design Spaces Algorithm Expression Finding Concurrency Expose concurrent tasks Algorithm Structure Map tasks to processes to exploit parallel architecture Software Construction Supporting Structures Code and data structuring patterns Implementation Mechanisms Low level mechanisms used to write parallel programs Patterns for Parallel Programming. Mattson, Sanders, and Massingill (2005). EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

22 Here’s my algorithm. Where’s the concurrency?
MPEG Decoder MPEG bit stream VLD macroblocks, motion vectors split frequency encoded macroblocks differentially coded motion vectors ZigZag IQuantization Motion Vector Decode IDCT Repeat Saturation spatially encoded macroblocks motion vectors join Motion Compensation recovered picture Picture Reorder Color Conversion Display EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

23 Here’s my algorithm. Where’s the concurrency?
MPEG Decoder MPEG bit stream Task decomposition Independent coarse-grained computation Inherent to algorithm Sequence of statements (instructions) that operate together as a group Corresponds to some logical part of program Usually follows from the way programmer thinks about a problem VLD macroblocks, motion vectors split frequency encoded macroblocks differentially coded motion vectors ZigZag IQuantization Motion Vector Decode IDCT Repeat Saturation spatially encoded macroblocks motion vectors join Motion Compensation recovered picture Picture Reorder Color Conversion Display EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

24 Here’s my algorithm. Where’s the concurrency?
MPEG Decoder MPEG bit stream Task decomposition Parallelism in the application Pipeline task decomposition Data assembly lines Producer-consumer chains VLD macroblocks, motion vectors split frequency encoded macroblocks differentially coded motion vectors ZigZag IQuantization Motion Vector Decode IDCT Repeat Saturation spatially encoded macroblocks motion vectors join Motion Compensation recovered picture Picture Reorder Color Conversion Display EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

25 Here’s my algorithm. Where’s the concurrency?
MPEG Decoder MPEG bit stream Task decomposition Parallelism in the application Pipeline task decomposition Data assembly lines Producer-consumer chains Data decomposition Same computation is applied to small data chunks derived from large data set VLD macroblocks, motion vectors split frequency encoded macroblocks differentially coded motion vectors ZigZag IQuantization Motion Vector Decode IDCT Repeat Saturation spatially encoded macroblocks motion vectors join Motion Compensation recovered picture Picture Reorder Color Conversion Display EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

26 Guidelines for Task Decomposition
Algorithms start with a good understanding of the problem being solved Programs often naturally decompose into tasks Two common decompositions are Function calls and Distinct loop iterations Easier to start with many tasks and later fuse them, rather than too few tasks and later try to split them EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

27 Guidelines for Task Decomposition
Flexibility Program design should afford flexibility in the number and size of tasks generated Tasks should not tied to a specific architecture Fixed tasks vs. Parameterized tasks Efficiency Tasks should have enough work to amortize the cost of creating and managing them Tasks should be sufficiently independent so that managing dependencies doesn’t become the bottleneck Simplicity The code has to remain readable and easy to understand, and debug EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

28 Case for Pipeline Decomposition
Data is flowing through a sequence of stages Assembly line is a good analogy What’s a prime example of pipeline decomposition in computer architecture? Instruction pipeline in modern CPUs What’s an example pipeline you may use in your UNIX shell? Pipes in UNIX: cat foobar.c | grep bar | wc Other examples Signal processing Graphics ZigZag IQuantization IDCT Saturation EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

29 Guidelines for Data Decomposition
Data decomposition is often implied by task decomposition Programmers need to address task and data decomposition to create a parallel program Which decomposition to start with? Data decomposition is a good starting point when Main computation is organized around manipulation of a large data structure Similar operations are applied to different parts of the data structure EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

30 Common Data Decompositions
Geometric data structures Decomposition of arrays along rows, columns, blocks Decomposition of meshes into domains EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

31 Common Data Decompositions
Geometric data structures Decomposition of arrays along rows, columns, blocks Decomposition of meshes into domains Recursive data structures Example: decomposition of trees into sub-trees problem subproblem split subproblem split split compute subproblem compute subproblem compute subproblem compute subproblem merge merge subproblem subproblem merge solution EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

32 Guidelines for Data Decomposition
Flexibility Size and number of data chunks should support a wide range of executions Efficiency Data chunks should generate comparable amounts of work (for load balancing) Simplicity Complex data compositions can get difficult to manage and debug EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

33 Data Decomposition Examples
Molecular dynamics Compute forces Update accelerations and velocities Update positions Decomposition Baseline algorithm is N2 All-to-all communication Best decomposition is to treat mols. as a set Some advantages to geometric discussed in future lecture EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez

34 Data Decomposition Examples
Molecular dynamics Geometric decomposition Merge sort Recursive decomposition problem subproblem split subproblem split split compute subproblem compute subproblem compute subproblem compute subproblem merge merge subproblem subproblem merge solution EE382N: Parallelilsm and Locality, Spring Lecture 14 (c) Rodric Rabbah, Mattan Erez


Download ppt "Mattan Erez The University of Texas at Austin"

Similar presentations


Ads by Google