1 Parallel Processing Fundamental Concepts. 2 Selection of an Application for Parallelization Can use parallel computation for 2 things: –Speed up an.

Slides:

Advertisements

Similar presentations

Analyzing Parallel Performance Intel Software College Introduction to Parallel Programming – Part 6.

Advertisements

Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.

INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks Session:

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

INTEL CONFIDENTIAL Improving Parallel Performance Introduction to Parallel Programming – Part 11.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Potential for parallel computers/parallel programming

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

Reference: Message Passing Fundamentals.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Introduction to Analysis of Algorithms

1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.

Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Quantitative.

Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.

Chapter 9. Concepts in Parallelisation An Introduction

INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.

Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.

INTEL CONFIDENTIAL Why Parallel? Why Now? Introduction to Parallel Programming – Part 1.

INTEL CONFIDENTIAL Reducing Parallel Overhead Introduction to Parallel Programming – Part 12.

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

INTEL CONFIDENTIAL Finding Parallelism Introduction to Parallel Programming – Part 3.

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Recognizing Potential Parallelism Intel Software College Introduction to Parallel Programming – Part 1.

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Performance Evaluation of Parallel Processing. Why Performance?

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

Parallel Programming with MPI and OpenMP

INTEL CONFIDENTIAL Shared Memory Considerations Introduction to Parallel Programming – Part 4.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

TM Parallel Concepts An introduction. TM The Goal of Parallelization Reduction of elapsed time of a program Reduction in turnaround time of jobs Overhead:

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Concurrency and Performance Based on slides by Henri Casanova.

Tuning Threaded Code with Intel® Parallel Amplifier.

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

Potential for parallel computers/parallel programming

Code Optimization.

Distributed Processors

Conception of parallel algorithms

Parallel Programming By J. H. Wang May 2, 2017.

Parallel Algorithm Design

EE 193: Parallel Computing

Presented by: Huston Bokinsky Ying Zhang 25 April, 2013

Objective of This Course

Distributed Systems CS

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Mattan Erez The University of Texas at Austin

Potential for parallel computers/parallel programming

Potential for parallel computers/parallel programming

Potential for parallel computers/parallel programming

Potential for parallel computers/parallel programming

Presentation transcript:

1 Parallel Processing Fundamental Concepts

2 Selection of an Application for Parallelization Can use parallel computation for 2 things: –Speed up an existing application –Improve quality of result for an application more compute power allows revolutionary change in algorithm Application should be compute intensive and unless significant speedup is achievable, parallelization is not worth effort

3 Fundamental Limits: Amdahl’s Law T1 = execution time using 1 processor (serial execution time) Tp = execution time using P processors S = serial fraction of computation (i.e. fraction of computation which can only be executed using 1 processor) C = fraction of computation which could be executed by p processors Then S + C = 1 and Tp = S * T1+ (T1 * C)/P = (S + C/P)T1 Speedup = Ψ(p) = T1/Tp = 1/(S+C/P)

4 Fundamental Limits: Amdahl’s Law (cont.) Speedup Ψ(p) = T1/Tp = 1/(S+C/P) Maximum speedup ( by using infinite number of processors = 1/S example: S = 0.05, MaxSpeedup, Smax = 20

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 5 Scalability of Multithreaded Applications Amdahl’s Law Speedup is limited by the amount of serial code Maximum Theoretical Speedup from Amdahl's Law Number of cores Speedup %serial= 0 %serial=10 %serial=20 %serial=30 %serial=40 %serial=50 Ψ(p) ≤ 1 s + (1 - s) / p where 0 ≤ s ≤ 1, the fraction of serial operations

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 6 Scalability of Multithreaded Applications Question A: 1.25 B: 2.0 C: 4.0 D: No speedup If application is only 25% serial, what’s the maximum speedup you can ever achieve, assuming infinite number of processors ? (ignore parallel overhead)

7 Speedup Ψ(p) = T1/Tp = 1/(S+ C/P) Serial fractions appear in “non-obvious” ways Example application profile: 1.input: 10% 2.compute setup: 15% 3.computation: 75% If only part 3 parallelized: Smax = 4 If only part 2 and 3 parallelized: Smax = 10 Have to live with Smax if you cannot change the algorithm implemented in your application such that it makes better use of a parallel machine.

8 Speedups with Scaled Problem Sizes Determine speedup given constant problem size Determine speedup given constant turnaround time –assume perfect speedup and determine the problem size that can be computed given the same turnaround time

9 Type of Parallelization: User controlled vs automatic User controlled: –Programmer tells all processors what to do at all times –More freedom, but significant effort from programmer –Several problems, for example Programmer may not know the details of the machine as compiler Sophistication needed to write programs with good locality & grain (i.e. work size assigned to each processor) Automatic: compilers

10 Approaches: Exact vs Inexact Begin with sequential code Exact parallelism: –Definition: all data dependences remain intact –Advantage: answer guaranteed to be the same as sequential implementation independent of the number of processors –Problem: unnecessary dependences causes inefficiency

11 Approaches: Exact vs Inexact (continued) Inexact parallelism: –Definition: “relax”data dependences: allow “stale” data to be used, instead of most-up-to-date –Used in both numerical solution techniques and combinatorial optimization –Reduces synchronization overhead –Usually in context of iterative algorithms which still converge to right answer –may or may not be faster

12 Speculative Parallelism Do more work than may need to be done Example: execute the two sides of an IF statement

13 Orthogonal Parallelism Think about parallelism like slicing an apple: –Make N cuts on X axis: N pieces –Make M cuts on Y axis: NxM pieces –Make K cuts on Z axis: NxMxK pieces

14 Example Nested loop for I=1,10 for J=1,10 for K = 1,5 “independent work” Orthogonal parallelism means creating 500 different threads each executing an instance of the “independent work”

15 Design Tradeoffs in Parallel Decomposion “Granularity” vs “Communication/Synchronization” vs “Load Balance” Granularity: Amount of computation between interprocess communication Interprocess Communication: data transmission or synchronization Load Balance: Distributing work evenly between processes (threads)

16 Granularity (continued) Grain size –fine: program chopped into many small pieces –coarse: fewer larger pieces

17 Choice of Granuality Impact Parallel decomposition overhead: –As granularity decreases, overhead increases –e.g. time taken by each process to obtain task (serialization if single task queue) Load balance: –As granularity decreases, get better balance –Better distribution of work between processors

18 Graph Typical graph of execution time using P processors (if grain size can be varied) Time Overhead dominates Load imbalance dominates Granularity Fine

19 Execution Time Execution time is NOT S + (C/P) Execution time = S + (C/P)(1+ Kp) + Op Kp: Cost due to load imbalance and communication/synchronization Op: Other overhead

20 Scheduling: Static vs Dynamic If grain size is constant and the number of tasks is known, then can statically assign tasks to processors (e.g. at compile time) –reduce overhead of work assignment to processors If not, then need some dynamic scheduling mechanism (e.g. task queue, self-scheduled loop) Possible even to have a dynamic decision about whether or not to spawn (create additional process)

Static Scheduling of Parallel loops One of the most popular constructs for shared- memory programming is the parallel loop. A parallel loop is a “for” or “do” statement, except that it doesn’t iterate. –Instead, it says “just get all these things done, in any order, using several processors if possible.” –The number of processors available to the job may be specified or limited 21

An example parallel loop c = sin (d) parallel do i=1 to 30 a(i) = b(i) + c end parallel do e = a(20)+ a(15) 22

Implementation of parallel loops using static scheduling c = sin (d) start_task sub(a,b,c,1,10) start_task sub(a,b,c,11,20) call sub(a,b,c,21,30) wait_for_all_tasks_to_complete e = a(20)+ a(15)... subroutine sub(a,b,c,k,l)... for i=k to l a(i) = b(i) + c end for end sub 23 c = sin (d) parallel do i=1 to 30 a(i) = b(i) + c end parallel do e = a(20)+ a(15) Notice that, in this program, arrays a and b are shared by the three processors cooperating in the execution of the loop.

Implementation of parallel loops using static scheduling c = sin (d) start_task sub(a,b,c,1,10) start_task sub(a,b,c,11,20) call sub(a,b,c,21,30) wait_for_all_tasks_to_complete e = a(20)+ a(15)... subroutine sub(a,b,c,k,l)... for i=k to l a(i) = b(i) + c end for end sub 24 c = sin (d) parallel do i=1 to 30 a(i) = b(i) + c end parallel do e = a(20)+ a(15) This program assigns to each processor a fixed segment of the iteration space. This is called static scheduling.

Implementation of parallel loops using dynamic scheduling c = sin (d) start_task sub(a,b,c) call sub(a,b,c) wait_for_all_tasks_to_complete e = a(20)+ a(15)... subroutine sub(a,b,c) logical empty... call get_another_iteration(empty,i) while.not. empty do a(i) = b(i) + c call get_another_iteration(empty,i) end while end sub 25 c = sin (d) parallel do i=1 to 30 a(i) = b(i) + c end parallel do e = a(20)+ a(15) Here, the get_another_iteration() subroutine accesses a pool containing all n iteration numbers, gets one of them, and removes it from the pool. When all iterations have been assigned, and therefore the pool is empty, the function returns.true. in variable empty. On the next slide we show a third approach in which get_another_iteration() returns a range of iterations instead of a single iteration:

Another alternative subroutine sub(a,b,c,k,l) logical empty... call get_another_iteration(empty,i,j) while.not. empty do for k=i to j a(k) = b(k) + c end for call get_another_iteration(empty,i,j) end while end sub 26 c = sin (d) parallel do i=1 to 30 a(i) = b(i) + c end parallel do e = a(20)+ a(15) get_another_iteration() returns a range of iterations instead of a single iteration: c = sin (d) start_task sub(a,b,c) call sub(a,b,c) wait_for_all_tasks_to_complete e = a(20)+ a(15)

Array Programming Languages Array operations are written in a compact form that makes programs more readable. Consider the loop: s=0 do i=1,n a(i)=b(i)+c(i) s=s+a(i) end do It can be written (in Fortran 90 notation) as follows: a(1:n) = b(1:n) +c(1:n) s=sum(a(1:n)) A popular array language today is MATLAB. 27 vector operation reduction function

Parallelizing Vector Expressions All the arithmetic operations (+, -, * /, **) involved in a vector expression can be performed in parallel. Intrinsic reduction functions also can be performed in parallel. Vector operations can be easily executed in parallel using almost any form of parallelism including pipelining and multiprocessing. 28

Array Programming Languages (cont.) Array languages can be used to express parallelism because array operations can be easily executed in parallel. Vector programs are easily translated for execution on shared memory parallel machines. 29 c = sin(d) a(1:30)=b(2:31) + c e=a(20)+a(15) c = sin (d) parallel do i=1 to 30 a(i) = b(i+1) + c end parallel do e = a(20)+ a(15) Translated to

30 Typical Parallel Programs Bottlenecks: Task Queue One central task queue may be a bottleneck –q distributed task queues with p threads, q <= p –distributes contention across queues –Cost: if the 1st queue a processor checks is empty, the processor has to go and look at others Task insertion is an issue: –If the number of tasks generated by each processor is uniform--> each processor is assigned a specific task queue for insertion –If task generation is non-uniform (e.g. one processor generates all tasks)-- > tasks should be uniformly, randomly spread among queues Priority queues: Give more important jobs higher priority --> get executed sooner