Parallelizing Iterative Computation for Multiprocessor Architectures Peter Cappello.

Slides:



Advertisements
Similar presentations
Introduction to Programming G51PRG University of Nottingham Revision 1
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Computer Abstractions and Technology
The University of Adelaide, School of Computer Science
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Thinking about Systolic Arrays Peter Cappello This presentation assumes that you have read the Chapter by C. Leiserson and H.-T. Kung.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Performance Analysis of Multiprocessor Architectures
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
Advanced Topics in Algorithms and Data Structures Lecture 7.1, page 1 An overview of lecture 7 An optimal parallel algorithm for the 2D convex hull problem,
Weekly Report Ph.D. Student: Leo Lee date: Oct. 9, 2009.
CIS 101: Computer Programming and Problem Solving Lecture 8 Usman Roshan Department of Computer Science NJIT.
Introduction CS 524 – High-Performance Computing.
Recap.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Elementary Data Types Scalar Data Types Numerical Data Types Other
Chapter 5, CLR Textbook Algorithms on Grids of Processors.
CUDA Grids, Blocks, and Threads
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Java Software Solutions Lewis and Loftus Chapter 2 1 Copyright 1997 by John Lewis and William Loftus. All rights reserved. Software Concepts -- Introduction.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Introduction to Java Appendix A. Appendix A: Introduction to Java2 Chapter Objectives To understand the essentials of object-oriented programming in Java.
Chapter 1 Algorithm Analysis
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
ISBN Chapter 7 Expressions and Assignment Statements.
Analysis of Algorithms
Lec 6 Data types. Variable: Its data object that is defined and named by the programmer explicitly in a program. Data Types: It’s a class of Dos together.
Institute for Software Science – University of ViennaP.Brezany Parallel and Distributed Systems Peter Brezany Institute for Software Science University.
Complexity of Algorithms
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping 林孟諭 Dept. of Electrical Engineering National Cheng Kung University.
1 Text Reference: Warford. 2 Computer Architecture: The design of those aspects of a computer which are visible to the programmer. Architecture Organization.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
J ICOS’s Abstract Distributed Service Component Peter Cappello Computer Science Department UC Santa Barbara.
The Instruction Set Architecture. Hardware – Software boundary Java Program C Program Ada Program Compiler Instruction Set Architecture Microcode Hardware.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
CS203 – Advanced Computer Architecture Performance Evaluation.
Functional Processing of Collections (Advanced) 6.0.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Programming Languages Salihu Ibrahim Dasuki (PhD) CSC102 INTRODUCTION TO COMPUTER SCIENCE.
J ICOS A Java-Centric Distributed Computing Service Peter Cappello & Chris Coakley Computer Science Department UC Santa Barbara.
Arrays An array is a sequence of objects all of which have the same type. The objects are called the elements of the array and are numbered consecutively.
Windows Programming Lecture 03. Pointers and Arrays.
Concurrency Idea. 2 Concurrency idea Challenge –Print primes from 1 to Given –Ten-processor multiprocessor –One thread per processor Goal –Get ten-fold.
C LANGUAGE UNIT 3. UNIT 3 Arrays Arrays – The concept of array – Defining arrays – Initializing arrays.
Distributed and Parallel Processing George Wells.
CS203 – Advanced Computer Architecture
Computers’ Basic Organization
Advanced Computer Systems
Computer Programming.
Analysis of Algorithms
Objectives Identify the built-in data types in C++
Introduction to the C Language
CSE8380 Parallel and Distributed Processing Presentation
Numerical Algorithms Quiz questions
Chapter 1 Introduction.
Chapter 6: Programming Languages
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Data Parallel Pattern 6c.1
Computer Architecture
Computability and Undecidability
Presentation transcript:

Parallelizing Iterative Computation for Multiprocessor Architectures Peter Cappello

2 What is the problem? Create programs for multi-processor unit (MPU) –Multicore processors –Graphics processing units (GPU)

3 For whom is it a problem? Compiler designer Application Program Compiler Executable CPU EASY

4 For whom is it a problem? Compiler designer Application Program Compiler Executable MPU HARD

5 For whom is it a problem? Application programmer Application Program Compiler Executable MPU

6 Complex machine consequences Programmer needs to be highly skilled Programming is error-prone These consequences imply... Increased parallelism  increased development cost!

7 Amdahl’s Law The speedup of a program is bounded by its inherently sequential part. ( If –A program needs 20 hours using a CPU –1 hour cannot be parallelized Then –Minimum execution time ≥ 1 hour. –Maximum speed up ≤ 20.

8 (

9 Parallelization opportunities Scalable parallelism resides in 2 sequential program constructs: Divide-and-conquer recursion Iterative statements (for)

10 2 schools of thought Create a general solution (Address everything somewhat well) Create a specific solution (Address one thing very well)

11 Focus on iterative statements (for) float[] x = new float[n]; float[] b = new float[n]; float[][] a = new float[n][n];... for ( int i = 0; i < n; i++ ) { b[i] = 0; for ( int j = 0; j < n; j++ ) b[i] += a[i][j]*x[j]; }

12 Matrix-Vector Product b = Ax, illustrated with a 3X3 matrix, A. _______________________________ b1 = a11*x1 + a12*x2 + a13*x3 b2 = a21*x1 + a22*x2 + a23*x3 b3 = a31*x1 + a32*x2 + a33*x3

13 a31a32a33 a21a22a23 a11a12a13 x1x2x3 x1 x2 x3 b1 b2 b3 x1x2x3

14 a31a32a33 a21a22a23 a11a12a13 x1x2x3 x1 x2 x3 TIME SPACESPACE

15 a31a32a33 a21a22a23 a11a12a13 x1x2x3 x1 x2 x3 SPACESPACE TIME

16 a31 a32 a33 a21 a22 a23 a11 a12 a13 x1x1 x2x2 x3x3 x1x1 x1x1 x2x2 x2x2 x3x3 x3x3 SPACESPACE TIME

17 Matrix Product C = AB, illustrated with a 2X2 matrices. c11 = a11*b11 + a12*b21 c12 = a11*b12 + a12*b22 c21 = a21*b11 + a22*b21 c12 = a21*b12 + a22*b22

18 a21a22 a11a12 b11 b21 k row a21a22 a11a12b12 b21 b12 b22 col

19 a11 a21 a22 a12 b11 b21 T S a21a22 a11a12b12 b21 b12 b22 S

20 a21a22 a11a12 b11 b21 T S a21a22 a11a12b12 b21 b12 b22 S

21 Declaring an iterative computation Index set Data network Functions Space-time embedding

22 Declaring an Index set I1:I1: I2:I2: 1 ≤ i ≤ j ≤ n 1 ≤ i ≤ n1 ≤ j ≤ n i j i j

23 Declaring a Data network D 1 : x: [ -1, 0]; b: [ 0, -1]; a: [ 0, 0]; D 2 : x: [ -1, 0]; b: [ -1, -1]; a: [ 0, -1]; x b a x a b

24 I 1 : D 1 : x: [ -1, 0]; b: [ 0, -1]; a: [ 0, 0]; Declaring an Index set + Data network i j x b a 1 ≤ i ≤ j ≤ n

25 Declaring the Functions R 1 : float x’ (float x) { return x; } float b’ (float b, float x, float a) { return b + a*x; } R 2 : char x’ (char x) { return x; } boolean b’ (boolean b, char x, char a) { return b && a == x; } i j

26 Declaring a Spacetime embedding E 1 : –space = -i + j –time = i + j. E 2 : –space 1 = i –space 2 = j –time = i + j. time space time space 2 space 1

27 Declaring an iterative computation Upper triangular matrix-vector product UTMVP = (I 1,D 1,F 1,E 1 ) time space

28 Declaring an iterative computation Full matrix-vector product UTMVP = (I 2,D 1,F 1,E 1 ) time space

29 Declaring an iterative computation Convolution (polynomial product) UTMVP = (I 2,D 2,F 1,E 1 ) time space

30 Declaring an iterative computation String pattern matching UTMVP = (I 2,D 2,F 2,E 1 ) time space

31 Declaring an iterative computation Pipelined String pattern matching UTMVP = (I 2,D 2,F 2,E 2 ) time space 2 space 1

32 Iterative computation specification Declarative specification Is a 4-dimensional design space (actually 5 dimensional: space embedding is independent of time embeding) Facilitates reuse of design components.

33 Starting with an existing language … Can infer –Index set –Data network –Functions Cannot infer –Space embedding –Time embedding

34 Spacetime embedding Start with it as a program annotation More advanced: compiler optimized based on program annotated figure of merit.

35 Work Work out details of notation Implement in Java, C, Matlab, HDL, … Map virtual processor network to actual processor network Map –Java: map processors to Threads, [links to Channels] –GPU: map processors to GPU processing elements (Challenge: spacetime embedding depends on underlying architecture)

36 Work … The output of 1 iterative computation is the input to another. Develop a notation for specifying composite iterative computation?

37 Thanks for listening! Questions?