Data Parallel Computations and Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson, slides6c.ppt Nov 4, c.1.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Streaming SIMD Extension (SSE)
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
The University of Adelaide, School of Computer Science
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.
Potential for parallel computers/parallel programming
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.
Numerical Algorithms • Matrix multiplication
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
Chapter 10 in textbook. Sorting Algorithms
CUDA Grids, Blocks, and Threads
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
1 Sorting Algorithms - Rearranging a list of numbers into increasing (strictly non-decreasing) order. ITCS4145/5145, Parallel Programming B. Wilkinson.
1 Chapter 04 Authors: John Hennessy & David Patterson.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
Hyper Threading (HT) and  OPs (Micro-Operations) Department of Computer Science Southern Illinois University Edwardsville Summer, 2015 Dr. Hiroshi Fujinoki.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.
Single Instruction Multiple Threads
Numerical Algorithms Chapter 11.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.
Conception of parallel algorithms
Multi-core processors
Synchronous Computations
Flynn’s Classification Of Computer Architectures
Constructing a system with multiple computers or processors
Morgan Kaufmann Publishers
SIMD Programming CS 240A, 2017.
Pipelining and Vector Processing
Array Processor.
Numerical Algorithms • Parallelizing matrix multiplication
Introduction to High Performance Computing Lecture 20
Parallel Computation Patterns (Scan)
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt Oct 24, 2013.
Constructing a system with multiple computers or processors
Parallel Sorting Algorithms
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt March 20, 2014.
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson slides5.ppt August 17, 2014.
Programming with Shared Memory Specifying parallelism
Questions Parallel Programming Shared memory performance issues
CUDA Grids, Blocks, and Threads
CUDA Programming Model
Sorting Algorithms - Rearranging a list of numbers into increasing (strictly non-decreasing) order. Sorting number is important in applications as it can.
CS 286 Computer Organization and Architecture
Matrix Addition and Multiplication
Programming with Shared Memory - 3 Recognizing parallelism
Programming with Shared Memory Specifying parallelism
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Data Parallel Pattern 6c.1
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Data Parallel Computations and Pattern
Data Parallel Computations and Pattern
Presentation transcript:

Data Parallel Computations and Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson, slides6c.ppt Nov 4, c.1

Data Parallel Computations Same operation performed on different data elements simultaneously; i.e., in parallel, fully synchronous. Particularly convenient because: Can scale easily to larger problem sizes. Many numeric and some non-numeric problems can be cast in a data parallel form. Ease of programming (only one program!). 6c.2

Single Instruction Multiple Data (SIMD) model Data parallel model used in vector super-computers designs in1970s: Synchronism at the instruction level. Each instruction specifies a “vector” operation and elements of array to perform operation on. Multiple execution units, each executes operation on a different element or pairs of elements in synchronism Only one instruction fetch/decode unit Subsequently seen in Intel processors -- Vector SSE (Streaming SIMD Extensions) instructions. 6c.3

(SIMD) Data Parallel Pattern 4 Execution units Program Same program instruction sent to all execution units at the same time Data Each execution unit performs same operation but on different data in parallel. Usually data are elements of an array Could be described a computational “pattern”:

To add same constant, k, to each element of an array: for (i = 0; i < N; i++) a[i] = a[i] + k; Statement a[i] = a[i] + k; could be executed simultaneously by multiple processors, each using a different index i (0<i<=n). SIMD Example 6c.5 Vector instruction Meaning add k to all elements of A[i], 0 <i<N

Using forall construct for data parallel pattern Could use forall to specify data parallel operations forall (i = 0; i < n; i++) a[i] = a[i] + k However, forall is more general – it states that the n instances of the body can be executed simultaneously or in any order (not necessarily executed at the same time). We shall see this in GPU implementation of data parallel pattern. Note forall does imply synchronism at its end – all instances must complete before continuing, which will be true in GPUs 6.6

Data Parallel Example Prefix Sum Problem Given a list of numbers, x 0, …, x n-1, compute all the partial summations, i.e.: x 0 + x 1 ; x 0 + x 1 + x 2 ; x 0 + x 1 + x 2 + x 3 ; x 0 + x 1 + x 2 + x 3 + x 4 ; … Can also be defined with associative operations other than addition. Widely studied. Practical applications in areas such as processor allocation, data compaction, sorting, and polynomial evaluation. 6.7

Data parallel method for prefix sum operation 6.8

Sequential pseudo code for (j = 0; j < log(n); j++) // at each step for (i = 2 j ; i < n; i++) // accumulate sum x[i] = x[i] + x[i + 2 j ]; Parallel code using forall notation for (j = 0; j < log(n); j++) // at each step forall (i = 0; i < n; i++) // accumulate sum if (i >= 2 j ) x[i] = x[i] + x[i + 2 j ]; 6c.9

10 Low level image processing Involves manipulating image pixels (picture elements) and often the same operation on each pixel using neighboring pixel values SIMD (single instruction multiple data) model very applicable. Historically, GPUs designed for creating image data for displays using this model.

11 Single Instruction Multiple Thread Programming Model (SIMT) A version of SIMD used in recent GPUs. GPUs use a thread model to achieve very high parallel performance and to hide memory latency Multiple threads, each execute the same instruction sequence. A very large number of threads (10,000’s) can be declared in the program. Our GPUs have 448 and 2496 cores on each chip (see later), providing that number of simultaneous threads. Groups of threads scheduled to execute at the same time on execution cores. Very low thread overhead.

SIMT Example -- Matrix Multiplication Matrix multiplication easy to make a data parallel version. Change two for’s to forall’s: forall (i = 0; i < n; i++) // for each row of A forall (j = 0; j < n; j++) { // for each column of B c[i][j] = 0; for (k = 0; k < n; k++) c[i][j] += a[i][k] * b[k][j]; } Each instance of body is a separate thread, doing same calculation but on different elements of array 6c.1212

forall (i = 0; i < n; i++) // for each row of A forall (j = 0; j < n; j++) { // for each column of B } 6c.13 c[0][0] = 0; for (k = 0; k < n; k++) c[0][0]+=a[0][k]*b[k][0]; One thread for each c element, doing the same calculation but using different a and b elements c[n-1][n-1] = 0; for (k = 0; k < n; k++) c[n-1][n-1]+=a[n-1][k]*b[k][n-1];

We will explore programming GPUs for high performance computing next. Questions so far 6.14