Data Parallel Pattern 6c.1

Slides:

Advertisements

Similar presentations

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Advertisements

1 5.1 Pipelined Computations. 2 Problem divided into a series of tasks that have to be completed one after the other (the basis of sequential programming).

Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.

1 5.1 Pipelined Computations. 2 Problem divided into a series of tasks that have to be completed one after the other (the basis of sequential programming).

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.

Numerical Algorithms • Matrix multiplication

Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

COMPE575 Parallel & Cluster Computing 5.1 Pipelined Computations Chapter 5.

CUDA Grids, Blocks, and Threads

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

CSCI-455/552 Introduction to High Performance Computing Lecture 23.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.

Data Parallel Computations and Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson, slides6c.ppt Nov 4, c.1.

Numerical Algorithms Chapter 11.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Synchronous Computations

Flynn’s Classification Of Computer Architectures

Lecture 2: Parallel computational models

Constructing a system with multiple computers or processors

CS 213: Data Structures and Algorithms

PRAM Algorithms.

CS 179: GPU Programming Lecture 7.

Pipelining and Vector Processing

Array Processor.

Numerical Algorithms • Parallelizing matrix multiplication

Pipelined Computations

Introduction to High Performance Computing Lecture 20

Parallel Computation Patterns (Scan)

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Introduction to High Performance Computing Lecture 12

Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt Oct 24, 2013.

Pipelined Pattern This pattern is implemented in Seeds, see

Quiz Questions Parallel Programming Parallel Computing Potential

All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,

Questions Parallel Programming Shared memory performance issues

Parallel Sorting Algorithms

3. Brute Force Selection sort Brute-Force string matching

Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt March 20, 2014.

Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson slides5.ppt August 17, 2014.

ECE 498AL Lecture 15: Reductions and Their Implementation

UNIVERSITY OF MASSACHUSETTS Dept

Programming with Shared Memory Specifying parallelism

Questions Parallel Programming Shared memory performance issues

Parallel build blocks.

CUDA Grids, Blocks, and Threads

3. Brute Force Selection sort Brute-Force string matching

Sorting Algorithms - Rearranging a list of numbers into increasing (strictly non-decreasing) order. Sorting number is important in applications as it can.

UNIVERSITY OF MASSACHUSETTS Dept

Potential for parallel computers/parallel programming

Stencil Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson Oct 14, 2014 slides6b.ppt 1.

Matrix Addition and Multiplication

Programming with Shared Memory Specifying parallelism

Stencil Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson StencilPattern.ppt Oct 14,

Quiz Questions Parallel Programming Parallel Computing Potential

Quiz Questions Parallel Programming Parallel Computing Potential

Quiz Questions Parallel Programming Parallel Computing Potential

Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.

Data Parallel Computations and Pattern

Data Parallel Computations and Pattern

3. Brute Force Selection sort Brute-Force string matching

Presentation transcript:

Data Parallel Pattern 6c.1 ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson, Oct 22, 2012 6c.1

Data Parallel Computations Same operation performed on different data elements simultaneously; i.e., in parallel. Fully synchronous. All processes operate in synchronism Particularly convenient because: • Ease of programming (essentially only one program). • Can scale easily to larger problem sizes. • Many numeric and some non-numeric problems can be cast in a data parallel form. Has been used in vector supercomputers designs in the 1970s. Versions seen in Intel processors, SSE extensions Currently used a basis of GPU operations, see later. 6c.2

Example To add the same constant to each element of an array: for (i = 0; i < n; i++) a[i] = a[i] + k; Statement a[i] = a[i] + k; could be executed simultaneously by multiple processors, each using a different index i (0<i<=n). Vector supercomputers were designed to operate this way with single instruction multiple data model (SIMD) 6c.3

Using forall construct for data parallel pattern Could use forall to specify data parallel operations forall (i = 0; i < n; i++) a[i] = a[i] + k However, forall is more general – it states that the n instances of the body can be executed simultaneously or in any order (not necessarily executed at the same time). We shall see that a GPU implementation of data parallel patterns does not necessarily allow all instances to execute at the same time. Note forall does imply synchronism at its end – all instances must complete before continuing, which will be true in GPUs 6.4

Data Parallel Example Prefix Sum Problem Given a list of numbers, x0, …, xn-1, compute all the partial summations, i.e.: x0 + x1; x0 + x1 + x2; x0 + x1 + x2 + x3; x0 + x1 + x2 + x3 + x4; … Can also be defined with associative operations other than addition. Widely studied. Practical applications in areas such as processor allocation, data compaction, sorting, and polynomial evaluation. 6.5

Data parallel method for prefix sum operation 6.6

Parallel code using forall notation Sequential code for (j = 0, j < log(n); j++) // at each step for (i = 2j; i < n; i++) // accumulate sum x[i] = x[i] + x[i + 2j]; Parallel code using forall notation for (j=0, j< log(n); j++) // at each step forall (i = 0; i < n; i++) // accumulate sum if (i >= 2j) x[i] = x[i] + x[i + 2j]; 6c.7

Matrix Multiplication Easy to make a data parallel version Change for’s to forall’s: forall (i = 0; i < n; i++) // for each row of A forall (j = 0; j < n; j++) { // for each column of B c[i][j] = 0; for (k = 0; k < n; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j]; } Here the data parallel definition extended to multiple sequential operations on data items – each instance of the body is a separate thread Each instance executed in sequential order 6c.8

We will explore the data parallel pattern using GPUs for high performance computing, see next. Questions so far 6.9