© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

Slides:



Advertisements
Similar presentations
Distributed Systems CS
Advertisements

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 14: Basic Parallel Programming Concepts.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
GridFlow: Workflow Management for Grid Computing Kavita Shinde.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
ME964 High Performance Computing for Engineering Applications “Part of the inhumanity of the computer is that, once it is competently programmed and working.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
ME964 High Performance Computing for Engineering Applications Parallel Programming Patterns Oct. 16, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CMPS 5433 Dr. Ranette Halverson Programming Massively.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
GPU Programming Shirley Moore CPS 5401 Fall 2013
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/498AL, University of Illinois, Urbana-Champaign 1 ECE 408/CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Parallel Computing Presented by Justin Reschke
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Parallel Patterns.
Parallel Programming By J. H. Wang May 2, 2017.
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
Lecture 5: GPU Compute Architecture
EE 193: Parallel Computing
L18: CUDA, cont. Memory Hierarchy and Examples
Mattan Erez The University of Texas at Austin
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Synchronization These notes introduce:
6- General Purpose GPU Programming
Presentation transcript:

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors Lecture 12: Basic Parallel Programming Concepts Computational Thinking 101

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 2 Objective To provide you with a framework based on the techniques and best practices used by experienced parallel programmers for –Thinking about the problem of parallel programming –Discussing your work with others –Addressing performance and functionality issues in your parallel program –Using or building useful tools and environments –understanding case studies and projects

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 3 “If you build it, they will come.” “And so we built them. Multiprocessor workstations, massively parallel supercomputers, a cluster in every department … and they haven’t come. Programmers haven’t come to program these wonderful machines. … The computer industry is ready to flood the market with hardware that will only run at full speed with parallel programs. But who will write these programs?” - Mattson, Sanders, Massingill (2005)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 4 Fundamentals of Parallel Computing Parallel computing requires that –The problem can be decomposed into sub-problems that can be safely solved at the same time –The programmer structures the code and data to solve these sub-problems concurrently The goals of parallel computing are –To solve problems in less time, and/or –To solve bigger problems, and/or –To achieve better solutions The problems must be large enough to justify parallel computing and to exhibit exploitable concurrency.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 5 A Recommended Reading Mattson, Sanders, Massingill, Patterns for Parallel Programming, Addison Wesley, 2005, ISBN –We draw quite a bit from the book –A good overview of challenges, best practices, and common techniques in all aspects of parallel programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 6 Key Parallel Programming Steps 1)To find the concurrency in the problem 2)To structure the algorithm so that concurrency can be exploited 3)To implement the algorithm in a suitable programming environment 4)To execute and tune the performance of the code on a parallel system Unfortunately, these have not been separated into levels of abstractions that can be dealt with independently.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 7 Challenges of Parallel Programming Finding and exploiting concurrency often requires looking at the problem from a non-obvious angle –Computational thinking (J. Wing) Dependences need to be identified and managed –The order of task execution may change the answers Obvious: One step feeds result to the next steps Subtle: numeric accuracy may be affected by ordering steps that are logically parallel with each other Performance can be drastically reduced by many factors –Overhead of parallel processing –Load imbalance among processor elements –Inefficient data sharing patterns –Saturation of critical resources such as memory bandwidth

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 8 Shared Memory vs. Message Passing We will focus on shared memory parallel programming –This is what CUDA is based on –Future massively parallel microprocessors are expected to support shared memory at the chip level The programming considerations of message passing model is quite different! –Look at MPI (Message Passing Interface) and its relatives such as Charm++

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 9 Finding Concurrency in Problems Identify a decomposition of the problem into sub- problems that can be solved simultaneously –A task decomposition that identifies tasks for potential concurrent execution –A data decomposition that identifies data local to each task –A way of grouping tasks and ordering the groups to satisfy temporal constraints –An analysis on the data sharing patterns among the concurrent tasks –A design evaluation that assesses of the quality the choices made in all the steps

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 10 Finding Concurrency – The Process Task Decomposition Data Decomposition Data Sharing Order Tasks Decomposition Group Tasks Dependence Analysis Design Evaluation This is typically a iterative process. Opportunities exist for dependence analysis to play earlier role in decomposition.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 11 Task Decomposition Many large problems can be naturally decomposed into tasks – CUDA kernels are largely tasks –The number of tasks used should be adjustable to the execution resources available. –Each task must include sufficient work in order to compensate for the overhead of managing their parallel execution. –Tasks should maximize reuse of sequential program code to minimize effort. “In an ideal world, the compiler would find tasks for the programmer. Unfortunately, this almost never happens.” - Mattson, Sanders, Massingill

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 12 Task Decomposition Example - Square Matrix Multiplication P = M * N of WIDTH ● WIDTH –One natural task (sub-problem) produces one element of P –All tasks can execute in parallel in this example. M N P WIDTH

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 13 Task Decomposition Example – Molecular Dynamics Simulation of motions of a large molecular system For each atom, there are natural tasks to calculate –Vibrational forces –Rotational forces –Neighbors that must be considered in non-bonded forces –Non-bonded forces –Update position and velocity –Misc physical properties based on motions Some of these can go in parallel for an atom It is common that there are multiple ways to decompose any given problem.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 14 NAMD SPEC_NAMD 6 Different NAMD Configurations (all independent) SelfComputes Objects PairComputes Objects … iterations (per patch) Independent Iterations …. Force & Energy Calculation Inner Loops 1872 iterations (per patch pair) ….. PatchList Data Structure

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 15 Data Decomposition The most compute intensive parts of many large problem manipulate a large data structure –Similar operations are being applied to different parts of the data structure, in a mostly independent manner. –This is what CUDA is optimized for. The data decomposition should lead to –Efficient data usage by tasks within the partition –Few dependencies across the tasks that work on different partitions –Adjustable partitions that can be varied according to the hardware characteristics

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 16 Data Decomposition Example - Square Matrix Multiplication Row blocks –Computing each partition requires access to entire N array Square sub-blocks –Only bands of M and N are needed M N P WIDTH

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 17 Tasks Grouping Sometimes natural tasks of a problem can be grouped together to improve efficiency –Reduced synchronization overhead – all tasks in the group can use a barrier to wait for a common dependence –All tasks in the group efficiently share data loaded into a common on-chip, shared storage (Shard Memory) –Grouping and merging dependent tasks into one task reduces need for synchronization –CUDA thread blocks are task grouping examples.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 18 P Task Grouping Example - Square Matrix Multiplication Tasks calculating a P sub-block –Extensive input data sharing, reduced memory bandwidth using Shared Memory –All synched in execution M N WIDTH

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 19 Task Ordering Identify the data and resource required by a group of tasks before they can execute them –Find the task group that creates it –Determine a temporal order that satisfy all data constraints

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 20 Task Ordering Example: Molecular Dynamics Neighbor List Vibrational and Rotational Forces Non-bonded Force Next Time Step Update atomic positions and velocities

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 21 Data Sharing Data sharing can be a double-edged sword –Excessive data sharing can drastically reduce advantage of parallel execution –Localized sharing can improve memory bandwidth efficiency Efficient memory bandwidth usage can be achieved by synchronizing the execution of task groups and coordinating their usage of memory data –Efficient use of on-chip, shared storage Read-only sharing can usually be done at much higher efficiency than read-write sharing, which often requires synchronization

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 22 Data Sharing Example – Matrix Multiplication Each task group will finish usage of each sub-block of N and M before moving on –N and M sub-blocks loaded into Shared Memory for use by all threads of a P sub-block –Amount of on-chip Shared Memory strictly limits the number of threads working on a P sub-block Read-only shared data can be more efficiently accessed as Constant or Texture data

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 23 Data Sharing Example – Molecular Dynamics The atomic coordinates –Read-only access by the neighbor list, bonded force, and non-bonded force task groups –Read-write access for the position update task group The force array –Read-only access by position update group –Accumulate access by bonded and non-bonded task groups The neighbor list –Read-only access by non-bonded force task groups –Generated by the neighbor list task group