CS6963 L13: Application Case Study III: Molecular Visualization and Material Point Method.

Slides:

Advertisements

Similar presentations

L9: CUDA-CHiLL Research and Introduction to Dense Linear Algebra CS6235.

Advertisements

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,

L16: Sorting and OpenGL Interface

L14: Application Case Studies II CS6963. Administrative Issues Project proposals – Due 5PM, Wednesday, March 17 (hard deadline) – handin cs6963 prop CS6963.

L15: Tree-Structured Algorithms on GPUs CS6963L15: Tree Algorithms.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

L13: Application Case Studies CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ” Project.

L10: Floating Point Issues and Project CS6963. Administrative Issues Project proposals – Due 5PM, Friday, March 13 (hard deadline) Homework (Lab 2) –

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

GPU Acceleration of the Generalized Interpolation Material Point Method Wei-Fan Chiang, Michael DeLisi, Todd Hummel, Tyler Prete, Kevin Tew, Mary Hall,

L9: Next Assignment, Project and Floating Point Issues CS6963.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization CS6963.

L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”

L10: Dense Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Friday, March 5 – handin cs6963 lab 3 ” Project.

© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.

L7: Memory Hierarchy Optimization, cont. CS6963. Administrative Homework #2, posted on website – Due 5PM, Thursday, February 19 – Use handin program to.

L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

L10: Dense Linear Algebra on GPUs CS6235. Administrative Issues Next assignment, linear algebra – Handed out by Friday – Due before spring break – handin.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

CS6963 L15: Design Review and CUBLAS Paper Discussion.

L9: Project Discussion and Floating Point Issues CS6235.

CS6235 Review for Midterm. 2 CS6235 Administrative Pascal will meet the class on Wednesday -I will join at the beginning for questions on test Midterm.

L21: Final Preparation and Course Retrospective (also very brief introduction to Map Reduce) December 1, 2011.

CIS 565 Fall 2011 Qing Sun

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

GPU Architecture and Programming

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2005, 2006 Dr. Ken Hoganson CS8625-June Class Will Start Momentarily… Homework.

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.

CS6963 L14: Design Review, Projects, Performance Cliffs and Optimization Benefits.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

CS6963 L18: Global Synchronization and Sorting. L18: Synchronization and Sorting 2 CS6963 Administrative Grading -Should have exams. Nice job! -Design.

CS6235 L12: Application Case Study I: Material Point Method.

L12: Application Case Studies CS6235. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 5 – handin cs6235 lab 3 ” Project.

Optimizing Stencil Computations March 18, Administrative Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings.

L19: Putting it together: N-body (Ch. 6) November 22, 2011.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008.

L11: Dense Linear Algebra on GPUs, cont.

Basic CUDA Programming

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6235 L4: Memory Hierarchy, II.

L18: CUDA, cont. Memory Hierarchy and Examples

L4: Memory Hierarchy Optimization II, Locality and Data Placement

L16: Feedback from Design Review

Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.

Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.

L15: CUDA, cont. Memory Hierarchy and Examples

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

ECE 8823: GPU Architectures

Review for Midterm.

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

© David Kirk/NVIDIA and Wen-mei W. Hwu,

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Presentation transcript:

CS6963 L13: Application Case Study III: Molecular Visualization and Material Point Method

L13: Application Case Study III 2 CS6963 Administrative Proposal due TODAY STRSM due March 15 Midterm coming -In class March 30 or April 4, open notes -Review notes, readings and review lecture (before break) -Will post prior exams Design Review -Intermediate assessment of progress on project, oral and short -Tentatively April 11 and 13 Final projects -Poster session, April 27 (dry run April 25) -Final report, May 4

L13: Application Case Study III 3 CS6963 Project Proposal (due TODAY) Proposal Logistics: – Significant implementation, worth 55% of grade – Each person turns in the proposal (should be same as other team members) Proposal: – 3-4 page document (11pt, single-spaced) – Submit with handin program: “handin cs6963 prop ” CS6963

L13: Application Case Study III 4 CS6963 Content of Proposal I.Team members: Name and a sentence on expertise for each member II. Problem description -What is the computation and why is it important? -Abstraction of computation: equations, graphic or pseudo-code, no more than 1 page III. Suitability for GPU acceleration -Amdahl’s Law: describe the inherent parallelism. Argue that it is close to 100% of computation. Use measurements from CPU execution of computation if possible. -Synchronization and Communication: Discuss what data structures may need to be protected by synchronization, or communication through host. -Copy Overhead: Discuss the data footprint and anticipated cost of copying to/from host memory. IV.Intellectual Challenges -Generally, what makes this computation worthy of a project? -Point to any difficulties you anticipate at present in achieving high speedup

L13: Application Case Study III 5 CS6963 Triangular Solve (STRSM) for (j = 0; j < n; j++) for (k = 0; k < n; k++) if (B[j*n+k] != 0.0f) { for (i = k+1; i < n; i++) B[j*n+i] -= A[k * n + i] * B[j * n + k]; } Equivalent to: cublasStrsm('l' /* left operator */, 'l' /* lower triangular */, 'N' /* not transposed */, ‘u' /* unit triangular */, N, N, alpha, d_A, N, d_B, N); See:

L13: Application Case Study III 6 CS6963 Approaching Projects/STRSM/Case Studies 1.Parallelism? How do dependences constrain partitioning strategies? 2.Analyze data accesses for different partitioning strategies Start with global memory: coalesced? Consider reuse: within a thread? Within a block? Across blocks? 3.Data Placement (adjust partitioning strategy?) Registers, shared memory, constant memory, texture memory or just leave in global memory 4.Tuning Unrolling, fine-tune partitioning, floating point, control flow, …

L13: Application Case Study III 7 CS6963 Step 1. Simple Partition for STRSM __global__ void strsm1( int n, float *A, float *B ) { int bx = blockIdx.x; int tx = threadIdx.x; int j = bx*THREADSPERBLOCK + tx; // // one thread per column, columns work independently int JN = j * n; int i, k; for (k = 0; k < n; ++k) { // ROW int KN = k * n; for (i = k+1; i < n; ++i) { // ALSO row // B[i][j] -= A[i][k] * B[k][j] element depends on elts in ROWS above it in same col B[ JN + i ] -= A[ KN + i ] * B[ JN + k ]; } Slide source: Mark Hall

L13: Application Case Study III 8 CS6963 Outline Two application case studies Molecular Dynamics Visualization -Read Kirk and Hwu, Ch. 9 -Slide source: John Stone (excerpted) -Link: Material Point Method -Class project in Read “GPU Acceleration of the Generalized Interpolation Material Point Method,” Wei-Fan Chiang, Michael DeLisi, Todd Hummel, Tyler Prete, Kevin Tew, Mary Hall, Phil Wallstedt, and James Guilkey, Symposium on Application Accelerators for High Performance Computing, July Slides from SAAHPC 2009