Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization.

Slides:



Advertisements
Similar presentations
Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Instruction Set Design
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Lecture 6: Multicore Systems
Compiler Challenges for High Performance Architectures
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
Challenge the future Delft University of Technology Evaluating Multi-Core Processors for Data-Intensive Kernels Alexander van Amesfoort Delft.
Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,
Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Implicit and Explicit Optimizations for Stencil Computations Shoaib Kamil 1,2, Kaushik Datta.
Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall Leonid Oliker (LBNL) and Katherine Yelick (UCB and LBNL)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Auto-tuning Stencil Codes for.
Computer Organization and Architecture The CPU Structure.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta.
© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (paper to appear at SC07) Sam Williams
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Lecture 8 – Stencil Pattern Stencil Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Multi-core architectures. Single-core computer Single-core CPU chip.
Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.
November 2006 SpaceOps 2010 Conference, Huntsville, Alabama By José Feiteirinha, Nuno Sebastião Date: April 2010 Emulation Using the CellBE Processor.
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009.
The Potential of the Cell Processor for Scientific Computing
Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007
CS240A: Conjugate Gradients and the Model Problem.
Investigating Architectural Balance using Adaptable Probes.
1 Structured Grids and Sparse Matrix Vector Multiplication on the Cell Processor Sam Williams Lawrence Berkeley National Lab University of California Berkeley.
Optimizing Stencil Computations March 18, Administrative Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Sunpyo Hong, Hyesoon Kim
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Optimizing 3D Multigrid to Be Comparable to the FFT Michael Maire and Kaushik Datta Note: Several diagrams were taken from Kathy Yelick’s CS267 lectures.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
University of California, Berkeley
Ioannis E. Venetis Department of Computer Engineering and Informatics
Parallel Processing - introduction
The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable.
The Coming Multicore Revolution: What Does it Mean for Programming?
Lecture: Cache Innovations, Virtual Memory
CS 3410, Spring 2014 Computer Science Cornell University
6- General Purpose GPU Programming
Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.
Kaushik Datta1,2, Mark Murphy2,
Presentation transcript:

Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13,

Outline Stencil Introduction Grid Traversal Algorithms Serial Performance Results Parallel Performance Results Conclusion

Outline Stencil Introduction Grid Traversal Algorithms Serial Performance Results Parallel Performance Results Conclusion

What are stencil codes? For a given point, a stencil is a pre-determined set of nearest neighbors (possibly including itself) A stencil code updates every point in a regular grid with a constant weighted subset of its neighbors (“applying a stencil”) 2D Stencil 3D Stencil

Stencil Applications Stencils are critical to many scientific applications: –Diffusion, Electromagnetics, Computational Fluid Dynamics –Both explicit and implicit iterative methods (e.g. Multigrid) –Both uniform and adaptive block-structured meshes Many type of stencils –1D, 2D, 3D meshes –Number of neighbors (5- pt, 7-pt, 9-pt, 27-pt,…) –Gauss-Seidel (update in place) vs Jacobi iterations (2 meshes) This talk focuses on 3D, 7-point, Jacobi iteration

Naïve Stencil Pseudocode (One iteration) void stencil3d(double A[], double B[], int nx, int ny, int nz) { for all grid indices in x-dim { for all grid indices in y-dim { for all grid indices in z-dim { B[center] = S0* A[center] + S1*(A[top] + A[bottom] + A[left] + A[right] + A[front] + A[back]); }

2D Poisson Stencil- Specific Form of SpMV Stencil uses an implicit matrix –No indirect array accesses! –Stores a single value for each diagonal 3D stencil is analagous (but with 7 nonzero diagonals) T = 4 Graph and “stencil”

Reduce Memory Traffic! Stencil performance usually limited by memory bandwidth Goal: Increase performance by minimizing memory traffic –Even more important for multicore! Concentrate on getting reuse both: –within an iteration –across iterations (Ax, A 2 x, …, A k x) Only interested in final result

Outline Stencil Introduction Grid Traversal Algorithms Serial Performance Results Parallel Performance Results Conclusion

Grid Traversal Algorithms Yes No* Intra-iteration Reuse No*Yes Inter-iteration Reuse Cache Blocking Naive One common technique –Cache blocking guarantees reuse within an iteration Two novel techniques –Time Skewing and Circular Queue also exploit reuse across iterations N/A Time Skewing Circular Queue * Under certain circumstances

Grid Traversal Algorithms Yes No* Intra-iteration Reuse No*Yes Inter-iteration Reuse Cache Blocking Naive One common technique –Cache blocking guarantees reuse within an iteration Two novel techniques –Time Skewing and Circular Queue also exploit reuse across iterations N/A Time Skewing Circular Queue * Under certain circumstances

Naïve Algorithm Traverse the 3D grid in the usual way –No exploitation of locality –Grids that don’t fit in cache will suffer y (unit-stride) x

Grid Traversal Algorithms Yes No* Intra-iteration Reuse No*Yes Inter-iteration Reuse Cache Blocking Naive One common technique –Cache blocking guarantees reuse within an iteration Two novel techniques –Time Skewing and Circular Queue also exploit reuse across iterations N/A Time Skewing Circular Queue * Under certain circumstances

Cache Blocking- Single Iteration At a Time Guarantees reuse within an iteration –“Shrinks” each plane so that three source planes fit into cache –However, no reuse across iterations In 3D, there is tradeoff between cache blocking and prefetching –Cache blocking reduces memory traffic by reusing data –However, short stanza lengths do not allow prefetching to hide memory latency Conclusion: When cache blocking, don’t cut in unit-stride dimension! y (unit-stride) x

Grid Traversal Algorithms Yes No* Intra-iteration Reuse No*Yes Inter-iteration Reuse Cache Blocking Naive One common technique –Cache blocking guarantees reuse within an iteration Two novel techniques –Time Skewing and Circular Queue also exploit reuse across iterations N/A Time Skewing Circular Queue * Under certain circumstances

Time Skewing- Multiple Iterations At a Time Now we allow reuse across iterations Cache blocking now becomes trickier –Need to shift block after each iteration to respect dependencies –Requires cache block dimension c as a parameter (or else cache oblivious) –We call this “Time Skewing” [Wonnacott ‘00] Simple 3-point 1D stencil with 4 cache blocks shown above

2-D Time Skewing Animation No iterations 1 iteration 2 iterations 3 iterations 4 iterations Since these are Jacobi iterations, we alternate writes between the two arrays after each iteration Cache Block #1 Cache Block #2 Cache Block #3 Cache Block #4 y (unit-stride) x

Time Skewing Analysis Positives –Exploits reuse across iterations –No redundant computation –No extra data structures Negatives –Inherently sequential –Need to find optimal cache block size Can use exhaustive search, performance model, or heuristic –As number of iterations increases: Cache blocks can “fall” off the grid Work between cache blocks becomes more imbalanced

Time Skewing- Optimal Block Size Search G O O D

Reduced memory traffic does correlate to higher GFlop rates G O O D

Grid Traversal Algorithms Yes No* Intra-iteration Reuse No*Yes Inter-iteration Reuse Cache Blocking Naive One common technique –Cache blocking guarantees reuse within an iteration Two novel techniques –Time Skewing and Circular Queue also exploit reuse across iterations N/A Time Skewing Circular Queue * Under certain circumstances

2-D Circular Queue Animation Read arrayFirst iterationSecond iterationWrite array

Parallelizing Circular Queue Stream out planes to target grid Stream in planes from source grid Each processor receives a colored block Redundant computation when performing multiple iterations

Circular Queue Analysis Positives –Exploits reuse across iterations –Easily parallelizable –No need to alternate the source and target grids after each iteration Negatives –Redundant computation Gets worse with more iterations –Need to find optimal cache block size Can use exhaustive search, performance model, or heuristic –Extra data structure needed However, minimal memory overhead

Algorithm Spacetime Diagrams space time Naive 1st Block2nd Block3rd Block4th Block space time Cache Blocking space time Time Skewing space time Circular Queue

Outline Stencil Introduction Grid Traversal Algorithms Serial Performance Results Parallel Performance Results Conclusion

Serial Performance Single core of 1 socket x 4 core Intel Xeon (Kentsfield) Single core of 1 socket x 2 core AMD Opteron

Outline Stencil Introduction Grid Traversal Algorithms Serial Performance Results Parallel Performance Results Conclusion

Multicore Performance # cores Left side: –Intel Xeon (Clovertown) –2 sockets x 4 cores –Machine peak DP: 85.3 GFlops/s Right side: –AMD Opteron (Rev. F) –2 sockets x 2 cores –Machine peak DP: 17.6 GFlops/s 1 iteration of Problem

Outline Stencil Introduction Grid Traversal Algorithms Serial Performance Results Parallel Performance Results Conclusion

Stencil Code Conclusions Need to autotune! –Choosing appropriate algorithm AND block sizes for each architecture is not obvious –Can be used with performance model –My thesis work :) Appropriate blocking and streaming stores most important for x86 multicore –Streaming stores reduces mem. traffic from 24 B/pt. to 16 B/pt. Getting good performance out of x86 multicore chips is hard! –Applied 6 different optimizations, all of which helped at some point

Backup Slides

Poisson’s Equation in 1D T = 2 Graph and “stencil” Discretize: d 2 u/dx 2 = f(x) on regular mesh : u i = u(i*h) to get: [ u i+1 – 2*u i + u i-1 ] / h 2 = f(x) Write as solving: Tu = -h 2 * f for u where

Cache Blocking with Time Skewing Animation z (unit-stride) y x

Cache Conscious Performance Cache conscious measured with optimal block size on each platform Itanium 2 and Opteron both improve

Cell Processor PowerPC core that controls 8 simple SIMD cores (“SPE”s) Memory hierarchy consists of: –Registers –Local memory –External DRAM Application explicitly controls memory: –Explicit DMA operations required to move data from DRAM to each SPE’s local memory –Effective for predictable data access patterns Cell code contains more low-level intrinsics than prior code

Excellent Cell Processor Performance Double-Precision (DP) Performance: 7.3 GFlops/s DP performance still relatively weak –Only 1 floating point instruction every 7 cycles –Problem becomes computation-bound when cache-blocked Single-Precision (SP) Performance: 65.8 GFlops/s! –Problem now memory-bound even when cache-blocked If Cell had better DP performance or ran in SP, could take further advantage of cache blocking

Summary - Computation Rate Comparison

Summary - Algorithmic Peak Comparison

Outline Stencil Introduction Grid Traversal Algorithms Serial Performance Results Parallel Performance Results Conclusion