Carnegie Mellon Performance Portable Tracking of Evolving Surfaces Wei Yu Citadel Investment Group Franz Franchetti, James C. Hoe, José M. F. Moura Carnegie.

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

A Synergetic Approach to Throughput Computing on IA Chi-Keung (CK) Luk TPI/DPD/SSG Intel Corporation Nov 16, 2010.

1 Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software.

Introduction to the CUDA Platform

GPU Programming using BU Shared Computing Cluster

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Carnegie Mellon Automatic Generation of Vectorized Fast Fourier Transform Libraries for the Larrabee and AVX Instruction Set Extension Automatic Generation.

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Parallel computer architecture classification

Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University

Carnegie Mellon Lessons From Building Spiral The C Of My Dreams Franz Franchetti Carnegie Mellon University Lessons From Building Spiral The C Of My Dreams.

Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)

Introductory Courses in High Performance Computing at Illinois David Padua.

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

Reconfigurable Application Specific Computers RASCs Advanced Architectures with Multiple Processors and Field Programmable Gate Arrays FPGAs Computational.

Carnegie Mellon μ-op Fission: Hyper-threading without the Hyper-headache Anthony Cartolano Robert Koutsoyannis Daniel S. McFarlin Carnegie Mellon University.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.

1 7 Questions for Parallelism Applications: 1. What are the apps? 2. What are kernels of apps? Hardware: 3. What are the HW building blocks? 4. How to.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

Lecture 8: Caffe - CPU Optimization

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Short Vector SIMD Code Generation for DSP Algorithms

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

High Performance Linear Transform Program Generation for the Cell BE

Sobolev Showcase Computational Mathematics and Imaging Lab.

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.

Computing Without Processors By Satnam Singh Presented By : Divya Shreya Boddapati.

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

GPU Architecture and Programming

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.

Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Carnegie Mellon Program Generation with Spiral: Beyond Transforms This work was supported by DARPA DESA program, NSF-NGS/ITR, NSF-ACR, Mercury Inc., and.

Understanding Parallel Computers Parallel Processing EE 613.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Trends in Multicore Architecture.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.

A survey of Exascale Linear Algebra Libraries for Data Assimilation

CS427 Multicore Architecture and Parallel Computing

Optimization: The Art of Computing

Vector Processing => Multimedia

High Performance Computing (CS 540)

CS 179 Project Intro.

Compiler Back End Panel

Compiler Back End Panel

Automatic Generation of Warp-Level Primitives and Atomic Instruction for Fast and Portable Parallel Reduction on GPU Simon Garcia De Gonzalo and Sitao.

Multicore and GPU Programming

Multicore and GPU Programming

Presentation transcript:

Carnegie Mellon Performance Portable Tracking of Evolving Surfaces Wei Yu Citadel Investment Group Franz Franchetti, James C. Hoe, José M. F. Moura Carnegie Mellon University Tsuhan Chen Cornell University This work was supported by DARPA, ONR, NSF, Intel, Mercury, and ITRI

Carnegie Mellon The Future is Parallel and Heterogeneous multicore ? and later Cell BE 8+1 cores before 2000 Core2 Duo Core2 Extreme Virtex 5 FPGA+ 4 CPUs SGI RASC Itanium + FPGA Nvidia GPUs 240 streaming cores Sun Niagara 32 threads IBM Cyclops64 80 cores Intel Larrabee Xtreme DATA Opteron + FPGA ClearSpeed 192 cores Programmability? Performance portability? Rapid prototyping? CPU platforms AMD Fusion BlueGene/Q Intel Haswell vector coprocessors Tilera TILEPro 64 cores IBM POWER7 2x8 cores Intel Sandy Bridge 8-way float vectors Nvidia Fermi

Carnegie Mellon Softwares Slow Pace of Change Popular performance programming languages 1953: Fortran 1973: C 1985: C : OpenMP 2007: CUDA Popular performance libraries 1979: BLAS 1992: LAPACK 1994: MPI 1995: ScaLAPACK 1995: PETSc 1997: FFTW Popular productivity/scripting languages 1987: Perl 1989: Python 1993: Ruby 1995: Java 2000: C#

Carnegie Mellon Compilers: The Fundamental Problem += Matrix-matrix multiplication for i=1:N for j=1:N for k=1:N C[i,j] += A[i,k]*B[k,j] #pragma omp parallel for for i=1:NB:N for j=1:NB:N for k=1:NB:N for i0=i:NU:i+NB #pragma prefetch B for j0=j:NU:j+NB for k0=k:NU:k+NB #pragma unroll for k00=k0:1:k0+NU for j00=j0:1:j0+NU #pragma vector always for i00=i0:1:i0+NU C[i00,j00] += A[i00,k00]*B[k00,j00]; Tiled+unrolled+parallelized+SIMDized+prefetching matrix-matrix multiplication += Problem: which transformation order? What parameter values?

Carnegie Mellon Autotuning and Program Synthesis Compiler-Based AutotuningSynthesis from Domain Math Autotuning Numerical LibrariesAutotuning Primer Polyhedral framework IBM XL, Pluto, CHiLL Transformation prescription CHiLL, POET Profile guided optimization Intel C, IBM XL Spiral Signal and image processing, SDR Tensor Contraction Engine Quantum Chemistry Code Synthesizer FLAME Numerical linear algebra (LAPACK) ATLAS BLAS generator FFTW kernel generator Vendor math libraries Code generation scripts

Carnegie Mellon Level-Set Image Segmentation Level Set Algorithm PDE-based image segmentation method Image produces force field on moving contour Embeds contour into n+1 dimensional level-set function Narrow Band Level Set Algorithm Compute only in neighborhood of boundary Makes algorithm computationally tractable but complicated

Carnegie Mellon Target: Cache-Based Multicores COTS Multicore Shared memory (SMP or NUMA) Cache-based memory (L1, L2, L3) SIMD vector instructions (SSE, AltiVec,…) Intel Core i7 8 cores, 2-way SMT IBM POWER7 8 cores, 4-way SMT UltarSPARC T3 16 cores, 8-way SMT Intel Atom 1 cores, 2-way SMT

Carnegie Mellon Why A Challenging Problem? dense & regular data structure sparse data with dense substructure Narrow Band Level SetStencil sparse & irregular data structure SparseMV r iter N iter O(1)O(N) Memory boundcompute bound

Carnegie Mellon Approach Understand algorithmic trade-offs Cheaper iterations vs. fewer iterations Computation load vs. algorithm regularity Understand the target hardware What are the costs and trade-offs? What are good implementation choices and tricks? Develop optimized parameterized implementation Parameters should expose machine/algorithm interaction Parameterization may require macros or program generation Autotune the algorithm and implementation parameters Parameter choice is unclear because of algorithm/architecture interaction Autotuning is the last step in aggressive performance tuning 100x gain: Optimized parameterized program: 50x, autotuning: 2x

Carnegie Mellon How to Get Level Set Up To Speed? Code generation for (int j=lj; j<=hj; j++) for (int i=li; i<=hi; i++) BI[j][i]=1; switch (case) { case (…): BI[j-1][i-1]=1; … BI[j+1][i+1]=1; break; case (…): … } Sparse Linear Algebra Axy Time Skewing Auto-tuning

Carnegie Mellon How Far Did We Go? Level 0: simple C program implements the algorithm cleanly Level 1: C macros plus autotuning search script use C preprocessor for meta-programming Level 2: autotuning + scripting for code specialization text-based program generation, specialization Level 3: add compiler technology, synthesize parts internal code representation, standard compiler passes Level 4: synthesize the whole program from scratch high level representation, domain-specific synthesis

Carnegie Mellon Narrow Band Level Set in a Nutshell narrow band update the level set converge no yes update the band zero level set at t narrow band zero level set at t+1 Autotuning parameter: narrow band width

Carnegie Mellon Projection of Manifold into Dense Stencil Autotuning parameter: tile size

Carnegie Mellon Polyhedral width Time iteration Height Projective Time Skewing r6 r5r3 r4 r0r1r5r6r2r3r4 r0r1r2r3r5r6 r4 r0r1r5r6r2r3 r4 r0r1r2 t0t0 2D row number Standard 1-D dense time-skewing Projective time-skewing t 0 +1 t 0 +2 t 0 +3 Autotuning parameter: polyhedral tile size

Carnegie Mellon In-Core Stencil Optimization 4x4 tile Compute norm leftright up down store and cmple/cmpge rsqrt load add/sub mul shuffle Optimizations Common subexpression elimination Transcendental function approximation (cos by polynomial) SIMD vectorization Instruction scheduling (intra and inter stencil) Autotuning parameter: tile instruction schedule Compute level set function vector register xmm1 vector operation addps xmm0, xmm1 xmm0

Carnegie Mellon Multicore Optimization: Software Pipeline Prologue Steady state Epilogue Data transfer between cores requires a memory fence Core 1Core 2 Autotuning parameter: prologue/epilogue vs. steady state

Carnegie Mellon Narrow Band Update: Program Generation // rubber stamp zero level set for (i=0;i<TILE;i++) for (j=0;j<TILE;j++) if (!is_zero(Ti+i,Tj+j)) { for (k=-BAND;k<=BAND;k++) for (m=-BAND;m<=BAND;m++) band[Ti+i+k][Tj+j+m]=1; } Rubber stamp // TILE=4x4, BAND=+/-1 // unroll: 2x4 for (i=0;i<4;i+=2) { b0=&band[Ti+i-1][Tj-1]; b1=&band[Ti+i][Tj-1]; b2=&band[Ti+i+1][Tj-1]; b3=&band[Ti+i+2][Tj-1]; switch (zeropat(Ti+i,Tj)) {... case PAT(0,1,1,0 0,0,1,0): b0[1]=1;b0[2]=1;b0[3]=1;b0[4]=1; b1[1]=1;b1[2]=1;b1[3]=1;b1[4]=1; b2[1]=1;b2[2]=1;b2[3]=1;b2[4]=1; b3[2]=1;b3[3]=1;b2[4]=1; break; case PAT(...): } 2x4 rubber stamp Autotuning parameter: rubber stamp unrolling

Carnegie Mellon A Fundamental Tradeoff Level set update Band update total runtime Narrow band radius Band update Level set update total

Carnegie Mellon Parameter Space Tuning ParametersRange Tile size{1,2,4,8}x{4,8,16} Band radius{1,2,3,4,5,6,7,8} Polytope size height {2, 4, 8, 16, 24, 32, 48, 64, 80, 96, 128, 160, 192} Polytope size height{2, 4, 8, 12, 16, 32} Enable/disableSIMD, instruction scheduling, approx. cos(), padding, large pages,…

Carnegie Mellon Autotuning: Iterative Line Search Parameter 1 Parameter 2 Parameter 3 Iterative Line search Search parameters one by one hold others fixed Multiple refinement iterations Search space is well-behaved Heuristics Cut search space For instance: DAG scheduling

Carnegie Mellon Efficiency on Single Core 2.8 GHz Xeon 5560 Peak performance (single core) Full program In-core kernel upper bound C base line=best 3 rd party code 33x 8x

Carnegie Mellon Speed-Up On 2x 2.8 GHz Quad Xeon threads 4 threads 2 threads

Carnegie Mellon Performance Gain Through Autotuning Atom 1.6 GHz 2x Quad Xeon GHz

Carnegie Mellon Speedup Breakdown on 2x Xeon 5560

Carnegie Mellon Lessons From Optimizing Level Set 10x to 100x speed-up is often possible really hard work, specific to the problem Doable with standard tools C compiler, autotuning, and program generation scripts Crucial: extract meaningful parameters capture machine/algorithm interaction, one algorithm a time Many things must be done right SIMD, threading, low-level tricks, specialization, tuning Autotuning gives the last 50% to 2x speed-up the remaining 5x to 50x come from standard optimization

Carnegie Mellon More Information: