Thinking Outside of the Tera-Scale Box Piotr Luszczek.

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

MAGMA – LAPACK for HPC on Heterogeneous Architectures MAGMA – LAPACK for HPC on Heterogeneous Architectures Stan Tomov and Jack Dongarra Research Director.
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
GPU Programming using BU Shared Computing Cluster
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
1 Computational models of the physical world Cortical bone Trabecular bone.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Isaac Lyngaas John Paige Advised by: Srinath Vadlamani & Doug Nychka SIParCS,
PaRSEC: Parallel Runtime Scheduling and Execution Controller Jack Dongarra, George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas.
Parallel computer architecture classification
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
Computing with Accelerators: Overview ITS Research Computing Mark Reed.
Solving Challenging Numerical Linear Algebra Algorithms Using GPU Accelerators Hatem Ltaief KAUST Supercomputing Laboratory Stanimire Tomov University.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
Parallel Processing1 Parallel Processing (CS 676) Overview Jeremy R. Johnson.
Cyberinfrastructure for Scalable and High Performance Geospatial Computation Xuan Shi Graduate assistants supported by the CyberGIS grant Fei Ye (2011)
GPU Computing with CUDA as a focus Christie Donovan.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
High Performance Discrete Fourier Transforms on Graphics Processors Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, John Manferdelli Microsoft.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Graphics Processing Units
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
A massively parallel solution to a massively parallel problem HPC4NGS Workshop, May 21-22, 2012 Principe Felipe Research Center, Valencia Brian Lam, PhD.
Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,
Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley CScADS Autotuning Workshop 9 July 2007.
On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.
Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
OpenCL Programming James Perry EPCC The University of Edinburgh.
SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, UT Austin 1.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
My Coordinates Office EM G.27 contact time:
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
The Library Approach to GPU Computations of Initial Value Problems Dave Yuen University of Minnesota, U.S.A. with Larry Hanyk and Radek Matyska Charles.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
1/24 UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU.
Lynn Choi School of Electrical Engineering
A survey of Exascale Linear Algebra Libraries for Data Assimilation
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
GPU with CPU OZAN ÇETİNASLAN.
Seamlessly distributing tasks between CPUs, GPUs, disks, clusters, ...
“The Brain”… I will rule the world!
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Multicore and GPU Programming
6- General Purpose GPU Programming
Presentation transcript:

Thinking Outside of the Tera-Scale Box Piotr Luszczek

Brief History of Tera-flop: ASCI Red

Brief History of Tera-flop: ASCI Red Intel Polaris

Brief History of Tera-flop: GPGPU ASCI Red Intel Polaris GPGPU

Tera-flop: a Dictionary Perspective Belongs to supercomputer expert Consumes 500 kW Costs over $1M Requires distributed memory programming Memory: 1 GiB or more Belongs to gaming geek Consumes under 500 W Costs under $200 Requires only a CUDA kernel Memory: 1 GiB or more

Double-Dip Recession? Time Productivity Hybrid Multicore Multicore Cilk, Intel TBB, Intel Ct, Apple GCD, MS Concrt, … HMPP, PGI Accelerator, PathScale?

Locality Space in HPC Challenge HPL Mat-Mat-Mul Parallel-Transpose STREAM FFTRandomAccess

Locality Space in HPC Challenge: Bounds HPL Mat-Mat-Mul Parallel-Transpose STREAM FFTRandomAccess Latency-bound Bandwidth-bound Compute-bound AORSA2D PCIe

CUDA vs. OpenCL: Similar Software Stacks

Matrix-Matrix Multiply: the Simple Form for i = 1:M for j = 1:N for k = 1:T C(i, j) += A(i, k) * B(k, j) CUBLAS Volkov et al. MAGMA Nakasato

Fermi C2050: from CUDA to OpenCL

Fermi C2050: Simple OpenCL Porting

ATI 5870: Simple OpenCL Porting

Making Case for Autotuning: Use of Texture

Making Case for Autotuning: Blocking Factors Provided by NVIDIA

Mat-Mul Portability Summary Achieved Performance Single precision Achieved Performance Double precision ATI OpenCL 52%57% ATI IL 74%87% NVIDIA CUDA 63%58% NVIDIA OpenCL 57%49% Multicore (vendor) 79% Multicore (autotuned) 46%40%

Now, let’s think outside of the box

Recipe for Performance Before Multicore Wider superscalar instr. issue Increase clock frequency Increase number of transistor Load/Store Instr. scheduling

Recipe for Performance After Multicore More cores Parallel software

Recipe for Future Performance More cores Better sequential software DAG Runtime

From Assembly to DAG: PLASMA Typical loop in assembly –load R0, #1000 –divf F1, F2, F3 –load R1, #1000 –divf F2, F4, F5 –dec R1 –jump … LU factorization –for i = 1:N – GETR2(A[i, i]) – for j = i+1:N – TRSM(A[i, i], A[i, j]) – for k = i+1:N – GEMM(A[k,i],A[i,j],A[k,j]) Instr. scheduling GETRF2 TRSM GETRF2 GEMM TRSM GEMM DAG scheduling Sequential instruction stream Sequential task stream

Scalability of LU Factorization on Multicore PLASMA MKL 11.0 LAPACK 6 cores = 1 socket 24 cores = 4 sockets 48 cores = 8 sockets AMD Istanbul 2.8 GHz

Combining Multicore and GPU for QR Fact. AMD Instanbul 2.8 GHz, 24 cores and Fermi GTX 480

Performance on Distributed Memory QR LU Cholesky ORNL Kraken Cray XT5 AMD Istanbul 2.6 GHz Dual-socet 6-core Interconnect: Seastar, 3D torus Cray LibSci DPLASMA Peak

People and Projects Jack Dongarra Mathieu Faverge Jakub Kurzak Hatem Ltaief Stan Tomov Asim YarKhan Peng Du Rick Weber MAGMA PLASMA DPLASMA and DAGuE –George Bosilca –Aurelien Bouteiller –Anthony Danalis –Thomas Herault –Pierre Lemarinier