1 Aug 7, 2004 GPU Req GPU Requirements for Large Scale Scientific Applications “Begin with the end in mind…” Dr. Mark Seager Asst DH for Advanced Technology.

Slides:

Advertisements

Similar presentations

Weather Research & Forecasting: A General Overview

Advertisements

INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

1 Slides presented by Hank Childs at the VACET/SDM workshop at the SDM Center All-Hands Meeting. November 26, 2007 Snoqualmie, Wa Work performed under.

The University of Adelaide, School of Computer Science

Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

Planned Machines: ASCI Purple, ALC and M&IC MCR Presented to SOS7 Mark Seager ICCD ADH for Advanced Technology Lawrence Livermore.

OpenFOAM on a GPU-based Heterogeneous Cluster

Multi Agent Simulation and its optimization over parallel architecture using CUDA™ Abdur Rahman and Bilal Khan NEDUET(Department Of Computer and Information.

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.

1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

Contemporary Languages in Parallel Computing Raymond Hummel.

1 THE ARCHITECTURE FOR THE DIGITAL WORLD TM THE ARCHITECTURE FOR THE DIGITAL WORLD Embedded Linux for ARM Architecture.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Priority Research Direction Key challenges Fault oblivious, Error tolerant software Hybrid and hierarchical based algorithms (eg linear algebra split across.

Dr. Chris Musselle – Consultant R Meets Julia Dr Chris Musselle.

Enhancing GPU for Scientific Computing Some thoughts.

LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.

Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 3, 2011outline.1 ITCS 6010/8010 Topics in Computer Science: GPU Programming for High Performance.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Tammy Dahlgren with Tom Epperly, Scott Kohn, and Gary Kumfert Center for Applied Scientific Computing Common Component Architecture Working Group October.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

CADCAM HARDWARE.  The computing system in operation can be compared to a human being in terms of its operating characteristics.

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

GPU Architecture and Programming

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,

Scott Kohn with Tammy Dahlgren, Tom Epperly, and Gary Kumfert Center for Applied Scientific Computing Lawrence Livermore National Laboratory October 2,

Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,

Brent Gorda LBNL – SOS7 3/5/03 1 Planned Machines: BluePlanet SOS7 March 5, 2003 Brent Gorda Future Technologies Group Lawrence Berkeley.

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,

Lawrence Livermore National Laboratory S&T Principal Directorate - Computation Directorate Tools and Scalable Application Preparation Project Computation.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

August 12, 2004 UCRL-PRES Aug Outline l Motivation l About the Applications l Statistics Gathered l Inferences l Future Work.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

EU-Russia Call Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

Application of Emerging Computational Architectures (GPU, MIC) to Atmospheric Modeling Tom Henderson NOAA Global Systems Division

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

CS427 Multicore Architecture and Parallel Computing

Use-case: CFD software with FEM and unstructured meshes

Transitioning VisIt to CMake

Enabling machine learning in embedded systems

Texas Instruments TDA2x and Vision SDK

Computer-Generated Force Acceleration using GPUs: Next Steps

Cache-Efficient Layouts of BVHs and Meshes

Alternative Processor Panel Results 2008

Introduction to CUDA.

Presentation transcript:

1 Aug 7, 2004 GPU Req GPU Requirements for Large Scale Scientific Applications “Begin with the end in mind…” Dr. Mark Seager Asst DH for Advanced Technology UCRL-PRES August 7, 2004 This work was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under Contract No. W-7405-Eng-48. Presented to GP 2 Workshop

2 Aug 7, 2004 GPU Req Overview Code Characteristics Hardware requirements Software requirements Runtime requirements Bringa, 350MAtoms, 1,944 1GiB, 50 TB output, 7 Days. 10K LOC, 35% PE efficiency, 95% parallel efficiency Gilmer, 10MAtoms, GiB, 40(110 GB output, 48 HR). 10K LOC, 35% PE efficiency, 95% parallel efficiency

3 Aug 7, 2004 GPU Req Simulation’s value is dependent on the other elements of the integrated program Simulations and experimental program are tightly coupled for overall confidence in the stockpile

4 Aug 7, 2004 GPU Req Code Characteristics Complex multi-physics package applications  Typically solving multiple types of PDEs  Time evolution calculations (100K time steps  weeks of runtime)  Non-linear solves in each package (100s)  Linear solves within non-linear solve (1Ks)  Multiple physical properties databases Languages include C, C++, Fortran90, Python  50K-1.5M LOC  Heavy use of complex structures and C++ templates  Need programming model and platform architecture stability for horizontal (platform) and vertical (time-dependent) portability  Very complex makefiles, controllers (perl & python) and pre- and post-processing Designed and written from the ground up for MPI and OpenMP style parallelism  Targeted at hierarchical memory systems  Lots of low level parallelism left to be exploited, but with short vector lengths Written by large (5-25 people) teams  Core physics physicists  Computer scientists  Mathematicians TImespans  3-5 years to develop, 10 years usage, 5-10 years legacy  Constant evolution of codes to add physics features, debug, improved validation and databases

5 Aug 7, 2004 GPU Req Application Performance Characteristics Node code  No hot spots – e.g., package has 20 routines with 5% runtime each  Compute intensive with 5-35% performance efficiency  5-20% FMA  Random access and block access memory patterns  Most don’t have math library (e.g., BLAS3) usage  Typically use GiB of memory MPI  Long and short messages, depending on package Exchanges for FEM Random connections for sparse matrix ops  Highly dependent on Barrier, ALL_REDUCE

6 Aug 7, 2004 GPU Req GPU hardware requirements 64b arithmetic predominates, but some 32b is acceptable Need better IEEE arithmetic  Better FP behavior, not full compliance  Exception generation mechanism Large memory and access to node memory  Streaming access to node memory  Random access and block access modes  Reduced texture memory restrictions  Efficient Gather and Scatter mechanisms Short vectors  low overhead to start parallelism  Conditional execution essential for vectorization of if-tests

7 Aug 7, 2004 GPU Req GPU software requirements Languages  The closer to C and C++ the better  Porting to OpenGL is not an option  Challenge is to be able to express data parallelism (streams) in portable C Ability to debug essential How to efficiently utilize multiple GPUs?  Multiple levels of parallelism (data parallel, mGPU, GPU-CPU, mCPU) Open source  Device drivers, compilers, debuggers, etc

8 Aug 7, 2004 GPU Req Runtime requirements Dynamically load programs into GPU with dynamic linked libraries Need exception mechanism Ability to cleanly map node memory into GPU memory  Move data with portable constructs

9 Aug 7, 2004 GPU Req Possible approaches HPC market potential could be used to induce vendors to improve environment  1K clusters have 2-4K slots for GPUs…  Large market Libraries  Not widely used  Few key applications could benefit Key functions  Monte carlo random number generation  EOS evaluation utilizing “free” interpolation  FEM Elem by Elem Mx operations  Secondary calculations (diagnostics, visualization) Work with early adopters

10 Aug 7, 2004 GPU Req Conclusions Large scientific simulations have enormous computational requirements GPUs offer unique capabilities and are becoming more usable Wide spread adoption awaits more general purpose usability Langer, 6.8 TZone, Laser Plasma interaction, GiB, 10 days for 35 pico-sec, 14 TB vis data Woodward, 8 BZone, TurbHydro, 2K 1.5GiB, 25 days 25 TB vis data