Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

Slides:

Advertisements

Similar presentations

Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.

IIAA GPMAD A beam dynamics code using Graphics Processing Units GPMAD (GPU Processed Methodical Accelerator Design) utilises Graphics Processing Units.

Nuclear Physics in the SciDAC Era Robert Edwards Jefferson Lab SciDAC 2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.

Performance Metrics Inspired by P. Kent at BES Workshop Run larger: Length, Spatial extent, #Atoms, Weak scaling Run longer: Time steps, Optimizations,

Last Lecture The Future of Parallel Programming and Getting to Exascale 1.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

OpenFOAM on a GPU-based Heterogeneous Cluster

Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.

Algorithms for Lattice Field Theory at Extreme Scales Rich Brower 1*, Ron Babich 1, James Brannick 2, Mike Clark 3, Saul Cohen 1, Balint Joo 4, Tony Kennedy.

Parallel Programming Henri Bal Vrije Universiteit Faculty of Sciences Amsterdam.

Supporting Efficient Execution in Heterogeneous Distributed Computing Environments with Cactus and Globus Gabrielle Allen, Thomas Dramlitsch, Ian Foster,

Reconfigurable Application Specific Computers RASCs Advanced Architectures with Multiple Processors and Field Programmable Gate Arrays FPGAs Computational.

CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.

Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.

Parallel Programming Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences.

Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

gpucomputing.net is a research and development community site dedicated to fostering collaborative and interdisciplinary work on the various disciplines.

Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.

Slide 1 Auburn University Computer Science and Software Engineering Scientific Computing in Computer Science and Software Engineering Kai H. Chang Professor.

Simulating Quarks and Gluons with Quantum Chromodynamics February 10, CS635 Parallel Computer Architecture. Mahantesh Halappanavar.

LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

QCD Project Overview Ying Zhang September 26, 2005.

Lattice QCD in Nuclear Physics Robert Edwards Jefferson Lab CCP 2011 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Extreme-scale computing systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward exa-scale computing.

Physics Steven Gottlieb, NCSA/Indiana University Lattice QCD: focus on one area I understand well. A central aim of calculations using lattice QCD is to.

R. Ryne, NUG mtg: Page 1 High Energy Physics Greenbook Presentation Robert D. Ryne Lawrence Berkeley National Laboratory NERSC User Group Meeting.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.

Lattice QCD and the SciDAC-2 LQCD Computing Project Lattice QCD Workflow Workshop Fermilab, December 18, 2006 Don Holmgren,

GPU Architecture and Programming

Computational Requirements for NP Robert Edwards Jefferson Lab TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAAAA.

Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower QCD Project Review May 24-25, 2005 Code distribution see

Brent Gorda LBNL – SOS7 3/5/03 1 Planned Machines: BluePlanet SOS7 March 5, 2003 Brent Gorda Future Technologies Group Lawrence Berkeley.

Introduction to Research 2011 Introduction to Research 2011 Ashok Srinivasan Florida State University Images from ORNL, IBM, NVIDIA.

SURA BOT 11/5/02 Lattice QCD Stephen J Wallace. SURA BOT 11/5/02 Lattice.

GPU Accelerated MRI Reconstruction Professor Kevin Skadron Computer Science, School of Engineering and Applied Science University of Virginia, Charlottesville,

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

What is QCD? Quantum ChromoDynamics is the theory of the strong force

High Performance Computing

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

U.S. Department of Energy’s Office of Science Midrange Scientific Computing Requirements Jefferson Lab Robert Edwards October 21, 2008.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

High Performance Computing Activities at Fermilab James Amundson Breakout Session 5C: Computing February 11, 2015.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

LQCD Computing Project Overview

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Electron Ion Collider New aspects of EIC experiment instrumentation and computing, as well as their possible impact on and context in society (B) COMPUTING.

Scientific Computing At Jefferson Lab

Presentation transcript:

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science provided support for the development of Lattice QCD software during the Scientific Discovery Through Advanced Computing programs (SciDAC and SciDAC-2) programs and will provide support in the future under SciDAC-3 that will be crucial to exploiting exascale resources. USQCD physicists rely on these libraries and applications to perform calculations with high performance on all US DOE and NSF supercomputing hardware, including: Infiniband clusters IBM BlueGene supercomputers Cray supercomputers Heterogeneous systems such as Graphics Processor Unit (GPU) clusters Purpose-built systems such as the QCDOC Accelerating LQCD with GPUs Since 2005, physicists have exploited GPUs for Lattice QCD, first using OpenGL with Cg, and more recently using CUDA. The codes achieve performance gains of 5x to 20x over the host CPUs. Communication-avoiding algorithms (e.g. domain decomposition), and novel techniques such as using mixed precision (half-, single-, and double precision) and compressed data structures, have been keys to maximizing performance gains. Large problems require the use of multiple GPUs operating in parallel. GPU-accelerated clusters must be carefully designed to provide sufficient memory and network bandwidth. Fermilab, together with Jefferson Lab and Brookhaven Lab, operate dedicated facilities for USQCD, the national collaboration of lattice theorists, as part of the DOE Office of Science LQCD-ext Project. The Fermilab “Ds” cluster has cores and uses QDR Infiniband. An earlier version with 7680 cores was #218 on the November 2010 Top500 list. It delivers 21.5 TFlop/sec sustained for LQCD calculations and was commissioned December The Fermilab “J/Psi” cluster has 6848 cores, uses DDR Infiniband, and was #111 on the June 2009 Top500 list. It delivers 8.4 TFlop/sec sustained for LQCD calculations and was commissioned January Fermilab has just deployed a GPU cluster with 152 NVIDIA M2050 GPUs in 76 hosts, coupled by QDR Infiniband. It was designed to optimize strong scaling and will be used for large GPU-count problems. Jefferson Lab, under a 2009 ARRA grant from the DOE Nuclear Physics Office, deployed and operates GPU clusters for USQCD with over 500 GPUs. These are used for small GPU-count problems. Computing Requirements Lattice Quantum Chromodynamics (LQCD) is a form of the theory of the strong nuclear force amenable to computer calculations. LQCD codes spend much of their execution time solving linear systems with very large, very sparse matrices. For example, a 48x48x48x144 problem, typical for current simulations, has a complex matrix of size 47.8 million x 47.8 million. The matrix has 1.15 billion non-zero elements (about one in every 2 million). Iterative techniques like “conjugate gradient” are used to perform these inversions. Nearly all Flops performed are matrix-vector multiplications (3x3 and 3x1 complex). The matrices describe gluons, and the vectors describe quarks. Memory bandwidth limits the speed of the calculation on a single computer. Individual LQCD calculations require many TFlop/sec-yrs of computations. They can only be achieved by using large-scale parallel machines. The 4-dimensional Lattice QCD simulations are divided across hundreds to many thousands of cores. On each iteration of the inverter, each core interchanges data on the faces of its 4D-sub-volume with its nearest neighbor. The codes employ MPI or other message passing libraries for these communications. Networks, such as Infiniband, provide the required high bandwidth and low latency. USQCD Facilities Weak Scaling Performance Data Showing the Benefits of Mixed Precision Algorithms Strong Scaling Performance Data Showing the Benefits of Domain Decomposition Algorithm Visualization of Topological Charge in a Vacuum Layered architecture for USQCD software. USQCD has partnerships with IBM (BlueGene/Q), NVIDIA, and Intel. In the diagram above, QUDA is an LQCD specific acceleration library developed in close collaboration with NVIDIA. CLUSTERS & GPUS FOR LATTICE QUANTUM CHROMODYNAMICS