A Tool for Chemical Kinetics Simulation on Accelerated Architectures

Slides:



Advertisements
Similar presentations
GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links > Sino-German Workshop > Chen Tang > DLR.de Chart 1 Chen Tang.
Advertisements

Introduction to the CUDA Platform
GPU Programming using BU Shared Computing Cluster
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
Computing with Accelerators: Overview ITS Research Computing Mark Reed.
XEON PHI. TOPICS What are multicore processors? Intel MIC architecture Xeon Phi Programming for Xeon Phi Performance Applications.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Contemporary Languages in Parallel Computing Raymond Hummel.
Panda: MapReduce Framework on GPU’s and CPU’s
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
AGENT SIMULATIONS ON GRAPHICS HARDWARE Timothy Johnson - Supervisor: Dr. John Rankin 1.
A performance analysis of multicore computer architectures Michel Schelske.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Computing Labs CL5 / CL6 Multi-/Many-Core Programming with Intel Xeon Phi Coprocessors Rogério Iope São Paulo State University (UNESP)
TRACEREP: GATEWAY FOR SHARING AND COLLECTING TRACES IN HPC SYSTEMS Iván Pérez Enrique Vallejo José Luis Bosque University of Cantabria TraceRep IWSG'15.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
John C. Linford ParaTools, Inc. IGC7, Harvard SEAS, Cambridge, MA 4 May 2015.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
GPU Architecture and Programming
Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.
Multi-core Acceleration of NWP John Michalakes, NCAR John Linford, Virginia Tech Manish Vachharajani, University of Colorado Adrian Sandu, Virginia Tech.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.
VTK-m Project Goals A single place for the visualization community to collaborate, contribute, and leverage massively threaded algorithms. Reduce the challenges.
Innovation for Our Energy Future Opportunities for WRF Model Acceleration John Michalakes Computational Sciences Center NREL Andrew Porter Computational.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
HPC F ORUM S EPTEMBER 8-10, 2009 Steve Rowan srowan at conveycomputer.com.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
Scaling up R computation with high performance computing resources.
Date of download: 6/1/2016 Copyright © 2016 SPIE. All rights reserved. Triangulated shapes of human head layer boundaries employed in simulations: (a)
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
J.J. Keijser Nikhef Amsterdam Grid Group MyFirstMic experience Jan Just Keijser 26 November 2013.
Computer Engg, IIT(BHU)
Performance Tool Integration in Programming Environments for GPU Acceleration: Experiences with TAU and HMPP Allen D. Malony1,2, Shangkar Mayanglambam1.
M. Bellato INFN Padova and U. Marconi INFN Bologna
A survey of Exascale Linear Algebra Libraries for Data Assimilation
Hee-Seok Kim, Izzat El Hajj, John Stratton,
GPU Computing Jan Just Keijser Nikhef Jamboree, Utrecht
Potential Project.
Please do not distribute
Geant4 MT Performance Soon Yung Jun (Fermilab)
High Performance Computing on an IBM Cell Processor --- Bioinformatics
Genomic Data Clustering on FPGAs for Compression
Performance Tuning Team Chia-heng Tu June 30, 2009
A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.
Many-core Software Development Platforms
Ray-Cast Rendering in VTK-m
SOC Runtime Gregory Stoner.
Intel® Parallel Studio and Advisor
Course Agenda DSP Design Flow.
Development of the Nanoconfinement Science Gateway
Simulation at NASA for the Space Radiation Effort
Assimilating Tropospheric Emission Spectrometer profiles in GEOS-Chem
GENERAL VIEW OF KRATOS MULTIPHYSICS
Using OpenMP offloading in Charm++
Introduction to CUDA.
HPC User Forum: Back-End Compiler Technology Panel
Graphics Processing Unit
CSE 502: Computer Architecture
Multicore and GPU Programming
Run time performance for all benchmarked software.
Presentation transcript:

A Tool for Chemical Kinetics Simulation on Accelerated Architectures John C. Linford ParaTools, Inc. NASA Booth 17 November 2014, SC’14

Chemical Kinetic Simulation Combustion Air and Water Quality Climate Change Wildfires Volcanic Eruptions Plastics Devolatilzation Microorganism Growth Cell Biology SC'14, Copyright © ParaTools, Inc.

70% of GEOS-Chem Runtime is Chemistry Eight threads Intel Core i7-4820K CPU 3.70GHz SC'14, Copyright © ParaTools, Inc.

70% of GEOS-Chem Runtime is Chemistry Eight threads Intel Core i7-4820K CPU 3.70GHz SC'14, Copyright © ParaTools, Inc.

Kppa: The Kinetic PreProcessor Accelerated Code generation Analysis Lexical parser CUDA Kppa Fortran C Python Code optimized Domain Specific Language Serial Multi-core GPGPU Intel MIC Architecture SC'14, Copyright © ParaTools, Inc.

Sparse Matrix/Vector Code Generation <% d_Decomp.begin() piv = lang.Variable('piv', REAL, 'Row element divided by diagonal') piv.declare() %> ${size_t} idx = blockDim.x*blockIdx.x+threadIdx.x; if(idx < ${ncells32}) { ${A} += idx; lang.upindent() for i in xrange(1, nvar): for j in xrange(crow[i], diag[i]): c = icol[j] d = diag[c] piv.assign(A[j*ncells32] / A[d*ncells32]) A[j*ncells32].assign(piv) ... } <% d_Decomp.end() %> Generates C, C++, CUDA, Fortran, or Python A[334] = t1; A[337] = -A[272]*t1 + A[337]; A[338] = -A[273]*t1 + A[338]; A[339] = -A[274]*t1 + A[339]; A[340] = -A[275]*t1 + A[340]; t2 = A[335]/A[329]; A[335] = t2; A[337] = -A[330]*t2 + A[337]; A[338] = -A[331]*t2 + A[338]; A[339] = -A[332]*t2 + A[339]; A[340] = -A[333]*t2 + A[340]; t3 = A[341]/A[59]; SC'14, Copyright © ParaTools, Inc.

Simplify During Code Generation X[27] = ((A[43]*A[43]*A[43]) + (A[43]*A[43]) – (A[43]) – (5+4)) / (A[43]*A[43] + (8-6)*A[43] + 1) Simplify Fortran X(28) = A(44) - 1 SC'14, Copyright © ParaTools, Inc.

Fewer Operations, Faster Runtimes SC'14, Copyright © ParaTools, Inc.

Performance Results Baseline: hand-tuned KPP-generated code Use KPP to generate a serial code A skilled programmer parallelizes the code Comparison: unmodified Kppa-generated code Same input files as KPP No modification to source Results: Xeon Phi: 140% faster NVIDIA K40: 160% faster Xeon with OpenMP: 60% faster Exact same hardware SC'14, Copyright © ParaTools, Inc.

Performance Results SC'14, Copyright © ParaTools, Inc.

Chemical Kernel Runtime SC'14, Copyright © ParaTools, Inc.

SBIR Project Status NASA SBIR, Phase I successful Phase II to implement: Advanced reaction rate functions Knight’s Landing support FPGA support Auto-tuning source for hardware parameters User interface and prototyping environment SC'14, Copyright © ParaTools, Inc.

Downloads, tutorials, resources http://www.paratools.com/kppa Downloads, tutorials, resources SC'14, Copyright © ParaTools, Inc.