 155 South 1452 East Room 380  Salt Lake City, Utah 84112  1-801-585-1233 This research was sponsored by the National Nuclear Security Administration.

Slides:

Advertisements

Similar presentations

Barcelona Supercomputing Center. The BSC-CNS objectives: R&D in Computer Sciences, Life Sciences and Earth Sciences. Supercomputing support to external.

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

GPU Computing with Hartford Condor Week 2012 Bob Nordlund.

2003 International Congress of Refrigeration, Washington, D.C., August 17-22, 2003 CFD Modeling of Heat and Moisture Transfer on a 2-D Model of a Beef.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA.

1 Approved for unlimited release as SAND C Verification Practices for Code Development Teams Greg Weirs Computational Shock and Multiphysics.

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,

Cyberinfrastructure for Scalable and High Performance Geospatial Computation Xuan Shi Graduate assistants supported by the CyberGIS grant Fei Ye (2011)

OpenFOAM on a GPU-based Heterogeneous Cluster

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Thermo-fluid Analysis of Helium cooling solutions for the HCCB TBM Presented By: Manmeet Narula Alice Ying, Manmeet Narula, Ryan Hunt and M. Abdou ITER.

A Parallel Structured Ecological Model for High End Shared Memory Computers Dali Wang Department of Computer Science, University of Tennessee, Knoxville.

Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta.

Dynamic Reduced-order Model for the Air Temperature Field Inside a Data Center G.W. Woodruff School of Mechanical Engineering Georgia Institute of Technology.

Plans for Exploitation of the ORNL Titan Machine Richard P. Mount ATLAS Distributed Computing Technical Interchange Meeting May 17, 2013.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Lecture Objectives: Review discretization methods for advection diffusion equation Accuracy Numerical Stability Unsteady-state CFD Explicit vs. Implicit.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

1 Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins, Todd Harman Scientific Computing and.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Dax: Rethinking Visualization Frameworks for Extreme-Scale Computing DOECGF 2011 April 28, 2011 Kenneth Moreland Sandia National Laboratories SAND P.

Processing of a CAD/CAE Jobs in grid environment using Elmer Electronics Group, Physics Department, Faculty of Science, Ain Shams University, Mohamed Hussein.

Martin Berzins (Steve Parker) What are the hard apps problems? How do the solutions get shared? What non-apps work is needed? Thanks to DOE for funding.

Easy-to-Use CFD for Electronics Design. Introduction A CFD thermal simulation tool specifically designed for the electronics industry Future Facilities.

Analysis of Radiation Heat Transfer in Furnace P M V Subbarao Professor Mechanical Engineering Department Test for Cooling Capacity of Furnace Surface….

Extracted directly from:

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

Adaptive Mesh Modification in Parallel Framework Application of parFUM Sandhya Mangala (MIE) Prof. Philippe H. Geubelle (AE) University of Illinois, Urbana-Champaign.

The Scalable Data Management, Analysis, and Visualization Institute VTK-m: Accelerating the Visualization Toolkit for Multi-core.

GPU Architecture and Programming

Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,

Introduction to Research 2011 Introduction to Research 2011 Ashok Srinivasan Florida State University Images from ORNL, IBM, NVIDIA.

ARCHES: GPU Ray Tracing I.Motivation – Emergence of Heterogeneous Systems II.Overview and Approach III.Uintah Hybrid CPU/GPU Scheduler IV.Current Uintah.

VTK-m Project Goals A single place for the visualization community to collaborate, contribute, and leverage massively threaded algorithms. Reduce the challenges.

By Dirk Hekhuis Advisors Dr. Greg Wolffe Dr. Christian Trefftz.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Enabling Self-management of Component-based High-performance Scientific Applications Hua (Maria) Liu and Manish Parashar The Applied Software Systems Laboratory.

Distributed Components for Integrating Large- Scale High Performance Computing Applications Nanbor Wang, Roopa Pundaleeka and Johan Carlsson

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

7. Grid Computing Systems and Resource Management

Presented by Adaptive Hybrid Mesh Refinement for Multiphysics Applications Ahmed Khamayseh and Valmor de Almeida Computer Science and Mathematics Division.

National Center for Supercomputing Applications University of Illinois at Urbana–Champaign Visualization Support for XSEDE and Blue Waters DOE Graphics.

CS 351/ IT 351 Modeling and Simulation Technologies Review ( ) Dr. Jim Holten.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.

EU-Russia Call Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.

December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.

 155 South 1452 East Room 380  Salt Lake City, Utah  This research was sponsored by the National Nuclear Security Administration.

Heat Transfer System By Team Awesome: Sub-team Awesomer.

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,

SCEC Capability Simulations on TeraGrid

 155 South 1452 East Room 380  Salt Lake City, Utah  This research was sponsored by the National Nuclear Security Administration.

These slides are based on the book:

HPC Roadshow Overview of HPC systems and software available within the LinkSCEEM project.

Productive Performance Tools for Heterogeneous Parallel Computing

Free vs. Forced Convection

Dynamic Data Driven Application Systems

Introduction to Parallelism.

MASS CUDA Performance Analysis and Improvement

Dynamic Data Driven Application Systems

Department of Computer Science, University of Tennessee, Knoxville

Presentation transcript:

 155 South 1452 East Room 380  Salt Lake City, Utah  This research was sponsored by the National Nuclear Security Administration under the Accelerating Development of Retrofitable CO2 Capture Technologies through Predictivity program through DOE Cooperative Agreement DE-NA Todd Harman Department of Mechanical Engineering Jeremy Thornock Department of Chemical Engineering Isaac Hunsaker Graduate Student Department of Chemical Engineering

Year 2: Demonstration of a fully-coupled problem using RMCRT within ARCHES. Scalability demonstration.

CFD: Finest level, (always) RMCRT: 1 Level: CFD 2 Level: coarsest level “Data Onion”: finest level, Research Topic: Region of Interest (ROI)

2 Levels

3-Levels

Implemented Research Topic: ROI location? Static: User defined region?

Implemented Research Topic: ROI location Dynamic: ROI computed every timestep? (abskg sigmaT 4 ) ROI proportional to the size of fine level patches?

80% Complete: Data Onion, dynamic & static region of interests. Testing phase, need benchmarks. 90% Complete: Integration of RMCRT tasks within ARCHES (2 level)

Single Level Verification Order of accuracy # rays (old) grid resolution Scalability studies, new mixed scheduler. 2 Levels verification Errors associated with coarsening

S. P. Burns and M.A Christon. Spatial domain-based parallelism in large-scale, participating-media, radiative transport applications. Numerical Heat Transfer, Part B, 31(4): , Initial Conditions: - Uniform temperature field - Analytical function for absorption coefficient

S. P. Burns and M.A Christon. Spatial domain-based parallelism in large-scale, participating-media, radiative transport applications. Numerical Heat Transfer, Part B, 31(4): , 1997.

4X error from coarsening abskg

Coarsening: smoothing filter Error Abskg

Leverage the work of Dr. Berzin’s team Hybrid MPI-threaded Task Scheduler (Qingyu Meng) GPU-RMCRT (Alan Humphrey)

Hybrid MPI-threaded Task Scheduler*: Memory reduction! 13.5Gb -> 1GB per node (12 cores/node)*. ( 2 material CFD problem, cells, on cores of Jaguar) Interconnect drivers and MPI software must be threadsafe. RMCRT requires an MPI environmental variable expert! *Q. Meng, M. Berzins, and J. Schmidt, Using hybrid parallelism to improve memory use in uintah. In Proceeding of the Teragrid 2011.

Kraken 100 rays per cell

Difficult to run on Kraken, crashing in mvapich Further testing needed on bigger machines?

Motivation - Utilize all available hardware  Uintah’s asynchronous task-based approach is well suited to take advantage of GPUs  RMCRT is ideal for GPUs Keeneland Initial Delivery System 360 GPUs DoE Titan 1000s of GPUs Nvidia M2070/90 Tesla GPU Multi-core CPU +

Offload Ray Tracing and RNG to GPU(s)‏ Available CPU cores can perform other computation. Uintah infrastructure supports GPU task scheduling and execution: Can access multiple GPUs on-node Uses Nvidia CUDA C/C++ Using NVIDIA cuRAND Library GPU-accelerated random number generation (RNG)‏

Create & schedule CPU & GPU tasks Enables Uintah to “pre-fetch” GPU data Uintah infrastructure manages: Queues of CUDA Stream and Event handles Device memory allocation and transfers Utilize all available: CPU cores and GPUs

Capability jobs run on: Keeneland Initial Delivery System (NICS)‏ 1440 CPU cores & 360 GPUs simultaneously Jaguar - GPU partition (OLCF)‏ CPU cores & 960 GPUs simultaneousl Development of GPU RMCRT prototype underway.

Head-to-head comparison of RMCRT with Discrete Ordinates Method. Single level. Accuracy versus computational cost. 2 Levels: Coarsening error for variable temperature and radiative properties. Data Onion: Serial performance Accuracy versus number of levels, refinement ratio, dynamic/static ROI. Scalability Studies

Order of Accuracy: # rays 0.5, grid Cells 1 Accuracy issues related to coarsening data. Cost = f( #rays, Grid Cells communication….) Doubling the grid resolution = 20ish X increase in cost. Good scalability characteristics Year 2: Demonstration of a fully-coupled problem using RMCRT within ARCHES. Scalability demonstration.

Acknowledgements: DoE for funding the CSAFE project from , DOE NETL, DOE NNSA, INCITE NSF for funding via SDCI and PetaApps Keeneland Computing Facility, supported by NSF under Contract OCI Oak Ridge Leadership Computing Facility – DoE Jaguar XK6 System (GPU partition)‏

Isotropic scattering added to the model Verification testing performed using an exact solution (Siegel, 1987) Grid convergence analysis performed Discrepancy diminishes with increased mesh refinement

Seigel, R. “Transient Radiative Cooling of a droplet-filled layer,” ASME Journal of Heat Transfer,109: , Benchmark Case of Seigel 1987 Cube (1m 3 ) Uniform Temperature 64.7K Mirror surface on all sides Black top and bottom walls Computed surface fluxes on top & bottom walls 10 rays per cell (low)

Radiative Flux vs Optical Thickness Seigel, R. “Transient Radiative Cooling of a droplet-filled layer,” ASME Journal of Heat Transfer,109: , RMCRT (dots) Exact solution (lines)

Grid convergence of the L1 error norms where the scattering coefficient is 8 m -1, and the absorption coefficient is 2m -1.

IFRF burner simulation (production size run) 1344 processors/cores Initial conditions taken from a previous run with DOM. Domain: (1m x 4.11 m x 1m) Resolution: (4.4mm x 8.8mm x 4.4mm) 24 million cells