Performance Metrics Inspired by P. Kent at BES Workshop Run larger: Length, Spatial extent, #Atoms, Weak scaling Run longer: Time steps, Optimizations,

Slides:

Advertisements

Similar presentations

Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.

Advertisements

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Towards Elastic Operating Systems Amit Gupta Ehab Ababneh Richard Han Eric Keller University of Colorado, Boulder.

Parallel Research at Illinois Parallel Everywhere

The Multikernel: A new OS architecture for scalable multicore systems Andrew Baumann et al CS530 Graduate Operating System Presented by.

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.

COMPUTER SYSTEMS An Integrated Approach to Architecture and Operating Systems Chapter 14 Epilogue: A Look Back at the Journey ©Copyright 2008 Umakishore.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Leveraging Hierarchy Is this our Undiscovered Country? John T. Daly.

BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC 1 Future Direction with NAMD David Hardy

GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Are their more appropriate domain-specific performance metrics for science and engineering HPC applications available then the canonical “percent of peak”

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Extracted directly from:

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

GPU Architecture and Programming

Software Working Group Chairman’s Note: This document was prepared by the “software and applications” working group and was received by the entire workshop.

Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.

CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.

Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,

QCAdesigner – CUDA HPPS project

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Lawrence Livermore National Laboratory BRdeS-1 Science & Technology Principal Directorate - Computation Directorate How to Stop Worrying and Learn to Love.

Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.

B5: Exascale Hardware. Capability Requirements Several different requirements –Exaflops/Exascale single application –Ensembles of Petaflop apps requiring.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

Final round table on petascale to yotta scale computing and Turbulence Bengaluru, December 16, 2011

Presented by NCCS Hardware Jim Rogers Director of Operations National Center for Computational Sciences.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Applications Scaling Panel Jonathan Carter (LBNL) Mike Heroux (SNL) Phil Jones (LANL) Kalyan Kumaran (ANL) Piyush Mehrotra (NASA Ames) John Michalakes.

CERN VISIONS LEP  web LHC  grid-cloud HL-LHC/FCC  ?? Proposal: von-Neumann  NON-Neumann Table 1: Nick Tredennick’s Paradigm Classification Scheme Early.

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

Gorilla: A Fast, Scalable, In-Memory Time Series Database

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

Geoffrey Fox Panel Talk: February

Parallel Programming Models

Panel: Beyond Exascale Computing

Scaling for the Future Katherine Yelick U.C. Berkeley, EECS

Clusters of Computational Accelerators

Parallel Programming in C with MPI and OpenMP

Toward a Unified HPC and Big Data Runtime

Scalable Parallel Interoperable Data Analytics Library

Multicore and GPU Programming

Parallel Programming in C with MPI and OpenMP

Multicore and GPU Programming

Presentation transcript:

Performance Metrics Inspired by P. Kent at BES Workshop Run larger: Length, Spatial extent, #Atoms, Weak scaling Run longer: Time steps, Optimizations, Strong scaling Run more/different simulation methods, e.g. DFT or CC, LES or DNS Run more initial conditions, e.g. molecule, boundaries, Ensembles Run more algorithms: Convergence, systematic errors due to cutoffs, AMR, etc. Delivering on Mission Needs is the most importance metric

Computational Challenges Abstractions to reveal concurrency, tolerate latency, and tolerate failures –Domain specific, e.g. Global Arrays, or Super Instruction Architecture for chemistry –Libraries or frameworks, e.g. ACTS Collection Rethink algorithms based on a flop rich, bandwidth poor, and higher latency environment –If there is extra physics that can be incorporated (at high computational intensity) –Balance of control logic versus flops will change

Role of Accelerators Accelerators (at least GPUs) becoming pervasive –Many computational groups investigating: chemistry, accelerator physics, climate –Desktop GPUs through to Roadrunner Conventional multicores may adopt features (pseudo- vector, etc.) that provide accelerator advantages without make harder programming abstractions Data parallelism prevalent in simulation codes –CUDA provides abstraction to exploit it Naïve CUDA on host Naïve CUDA on device Results from 7-point stencil on NVIDIA GTX 280 (double precision)

Programming Models For simulations using a large fraction of a system in the next 2-3 years MPI+X will substantially increase –Replicated data structures grow in importance as memory per core decreases Auto-tuning to get the best from existing programming models Existing models will adapt, e.g. MPI and fault tolerance, OpenMP and data placement, or be incrementally replaced New models from left field, e.g. map reduce +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA Auto-tuning results on Opteron Socket F for LBMHD

System Attributes System architecture trends indicate flop-rich, bandwidth poorer, and latency poorer per socket. Intermediate storage between memory and disk, e.g. FLASH will become common place Data deluge will increase –If we have a exascale systems nationally, petascale desktops will increase data-handling demands of even the least data-rich disciplines –Higher demand for storage and networking Power will be a major preoccupation