B5: Exascale Hardware. Capability Requirements Several different requirements –Exaflops/Exascale single application –Ensembles of Petaflop apps requiring.

Slides:



Advertisements
Similar presentations
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Center for Computational Sciences Cray X1 and Black Widow at ORNL Center for Computational.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Priority Research Direction Key challenges General Evaluation of current algorithms Evaluation of use of algorithms in Applications Application of “standard”
Performance Metrics Inspired by P. Kent at BES Workshop Run larger: Length, Spatial extent, #Atoms, Weak scaling Run longer: Time steps, Optimizations,
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.
Zhao Lixing.  A supercomputer is a computer that is at the frontline of current processing capacity, particularly speed of calculation.  Supercomputers.
Parallel Research at Illinois Parallel Everywhere
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Lecture 2: Modern Trends 1. 2 Microprocessor Performance Only 7% improvement in memory performance every year! 50% improvement in microprocessor performance.
Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.
Some Thoughts on Technology and Strategies for Petaflops.
1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
What Great Research ?s Can RAMP Help Answer? What Are RAMP’s Grand Challenges ?
1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Trigger and online software Simon George & Reiner Hauser T/DAQ Phase 1 IDR.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
© Copyright Alvarion Ltd. Hardware Acceleration February 2006.
Computer performance.
CENG 546 Dr. Esma Yıldırım. Copyright © 2012, Elsevier Inc. All rights reserved What is a computing cluster?  A computing cluster consists of.
Computer System Architectures Computer System Software
Priority Research Direction Key challenges Fault oblivious, Error tolerant software Hybrid and hierarchical based algorithms (eg linear algebra split across.
4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.
Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.
Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
CS/ECE 3330 Computer Architecture Kim Hazelwood Fall 2009.
Architectural Considerations for Petaflops and beyond Bill Camp Sandia National Lab’s March 4,2003 SOS7 Durango, CO, USA -
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
J. Christiansen, CERN - EP/MIC
Software Working Group Chairman’s Note: This document was prepared by the “software and applications” working group and was received by the entire workshop.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.
System Architecture: Near, Medium, and Long-term Scalable Architectures Panel Discussion Presentation Sandia CSRI Workshop on Next-generation Scalable.
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
Computer Organization & Assembly Language © by DR. M. Amer.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
© 2009 IBM Corporation Motivation for HPC Innovation in the Coming Decade Dave Turek VP Deep Computing, IBM.
A new perspective on processing-in-memory architecture design These data are submitted with limited rights under Government Contract No. DE-AC52-8MA27344.
Department of Electrical and Electronic Engineering The micro group The micro group is formed by academic staff from the EE and the CS department that.
Enhancing Commodity Scalar Processors with Vector Components for Increased Scientific Productivity.
Parallel Processing Presented by: Wanki Ho CS147, Section 1.
Tackling I/O Issues 1 David Race 16 March 2010.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
CERN VISIONS LEP  web LHC  grid-cloud HL-LHC/FCC  ?? Proposal: von-Neumann  NON-Neumann Table 1: Nick Tredennick’s Paradigm Classification Scheme Early.
NCSA Strategic Retreat: System Software Trends Bill Gropp.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
F1-17: Architecture Studies for New-Gen HPC Systems
Computer Organization and Architecture Lecture 1 : Introduction
Productive Performance Tools for Heterogeneous Parallel Computing
Structural Simulation Toolkit / Gem5 Integration
Architecture & Organization 1
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS
COT 5611 Operating Systems Design Principles Spring 2014
COT 5611 Operating Systems Design Principles Spring 2012
Architecture & Organization 1
Computer Evolution and Performance
Co-designed Virtual Machines for Reliable Computer Systems
PLANNING A SECURE BASELINE INSTALLATION
Computer Architecture
Facts About High-Performance Computing
Presentation transcript:

B5: Exascale Hardware

Capability Requirements Several different requirements –Exaflops/Exascale single application –Ensembles of Petaflop apps requiring Exaflops-years –Streaming/Realtime –I/O intensive (e.g., analysis, data mining) Not considering capacity

Exaflops are Possible Extrapolation of Top500 suggests that 1EF in 2019 DOE (through ASCI and LCF) has contributed to staying on this trajectory –May require investment to stay on this trajectory –History shows Federal investment accelerated top systems –May not get usable FLOPS (non LINPACK) without investment

Components of an Exascale System Its not just FLOPS. Need –Processors –Interconnect –Memory –I/O (persistent storage) –Connection to the outside world –Balance of these Constraints Include –Power –Cooling –Reliability –Adoption by applications, particularly legacy, and including familiar development environment –Cost :)

Example Commodity Design

Notes on Commodity Design Based on Jeff Vetter’s extrapolation of current technology –Details in ORNL presentation Does not preserve the performance ratios (e.g., bytes/flop interconnect bandwidth) commonly expected –This is not new; e.g., PC memory/disk size ratios have changed significantly Most (all?) Exascale system designs will mandate some changes in those ratios –R&D can either reduce the change in the ratio or reduce the impact of the change (e.g., new algorithms) –E.g., more specialized systems may provide better cost/perf for specific application classes

Issues (concerns) There are possible hazards: –Interconnect performance Latency, bandwidth –I/O Density, bandwidth, fault management –Memory Cost, power (and latency and bandwidth) –Power 4M PS3 is 1EF but use 1GW –Latency/bandwidth/faults/concurrency –Software and algorithms Workaround/with latency/bandwidth/faults/concurrency Non issue - getting the peak FLOPS All of these can (must) benefit from research and development investment

Alternate Directions Commodity –GPGPU and STI Cell offer very high compute density wrt commodity CPU –Ex. 4M PS3 = 1EF (single precision) –But Not all algorithms can effectively use these systems Programming complexity (currently) much greater –Embedded processors (better FLOPS/Watt) New Architectures –PIM, FPGA-centric, … Not in this time frame –Quantum, molecular, DNA, …

Suggestions Need multiple architectures (no one right answer) Approaches –Integrated solution (e.g., BG) –Component solution (e.g., Cray) –Not general purpose (e.g., GPGPU, FPGA, GRAPE)

Promising Tech Tech that can improve balance (ratios) in system; cost, reliability, etc. Optimizing the use of die space for CPU (manycore, multicore, stream, vector, heterogeneous, variable precision arithmetic, etc.) Optical network (faster signaling, cheaper/denser connectors) Optical into/out of the processor 3-D chips, integrated memory/processor Faster development of customized processors Hardware accelerated system verification (e.g., RAMP) NAND Flash, MRAM, and other non-volatile memory (disk replacements) Myriad approaches to power efficiency

Cross Cutting Issues Better characterization of algorithm requirements wrt system ratios New algorithms to match system ratios –Disk I/O/main memory –Interconnect bandwidth/flops –Etc New algorithms/software to detect and handle faults New approaches to algorithms/software for specialized/disruptive processor architectures –E.g., good ways to move apps to GPGPUs, PIMs, or FPGAs Need to accelerate applications and algorithms (esp. new ones) to PF now to prepare for EF Programming Language and Environments –PGAS, Domain-specific, auto-tuner, hierarchical programming models (built on current models) –Interaction with hardware (e.g., user-managed caches, remote atomic updates, etc.) –Performance modeling and debugging –Productivity etc. –System software, OS (e.g., memory management)

Sample Plan Components Point studies for future –Like the Petaflops point designs, with more application/algorithm designer involvement and include OS. Evaluate time/cost to get apps running on system. Ongoing process; contrast with baseline Early simulation and modeling of systems, algorithms, and applications (see open source below) incl hardware (e.g., RAMP), particularly wrt promising technologies Evaluate special purpose architectures and non-MPI programming models for application/algorithm classes (cheaper, faster, better) Partnerships for disruptive technologies –Need to understand timeline and costs –Goal is to accelerate; not required for Exaflops Directed vendor partnerships –QCDOC is a good example Support application involvement from the beginning –WRT point designs, with performance understanding –Must encourage new apps to increase community size Some Principles –Open source –Support multiple prototypes (at suitable scale) –Establish a framework to move from point studies to full systems through multiple stages