0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

Slides:

Advertisements

Similar presentations

Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.

Advertisements

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

A Novel 3D Layer-Multiplexed On-Chip Network

A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

Chapter 8 Hardware Conventional Computer Hardware Architecture.

Addressing the System-on-a-Chip Interconnect Woes Through Communication-Based Design N. Vinay Krishnan EE249 Class Presentation.

Week 1- Fall 2009 Dr. Kimberly E. Newman University of Colorado.

NoC Modeling Networks-on-Chips seminar May, 2008 Anton Lavro.

1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.

6/14/2015 How to measure Multi- Instruction, Multi-Core Processor Performance using Simulation Deepak Shankar Darryl Koivisto Mirabilis Design Inc.

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.

What Great Research ?s Can RAMP Help Answer? What Are RAMP’s Grand Challenges ?

1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.

CAD and Design Tools for On- Chip Networks Luca Benini, Mark Hummel, Olav Lysne, Li-Shiuan Peh, Li Shang, Mithuna Thottethodi,

Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.

Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.

Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.

ECE 526 – Network Processing Systems Design

Research Directions for On-chip Network Microarchitectures Luca Carloni, Steve Keckler, Robert Mullins, Vijay Narayanan, Steve Reinhardt, Michael Taylor.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

COLUMBIA UNIVERSITY Interconnects Jim Tomkins: “Exascale System Interconnect Requirements” Jeff Vetter: “IAA Interconnect Workshop Recap and HPC Application.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Back-end Timing Models Core Models.

(1) Introduction © Sudhakar Yalamanchili, Georgia Institute of Technology, 2006.

Manycore Network Interfaces for In-Memory Rack-Scale Computing Alexandros Daglis, Stanko Novakovic, Edouard Bugnion, Babak Falsafi, Boris Grot.

MIT Lincoln Laboratory XYZ 3/11/2005 Automatic Extraction of Software Models for Exascale Hardware/Software Co-Design Damian Dechev 1,2, Amruth.

C OLUMBIA U NIVERSITY Lightwave Research Laboratory Embedding Real-Time Substrate Measurements for Cross-Layer Communications Caroline Lai, Franz Fidler,

Computer System Architectures Computer System Software

EECE **** Embedded System Design

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

MacSim Tutorial (In ICPADS 2013) 1. |The Structural Simulation Toolkit: A Parallel Architectural Simulator (for HPC) A parallel simulation environment.

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Runjie Zhang Dec.3 S. Li et al. in MICRO’09.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

OpenSoC Fabric An open source, parameterized, network generation tool

Automated Design of Custom Architecture Tulika Mitra

Lessons Learned The Hard Way: FPGA  PCB Integration Challenges Dave Brady & Bruce Riggins.

Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Page 1 Reconfigurable Communications Processor Principal Investigator: Chris Papachristou Task Number: NAG Electrical Engineering & Computer Science.

Chonnam national university VLSI Lab 8.4 Block Integration for Hard Macros The process of integrating the subblocks into the macro.

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

ESL and High-level Design: Who Cares? Anmol Mathur CTO and co-founder, Calypto Design Systems.

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

© 2012 xtUML.org Bill Chown – Mentor Graphics Model Driven Engineering.

Rev PA1 1 Exascale Node Model Following-up May 20th DMD discussion Updated, June 13th Sébastien Rumley, Robert Hendry, Dave Resnick, Anthony Lentine.

Veronica Eyo Sharvari Joshi. System on chip Overview Transition from Ad hoc System On Chip design to Platform based design Partitioning the communication.

System Architecture: Near, Medium, and Long-term Scalable Architectures Panel Discussion Presentation Sandia CSRI Workshop on Next-generation Scalable.

Silicon Nanophotonic Network-On-Chip Using TDM Arbitration

MAPLD 2005/254C. Papachristou 1 Reconfigurable and Evolvable Hardware Fabric Chris Papachristou, Frank Wolff Robert Ewing Electrical Engineering & Computer.

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

Performed by: Guy Assedou Ofir Shimon Instructor: Yaniv Ben-Yitzhak המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory הטכניון - מכון.

A new perspective on processing-in-memory architecture design These data are submitted with limited rights under Government Contract No. DE-AC52-8MA27344.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Manifold Execution Model and System.

Full and Para Virtualization

Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.

Building manycore processor-to-DRAM networks using monolithic silicon photonics Ajay Joshi †, Christopher Batten †, Vladimir Stojanović †, Krste Asanović.

-1- Soft Core Viterbi Decoder EECS 290A Project Dave Chinnery, Rhett Davis, Chris Taylor, Ning Zhang.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Runtime Reconfigurable Network-on- chips for FPGA-based systems Mugdha Puranik Department of Electrical and Computer Engineering

Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin

Structural Simulation Toolkit / Gem5 Integration

Performance Tuning Team Chia-heng Tu June 30, 2009

Gilbert Hendry Johnnie Chan, Daniel Brunina,

Analysis of a Chip Multiprocessor Using Scientific Applications

Digital Processing Platform

A High Performance SoC: PkunityTM

Presentation transcript:

0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove: Lawrence Berkeley National Laboratory Gilbert Hendry: Sandia National Laboratory Dan Quinlan, Chunhua Liao: Lawrence Livermore National Lab Sudhakar Yalamanchili: Georgia Tech Data Movement Dominates (DMD) and CoDEx: CoDesign for Exascale

Codesign Tools Recap Architectural Simulation to Accelerate CoDesign SST System level models ACE Node level emulation ROSE Application Analysis ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST. ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs. SST Macro System Simulation: Enables system- scale simulation through capture of application communication traces and simulation of large- scale interconnects. SST Micro Software Simulators: Software simulation for node-level simulation

Codesign Tools Recap Architectural Simulation to Accelerate CoDesign SST System level models ACE Node level emulation ROSE Application Analysis ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST. ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs. SST Macro System Simulation: Enables system- scale simulation through capture of application communication traces and simulation of large- scale interconnects. SST Micro Software Simulators: Software simulation for node-level simulation CoDEx: CoDesign For Exascale ASCR-funded Simulation Infrastructure Project CoDEx: CoDesign For Exascale ASCR-funded Simulation Infrastructure Project SST: Structure Simulation Toolkit NNSA-funded Simulation Tools (ASC Program) SST: Structure Simulation Toolkit NNSA-funded Simulation Tools (ASC Program)

Codesign Tools Recap Architectural Simulation to Accelerate CoDesign SST System level models ACE Node level emulation ROSE Application Analysis ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST. ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs. SST Macro System Simulation: Enables system- scale simulation through capture of application communication traces and simulation of large- scale interconnects. SST Micro Software Simulators: Software simulation for node-level simulation CoDEx: CoDesign For Exascale ASCR-funded Simulation Infrastructure Project CoDEx: CoDesign For Exascale ASCR-funded Simulation Infrastructure Project SST: Structure Simulation Toolkit NNSA-funded Simulation Tools (ASC Program) SST: Structure Simulation Toolkit NNSA-funded Simulation Tools (ASC Program) CAL: (Sandia/LBL) Computer Architecture Laboratory CAL: (Sandia/LBL) Computer Architecture Laboratory

Fidelity vs. Scope for Architectural Simulation Methods 4

ROSE Compiler Full Program Understanding through Deep Source-Code Analysis 5

Can automatically predict performance for many input codes and software optimizations Predict performance under different architectural scenarios Much faster than hardware simulation and manual modeling ExaSAT: Exascale Static Analysis Tool Compiler-Automated Performance Model Extraction 6 Combustion Codes Compiler Analysis Performance Prediction Spreadsheet Dependency Graph Optimization User Parameter s User Parameter s Performanc e Model Machine Parameter s

SST/macro: Coarse-Grained Simulation 7 An application code with minor modifications SST/Macro Impl. of interfaces (MPI), which simulate execution and communication

SST/micro: Cycle-Accurate Framework Has a general simulation framework for integrating models Simulation backend is parallel Plenty of people involved 8

Some Models Currently Integrated 9 Gem5 is a well-known architectural simulator with models for processors, caches, busses, and network components. MacSim provides a model of GPU/CPU cores or geterogenous computing nodes, which can be driven from x86 or PTX (CUDA) traces. IRIS provides a pipelined, cycle- accurate router model capable of modeling a variety of Network-on- Chip (NoC) and inter- node interconnection architectures. PhoenixSim models photonic networks.

Leveraging Embedded Design Automation For Design Space Exploration This stuff is essential!

Embedded Design Automation (Using FPGA emulation to do rapid prototyping) RAMP FPGA-accelerated Emulation of ASIC Or “tape out” To FPGA

Data Movement Dominates (Sandia, Micron, Columbia, LBL) Understand the Potential of Intelligent, Stacked DRAM Technology Data movement are projected to account for over 75% of power budget for an exascale platform Work to reduce that via –Optical interconnect(s) –3D stacking (logic + memory + optics) –New memory protocols Research Questions –What is the performance potential of stacked memory (power & speed) –How much intelligence to put into logic layer Atomics, gather/scatter, checksums, full-processor-in-memory –What is the memory consistency model for intelligent DRAM –How to program it if we put embed more intelligence into DRAM

The Cost of Moving Data

Locality Management is Key What are the best combination of software and hardware mechanisms to maximize data movement efficiency Vertical Locality Management Horizontal Locality Management 14 Sun Microsystems Temporal Topological

Why Study Chip Stacking (TSVs)? Energy = (V 2 ∗ C) ∗ Overhead + Ecomm DRAM Cells Efficient DRAM cells require < 1 pJ to access Current DRAM architectures are not power efficient Long distances ➔ high power We pay for more than we get at every level –Cache: throw away 75-80% –DRAM Row: Charge 1024B for each 64B access –DIMM: Charge 8-9 chips/access –~800 pJ/byte total DRAM design driven by packaging constraints –~50% of DRAM chip cost is packaging, mainly in pins –DIMMs use multiple chips with a few data pins to achieve high BW TSVs Reduce Costs TSVs orders of magnitude less energy –250 fJ/bit for reading DRAM –5 fJ/bit for TSV –250 fJ/bit for mem. controller –~0.5 pJ/bit (compared to 30pJ for conventional DIMM) –Don’t have to access more data than needed Enables.... –Lower Capacitance: Narrower – Lower Overhead: Smarter –In-Memory computation Requires –...changes to how we view the machine & the memory 15

Why Photonics? ELECTRONICS:  Buffer, receive, and re-transmit at every router.  Space Parallelism: Each bus lane routed independently (P  N LANES ).  Off-chip BW requires much more power than on-chip BW.ELECTRONICS:  Buffer, receive, and re-transmit at every router.  Space Parallelism: Each bus lane routed independently (P  N LANES ).  Off-chip BW requires much more power than on-chip BW. Photonics changes the rules for Bandwidth-per-Watt. PHOTONICS:  Modulate/receive data stream once per communication event.  Wavelength Parallelism: Broadband switch routes entire multi-wavelength stream.  Off-chip BW ≈ on-chip BW for nearly same power. PHOTONICS:  Modulate/receive data stream once per communication event.  Wavelength Parallelism: Broadband switch routes entire multi-wavelength stream.  Off-chip BW ≈ on-chip BW for nearly same power.

HBDRAM Large Pin-out Complex wiring Low bandwidth density Distance constrained by electrical limitations High power dissipation All-optical link, no electronic bus to drive Bit-rate transparent link High bandwidth density, less pins Distance immunity at computer scale Low power dissipation Optical Link Traditional MemoryOptically-Connected Memory Why Optically-Connected Memory? Will not scale to meet power and bandwidth requirements of future high- performance computing systems Enables scaling of high-performance computing through increased memory capacity and bandwidth CPU HBDRAM CPU HBDRAM Electronic Bus

18

Mixed Model Simulation cycle accurate and energy-accurate models SST/macro skeleton app (C, C++, Fortran) skeleton app (C, C++, Fortran) (C++) NoC Model (PhoenixSim) Memory Model (DRAMSim2, FLASHsim, NVRAM) Memory Model (DRAMSim2, FLASHsim, NVRAM) Address Translation Processor Model (SST/micro & Tensilica) Processor Model (SST/micro & Tensilica) Workload Translation kernels SystemC Fault Injection Checkpoint/r estart MPI Traces (DUMPI) MPI Traces (DUMPI)

Simulator Infrastructure: Interconnects cycle accurate and energy-accurate models Developed by Sandia Collaborators CoDEx project

Simulator Infrastructure: Memory cycle accurate and energy-accurate models Validated against Micron DRAM HMC model coming this summer

Simulator Infrastructure cycle accurate and energy-accurate models Rewrote Columbia PhoenixSim summer 2011 Orion-2 energy model Validated against Cornell test parts

Simulator Infrastructure cycle accurate and energy-accurate models Full Gate-level RTL model of processor Well characterized energy model