Explicit HW and SW Hierarchies High-Level Abstractions for giving the system what it wants Mattan Erez The University of Texas at Austin Salishan 2011.

Slides:



Advertisements
Similar presentations
Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.
Advertisements

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.
Deep Packet Inspection Which Implementation Platform? Sarang Dharmapurikar Cisco.
Scalable Routing In Delay Tolerant Networks
Micro controllers introduction. Areas of use You are used to chips like the Pentium and the Athlon, but in terms of installed machines these are a small.
FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.
SE-292 High Performance Computing
Augmenting FPGAs with Embedded Networks-on-Chip
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
GPUs and Future of Parallel Computing Authors: Stephen W. Keckler et al. in NVIDIA IEEE Micro, 2011 Taewoo Lee
C SINGH, JUNE 7-8, 2010IWW 2010, ISATANBUL, TURKEY Advanced Computers Architecture, UNIT 2 Advanced Computers Architecture UNIT 2 CACHE MEOMORY Lecture7.
MEMORY popo.
Presenter : Cheng-Ta Wu Kenichiro Anjo, Member, IEEE, Atsushi Okamura, and Masato Motomura IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39,NO. 5, MAY 2004.
1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory.
MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.
1 Overview Assignment 4: hints Memory management Assignment 3: solution.
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :
Computer Systems – the impact of caches
SE-292 High Performance Computing
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Virtual Storage SystemCS510 Computer ArchitecturesLecture Lecture 14 Virtual Storage System.
Rethinking Database Algorithms for Phase Change Memory
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, Don.
Big Data: Big Challenges for Computer Science Henri Bal Vrije Universiteit Amsterdam.
Implementing Large Scale FFTs on Heterogeneous Multicore Systems Yan Li 1, Jeff Diamond 2, Haibo Lin 1, Yudong Yang 3, Zhenxing Han 3 June 4 th,
Hardware/ Software Partitioning 2011 年 12 月 09 日 Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 These.
Outline Memory characteristics SRAM Content-addressable memory details DRAM © Derek Chiou & Mattan Erez 1.
Computer Maintenance Unit Subtitle: Cache Concepts Excerpted from Copyright © Texas Education Agency, 2011.
Challenges and Opportunities for System Software in the Multi-Core Era or The Sky is Falling, The Sky is Falling!
MEMORY ORGANIZATION Memory Hierarchy Main Memory Auxiliary Memory
1 Runnemede: Disruptive Technologies for UHPC John Gustafson Intel Labs HPC User Forum – Houston 2011.
1 Lecture 16: Virtual Memory Today: DRAM innovations, virtual memory (Sections )
COMPUTER MEMORY Modern computers use semiconductor memory It is made up of thousands of circuits (paths) for electrical currents on a single silicon chip.
SoC Design Methodology for Exascale Computing
Physical Memory By Gregory Marshall. MEMORY HIERARCHY.
Chapter 1 CSF 2009 Computer Abstractions and Technology.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
Computer Science 112 Fundamentals of Programming II Introduction to Graphs.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
A brief introduction to Memory system proportionality and resilience Mattan Erez The University of Texas at Austin.
Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Lecture#15. Cache Function The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that.
Lecture 14: Caching, cont. EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Understanding Parallel Computers Parallel Processing EE 613.
Memory 2. Activity 1 Research / Revise what cache memory is. 5 minutes.
Containment Domains A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems Jinsuk Chung, Ikhwan Lee, Michael Sullivan, Jee Ho Ryoo,
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
ESE532: System-on-a-Chip Architecture
NVIDIA’s Extreme-Scale Computing Project
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Mattan Erez The University of Texas at Austin
Lecture 23: Cache, Memory, Virtual Memory
Mattan Erez The University of Texas at Austin July 2015
2.C Memory GCSE Computing Langley Park School for Boys.
Introduction to Heterogeneous Parallel Computing
Mattan Erez The University of Texas at Austin
What is Computer Architecture?
Hardware Organization
Mattan Erez The University of Texas at Austin
Artificial Intelligence: Driving the Next Generation of Chips and Systems Manish Pandey May 13, 2019.
Presentation transcript:

Explicit HW and SW Hierarchies High-Level Abstractions for giving the system what it wants Mattan Erez The University of Texas at Austin Salishan 2011

N N N Power and reliability bound performance More and more components Per-component improvement too slow 1 KW 10 KW 100 KW 1 MW 10 MW 100 MW 1 GW Tera Peta Exa (c) Mattan Erez, UT Austin

N N N Power and reliability bound performance More and more components Per-component improvement too slow (c) Mattan Erez, UT Austin

N N N What can we do? Compute less and store less –Use better algorithms Specialize more –But still innovate on algorithms Waste less –Minimize movement –Dynamically rebalance hardware Efficient resiliency for reliability –Minimize redundancy –Tradeoff inherent reliability and resiliency (c) Mattan Erez, UT Austin

N N N Power is a zero-sum game Tradeoff control, compute, storage, comm. –Dense algebra –Large sparse data –Building data structures (c) Mattan Erez, UT Austin

N N N Power is a zero-sum game Tradeoff control, compute, storage, comm. –Dense algebra –Large sparse data –Building data structures (c) Mattan Erez, UT Austin

N N N Hierarchy enables HW/SW co-tuning and co-design Hierarchy as common abstraction for HW and SW –Basic engineering –Match abstractions Portability to ensure progress –Co-design cycle Portability to ensure efficiency –Co-tune for proportionality (c) Mattan Erez, UT Austin

N N N Hardware hierarchy – locality Communication and storage dominate energy Closer and smaller == better –Amortize cost of global operations 28nm 20mm 64-bit DP 26 pJ256 pJ 1 nJ 500 pJ Efficient off-chip link 256-bit buses 16 nJ DRAM Rd/Wr 256-bit access 8 kB SRAM 50 pJ 20 pJ

N N N Locality hierarchy minimizes hardware Efficiency/performance tradeoffs –Efficiency goes up as BW goes down (c) Mattan Erez, UT Austin

N N N Hardware hierarchy – control Specialization is a form of hierarchy –Amortize SW control decisions in HW Sophisticated high-level control –Dynamic rebalancing Simple low-level control –Minimize hardware waste How far can we push this? (c) Mattan Erez, UT Austin

N N N Hierarchical HW hierarchical SW Hierarchy is least abstract common denominator L2 cache ALUs Main memory L1 cache Dual-core PC L2 cache ALUs Node memory Aggregate cluster memory (virtual level) L1 cache L2 cache ALUs Node memory L1 cache L2 cache ALUs Node memory L1 cache L2 cache ALUs Node memory L1 cache 4 node cluster of PCs Cluster of dual Cell blades LS Main memory Aggregate cluster memory (virtual level) LS Main memory GPU memory ALUs SM ALUs SM ALUs SM ALUs SM ALUs SM ALUs SM ALUs SM ALUs SM System with a GPU Main memory ALUs SM … ALUs SM matmul large matrix mult ABC matmul_L1 32x32 matrix mult... matmul_L2 256x256 matrix mult matmul_L1 32x32 matrix mult matmul_L1 32x32 matrix mult matmul_L1 32x32 matrix mult matmul_L2 256x256 matrix mult matmul_L1 32x32 matrix mult... matmul_L1 32x32 matrix mult matmul_L1 32x32 matrix mult matmul_L1 32x32 matrix mult...

N N N Task hierarchies task matmul::inner( in float A[M][T], in float B[T][N], inout float C[M][N] ) { tunable int P, Q, R; mappar( int i=0 to M/P, int j=0 to N/R ) { mapseq( int k=0 to T/Q ) { matmul( A[P*i:P*(i+1);P][Q*k:Q*(k+1);Q], B[Q*k:Q*(k+1);Q][R*j:R*(j+1);R], C[P*i:P*(i+1);P][R*j:R*(j+1);R] ); } task matmul::leaf( in float A[M][T], in float B[T][N], inout float C[M][N] ) { for (int i=0; i<M; i++) for (int j=0; j<N; j++) for (int k=0; k<T; k++) C[i][j] += A[i][k] * B[k][j]; } matmul::inner matmul::leaf Variant call graph

N N N A BC Task hierarchies task matmul::inner( in float A[M][T], in float B[T][N], inout float C[M][N] ) { tunable int P, Q, R; mappar( int i=0 to M/P, int j=0 to N/R ) { mapseq( int k=0 to T/Q ) { matmul( A[P*i:P*(i+1);P][Q*k:Q*(k+1);Q], B[Q*k:Q*(k+1);Q][R*j:R*(j+1);R], C[P*i:P*(i+1);P][R*j:R*(j+1);R] ); } task matmul::leaf( in float A[M][T], in float B[T][N], inout float C[M][N] ) { for (int i=0; i<M; i++) for (int j=0; j<N; j++) for (int k=0; k<T; k++) C[i][j] += A[i][k] * B[k][j]; } Callee task: matmul::leaf Calling task: matmul::inner ABC Located at level X Located at level Y

N N N Hierarchical software enables efficiency Portability –Hierarchy is least abstract common denominator –Its what systems want Proportionality –Co-tune hardware and software –Path to true efficiency Co-design cycles –Maintain efficiency with new technology How strict is the hierarchy? (c) Mattan Erez, UT Austin

N N N Hierarchical software enables co-tuning Locality profiles drive dynamic rebalancing (c) Mattan Erez, NVIDIA

N N N Proportional and efficient resiliency Resiliency principles: –Detect fault –Correct erroneous data if possible –Contain fault –Repair/reconfigure –Restore state and re-execute Each step can be improved with co-tuning –Ignore certain faults (allow some errors) –Detect at coarse granularity –Contain where cheapest –Re-map application instead of repairing/reconfiguring hardware –Preserve and restore minimally and effectively (c) Mattan Erez, UT Austin

N N N Hierarchical resiliency – containment domains Containment domains enable proportionality Match locality hierarchy with resiliency hierarchy –Efficient state preservation and restoration –Predictable (minimal) overhead Hierarchy provides natural domains for managing faults (and rebalancing) –Co-tune resiliency scheme in HW and SW –Range of hardware error detection and correction mechanisms –Mechanisms introduce minimal overhead when not in use (c) Mattan Erez, UT Austin

N N N Containment Domains: a full-system approach to resiliency Hierarchy provides natural domains for containing faults Containment domains enable software-controlled resilience –Preserve data on domain start –Detect faults before domain commits –Recover: restore data and re-execute when necessary Arbitrary nesting –Tasks –Functions –Loop iterations –Instructions Amenable to compiler analysis Constructs for programmer tuning (c) Mattan Erez, UT Austin

N N N Tunable error protection High AMTTI requires strong error protection –Global redundancy overhead can be high –Hardware mechanisms can help –Can do even better with software control Containment domains enable specialized protection –Each domain can have unique detection routine May even be scenario specific –Redundancy can be added at any granularity (c) Mattan Erez, UT Austin

N N N State preservation and restoration Match storage hierarchy Utilize NV memory Explicit software control Trade off overheads: –Storage, local and global bandwidth, recomputation, complexity and effort (c) Mattan Erez, UT Austin

N N N Faults and default behavior encompasses current approaches Soft memory errors –Detect: hardware ECC –Recover: retry, if fail then restore, re- execute Hard memory fault –Detect: runtime liveness –Recover: Map-out bad mem If enough space then: recover and re-exec Else: escalate failure Soft arithmetic error –Detect: user-selectable Duplicated execution (HW/SW) Other HW techniques Algorithm-specific assert –Recover: retry, if fail then restore, re- execute Soft control errors –Detect: User selectable signatures Implicit exceptions –Recover: restore, re-execute Hard compute fault –Detect: runtime liveness –Recover: Map-out bad PE If OK w/o resource or spare available then: recover and re- exec Else: escalate failure High-level unhandled faults –Detect: runtime heartbeat –Recover: Escalate failure (c) Mattan Erez, UT Austin

N N N Containment domains example void task SpMV(in matrix, in vec i, out res i ){ forall(…) reduce(…) SpMV(matrix[…],vec i […],res i […]); } preserve {preserve_NV(matrix);} //inner restore_for_child {…} void task SpMV(…) { for r=0..N for c=rowS[r]..rowS[r+1] { contain { res i [r]+=data[c]*vec i [cIdx[c]]; } check {fault (c > prevC);} prevC=c; } preserve {preserve_NV(matrix);} //leaf (c) Mattan Erez, UT Austin

N N N Summary Hierarchy is basic engineering approach –Works for hardware and works for software Hierarchy is inevitable –Minimize movement –Amortize control Match explicit hierarchies in HW and SW –Lowest abstract common denominator Natural domains and boundaries enable: – Co-design – Co-tuning – Dynamic rebalancing – Resiliency (c) Mattan Erez, UT Austin