Presentation is loading. Please wait.

Presentation is loading. Please wait.

Explicit HW and SW Hierarchies High-Level Abstractions for giving the system what it wants Mattan Erez The University of Texas at Austin Salishan 2011.

Similar presentations


Presentation on theme: "Explicit HW and SW Hierarchies High-Level Abstractions for giving the system what it wants Mattan Erez The University of Texas at Austin Salishan 2011."— Presentation transcript:

1 Explicit HW and SW Hierarchies High-Level Abstractions for giving the system what it wants Mattan Erez The University of Texas at Austin Salishan 2011

2 N N N Power and reliability bound performance More and more components Per-component improvement too slow 1 KW 10 KW 100 KW 1 MW 10 MW 100 MW 1 GW Tera Peta Exa (c) Mattan Erez, UT Austin

3 N N N Power and reliability bound performance More and more components Per-component improvement too slow (c) Mattan Erez, UT Austin

4 N N N What can we do? Compute less and store less –Use better algorithms Specialize more –But still innovate on algorithms Waste less –Minimize movement –Dynamically rebalance hardware Efficient resiliency for reliability –Minimize redundancy –Tradeoff inherent reliability and resiliency (c) Mattan Erez, UT Austin

5 N N N Power is a zero-sum game Tradeoff control, compute, storage, comm. –Dense algebra –Large sparse data –Building data structures (c) Mattan Erez, UT Austin

6 N N N Power is a zero-sum game Tradeoff control, compute, storage, comm. –Dense algebra –Large sparse data –Building data structures (c) Mattan Erez, UT Austin

7 N N N Hierarchy enables HW/SW co-tuning and co-design Hierarchy as common abstraction for HW and SW –Basic engineering –Match abstractions Portability to ensure progress –Co-design cycle Portability to ensure efficiency –Co-tune for proportionality (c) Mattan Erez, UT Austin

8 N N N Hardware hierarchy – locality Communication and storage dominate energy Closer and smaller == better –Amortize cost of global operations 28nm 20mm 64-bit DP 26 pJ256 pJ 1 nJ 500 pJ Efficient off-chip link 256-bit buses 16 nJ DRAM Rd/Wr 256-bit access 8 kB SRAM 50 pJ 20 pJ

9 N N N Locality hierarchy minimizes hardware Efficiency/performance tradeoffs –Efficiency goes up as BW goes down (c) Mattan Erez, UT Austin

10 N N N Hardware hierarchy – control Specialization is a form of hierarchy –Amortize SW control decisions in HW Sophisticated high-level control –Dynamic rebalancing Simple low-level control –Minimize hardware waste How far can we push this? (c) Mattan Erez, UT Austin

11 N N N Hierarchical HW hierarchical SW Hierarchy is least abstract common denominator L2 cache ALUs Main memory L1 cache Dual-core PC L2 cache ALUs Node memory Aggregate cluster memory (virtual level) L1 cache L2 cache ALUs Node memory L1 cache L2 cache ALUs Node memory L1 cache L2 cache ALUs Node memory L1 cache 4 node cluster of PCs Cluster of dual Cell blades LS Main memory Aggregate cluster memory (virtual level) LS Main memory GPU memory ALUs SM ALUs SM ALUs SM ALUs SM ALUs SM ALUs SM ALUs SM ALUs SM System with a GPU Main memory ALUs SM … ALUs SM matmul large matrix mult ABC matmul_L1 32x32 matrix mult... matmul_L2 256x256 matrix mult matmul_L1 32x32 matrix mult matmul_L1 32x32 matrix mult matmul_L1 32x32 matrix mult matmul_L2 256x256 matrix mult matmul_L1 32x32 matrix mult... matmul_L1 32x32 matrix mult matmul_L1 32x32 matrix mult matmul_L1 32x32 matrix mult...

12 N N N Task hierarchies task matmul::inner( in float A[M][T], in float B[T][N], inout float C[M][N] ) { tunable int P, Q, R; mappar( int i=0 to M/P, int j=0 to N/R ) { mapseq( int k=0 to T/Q ) { matmul( A[P*i:P*(i+1);P][Q*k:Q*(k+1);Q], B[Q*k:Q*(k+1);Q][R*j:R*(j+1);R], C[P*i:P*(i+1);P][R*j:R*(j+1);R] ); } task matmul::leaf( in float A[M][T], in float B[T][N], inout float C[M][N] ) { for (int i=0; i { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/5/1578517/slides/slide_12.jpg", "name": "N N N Task hierarchies task matmul::inner( in float A[M][T], in float B[T][N], inout float C[M][N] ) { tunable int P, Q, R; mappar( int i=0 to M/P, int j=0 to N/R ) { mapseq( int k=0 to T/Q ) { matmul( A[P*i:P*(i+1);P][Q*k:Q*(k+1);Q], B[Q*k:Q*(k+1);Q][R*j:R*(j+1);R], C[P*i:P*(i+1);P][R*j:R*(j+1);R] ); } task matmul::leaf( in float A[M][T], in float B[T][N], inout float C[M][N] ) { for (int i=0; i

13 N N N A BC Task hierarchies task matmul::inner( in float A[M][T], in float B[T][N], inout float C[M][N] ) { tunable int P, Q, R; mappar( int i=0 to M/P, int j=0 to N/R ) { mapseq( int k=0 to T/Q ) { matmul( A[P*i:P*(i+1);P][Q*k:Q*(k+1);Q], B[Q*k:Q*(k+1);Q][R*j:R*(j+1);R], C[P*i:P*(i+1);P][R*j:R*(j+1);R] ); } task matmul::leaf( in float A[M][T], in float B[T][N], inout float C[M][N] ) { for (int i=0; i { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/5/1578517/slides/slide_13.jpg", "name": "N N N A BC Task hierarchies task matmul::inner( in float A[M][T], in float B[T][N], inout float C[M][N] ) { tunable int P, Q, R; mappar( int i=0 to M/P, int j=0 to N/R ) { mapseq( int k=0 to T/Q ) { matmul( A[P*i:P*(i+1);P][Q*k:Q*(k+1);Q], B[Q*k:Q*(k+1);Q][R*j:R*(j+1);R], C[P*i:P*(i+1);P][R*j:R*(j+1);R] ); } task matmul::leaf( in float A[M][T], in float B[T][N], inout float C[M][N] ) { for (int i=0; i

14 N N N Hierarchical software enables efficiency Portability –Hierarchy is least abstract common denominator –Its what systems want Proportionality –Co-tune hardware and software –Path to true efficiency Co-design cycles –Maintain efficiency with new technology How strict is the hierarchy? (c) Mattan Erez, UT Austin

15 N N N Hierarchical software enables co-tuning Locality profiles drive dynamic rebalancing (c) Mattan Erez, NVIDIA

16 N N N Proportional and efficient resiliency Resiliency principles: –Detect fault –Correct erroneous data if possible –Contain fault –Repair/reconfigure –Restore state and re-execute Each step can be improved with co-tuning –Ignore certain faults (allow some errors) –Detect at coarse granularity –Contain where cheapest –Re-map application instead of repairing/reconfiguring hardware –Preserve and restore minimally and effectively (c) Mattan Erez, UT Austin

17 N N N Hierarchical resiliency – containment domains Containment domains enable proportionality Match locality hierarchy with resiliency hierarchy –Efficient state preservation and restoration –Predictable (minimal) overhead Hierarchy provides natural domains for managing faults (and rebalancing) –Co-tune resiliency scheme in HW and SW –Range of hardware error detection and correction mechanisms –Mechanisms introduce minimal overhead when not in use (c) Mattan Erez, UT Austin

18 N N N Containment Domains: a full-system approach to resiliency Hierarchy provides natural domains for containing faults Containment domains enable software-controlled resilience –Preserve data on domain start –Detect faults before domain commits –Recover: restore data and re-execute when necessary Arbitrary nesting –Tasks –Functions –Loop iterations –Instructions Amenable to compiler analysis Constructs for programmer tuning (c) Mattan Erez, UT Austin

19 N N N Tunable error protection High AMTTI requires strong error protection –Global redundancy overhead can be high –Hardware mechanisms can help –Can do even better with software control Containment domains enable specialized protection –Each domain can have unique detection routine May even be scenario specific –Redundancy can be added at any granularity (c) Mattan Erez, UT Austin

20 N N N State preservation and restoration Match storage hierarchy Utilize NV memory Explicit software control Trade off overheads: –Storage, local and global bandwidth, recomputation, complexity and effort (c) Mattan Erez, UT Austin

21 N N N Faults and default behavior encompasses current approaches Soft memory errors –Detect: hardware ECC –Recover: retry, if fail then restore, re- execute Hard memory fault –Detect: runtime liveness –Recover: Map-out bad mem If enough space then: recover and re-exec Else: escalate failure Soft arithmetic error –Detect: user-selectable Duplicated execution (HW/SW) Other HW techniques Algorithm-specific assert –Recover: retry, if fail then restore, re- execute Soft control errors –Detect: User selectable signatures Implicit exceptions –Recover: restore, re-execute Hard compute fault –Detect: runtime liveness –Recover: Map-out bad PE If OK w/o resource or spare available then: recover and re- exec Else: escalate failure High-level unhandled faults –Detect: runtime heartbeat –Recover: Escalate failure (c) Mattan Erez, UT Austin

22 N N N Containment domains example void task SpMV(in matrix, in vec i, out res i ){ forall(…) reduce(…) SpMV(matrix[…],vec i […],res i […]); } preserve {preserve_NV(matrix);} //inner restore_for_child {…} void task SpMV(…) { for r=0..N for c=rowS[r]..rowS[r+1] { contain { res i [r]+=data[c]*vec i [cIdx[c]]; } check {fault (c > prevC);} prevC=c; } preserve {preserve_NV(matrix);} //leaf (c) Mattan Erez, UT Austin

23 N N N Summary Hierarchy is basic engineering approach –Works for hardware and works for software Hierarchy is inevitable –Minimize movement –Amortize control Match explicit hierarchies in HW and SW –Lowest abstract common denominator Natural domains and boundaries enable: – Co-design – Co-tuning – Dynamic rebalancing – Resiliency (c) Mattan Erez, UT Austin


Download ppt "Explicit HW and SW Hierarchies High-Level Abstractions for giving the system what it wants Mattan Erez The University of Texas at Austin Salishan 2011."

Similar presentations


Ads by Google