†UCSD, ‡UCSB, EHTZ, UNIBO

†UCSD, ‡UCSB, EHTZ*, UNIBO*
Energy-Efficient GPGPU Architectures via Collaborative Compilation and Memristive Memory-Based Computing Abbas Rahimi†, A. Ghofrani‡, M. A. Montano‡, K-T Cheng‡, L. Benini*, R. K. Gupta† †UCSD, ‡UCSB, EHTZ*, UNIBO* Micrel.deis.unibo.it /MultiTherman Variability.org

Energy-Efficient GPGPU
Thousands of deep and wide pipelines make GPGPU high power consuming parts NT and VOS achieve energy efficiency at costs to Performance loss Increasing timing sensitivity in the presence of variations ✓SIMD  × conservative guardbands  loss of operational efficiency  Total delay: corner + 3σ stochastic delay Kakoee et al, TCAS-II’12 guardband

Variability is about Cost and Scale
Eliminating guardband  Timing error  Bowman et al, JSSC’09 error rate × wider width Wide lanes Costly error recovery for SIMD  Recovery cycles increases linearly with pipeline length quadratically expensive Deep pipes

Taxonomy of SIMD Variability-Tolerance
Guardband Adaptive Eliminating No timing error Timing error Hierarchically focused guardbanding and uniform instruction assignment Error recovery Rahimi et al, DATE’13 Rahimi et al, DAC’13 Exact / approximate computing Exact computing Predict & prevent Memoization Independent recovery Recalling recent context of error-free execution (approximately / exactly) Lane decoupling through private queues Pawlowski et al, ISSCC’12 Krimer et al, ISCA’12 Rahimi et al, TCAS’13 Rahimi et al, DATE’14 Detect-then-correct

Contributions Efficient spatiotemporal reuse of computation in GPGPUs by collaborative Micro-architectural design An associative memristive memory (AMM) module is integrated with FPUs − representing partial functionality Compiler profiling Fine-grained partitioning of values (searching space of possible inputs) Pre- storing high-frequent sets of values in AMM modules Ensure their resiliency under voltage overscaling for Evergreen GPGPUs

Collaborative compilation framework and memristive-based computing
Training datasets OpenCL Kernel Profiler Profiling Highly frequent computations one-off activity Customized clCreateBuffer to insert AMM contents 2) Code generation AMM contents Kernel lunching kernel programming FPU AMM 3) Runtime =?

Return pre-stored result
AMM with FPU Error  No Recovery  AMM: Software programmable Mimics partial functionality of FPU Two pipelined stages Return pre-stored result Search Operands TCAM: a self-referenced sensing scheme†, 2-bit encoding, 15% positive slack at 45nm Memory block: avoids read disturbance Ternary content addressable memory (TCAM) Crossbar-based 1T-1R memristive memory block †Li et al, JSSC’14

Programming before lunching kernel
OpenCL Sobel AMM Hit Rates Profiler +: {a, b} → {q} *: {a, b} → {q} √ : {a} → {q} … train test1 offline Programming before lunching kernel FPU+ AMM+ test2 FPU* AMM* FPU√ AMM√ … test3 runtime test4

Efficiency under Voltage Overscaling
33% 30% 36% 19% 17% 33% 28% 32% 39% 29% 37% 28% Reduce timing errors from 38% to 24% At 1.0V, without any timing error, 36% average energy saving (7 kernels) At 0.88V, on average 39% energy saving

Conclusion Static compiler analysis and coordinated microarchitectural design that enable efficient reuse of computations in GPGPUs Emerging associative memristive modules are coupled with FPU for fast spatial and temporal reuse GPGPU Kernels exhibit a low entropy yielding an average energy saving of 36% on the 32-entry AMMs

†UCSD, ‡UCSB, EHTZ, UNIBO

Similar presentations

Presentation on theme: "†UCSD, ‡UCSB, EHTZ, UNIBO"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

†UCSD, ‡UCSB, EHTZ*, UNIBO*

Similar presentations

Presentation on theme: "†UCSD, ‡UCSB, EHTZ*, UNIBO*"— Presentation transcript:

Similar presentations

About project

Feedback

†UCSD, ‡UCSB, EHTZ, UNIBO

Presentation on theme: "†UCSD, ‡UCSB, EHTZ, UNIBO"— Presentation transcript: