†UCSD, ‡UCSB, EHTZ, UNIBO

Slides:

Advertisements

Similar presentations

IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.

Advertisements

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Programmability Issues

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

UW-Madison Computer Sciences Vertical Research Group© 2010 Relax: An Architectural Framework for Software Recovery of Hardware Faults Marc de Kruijf Shuou.

th DAC Embedded Systems and Software CML CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler-Microarchitecture.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

An Efficient IP Lookup Architecture with Fast Update Using Single-Match TCAMs Author: Jinsoo Kim, Junghwan Kim Publisher: WWIC 2008 Presenter: Chen-Yu.

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

Educational Computer Architecture Experimentation Tool Dr. Abdelhafid Bouhraoua.

Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.

Analysis of Instruction-level Vulnerability to Dynamic Voltage and Temperature Variations ‡ Computer Science and Engineering, UC San Diego variability.org.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Secure Virtual Architecture John Criswell, Arushi Aggarwal, Andrew Lenharth, Dinakar Dhurjati, and Vikram Adve University of Illinois at Urbana-Champaign.

1 A Variability-Aware OpenMP Environment for Efficient Execution of Accuracy-Configurable Computation on Shared-FPU Processor Clusters Abbas Rahimi, Andrea.

Andrea Marongiu Luca Benini ETH Zurich Daniele Cesarini University of Bologna.

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

Assuring Application-level Correctness Against Soft Errors Jason Cong and Karthik Gururaj.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

Procedure Hopping: a Low Overhead Solution to Mitigate Variability in Shared-L1 Processor Clusters Abbas Rahimi.

Luca Benini/ UNIBO and ETHZ

|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

Lecture 2: Computer Architecture: A Science ofTradeoffs.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑教授組員 : R 張馨怡 R 林秀萍.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

File-System Management

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Remix: On-demand Live Randomization

Gwangsun Kim, Jiyun Jeong, John Kim

Computer Architecture: Parallel Task Assignment

Multiprocessor System Distributed System

Advanced Architectures

‡University of California Berekely

Evaluating Register File Size

Multiscalar Processors

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

Abbas Rahimi‡, Luca Benini†, and Rajesh Gupta‡ ‡CSE, UC San Diego

SECTIONS 1-7 By Astha Chawla

Supervised Learning Based Model for Predicting Variability-Induced Timing Errors Xun Jiao, Abbas Rahimi, Balakrishnan Narayanaswamy, Hamed Fatemi, Jose.

SIMD Lane Decoupling Improved Timing-Error Resilience

Cache Memory Presentation I

Improving Program Efficiency by Packing Instructions Into Registers

Computer Architecture & Operations I

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Experiment Evaluation

Gokul Ravi, Mikko H. Lipasti Electrical and Computer Engineering

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Abbas Rahimi, Luca Benini, Rajesh K. Gupta

Hwisoo So. , Moslem Didehban#, Yohan Ko

Memory Management 11/17/2018 A. Berrached:CS4315:UHD.

URECA: A Compiler Solution to Manage Unified Register File for CGRAs

CARP: Compression-Aware Replacement Policies

* From AMD 1996 Publication #18522 Revision E

MICRO-50 Swamit Tannu Zachary Myers Douglas Carmean Prashant Nair

Die Stacking (3D) Microarchitecture -- from Intel Corporation

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

Abbas Rahimi‡, Luca Benini†, and Rajesh Gupta‡ ‡CSE, UC San Diego

Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu

Authors: A. Rasmussen, A. Kragelund, M. Berger, H. Wessing, S. Ruepp

Haonan Wang, Adwait Jog College of William & Mary

Sculptor: Flexible Approximation with

Authors: Ding-Yuan Lee, Ching-Che Wang, An-Yeu Wu Publisher: 2019 VLSI

Presentation transcript:

†UCSD, ‡UCSB, EHTZ*, UNIBO* Energy-Efficient GPGPU Architectures via Collaborative Compilation and Memristive Memory-Based Computing Abbas Rahimi†, A. Ghofrani‡, M. A. Montano‡, K-T Cheng‡, L. Benini*, R. K. Gupta† †UCSD, ‡UCSB, EHTZ*, UNIBO* Micrel.deis.unibo.it /MultiTherman Variability.org

Energy-Efficient GPGPU Thousands of deep and wide pipelines make GPGPU high power consuming parts NT and VOS achieve energy efficiency at costs to Performance loss Increasing timing sensitivity in the presence of variations ✓SIMD  × conservative guardbands  loss of operational efficiency  Total delay: corner + 3σ stochastic delay Kakoee et al, TCAS-II’12 guardband

Variability is about Cost and Scale Eliminating guardband  Timing error  Bowman et al, JSSC’09 error rate × wider width Wide lanes Costly error recovery for SIMD  Recovery cycles increases linearly with pipeline length quadratically expensive Deep pipes

Taxonomy of SIMD Variability-Tolerance Guardband Adaptive Eliminating No timing error Timing error Hierarchically focused guardbanding and uniform instruction assignment Error recovery Rahimi et al, DATE’13 Rahimi et al, DAC’13 Exact / approximate computing Exact computing Predict & prevent Memoization Independent recovery Recalling recent context of error-free execution (approximately / exactly) Lane decoupling through private queues Pawlowski et al, ISSCC’12 Krimer et al, ISCA’12 Rahimi et al, TCAS’13 Rahimi et al, DATE’14 Detect-then-correct

Contributions Efficient spatiotemporal reuse of computation in GPGPUs by collaborative Micro-architectural design An associative memristive memory (AMM) module is integrated with FPUs − representing partial functionality Compiler profiling Fine-grained partitioning of values (searching space of possible inputs) Pre- storing high-frequent sets of values in AMM modules Ensure their resiliency under voltage overscaling for Evergreen GPGPUs

Collaborative compilation framework and memristive-based computing Training datasets OpenCL Kernel Profiler Profiling Highly frequent computations one-off activity Customized clCreateBuffer to insert AMM contents 2) Code generation AMM contents Kernel lunching kernel programming FPU AMM 3) Runtime =?

Return pre-stored result AMM with FPU Error  No Recovery  AMM: Software programmable Mimics partial functionality of FPU Two pipelined stages Return pre-stored result Search Operands TCAM: a self-referenced sensing scheme†, 2-bit encoding, 15% positive slack at 45nm Memory block: avoids read disturbance Ternary content addressable memory (TCAM) Crossbar-based 1T-1R memristive memory block †Li et al, JSSC’14

Programming before lunching kernel OpenCL Sobel AMM Hit Rates Profiler +: {a, b} → {q} *: {a, b} → {q} √ : {a} → {q} … train test1 offline Programming before lunching kernel FPU+ AMM+ test2 FPU* AMM* FPU√ AMM√ … test3 runtime test4

Efficiency under Voltage Overscaling 33% 30% 36% 19% 17% 33% 28% 32% 39% 29% 37% 28% Reduce timing errors from 38% to 24% At 1.0V, without any timing error, 36% average energy saving (7 kernels) At 0.88V, on average 39% energy saving

Conclusion Static compiler analysis and coordinated microarchitectural design that enable efficient reuse of computations in GPGPUs Emerging associative memristive modules are coupled with FPU for fast spatial and temporal reuse GPGPU Kernels exhibit a low entropy yielding an average energy saving of 36% on the 32-entry AMMs