A. Rahimi, A. Marongiu, P. Burgio, R. K. Gupta, L. Benini

Slides:

Advertisements

Similar presentations

Tunable Sensors for Process-Aware Voltage Scaling

Advertisements

Thank you for your introduction.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

Lecture 12 Reduce Miss Penalty and Hit Time

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Analysis of Instruction-level Vulnerability to Dynamic Voltage and Temperature Variations ‡ Computer Science and Engineering, UC San Diego variability.org.

1 A Variability-Aware OpenMP Environment for Efficient Execution of Accuracy-Configurable Computation on Shared-FPU Processor Clusters Abbas Rahimi, Andrea.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Andrea Marongiu Luca Benini ETH Zurich Daniele Cesarini University of Bologna.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Procedure Hopping: a Low Overhead Solution to Mitigate Variability in Shared-L1 Processor Clusters Abbas Rahimi.

3 rd Nov CSV881: Low Power Design1 Power Estimation and Modeling M. Balakrishnan.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

LA-LRU: A Latency-Aware Replacement Policy for Variation Tolerant Caches Aarul Jain, Cambridge Silicon Radio, Phoenix Aviral Shrivastava, Arizona State.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

1 Power-Aware System on a Chip A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson University of Massachusetts Amherst Boston Area Architecture.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.

Nios II Processor: Memory Organization and Access

Gwangsun Kim, Jiyun Jeong, John Kim

Raghuraman Balasubramanian Karthikeyan Sankaralingam

Lecture 3: MIPS Instruction Set

Andrea Acquaviva, Luca Benini, Bruno Riccò

Multiscalar Processors

Abbas Rahimi‡, Luca Benini†, and Rajesh Gupta‡ ‡CSE, UC San Diego

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

SECTIONS 1-7 By Astha Chawla

Supervised Learning Based Model for Predicting Variability-Induced Timing Errors Xun Jiao, Abbas Rahimi, Balakrishnan Narayanaswamy, Hamed Fatemi, Jose.

Performance of Single-cycle Design

SIMD Lane Decoupling Improved Timing-Error Resilience

Morgan Kaufmann Publishers

5.2 Eleven Advanced Optimizations of Cache Performance

Hierarchical Architecture

Cache Memory Presentation I

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Introduction to Pentium Processor

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

The University of British Columbia

Abbas Rahimi, Luca Benini, Rajesh K. Gupta

Multiprocessor Introduction and Characteristics of Multiprocessor

Compiler Supports and Optimizations for PAC VLIW DSP Processors

Dual Mode Logic An approach for high speed and energy efficient design

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Overheads for Computers as Components 2nd ed.

Impact of Parameter Variations on Multi-core chips

Advanced Computer Architecture

FPGA Glitch Power Analysis and Reduction

†UCSD, ‡UCSB, EHTZ*, UNIBO*

Sampoorani, Sivakumar and Joshua

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Lecture 3: MIPS Instruction Set

Abbas Rahimi‡, Luca Benini†, and Rajesh Gupta‡ ‡CSE, UC San Diego

Presentation transcript:

Variation-Tolerant OpenMP Tasking on Tightly-Coupled Processor Clusters A. Rahimi, A. Marongiu, P. Burgio, R. K. Gupta, L. Benini UC San Diego and Università di Bologna

Andrea Marongiu / Università di Bologna Outline Device Variability Process, voltage, and temperature variations Why OpenMP and why tasking? Task-Level Vulnerability (TLV) Variation-Tolerant Architecture Inter- and Intra-corner TLV Variation-Tolerant OpenMP Tasking Variation-Aware Reactive Scheduling Algorithm Experimental Reults 24-Nov-18 Andrea Marongiu / Università di Bologna

Ever-increasing Proc.-Vol.-Tem. Variations Variability in transistor characteristics is a major challenge in nanoscale CMOS Static Process variation, e.g., 40% VTH Dynamic variations, e.g., 160˚∆C temperature fluctuations and 10% supply voltage droops. To handle variations designers use conservative guardbands  loss of operational efficiency  24-Nov-18 Your Name / Affiliation

Approaches to Variability-Tolerance This approach relies on online measurements of errors creates runtime overhead for both [Bowman’11] Latency (up to 28 extra recovery cycles per error) Energy overhead of 26nJ that should be minimized Design time conservative guardbanding Post silicon binning Runtime tolerance by various adaptiveness, e.g., replay errant instructions 24-Nov-18 Andrea Marongiu / Università di Bologna

Why a Variation-Aware OpenMP? 847 MHz 847MHz 909MHz 901MHz 893MHz 855MHz 820MHz 877MHz 826MHz 870 917MHz 862MHz Variations are more exacerbated by many-core systems: Multiple voltage-temperature islands Cores in various islands display different error rate The programming model and runtime environment of MIMD should be aware of variations. Frequency variation of a 16-core cluster due to WID and D2D process variation Core1 at 0.81V faces 428K errant instructions  Core0 at 1.1V faces 7.3K errant instructions  24-Nov-18 Andrea Marongiu / Università di Bologna

Why OpenMP Tasking? Instruction-level Vulnerability (ILV) Sequence-level Vulnerability (SLV) Procedure-level Vulnerability (PLV) Task-level Vulnerability (TLV) The steps to build variability abstractions up to the SW layer Task-Level Vulnerability (TLV) as metadata to characterize variations. TLV is a vertical abstraction: TLV reflects manifestation of circuit-level variability in specific parallel software context. The right granularity: To observe and react for OMP scheduler A convenient abstraction for programmers to express irregular and unstructured parallelism. [ILV] A. Rahimi, L. Benini, R. K. Gupta, “Analysis of Instruction-level Vulnerability to Dynamic Voltage and Temperature Variations,” DATE, 2012. [SLV] A. Rahimi, L. Benini, R. K. Gupta, “Application-Adaptive Guardbanding to Mitigate Static and Dynamic Variability,” IEEE Tran. on Computer, 2013 (to appear) [PLV] A. Rahimi, L. Benini, R. K. Gupta, “Procedure Hopping: a Low Overhead Solution to Mitigate Variability in Shared-L1 Processor Clusters,” ISLPED, 2012. 24-Nov-18 Andrea Marongiu / Università di Bologna

Instruction-Level Vulnerability (ILV)* The ILV for each instructioni at every operating condition is quantified: where Ni is the total number of clock cycles in Monte Carlo simulation of instructioni with random operands. Violationj indicates whether there is a violated stage at clock cyclej or not. ILVi defines as the total number of violated cycles over the total simulated cycles for the instructioni. Therefore, the lower ILV, the better *A. Rahimi, L. Benini, R. K. Gupta, “Analysis of Instruction-level Vulnerability to Dynamic Voltage and Temperature Variations,” DATE, 2012. 24-Nov-18 Andrea Marongiu / Università di Bologna

Task-Level Vulnerability (TLV) ILV represents a useful variability metric that raises the level of abstraction from the circuit (critical paths) to the ISA-level. ILV is extended to a more coarse-grained task-level metric, TLV, towards building an integrated, vertical approach to control variability. TLV is a per core and per task type metric: ∑EI is # of errant instructions during taskj on corei Length is total # of executed instructions The lower TLV, the better  24-Nov-18 Andrea Marongiu / Università di Bologna

Variation-Tolerant MP Cluster (1/2) Inspired by STM STHORM 16x 32-bit RISC cores L1 SW-managed Tightly Coupled Data Memory (TCDM) Multi-banked/multi-ported Fast concurrent read access Fast Log. Interconnect One clock domain Bridge towards NoC SHARED L1 TCDM BANK 0 SLAVE PORT LOW-LATENCY LOGARITHMIC INTERCONNECT BANK 1 BANK N test-and-set semaphores L2/L3 BRIDGE CORE 0 MASTER I$ Var. sensor VDD-hopping Replay CORE M VDD-Hopping Replay Var-Sensor I$ CORE 0 MASTER PORT 24-Nov-18 Andrea Marongiu / Università di Bologna

Variation-Tolerant Architecture (2/2) VDD-Hopping Replay Var-Sensor I$ CORE 0 MASTER PORT Every core is equipped with: Error sensing (EDS [Bowman’09]) detect any timing error due to dynamic delay variation Error recovery (Multiple-issue replay mechanism [Bowman’11]) to recover the errant instruction without changing the clock frequency VDD hopping (semi-static) [Miermont’07] to compensate the impact of static process variation [Rahimi’12] Thus, cluster enables per-core characterization of TLV metadata Online variability measurement  TLV metadata characterization Fast access to the TLV metadata for each type of task is guaranteed by carefully placing these key data structures in L1 TCDM. 24-Nov-18 Andrea Marongiu / Università di Bologna

OpenMP Tasking #pragma omp parallel { #pragma omp single for (i = 1...N) { #pragma omp task FUNC_1 (i); FUNC_2 (i); } } /* implicit barrier */ Task queue TCDM Push task Task descriptor Fetch and execute (FIFO) two task types Task descriptors created upon encountering a task directive Task fetched by any core encountering a barrier task directives identify given portions of code (tasks) A task type is defined for every occurrence of the task directive in the program 24-Nov-18 Andrea Marongiu / Università di Bologna

Intra- and Inter-Corner TLV TLV across various type of tasks: TLV of each type of tasks is different (up to 9×) even within the fixed operating condition in a corei Intra-corner TLV at fix (25°C, 1.1V) Inter-corner TLV (across various operating conditions for 45nm) The average TLV of the six types of tasks is an increasing function of temperature. In contrast, decreasing the voltage from the nominal point of 1.1V increases TLV. Inter-corner TLV 24-Nov-18 Andrea Marongiu / Università di Bologna

Variation-tolerant OpenMP Tasking Online TLV characterization TLV table: LUT containing TLV for every core and task type Reside in TCDM. Parallel inspection from multiple cores Each core collects TLV information in parallel Distributed scheduler LUT updated at every task execution void handle_tasks () { while (HAVE_TASKS) { // Task scheduling loop task_desc_t *t = EXTRACT_TASK (); if (t) { float Otlv = tlv_read_task_metadata (core_id); /* Reset counter for this core */ tlv_reset_task_metadata (core_id); /* EXEC! */ t->task_fn (t->task_data); /* We executed. Fetch TLV ...*/ float tlv = tlv_read_task_metadata (core_id); /* Update TLV. Average new and old value */ tlv_table_write(t->task_type_id, core_id, (tlv-Otlv)/2); } VDD-Hopping Replay Var-Sensor I$ CORE 0 MASTER PORT TCDM task types cores TLV-table C0 C1 C2 T0 0.0211 - T1 0.891 0.000005 0.11 24-Nov-18 Andrea Marongiu / Università di Bologna

TLV-aware Extensions Variation-tolerant OpenMP scheduler #pragma omp parallel { #pragma omp single for (i = 1...N) { #pragma omp task FUNC_1 (i); FUNC_2 (i); } } /* implicit barrier */ Task queue TCDM Task descriptor Fetch and execute (FIFO) TLV-aware fetch Variation-tolerant OpenMP scheduler Reactive scheduling. Idle processors trying to fetch a task check if their TLV for the task is under a certain threshold to minimize number of errant instructions (and costly replay cycles) limited number of rejects for a given tasks, to avoid starvation 24-Nov-18 Andrea Marongiu / Università di Bologna

Variation-aware Scheduling Algorithm TLV-table TCDM C0 C1 C2 T0 0.0211 0.11 - T1 0.891 0.000005 core_escape_cnt C0 C1 C2 1 5 taskj = PEEK_QUEUE() TLV(i,j) = tlv_table_read(corei, taskj); if (TLV(i,j)> TLV_THR && corei_escape_cnt <ESCAPE_THR) { corei_escape_cnt ++; escape (taskj); } else assign_to_corei (taskj); corei_escape_cnt = 0; Task queue 24-Nov-18 Andrea Marongiu / Università di Bologna

Experimental Setup: Arch. + Benchmarks Architecture: SystemC-based virtual platform* modeling the tightly-coupled cluster Benchmark: Seven widely used computational kernels from the image processing domain are parallelized using OpenMP tasking. On average 375 dynamic tasks. The TLV lookup table only occupies 104−448 Bytes depending upon the number of task types. ARM v6 core 16 TCDM banks I$ size 16KB per core TCDM latency 2 cycles I$ line 4 words TCDM size 256 KB Latency hit 1 cycle L3 latency ≥ 60 cycles Latency miss ≥ 59 cycles L3 size 256MB *D. Bortolotti et al., “Exploring instruction caching strategies for tightly-coupled shared-memory clusters,” Proc. Intern.Symposium on System on Chip (SoC), pp.34-41, 2011 24-Nov-18 Andrea Marongiu / Università di Bologna

Experimental Setup: Variability Modeling Each core optimized during P&R with a target frequency of 850MHz. @ Sign-off: die-to-die and within-die process variations are injected using PrimeTime VX and variation-aware 45nm TSMC libs (derived from PCA) Six cores (C0, C2, C4, C10, C13, C14) cannot meet the design time target frequency of 850 MHz  All cores can work with the design time target frequency of 850 MHz  but multiple voltage OpPs  To emulate variations, we have integrated variations models at the level of individual instructions using the ILV characterization methodology. ILV models of 16-core LEON-3 for TSMC 45-nm, general-purpose process with normal VTH cells. Vdd-hopping is applied to compensate injected process variation. C0 847 C4 C8 909 C12 901 C1 893 C5 C9 855 C13 820 C2 C6 877 C10 826 C14 C3 C7 870 C11 917 C15 862 C0 >850 C4 C8 909 C12 901 C1 893 C5 C9 855 C13 C2 C6 877 C10 C14 C3 C7 870 C11 917 C15 862 Process Variation Vdd-Hopping 24-Nov-18 Andrea Marongiu / Università di Bologna VDD={ 1.1V, 0.97V, 0.81V }

Overhead of Variation-tolerant Scheduler Normalized IPC = IPC variation-aware scheduler / IPC OMP baseline scheduler On a variation-immune cluster, on average, the normalized IPC of the cluster is slightly decreased by 0.998×. Due to reading the TLV lookup table checking the conditions 24-Nov-18 Andrea Marongiu / Università di Bologna

IPC of Variability-affected Cluster M= Number of times that the scheduler postponing the execution of the task in the head of queue. On average, each task is escaped 2.1 times. Our scheduler decreases the number of cycles per cluster for each type of tasks, because cores incur fewer errant instructions and spend lower cycles for recovery. The normalized IPC is increased by 1.17× (on average) for all benchmarks executing at 10°C. At temperature of 100°C (ΔT=90°C) IPC is increased by 1.15 ×. 24-Nov-18 Andrea Marongiu / Università di Bologna

Andrea Marongiu / Università di Bologna Conclusion Vertical abstraction of circuit-level variations into a high-level parallel software execution (OpenMP 3.0 tasking) The vulnerability of tasks is characterized by TLV metadata during introspective execution The reactive variation-tolerant runtime scheduler utilizes TLV to match cores with tasks The normalized IPC of 16-core variability-affected cluster increases up to 1.51× (on average, 1.15×). Future work: multiple clusters @ multiple dynamic OpP in Vdd & f 24-Nov-18 Andrea Marongiu / Università di Bologna

Grazie dell’attenzione! ERC MultiTherman NSF Variability Expedition 24-Nov-18 Andrea Marongiu / Università di Bologna

Classification of Instructions Based ILV ILV at 0.88V, while varying temperature for 65nm: (V, T) (0.88V, -40°C) (0.88V, 0°C) (0.88V, 125°C) Cycle time (ns) 1 1.02 1.06 1.08 1.10 1.12 1.04 1.16 1.18 Logical & Arithmetic add and or sll sra srl sub xnor xor Mem load 0.824 0.707 0.796 store 0.847 0.743 0.823 Mul.&Div mul 0.996 0.064 0.027 0.017 0.065 0.018 0.876 0.016 06 div 0.991 0.989 0.984 0.994 0.973 Instructions are partitioned into three main classes: 1st Class: Logical & arithmetic instructions 2nd Class: Memory instructions 3rd Class: Hardware multiply & divide instructions For every operating conditions: ILV (3rd Class) ≥ ILV (2nd Class) ≥ ILV (1st Class) 24-Nov-18 Andrea Marongiu / Università di Bologna