Andrea Marongiu Luca Benini ETH Zurich Daniele Cesarini University of Bologna.

Slides:

Advertisements

Similar presentations

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.

Advertisements

Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

What Great Research ?s Can RAMP Help Answer? What Are RAMP’s Grand Challenges ?

Embedded Computing From Theory to Practice November 2008 USTC Suzhou.

Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Analysis of Instruction-level Vulnerability to Dynamic Voltage and Temperature Variations ‡ Computer Science and Engineering, UC San Diego variability.org.

Computer System Architectures Computer System Software

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

1 A Variability-Aware OpenMP Environment for Efficient Execution of Accuracy-Configurable Computation on Shared-FPU Processor Clusters Abbas Rahimi, Andrea.

Introspective 3D Chips S. Mysore, B. Agrawal, N. Srivastava, S. Lin, K. Banerjee, T. Sherwood (UCSB), ASPLOS 2006 Shimin Chen (LBA Reading Group Presentation)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Adaptive software in cloud computing Marin Litoiu York University Canada.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Multicore In Real-Time Systems – Temporal Isolation Challenges Due To Shared Resources Ondřej Kotaba, Jan Nowotsch, Michael Paulitsch, Stefan.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

Procedure Hopping: a Low Overhead Solution to Mitigate Variability in Shared-L1 Processor Clusters Abbas Rahimi.

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Power and Control in Networked Sensors E. Jason Riedy and Robert Szewczyk Presenter: Fayun Luo.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

CIS250 OPERATING SYSTEMS Chapter One Introduction.

Full and Para Virtualization

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Martin Kruliš by Martin Kruliš (v1.1)1.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator Paper Presentation Yifeng (Felix) Zeng University of Missouri.

1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Background Computer System Architectures Computer System Software.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Embedded Real-Time Systems

Feeding Parallel Machines – Any Silver Bullets? Novica Nosović ETF Sarajevo 8th Workshop “Software Engineering Education and Reverse Engineering” Durres,

Lynn Choi School of Electrical Engineering

Andrea Acquaviva, Luca Benini, Bruno Riccò

Abbas Rahimi‡, Luca Benini†, and Rajesh Gupta‡ ‡CSE, UC San Diego

Parallel Programming By J. H. Wang May 2, 2017.

Fault-Tolerant NoC-based Manycore system: Reconfiguration & Scheduling

Decoupled Access-Execute Pioneering Compilation for Energy Efficiency

Parallel Algorithm Design

Linchuan Chen, Xin Huo and Gagan Agrawal

Abbas Rahimi, Luca Benini, Rajesh K. Gupta

What is Parallel and Distributed computing?

Department of Computer Science University of California, Santa Barbara

A. Rahimi, A. Marongiu, P. Burgio, R. K. Gupta, L. Benini

†UCSD, ‡UCSB, EHTZ*, UNIBO*

Abbas Rahimi‡, Luca Benini†, and Rajesh Gupta‡ ‡CSE, UC San Diego

Department of Computer Science University of California, Santa Barbara

Operating System Overview

Presentation transcript:

Andrea Marongiu Luca Benini ETH Zurich Daniele Cesarini University of Bologna

Key Idea 2 Time or part Cost Application OpenMP Runtime Reduced Cost Efficient Runtime Efficient error recovery in runtime software

Outline Background about Variability Related Work Contributions Target Architecture Platform Online Meta-data Characterization Scheduling Centralized Distributed Experimental Results Summary 3

Ever-increasing Variability in Semiconductor 4 Technology Generation (nm) Performance post-silicon design for worst case scaling ~ 20× performance variation in near threshold! Guardbanding leads to loss of operational efficiency! < 10% > 50% [ITRS] Guard- banding

Reduced Guardband Causes Timing Error 5 Reducing guardband Timing error Costly error recovery 3×N recovery cycles per error for an in-order pipeline! N = number of stages Grand challenge: low-cost and scalable variability-tolerance [Bowman, JSSC’09] [Bowman, JSSC’11 ]

Related Work 6 HW SW Scheduler/ Allocator Work units Tasks Sequence ISA ISA vulnerability [DATE’12] Improved code transformation [ASPLOS’11, TC’13] Coarse-grained task/thread scheduling [JSSC’11] Task-level tolerance [DATE’13] Work-unit tolerance [JETCAS’14] Instruction replay [JSSC’11] OpenMP captures variations in various parallel software context!

Contributions I. Reducing the cost of recovery by exposing errors to runtime layer of OpenMP II. Online meta-data characterization to capture variation (per core) and workload  Task execution cost III. Scheduling policies Centralized Distributed 7

Target Architecture Platform 8 CPU MMU L2 $ Coherent interconnect L1 $ … Interconnect cluster L2 mem IO MMU L1 mem … Cluster L1 mem NI Main memory HostCluster-based many-core Core 0 L1 I$ … Core 15 L1 I$ NI Low-latency Interconnect BANK 1 BANK 2 BANK 32 … DMA Shared memory cluster L1 DATA mem (TCDM) … Stage 1 Stage 2 Stage 7 Core ΣIΣI EDAC ΣIR Shared memory cluster: bit in-order RISC processors Shared L1 tightly-coupled data memory (TCDM) Each core uses EDAC with multiple-issue instruction replay

Online Meta-data Characterization 9 Application X : #pragma omp parallel { #pragma omp master { for (int t = 0; t < T n ; t++) #pragma omp task run_add (t, TASK_SIZE); #pragma omp taskwait for (int t = 0; t < T n ; t++) #pragma omp task run_shift (t, TASK_SIZE); #pragma omp taskwait for (int t = 0; t < T n ; t++) #pragma omp task run_mul (t, TASK_SIZE); #pragma omp taskwait for (int t = 0; t < T n ; t++) #pragma omp task run_div (t, TASK_SIZE); #pragma omp taskwait } Meta-data Lookup Table for Application X Core 1Core 2... Task type1 Task type 2 Task type 3 Task type 4 Core 0 L1 I$ … Core 15 L1 I$ NI Low-latency Interconnect BANK 1 BANK 2 BANK 32 … DMA Shared memory cluster L1 DATA mem (TCDM) … Stage 1 Stage 2 Stage 7 Core EDAC ΣIR ΣIΣI Meta-data + TEC (task i, core j ) = #I (task i, core j ) + #RI (task i, core j )

Centralized Variability- and Load-Aware Scheduler (CVLS) 10 For a task i, CVLS assigns a core j s.t. TEC (task i, core j ) + load j is minimum across the cluster CVLS is a centralized and executed by one master thread

Limitations of Centralized Queue 11 CVLS with single master thread consumes more cycles to find a suitable core for task assignment compared to RRS Solution: reduce overheads via distributed task queues (private task queue for each core) 15% slower

Distributed Variability- and Load-Aware Scheduler (DVLS) 12 DVLS shares the computational load of CVLS from one master to multiple slave threads 1. Master thread simply pushes tasks in a decoupled queue 2. Slave threads pick up a task from decoupled queue and push to best-fitted distributed queue

Benefits of Decupling and Distributed Queues 13 This “decoupling” between queues: 1. Master thread proceeds fast to push tasks 2. The rest of threads cooperatively will schedule tasks among themselves  full utilization 44% faster 4% faster

Experimental Setup Each core optimized during P&R with a target frequency of Sign-off: die-to-die and within-die process variations are injected using PrimeTime VX and variation-aware 45nm TSMC libs (derived from PCA) We integrated ILV models into SystemC-based virtual platform Eight computational-intensive kernels accelerated by OpenMP 14 Core 0 L1 I$ … Core 15 L1 I$ NI Low-latency Interconnect BANK 1 BANK 2 BANK 32 … DMA L1 DATA mem (TCDM) C C C C C C C C C C C C C C C C Process Variation Six cores (C 0, C 2, C 4, C 10, C 13, C 14 ) cannot meet the design time target frequency of 850 MHz

Execution Time and Energy Saving 15 DVLS upto 47% (33% on average) faster execution than RRS CVLS upto 29% (4% on average) faster execution than RRS DVLS upto 44% (30% on average) energy saving compared to RRS CVLS upto 38% (17% on average) energy saving compared to RRS

Summary Our OpenMP reduces cost of error correction in software (proper task-to-core assignment in presence of errors) Introspective task monitoring to characterize online meta-data Capturing both hardware variability and workload Centralized/Distributed dispatching Decoupling and distributed tasking queues All threads cooperatively will schedule tasks among themselves Distributed scheduling achieves on average 30% energy saving, and 33% performance improvement compared to RRS. 16