WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
1 Optimizing compilers Managing Cache Bercovici Sivan.
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
The Case for a SC-preserving Compiler Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLA University of Michigan Abhay Singh Satish Narayanasamy.
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
November 29, 2005Christopher Tuttle1 Linear Scan Register Allocation Massimiliano Poletto (MIT) and Vivek Sarkar (IBM Watson)
Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.
Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.
2015/6/21\course\cpeg F\Topic-1.ppt1 CPEG 421/621 - Fall 2010 Topics I Fundamentals.
LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks Feng Qin, Cheng Wang, Zhenmin Li, Ho-seop Kim, Yuanyuan.
An Efficient Compiler Technique for Code Size Reduction using Reduced Bit-width ISAs S. Ashok Halambi, Aviral Shrivastava, Partha Biswas, Nikil Dutt, Alex.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Generic Software Pipelining at the Assembly Level Markus Pister
Chapter 3 Memory Management: Virtual Memory
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
JIT in webkit. What’s JIT See time_compilation for more info. time_compilation.
Department of Computer Science A Static Program Analyzer to increase software reuse Ramakrishnan Venkitaraman and Gopal Gupta.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Multi-core architectures. Single-core computer Single-core CPU chip.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Automated Design of Custom Architecture Tulika Mitra
Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
CS533 Concepts of Operating Systems Jonathan Walpole.
CMPE 511 Computer Architecture A Faster Optimal Register Allocator Betül Demiröz.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Operating Systems Lecture 9 Introduction to Paging Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Just-In-Time Compilation. Introduction Just-in-time compilation (JIT), also known as dynamic translation, is a method to improve the runtime performance.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Muen Policy & Toolchain
Research in Compilers and Introduction to Loop Transformations Part I: Compiler Research Tomofumi Yuki EJCP 2016 June 29, Lille.
Ph.D. in Computer Science
Henk Corporaal TUEindhoven 2009
Many-core Software Development Platforms
Instruction Scheduling for Instruction-Level Parallelism
Performance Optimization for Embedded Software
Register Pressure Guided Unroll-and-Jam
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Henk Corporaal TUEindhoven 2011
Jinquan Dai, Long Li, Bo Huang Intel China Software Center
Mapping DSP algorithms to a general purpose out-of-order processor
CMSC 611: Advanced Computer Architecture
Loop-Level Parallelism
Lecture 4: Instruction Set Design/Pipelining
Dynamic Binary Translators and Instrumenters
Presentation transcript:

WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

E LIMINATING NON - PRODUCTIVE MEMORY OPERATIONS IN NARROW - BITWIDTH ARCHITECTURES I NDU B HAGAT, E NRIC G IBERT, J ESÚS S ÁNCHEZ, A NTONIO G ONZÁLEZ (UPC) 64-bit datapath 64-bit addressing and high precision computing 64-bit adder 64 bit 64 bit

E LIMINATING NON - PRODUCTIVE MEMORY OPERATIONS IN NARROW - BITWIDTH ARCHITECTURES I NDU B HAGAT, E NRIC G IBERT, J ESÚS S ÁNCHEZ, A NTONIO G ONZÁLEZ (UPC) 64-bit datapath 64-bit addressing and high precision computing 16-bit adder 64 bit 64 bit

E LIMINATING NON - PRODUCTIVE MEMORY OPERATIONS IN NARROW - BITWIDTH ARCHITECTURES I NDU B HAGAT, E NRIC G IBERT, J ESÚS S ÁNCHEZ, A NTONIO G ONZÁLEZ (UPC) 16-bit integer datapath 64-bit addressing and high precision computing 40% of computations need only a 16-bit datapath Caveat: 64-bit computation becomes 8 * 16-bit computations (DBT) 16-bit adder

E LIMINATING NON - PRODUCTIVE MEMORY OPERATIONS IN NARROW - BITWIDTH ARCHITECTURES I NDU B HAGAT, E NRIC G IBERT, J ESÚS S ÁNCHEZ, A NTONIO G ONZÁLEZ (UPC) What does non-productive mean? 0 x x x

E LIMINATING NON - PRODUCTIVE MEMORY OPERATIONS IN NARROW - BITWIDTH ARCHITECTURES I NDU B HAGAT, E NRIC G IBERT, J ESÚS S ÁNCHEZ, A NTONIO G ONZÁLEZ (UPC) What does non-productive mean? 0 x x x

E LIMINATING NON - PRODUCTIVE MEMORY OPERATIONS IN NARROW - BITWIDTH ARCHITECTURES I NDU B HAGAT, E NRIC G IBERT, J ESÚS S ÁNCHEZ, A NTONIO G ONZÁLEZ (UPC) What does non-productive mean? 0 x x x

E LIMINATING NON - PRODUCTIVE MEMORY OPERATIONS IN NARROW - BITWIDTH ARCHITECTURES I NDU B HAGAT, E NRIC G IBERT, J ESÚS S ÁNCHEZ, A NTONIO G ONZÁLEZ (UPC) Contributions and conclusions 1.Narrow ISA offers more opportunities to remove non-productive memory operations 2.50 % of dynamic narrow operations are non-productive 3.Memory Productiveness Pruning: profile-guided, dynamic optimization

E NERGY EFFICIENT CODE GENERATION FOR PROCESSORS WITH EXPOSED DATAPATH D ONGRUI S HE, Y IFAN H E, B ART M ESMAN, H ENK C ORPORAAL (TUE) Exposed datapath: software controls every movement in the data path Example: transport-triggered architecture (Henk Corporaal) Register file access reduction

R EGISTER R EUSE S CHEDULING G ERGÖ B ARANY Objective Minimize spill code by attempting to find an instruction schedule that allows for the least expensive register allocation Motivation Spill code generated by the compiler has crucial effect on program performance Method Implicitly enforce instruction scheduling decisions by adding extra arcs to the data dependence graph (DDG) Results 8.9% less spilling, 3.4% smaller static spill costs

Register Allocation and spilling R EGISTER R EUSE S CHEDULING Virtual registers Physical registers Memory

Register Allocation with reuse candidates R EGISTER R EUSE S CHEDULING basic block interference graph definitely overlap definitely NO overlap possible overlap data dependence graph

Register Allocation with reuse candidates R EGISTER R EUSE S CHEDULING

D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING M OUNIRA B ACHIR, S ID -A HMED -A LI T OUATI, A LBERT C OHEN Objective Minimize the unrolling factor resulting from periodic register allocation of a software-pipelined loop, without altering the initiation interval (II) Motivation Code size related with memory requirements and I-cache performance Method Strategically insert move operations without increasing II to split meeting graph components into smaller ones ResultsGood if enough functional units to perform the additional move operations and acceptable execution time

Periodic Register Allocation D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING Rotating Register File R

Periodic Register Allocation D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING Rotating Register File Move operations d-1 MOVs/iteration d : iteration span of variables

Periodic Register Allocation D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING Rotating Register File Move operations Loop unrolling 3 * code size

Periodic Register Allocation D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING Rotating Register File Move operations Loop unrolling Modulo Variable Expansion a[i] b[i] c[i] a[i+1] b[i+1] c[i+1] a[i+2] b[i+2] c[i+2] using 9 registers instead of 8 MAXLIVE = 8

Periodic Register Allocation D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING Rotating Register File Move operations Loop unrolling Modulo Variable Expansion Meeting Graph lifetime in cycles lifetime interval of c ends when interval of b begins

Meeting Graph D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING a[i] b[i] c[i] a[i+1] b[i+1] c[i+1] a[i+2] b[i+2] c[i+2] a[i+3] b[i+3] c[i+3] a[i+4] b[i+4] c[i+4] a[i+5] b[i+5] c[i+5] a[i+6] b[i+6] c[i+6] a[i+7] b[i+7] c[i+7]

Circuit Decomposition D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING

2011 INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION Main Conference

MAO – AN EXTENSIBLE M ICRO -A RCHITECTURAL O PTIMIZER R OBERT H UNDT, E ASWARAN R AMAN, M ARTIN T HURESSON, N EIL V ACHHARAJANI (G OOGLE ) Micro-architectural: not always documented Proprietary compilers at advantage! SPEC2000 int Loop SPEC2000 int Loop NOP + 1 NOP instruction - 7% execution time

MAO – AN EXTENSIBLE M ICRO -A RCHITECTURAL O PTIMIZER R OBERT H UNDT, E ASWARAN R AMAN, M ARTIN T HURESSON, N EIL V ACHHARAJANI (G OOGLE ) Micro-architectural: not always documented Example: instruction decoding in Core 2 in chunks of 16 bytes SPEC2000 int Loop SPEC2000 int Loop NOP 16-byte alignment boundary

MAO – AN EXTENSIBLE M ICRO -A RCHITECTURAL O PTIMIZER R OBERT H UNDT, E ASWARAN R AMAN, M ARTIN T HURESSON, N EIL V ACHHARAJANI (G OOGLE ) Contributions and conclusions 1.Extensible assembly to assembly optimizer 2.Does not fit in GCC flow, because after RTL level not enough information preserved 3.Discover micro-architectural details semi- automatically through generation of micro-benchmarks

D YNAMIC REGISTER PROMOTION OF STACK VARIABLES J IANJUN L I, C HENGGANG W U, W EI -C HUNG H SU Use DBT to let x86 binaries use the extra registers on x86-64 recompiling is not always an option (legacy binaries) compute-intensive applications gain speed when using 64-bit Challenge: implicit stack accesses Solved using page protection and stack switching (with shadow stack)

L ANGUAGE AND COMPILER SUPPORT FOR AUTO - TUNING VARIABLE - ACCURACY ALGORITHMS J ASON A NSEL, Y EE L OK W ONG, C Y C HAN, M AREK O LSZEWSKI, A LAN E DELMAN, S AMAN A MARASINGHE (MIT) PetaBricks: language extensions to expose trade-offs between time and accuracy to the compiler 1.New programming language, toolchain and run-time environment 2.Technique for mapping variable accuracy code to enable auto- efficient tuning

P RACTICAL MEMORY CHECKING WITH D R. M EMORY D EREK B RUENING (G OOGLE ), Q IN Z HAO (MIT) x86 Existing memory checking tools (e.g. Valgrind) slow many false positives

A TRACE - BASED J AVA JIT COMPILER RETROFITTED FROM A METHOD - BASED COMPILER H IROSHI I NOUE, H IROSHIGE H AYASHIZAKI, P ENG W U, T OSHIO N AKATANI (IBM) Extend the compilation scope from methods to traces Traces span multiple method invocations More powerful than method inlining

A TRACE - BASED J AVA JIT COMPILER RETROFITTED FROM A METHOD - BASED COMPILER H IROSHI I NOUE, H IROSHIGE H AYASHIZAKI, P ENG W U, T OSHIO N AKATANI (IBM) Claim: current trace-JITs are immature Keep the advanced optimization infrastructure by retrofitting

P HASE - BASED T UNING FOR B ETTER U TILIZATION OF P ERFORMANCE -A SYMMETRIC M ULTICORE P ROCESSORS T YLER S ONDAG AND H RIDESH R AJAN Objective Design and apply a transparent and fully-automatic process called phase-based tuning which adapts an application to effectively utilize performance-asymmetric multicores Motivation Trend towards performance asymmetry among cores of a single chip Method Statically partition the application into code sections that are likely to have similar runtime behavior. Exhibited runtime characteristics of representative sections are used to map the whole cluster Results 36% average process speedup with negligible overheads

Phase-based tuning P HASE - BASED T UNING FOR B ETTER U TILIZATION OF P ERFORMANCE -A SYMMETRIC M ULTICORE P ROCESSORS

V APOR SIMD: A UTO -V ECTORIZE O NCE, R UN E VERYWHERE D ORIT N UZMAN, S ERGEI D YSHEL, E RVEN R OHOU, I RA R OZEN, A LBERT C OHEN, A YAL Z AKS Objective Design and a split vectorization framework and study how it compares to monolithic one Motivation JIT compiler technology offers portability while facilitating target – and context-specific specialization; SIMD hardware is ubiquitous and diverse Method Mix-and-match existing open compilation tools, namely GCC and MONO Results Comparable to specialized monolithic offline compilers

Vectorizing for different platforms V APOR SIMD: A UTO -V ECTORIZE O NCE, R UN E VERYWHERE

Split vectorization scheme V APOR SIMD: A UTO -V ECTORIZE O NCE, R UN E VERYWHERE

Interoparable compilation flows V APOR SIMD: A UTO -V ECTORIZE O NCE, R UN E VERYWHERE