‏ Adaptive Reduced Bit-width Instruction Set Architecture (adapt-rISA) Sandro Neves Soares – UCS Ashok Halambi – UCI Aviral Shrivastava – ASU Flávio Rech.

Slides:

Advertisements

Similar presentations

Instruction Set Design

Advertisements

Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

CML CML CS 230: Computer Organization and Assembly Language Aviral Shrivastava Department of Computer Science and Engineering School of Computing and Informatics.

Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,

August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

© ACES Labs, CECS, ICS, UCI. Energy Efficient Code Generation Using rISA * Aviral Shrivastava, Nikil Dutt

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

Instruction Representation II (1) Fall 2007 Lecture 10: Instruction Representation II.

An Efficient Compiler Technique for Code Size Reduction using Reduced Bit-width ISAs S. Ashok Halambi, Aviral Shrivastava, Partha Biswas, Nikil Dutt, Alex.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.

Instruction Representation II (1) Fall 2005 Lecture 10: Instruction Representation II.

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

S. Barua – CPSC 440 CHAPTER 5 THE PROCESSOR: DATAPATH AND CONTROL Goals – Understand how the various.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Self-Organizing Agents for Grid Load Balancing Junwei Cao Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04)

A Compiler-in-the-Loop (CIL) Framework to Explore Horizontally Partitioned Cache (HPC) Architectures Aviral Shrivastava*, Ilya Issenin, Nikil Dutt *Compiler.

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Intro to Architecture – Page 1 of 22CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Introduction Reading: Chapter 1.

Instruction Set Architecture

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

Computer Engineering AddSub page 1 Basic Building Blocks Multiplexer + Demultiplexer Adder.

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.

RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.

1 Control Unit Operation and Microprogramming Chap 16 & 17 of CO&A Dr. Farag.

CDA 3101 Fall 2013 Introduction to Computer Organization

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

Computer Architecture and Design – ECEN 350 Part 6 [Some slides adapted from A. Sprintson, M. Irwin, D. Paterson and others]

ECE 15B Computer Organization Spring 2011 Dmitri Strukov Partially adapted from Computer Organization and Design, 4 th edition, Patterson and Hennessy,

Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Control Unit.

The Instruction Set Architecture. Hardware – Software boundary Java Program C Program Ada Program Compiler Instruction Set Architecture Microcode Hardware.

Nadathur R Satish and Pierre-Yves Droz EECS Department, University of California Berkeley.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

CISC Machine Learning for Solving Systems Problems Microarchitecture Design Space Exploration Lecture 4 John Cavazos Dept of Computer & Information.

Sunpyo Hong, Hyesoon Kim

New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.

Computer Architecture & Operations I

CS 230: Computer Organization and Assembly Language

Evaluating Register File Size

A Closer Look at Instruction Set Architectures

COMP541 Datapaths I Montek Singh Mar 28, 2012.

Improving Program Efficiency by Packing Instructions Into Registers

from classroom to research: providing different

Lecture 4: MIPS Instruction Set

CS/COE0447 Computer Organization & Assembly Language

CSCI206 - Computer Organization & Programming

Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

The University of Adelaide, School of Computer Science

MARIE: An Introduction to a Simple Computer

Topic 5: Processor Architecture Implementation Methodology

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

Guest Lecturer TA: Shreyas Chand

COMS 361 Computer Organization

MARIE: An Introduction to a Simple Computer

Basic Building Blocks Multiplexer Demultiplexer Adder +

COMP541 Datapaths I Montek Singh Mar 18, 2010.

Basic Building Blocks Multiplexer Demultiplexer Adder +

MIPS instructions.

Chapter 4 The Von Neumann Model

Presentation transcript:

‏ Adaptive Reduced Bit-width Instruction Set Architecture (adapt-rISA) Sandro Neves Soares – UCS Ashok Halambi – UCI Aviral Shrivastava – ASU Flávio Rech Wagner – UFRGS Nikil Dutt - UCI 17th International Conference on Very Large Scale Integration Compiler Microarchitecture Lab Arizona State University

Introduction Code size continues to be an extremely important concern for low-end embedded systems – controllers in cars, TVs, refrigerators and music players A higher code size can imply: – the impossibility to execute the functionality – a significant impact on the system power and cost The problem is becoming complex with the current trend of increasing software content on embedded systems rISA (reduced bit-width ISA) is a popular solution for this code size problem – two instruction sets, the “normal” and the “reduced bit-width” 17th International Conference on Very Large Scale Integration

Introduction The advantages of rISA: – significant code size reduction – less fetches to the instruction memory The benefits of rISA are heavily dependent on the application – and on the narrow instruction set design Just one rISA is unable to exploit the dynamically changing "working IS" of today's embedded applications – a "reduced bit-width" ISA can have only a very limited number of opcodes 17th International Conference on Very Large Scale Integration

Introduction Previous works suggested techniques to design the best rISA for an embedded application – but they only solve the problem for single "reduced bit- width" ISA architectures The focus is now changed to develop dual "reduced bit- width" ISA for architectures such as ARM 11 – the different computational requirements inside a single application should be considered Our approach adapt-rISA is the first effort to design "reduced bit-width" ISAs for multiple rISA architectures 17th International Conference on Very Large Scale Integration

Outline Introduction rISA Architectural Feature Related Work Adaptive rISA – Code Conversion – Design Space Exploration – Implementation Details – Experiments and Results Conclusion Future Work 17th International Conference on Very Large Scale Integration

rISA Architectural Feature 17th International Conference on Very Large Scale Integration A program, compiled using rISA, is composed by reduced and normal blocks The role of a rISA compiler is to find the best rISA design and also the best rISA design configuration Normal code (a) and Reduced code (b) of a small section of the CRC32 program (a) lw $4,12($fp) addi $2,$4,-1 move $4,$2 sw $4,12($fp) lw $4,8($fp) addi $2,$4,1 move $4,$2 sw $4,8($fp) (b) Change Mode Instruction lw_r $4,12($fp) | addi_r $2,$4,-1 move_r $4,$2 | sw_r $4,12($fp) lw_r $4,8($fp) | addi_r $2,$4,1 move_r $4,$2 | sw_r $4,8($fp) rISA_nop | Ch.Mode Instr. Reduction

rISA Architectural Feature A rISA design specifies the number of bits in each bitfield rISA_4444: opcode(4 bits) – rs(4) – rt(4) - imm(4) A rISA design configuration specifies the different opcodes employed – to increase code density, rdc must include the most frequently encountered instructions – for power reduction, the most executed instructions must be selected 17th International Conference on Very Large Scale Integration addi instruction: Normal (a) and Reduced (b) using rISA design rISA_4444 addi $2,$4,-1 (normal) Opcode(6 bits) - rs(5) - rt(5) - imm(16) – – (a) (b) addi_r $2,$4,-1 (reduced) Opcode(4 bits) – rs(4) – rt(4) - imm(4) 0000 –

17th International Conference on Very Large Scale Integration subu $sp,$sp,40 sw $31,32($sp) sw $fp,28($sp) sw $16,24($sp) move $fp,$sp sw $4,40($fp) sw $5,44($fp) jal __main move $16,$0 lw $3,44($fp) addu $2,$3,4 move $3,$2 sw $3,44($fp) addu $2,$fp,20 lw $4,0($3) addu $5,$fp,16 move $6,$2 jal crc32file or $16,$16,$2 la $4,$LC0 lw $5,16($fp) jal printf sltu $3,$0,$16 move $2,$3 rdc A: sw addu la sltu rdc A: sw addu la sltu rdc B: sw addu lw move rdc B: sw addu lw move subu $sp,$sp,40 sw $31,32($sp) sw $fp,28($sp) sw $16,24($sp) move $fp,$sp sw $4,40($fp) sw $5,44($fp) jal __main move $16,$0 lw $3,44($fp) addu $2,$3,4 move $3,$2 sw $3,44($fp) addu $2,$fp,20 lw $4,0($3) addu $5,$fp,16 move $6,$2 jal crc32file or $16,$16,$2 la $4,$LC0 lw $5,16($fp) jal printf sltu $3,$0,$16 move $2,$3 Instructions selected by rdc B Instructions selected by rdc A

Related Work Shrivastava et al present a DSE framework for rISA design aimed at improving code density – the experiments employed various rISA designs: from 16 to 128 reduced opcodes – the work shows that the rISA design rISA_4444 presents a good trade-off – If a normal instruction cannot fit on a reduced instruction, it is discarded from reduction – some other rISA designs solve this problem adding special reduced instructions It is shown that a conversion aimed at improving code density does not achieve the best results in energy reduction 17th International Conference on Very Large Scale Integration

Related Work Shrivastava et al details various aspects of rISA designs: – there can be only an even number of contiguous rISA instructions – there should be a mechanism in software to specify the execution mode: mx and rISA_mx instructions When the processor is in rISA mode, the fetched code is assumed to contain two rISA instructions – they are translated into normal instructions before execution – only the decode logic needs to be modified 17th International Conference on Very Large Scale Integration

Adaptive rISA A simple application probably includes distinct sections with different requirements The idea supporting adaptive rISA is that a divide and conquer rISA approach can be used – previous works did not consider such granularity Most of the software and hardware aspects behind the adapt-rISA solution are the same of those in rISA 17th International Conference on Very Large Scale Integration

Routine R1 Begin... End Routine R2 Begin... End Routine R3 Begin... End Routine R4 Begin... End Main Routine Begin... End Routine R1 Begin... End Routine R2 Begin... End Routine R3 Begin... End Routine R4 Begin... End Main Routine Begin... End An unique rdc for the entire application Routine R1 Begin... End Routine R2 Begin... End Routine R3 Begin... End Routine R4 Begin... End Main Routine Begin... End Routine R1 Begin... End Routine R2 Begin... End Routine R3 Begin... End Routine R4 Begin... End Main Routine Begin... End Adapt-rISA Routine reduced using the rdc A rdc B rdc C rdc A rdc C

Adaptive rISA A reduced set with less opcodes can encompass more instructions (to be reduced) in a given section – lesser number of bits may be employed to specify the opcode – rISA_4444 seems to be a good solution for these cases Not all the initially marked instructions, as specified by the rdc, are actually reduced – the main cause of discard is overflow – number of contiguous instructions is too small – branches and jumps between normal and reduced blocks are not allowed 17th International Conference on Very Large Scale Integration

Adaptive rISA rISA_8ops seems to be a good solution for adapt- rISA Discard of instrs., qsort program – r_4444 (right) x r_8ops (left) 17th International Conference on Very Large Scale Integration

Code Conversion 17th International Conference on Very Large Scale Integration INPUT: application's Assembly code produced by gcc PARAMETERS: rISA design and rISA design configuration if (mips.usingRISA ( )) { mips.rISA.mapRegisters ( ); mips.rISA.markCandidates ( ); mips.rISA.isPossibleToReduceCandidates(); mips.rISA.discardSmallBlocks (); while(mips.rISA.treatBranchesAndJumps()) mips.rISA.discardSmallBlocks (); mips.rISA.countFinalBlocks (); mips.rISA.translateToRISAstep1 ( ); mips.rISA.translateToRISAstep2 ( ); mips.rISA.generateFinalCode ( output); }

Design Space Exploration Our DSE process focus on the dynamic aspects of the execution The application is executed with a small dataset to get its execution profile The different opcodes of these marked instructions are identified and stored A DSE process is triggered using combinations of these opcodes (8 or 16 each time) to try improved results for: – total number of reduced instructions – average block size – total number of blocks 17th International Conference on Very Large Scale Integration

Design Space Exploration The most promising combinations are used to form a rISA design and configuration database – each record of this database is applied on the application using the conversion-to-rISA algorithm – the application is then executed The granularity of this DSE is changed to application’s individual routines to support adapt-rISA – the result is a set of different rISA design configurations 17th International Conference on Very Large Scale Integration

Implementation Details Some additional software and hardware aspects are needed for adapt-rISA: – the mx instruction carries the rISA design configuration identifier as an immediate value – the translation unit must receive, as an input, this rISA design configuration identifier – it may also store the translation information partitioned into smaller and independent sub-units Adapt-rISA improve power not only by reducing the number of fetches, but also during the translation process A framework for design space exploration of embedded processors has been used in this work: T&D-Bench 17th International Conference on Very Large Scale Integration

Experiments and Results The methods and tools described were used to experiment with the bitcount, CRC32, qsort and stringsearch programs – first, the experiments were executed using these programs individually and, afterwards, grouped In the experiments, we present the following metrics: – number of fetches (main metric) – percentage of actual reduced instructions – average size of the reduced blocks – total number of reduced blocks – application's code size reduction (only informative) Each application was reduced using adapt-rISA and also the (one) optimal rdc for each individual program 17th International Conference on Very Large Scale Integration

Experiments and Results (1) number of fetches (2) percentage of actual reduced instructions (3) average size of the reduced blocks (4) total number of reduced blocks (5) application's code size reduction 17th International Conference on Very Large Scale Integration

Experiments and Results In general, adapt-rISA achieves better results: less fetches and better values in the code compression metrics – there were, in four of the six applications, less fetches, from a minimum of 2% to a maximum of 7% of reduction – the total number of reduced instructions was always larger in the presence of adapt-rISA: the average improvement was 19% – in 5 applications, the average size of the reduced blocks was improved by adapt-rISA These results were obtained using the new design rISA_8ops Experiments were validated by comparing the result(s) obtained at the host platform with the corresponding result(s) produced by the simulator 17th International Conference on Very Large Scale Integration

Conclusion adapt-rISA presents better results in almost all the applications, and for most of the metrics – for the code compression main metric, the average improvement was 19% – concerning the fetch requests, there were up to 7% less fetches This work also described a new rISA design 17th International Conference on Very Large Scale Integration

Future Work The work focused mainly on DSE for rISA design configuration – the path is opened for a DSE focused on different rISA designs The definition of a more robust heuristic to find the best rISA design and configuration The hardware implementation of the adapt- rISA translation unit Evaluation using other embedded applications 17th International Conference on Very Large Scale Integration

Thank you ! Questions ?