Improving Program Efficiency by Packing Instructions Into Registers

Slides:



Advertisements
Similar presentations
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Advertisements

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
ELEN 468 Advanced Logic Design
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,
August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Chapter 12 Pipelining Strategies Performance Hazards.
Honors Compilers Addressing of Local Variables Mar 19 th, 2002.
Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Chapter 13 Reduced Instruction Set Computers (RISC) CISC – Complex Instruction Set Computer RISC – Reduced Instruction Set Computer.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
1 Code Compression Motivations Data compression techniques Code compression options and methods Comparison.
Computer Organization and Architecture Reduced Instruction Set Computers (RISC) Chapter 13.
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.
Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Reduced Instruction Set Computers. Major Advances in Computers(1) The family concept —IBM System/ —DEC PDP-8 —Separates architecture from implementation.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
Electrical and Computer Engineering University of Cyprus
Advanced Architectures
Concepts and Challenges
Multiscalar Processors
SECTIONS 1-7 By Astha Chawla
Morgan Kaufmann Publishers
ELEN 468 Advanced Logic Design
Processor Architecture: Introduction to RISC Datapath (MIPS and Nios II) CSCE 230.
ECS 154B Computer Architecture II Spring 2009
Pipelining: Advanced ILP
Single-Cycle CPU DataPath.
Lecture 4: MIPS Instruction Set
Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.
Ka-Ming Keung Swamy D Ponpandi
MIPS Instructions.
The University of Adelaide, School of Computer Science
Topic 5: Processor Architecture Implementation Methodology
Computer Architecture and the Fetch-Execute Cycle
Topic 5: Processor Architecture
* From AMD 1996 Publication #18522 Revision E
COMS 361 Computer Organization
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Instruction Set Principles
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Chapter 12 Pipelining and RISC
CSC3050 – Computer Architecture
Lecture 5: Pipeline Wrap-up, Static ILP
COMP755 Advanced Operating Systems
Ka-Ming Keung Swamy D Ponpandi
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

Improving Program Efficiency by Packing Instructions Into Registers Hines, Green, Tyson AND Whalley, Florida State University (ISCA’05)

Motivation Code size is crucial for embedded systems power consumption I-Fetch logic consumes approximately 36% of total processor power on a StrongARM time Reduction in code size -> Power savings But trade off in code size vs power and execution time Code compression and voltage scaling reduce power requirements, but may increase execution time

Solution proposed Claim – no increase in power or execution time Some instructions are referenced much more frequently than others Store the frequently referenced instructions in a special register file inside the pipeline The original code now, will be converted into a pointer to this instruction register file, and needs lesser space in memory Two new SRAM structures - Instruction Register File (IRF) to hold common instructions, and Immediate Table (IMM) to hold most common immediate values

Vision for benefit/performance Observation - Instruction redundancy (data from MiBench) Static instruction redundancy – lesser code size Dynamic instruction redundancy – lesser power Another observation – redundancy in immediate values Introduce IMM, which allows a further level of indirection in storing the common immediate values

About the IRF and the IMM The IRF can be integrated into the pipeline either at the end of fetch or at the start of decode They claim that this doesn’t add any latency They also claim that the IRF can be modified to hold partially decoded instructions and accessed over multiple cycles. If the I-cache fetch hits a packed instruction, all the instructions it points to are written in the instruction buffer

Packing of instructions (I) Modified the MIPS ISA to produce two types of packing - ‘loose’ and ‘tight’ Loose - an IRF instruction pointer is added to a standard instruction eliminating shamt and shifting function Impact on original instruction – Shamt and lui For standard MIPS instructions, inst field points to nop in IRF

Packing of instructions (II) Tight – All fields of the original instruction replaced with pointers Can support from one to five instructions from the IRF Nop is used to signal stop of execution to hardware Loosely packed instructions were designed because often only 1 IRF entry is detected in the code

Parameterization of immediate values A level of indirection used to decrease code size Observation - 51.9% of static instructions and 43.4% of the dynamic instructions are I-type instructions The tightly packed instruction can contain up to two parameterized immediate values These can point to any entry in the IMM Any instruction can use any parameter; Opcode + S bit decide this Every instruction also has a default value

Positional Registers Based on the observation that a lot of times you’re doing the same thing but using different registers. E.g imagine a commone code snippet - load <VarReg>, <imm> Add <FixReg>, <VarReg>, 5 They use a scheme where registers can be referred as s[<pos>] and u[<pos>] The code now becomes Add <FixReg>, s[0], 5 Increases instruction redundancy and allows denser packing

Special care for branches Branches can be packed in either the loosely or tightly packed instructions but are packed Observation - 63.2% of static branches and  39.8% of dynamic branches can be represented with a 5-bit displacement. So, instead of picking the jump offset from the IMM, it is directly encoded as a parameter.

Compiler Modifications Compiler used is a port of VPO (Very Portable Optimizer) for MIPS Compilation is done twice on the source program First compiles and forms a profiling executable Then analysis tools takes the profile and selects instructions for IRF After selection, packing is done Claim - future work could remove the need of profiling by allowing the compiler to approximate dynamic frequencies by looking at loop nesting depths and the entire program CFG

Instruction selection for IRF (I) Done using a heuristic. A greedy algorithm selects the most frequently occurring instructions as well as default immediate values

Instruction selection for IRF (II) Some problems– Commutative operators – Same to SW but different to HW Compiler always reorders to one format Equivalent instructions – E.g mov is same as add with 0 or or with 0 Compiler always outputs as addi with zero, for maximum reuse

After profiling (the process of packing) In profiling, all possible instructions are optimized Now, each instruction is marked as present in IRF, present in IRF with parameterization, or not present in the IRF In profiling, all possible instructions are optimized Then packing is done per basic block by a sliding window algorithm Unused slots are filled with nops Repeated for each block containing an unpacked branch

An example with everything

Results (I) –Code Size Average code size was reduced to – 83.23% with just packing 81.70% with +parameterization 81.09% with +positional registers

Results (II) - Energy and execution time An average reduction of 37% in fetch energy Energy savings come from two sources – Increased fetch rate ( allowing applications to complete in fewer cycles) Fewer accesses to the I-cache and memory Average execution time savings was 5.04%

Results (III)

Issues pointed to by the paper itself Preservation of state (IRF and IMM) during context switches Can do this using a routine associated each process State only needs to be restored, not saved Addressing of exceptions in the middle of a packed instruction when structural/data hazards prevent all instructions to be simultaneously issued One or more instructions might have completed A bitmask of retired instructions can help

Other questions Not tested on a real pipeline. They’ve just used “Verilog models for simple pipelines”. Latency issues with the integration of the IRF+IMM in the pipeline. What are the reasons for difference in redundancy and compression? What about compilation time? You have to compile twice

Thank You. Questions?

Backup

Comparisons with other similar works <Table 4.> Methods are - Microcode - Procedural abstraction - L0 - Echo instruction - Zero-loop-overhead-buffer - Codewords/dictionaries - Dual instruction sets/augmenting instructions - Heads and tails - Many of these techniques are complementary to the IRF.