Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

1 ECE369 ECE369 Pipelining. 2 ECE369 addm (rs), rt # Memory[R[rs]] = R[rt] + Memory[R[rs]]; Assume that we can read and write the memory in the same cycle.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.
Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,
August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
An Efficient Compiler Technique for Code Size Reduction using Reduced Bit-width ISAs S. Ashok Halambi, Aviral Shrivastava, Partha Biswas, Nikil Dutt, Alex.
Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.
Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.
Analysis of Branch Predictors
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
CSC 3210 Computer Organization and Programming Chapter 8 MACHINE INSTRUCTIONS D.M. Rasanjalee Himali.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
Florida State University Automatic Tuning of Libraries and Applications, LACSI 2006 In Search of Near-Optimal Optimization Phase Orderings Prasad A. Kulkarni.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
Dept. of Computer Science - CS6461 Computer Architecture CS6461 – Computer Architecture Fall 2015 Lecture 1 – Introduction Adopted from Professor Stephen.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Branch Hazards and Static Branch Prediction Techniques
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Computer Architecture & Operations I
Advanced Architectures
Instruction Level Parallelism
Computer Architecture Principles Dr. Mike Frank
William Stallings Computer Organization and Architecture 8th Edition
Morgan Kaufmann Publishers
Chapter 14 Instruction Level Parallelism and Superscalar Processors
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Improving Program Efficiency by Packing Instructions Into Registers
ECS 154B Computer Architecture II Spring 2009
Pipelining: Advanced ILP
Instruction Level Parallelism and Superscalar Processors
Ka-Ming Keung Swamy D Ponpandi
Topic 5: Processor Architecture Implementation Methodology
Rocky K. C. Chang 6 November 2017
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Control unit extension for data hazards
Topic 5: Processor Architecture
Lecture 20: OOO, Memory Hierarchy
Instruction Level Parallelism (ILP)
In Search of Near-Optimal Optimization Phase Orderings
Instruction Set Principles
pipelining: static branch prediction Prof. Eric Rotenberg
Reducing pipeline hazards – three techniques
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
rePLay: A Hardware Framework for Dynamic Optimization
Guest Lecturer: Justin Hsia
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Ka-Ming Keung Swamy D Ponpandi
Predication ECE 721 Prof. Rotenberg.
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers Stephen Hines, David Whalley and Gary Tyson Computer Science Dept. Florida State University October 23, 2006

Instruction Packing Store frequently occurring instructions as specified by the compiler in a small, low-power Instruction Register File (IRF) Allow multiple instruction fetches from the IRF by packing instruction references together Tightly packed – multiple IRF references Loosely packed – piggybacks an IRF reference onto an existing instruction Facilitate parameterization of some instructions using an Immediate Table (IMM) Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

Execution of IRF Instructions Instruction Fetch Stage First Half of Instruction Decode Stage Instruction Cache IF/ID insn4 imm3 insn3 insn2 insn1 IRF PC packed instruction packed instruction To Instruction Decoder IRWP IMM Executing a Tightly Packed Param4c Instruction Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

Outline Introduction Improved Promotion to the IRF Compiler Optimizations Instruction Selection Register Re-assignment Instruction Scheduling Experimental Evaluation Conclusions & Future Work Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

Improved Promotion to the IRF Different classes of instructions can consume 1 – 5 slots More accurately model the benefits of promoting from one class of instruction to another Original IRF papers did not promote multiple I-type instructions with different default immediate values addi $3, $3, 4 and addi $3, $3, 1 would not both reside in the IRF, no matter how frequently they occurred Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

Mixed Profiling Static profiling is best for decreasing code size Dynamic profiling is best for reducing energy consumption Can simultaneously weight static and dynamic profile data to obtain a mixed result that has both good code compression and reduced energy consumption Can obtain most of the benefits of individual static/dynamic profiling Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

Compiler Optimizations Instruction Selection Choose beneficial encodings for increasing redundancy Register Re-assignment Attempts to rename registers such that instructions can be accessed via IRF Instruction Scheduling Intra-block – focus on reordering instructions so that dense packs are formed (both tight and loose) Inter-block – attempt to move instructions between blocks to fill up packs ending with branches/jumps Code duplication Predication Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

Intra-block Instruction Scheduling Without Instruction Scheduling With Instruction Scheduling 3 1 1 2 1 1 2 2 2 4 5 4’ 3 5 3 4 4 4’ 4 4’ 5 5 5 Instruction Dependence DAG Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

Code Duplication to Reduce Code Size W • • • X Y 5 c 5’ a b 6 slots is too many to fit in a single packed instruction … Z 1 1 3 4 3’ 4’ but we can duplicate a single instruction … 2 resulting in the ability to pack the remaining 5 slots together. 3 3’ 4 4’ Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

Predication – Forward Branches X • • • Cond Branch a Fall-through Instructions packed after forward branches will only be executed when the branch is not taken Y 2 1 2 3 2’ b 3 4 4 2’ 4’ 4’ Branch taken path 4 4’ Z • • • Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

Predication – Backward Branches • • • a b c Instructions packed after backward branches will only be executed when the branch is taken 1 1 2 2 2’ 2’ Branch d e f Branch offset • • • Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

Predication Advantages with IRF IRF facilitates a form of predication for the MIPS – a baseline architecture that traditionally does not support predication No need to waste instruction encoding space specifying predicate bits for most/all instructions (even ARM traded away general predication for reducing code size with Thumb and Thumb2) No need to fetch, decode and possibly execute instructions that are annulled after the branch within a pack (reducing energy consumption and execution time) Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

Experimental Evaluation MiBench embedded benchmark suite – 6 categories representing common tasks for various domains SimpleScalar MIPS/PISA architectural simulator Out-of-order, single issue embedded machine with 8KB 4-way set associative L1 instruction and data caches and 128-entry bimodal branch predictor Wattch/Cacti extensions for modeling energy consumption (inactive portions of pipeline only dissipate 10% of normal energy when using cc3 clock gating) VPO – Very Portable Optimizer targeted for SimpleScalar MIPS/PISA Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

Energy Consumption Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

Static Code Size Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

IRF Promotion with Mixed Profiling Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

Conclusions & Future Work Compiler optimizations targeted specifically for IRF can further reduce energy (12.2%15.8%), code size (16.8%28.8%) and execution time Unique transformation opportunities exist due to IRF, such as code duplication for code size reduction and predication As processor designs become more idiosyncratic, it is increasingly important to explore the possibility of evolving existing compiler optimizations Register targeting and loop unrolling should also be explored with instruction packing Enhanced parameterization techniques Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

Tightly Packed Instruction Format New opcodes for this T-format of MISA instructions Supports sequential execution of up to 5 RISA instructions from the IRF Unnecessary fields are padded with nop Supports up to 2 parameters replacing instruction slots Parameters can come from 32-entry IMM Each IRF entry also retains a default immediate value as well Branches use these 5 bits for displacements R-type RISA instructions can use parameter to replace RD field Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers

MIPS Instruction Format Modifications Creating Loosely Packed instructions R-type: Removed shamt field and merged with rs I-type: Shortened immediate values (16-bit  11bit) Lui now uses 21-bit immediate values, hence no loose packing J-type: Unchanged Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers