A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, 2000. Proceedings. 2000 International.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

DSPs Vs General Purpose Microprocessors

Instruction Set Design

Machine Instructions Operations

INSTRUCTION SET ARCHITECTURES

1 ECE462/562 ISA and Datapath Review Ali Akoglu. 2 Instruction Set Architecture A very important abstraction –interface between hardware and low-level.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.

Performance Analysis and Optimization (General guidelines; Some of this is review) Outline: introduction evaluation methods timing space—code compression.

Embedded Software Optimization for MP3 Decoder Implemented on RISC Core Yingbiao Yao, Qingdong Yao, Peng Liu, Zhibin Xiao Zhejiang University Information.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

An Arithmetic Structure for Test Data Horizontal Compression Marie-Lise FLOTTES, Regis POIRIER, Bruno ROUZEYRE Laboratoire d’Informatique, de Robotique.

Memory Management 2010.

Goal: Reduce the Penalty of Control Hazards

1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.

Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

GCSE Computing - The CPU

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Princess Sumaya Univ. Computer Engineering Dept. Chapter 2:

Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.

Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts.

Seok-Won Seong and Prabhat Mishra University of Florida IEEE Transaction on Computer Aided Design of Intigrated Systems April 2008, Vol 27, No. 4 Rahul.

1 Layers of Computer Science, ISA and uArch Alexander Titov 20 September 2014.

5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.

Instruction Set Design by Kip R. Irvine (c) Kip Irvine, All rights reserved. You may modify and copy this slide show for your personal use,

CMPE 421 Parallel Computer Architecture

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

1 Chapter 3.2 : Virtual Memory What is virtual memory? What is virtual memory? Virtual memory management schemes Virtual memory management schemes Paging.

High Performance Embedded Computing © 2007 Elsevier Chapter 2, part 2: CPUs High Performance Embedded Computing Wayne Wolf.

Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.

Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.

Module : Algorithmic state machines. Machine language Machine language is built up from discrete statements or instructions. On the processing architecture,

Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)

F 1 E 1 F 2 E 2 F 3 E 3 F 1 E 1 F 2 E 2 F 3 E 3 I 1 I 2 I 3 I 1 I 2 I 3 Instruction (a) Sequential execution (c) Pipelined execution Figure 8.1. Basic.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑教授組員 : R 張馨怡 R 林秀萍.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from.

Lectures 8 & 9 Virtual Memory - Paging & Segmentation System Design.

Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.

COSC 3330/6308 Second Review Session Fall Instruction Timings For each of the following MIPS instructions, check the cycles that each instruction.

The Universal Machine (UM) Implementing the UM Noah Mendelsohn Tufts University Web:

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Computer Orgnization Rabie A. Ramadan Lecture 9. Cache Mapping Schemes.

Assembly language.

CS2100 Computer Organization

Visit for more Learning Resources

A Closer Look at Instruction Set Architectures

Cache Memory Presentation I

Improving Program Efficiency by Packing Instructions Into Registers

CDA 3101 Spring 2016 Introduction to Computer Organization

Teaching Computing to GCSE

CSC 3210 Computer Organization and Programming

Control unit extension for data hazards

Computer Instructions

Efficient Placement of Compressed Code for Parallel Decompression

Presentation transcript:

A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International Conference on 2000 IEEE Yi-hsin Tseng Date ： 11/06/2007

Outline Introduction & motivation Code Compression Architecture Decompression Engine Design Experimental results Conclusion & Contributions of the paper Our project Relate to CSE520 Q & A

Introduction & motivation

For Embedded system More complicated architecture in embedded system nowadays. Available memory space is smaller. A reduced executable program can also indirectly affect the chip on…  Size  Weight  Power consumption

Why code compression/decompression? Compress the instruction segment of the executable running on the embedded system…  Reducing the memory requirements and bus transaction overheads Compression  Decompression

Related work on compressed instructions A logarithmic-based compression scheme where 32-bit instructions map to fixed but smaller width compressed instructions.  (The system using memory area only) Frequently appearing instructions are compressed to 8 bits.  (fixed-length 8 or 32 bits)

The compressed method in this paper Give comprehensive results for the whole system including  buses  memories (main memory and cache)  decompression unit  CPU

Code Compression Architecture

Architecture in this system (Post-cache) Reason ? -Increase the effective cache size -Improve instruction bandwidth

Code Compression Architecture Use SAMC to compress instructions  (Semiadaptive Markov Compression) Divide instructions into 4 groups  based on SPARC architecture  appended a short code (3-bit) in the beginning of each compressed instruction

4 Groups of Instructions Group 1  instructions with immediates Ex: sub %i1, 2, %g3 ; set 5000, %g2 Group 2  branch instructions Ex: be, bne, bl, bg,... Group 3  instructions with no immediates Ex: add %o1,%o2,%g3 ; st %g1,[%o2] Group 4  Instructions that are left uncompressed

Decompression Engine Design ( Approach)

The Key idea is…. Present an architecture for embedded systems that decompresses offline- compressed instructions during runtime  to reduce the power consumption  a performance improvement (in most cases)

Pipelined Design

Pipelined Design (con’t)

Pipelined Design – group 1 (stage 1) Index the Dec. Table Input Compressed Instructions Forward instructions

Pipelined Design – group 1 (stage 2)

Pipelined Design – group 1 (stage3)

Pipelined Design – group 1 (stage 4)

Pipelined Design – group 2 branch instructions (stage 1)

Pipelined Design – group 2 branch instructions (stage 2)

Pipelined Design – group 2 branch instructions (stage 3)

Pipelined Design – group 2 branch instructions (stage 4)

Pipelined Design – group 3 instructions with no immediates (stage 1) 256 entry table No immediate instructions may appear in pairs. -> compressed in one byte. ( 64 bits) 8 bits as index to address

Pipelined Design – group 3 instructions with no immediates (stage 2)

Pipelined Design – group 3 instructions with no immediates (stage 3)

Pipelined Design – group 3 instructions with no immediates (stage 4)

Pipelined Design – group 4 uncompressed instructions

Experimental results

Use different applications:  an algorithm for computing 3D vectors for a motion picture ("i3d“)  a complete MPEGII encoder ("mpeg ")  a smoothing algorithm for digital images ("smo")  a trick animation algorithm ("trick") A simulation tool written in C for obtaining performance data for the decompression engine

Experimental results (con’t) The decompression engine is application specific.  for each application -- build a decoding table and a fast dictionary table that will decompress that particular application only.

Experimental results for energy and performance

Worse performance on smo 512-byte instruction cache? - Do not require large memory. (Execute in tight loops) - Generates very few misses for this cache size. (So the compressed architecture therefore does not help an already almost perfect hit ratio and the slowdown by the decompression engine prevails)

Conclusion & Contributions of the paper This paper designed an instruction decompression engine as a soft IP core for low power embedded systems. Applications run faster as opposed to systems with no code compression (due to improved cache performance). Lower power consumption (due to smaller memory requirements for the executable program and smaller number of memory accesses)

Relate to CSE520 Implement the system performance and power consumption by using Pipeline Architecture in system. A different architecture design for lower power consumption on the Embedded system. Smaller cache size perform better on compressed architecture ; larger cache perform better on no-compressed architecture.  Cache hit ratio

Our project Goal:  How to improve the efficiency of power management in embedded multicore system Idea:  Use different power mode within a given power budget, global power management policy (In Jun Shen’s presentation)  Use the SAMC algorithm and this decompress architecture as another factor to simulate (This paper) How?  SimpleScalar tool set try simple function at first, then try the different power mode

Thank you! Q & A

Backup Slides

Critique The decompression engine will slowdown the system if the cache generate very few misses for some cache size.

Post-cache & Pre-cache Pre-cache: The instruction stored in the I-cache is decompressed. Post-cache: The instruction stored in the I-cache is still decompressed.

Problems for post-cache arch Memory Relocation  The compression will change the instruction location in the memory. In pre-cache arch:  Decompression is done before fetch into I- cache, so the address in the I-cache needn’t to be fixed.

SPARC Instruction Set Instruction groups  load/store (ld, st,...) Move data from memory to a register / Move data from a register to memory  integer arithmetic (add, sub,...) Arithmetic operations on data in registers  bit-wise logical (and, or, xor,...) Logical operations on data in registers  bit-wise shift (sll, srl,...) Shift bits of data in registers  integer branch (be, bne, bl, bg,...)  Trap (ta, te,...)  control transfer (call, save,...)  floating point (ldf, stf, fadds, fsubs,...)  floating point branch (fbe, fbne, fbl, fbg,...)

SPARC Instruction Example