A Survey on Low Power Multiplication / Accumulation Speaker : Byoung-Woon Kim.

Slides:

Advertisements

Similar presentations

DSPs Vs General Purpose Microprocessors

Advertisements

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

TMS320C6000 Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004 Architectural Overview.

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.

Control path Recall that the control path is the physical entity in a processor which: fetches instructions, fetches operands, decodes instructions, schedules.

2.3) Example of program execution 1. instruction  B25 8 Op-code B means to change the value of the program counter if the contents of the indicated register.

Datorteknik IntegerMulDiv bild 1 MIPS mul/div instructions Multiply: mult $2,$3Hi, Lo = $2 x $3;64-bit signed product Multiply unsigned: multu$2,$3Hi,

The CPU. Parts of the CPU Control Unit Arithmetic & Logic Unit Registers.

University of Michigan Electrical Engineering and Computer Science 1 Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed.

FIR Tap Filter Optimization CE222 Final Project Spring 2003 S oleste H ilberg N icole S tarr.

DLX Instruction Format

Multipliers CPSC 321 Computer Architecture Andreas Klappenecker.

ECEN4002 Spring 2002DSP Lab Intro R. C. Maher1 A Short Introduction to DSP Microprocessor Architecture R.C. Maher ECEN4002/5002 DSP Laboratory Spring 2002.

ECE 4436ECE 5367 ISA I. ECE 4436ECE 5367 CPU = Seconds= Instructions x Cycles x Seconds Time Program Program Instruction Cycle CPU = Seconds= Instructions.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

An Energy-Efficient Reconfigurable Multiprocessor IC for DSP Applications Multiple programmable VLIW processors arranged in a ring topology –Balances its.

An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.

A Bit-Serial Method of Improving Computational Efficiency of Dot-Products 1.

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

Processor Architecture Needed to handle FFT algoarithm M. Smith.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

High Speed, Low Power FIR Digital Filter Implementation Presented by, Praveen Dongara and Rahul Bhasin.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

1 Sign Bit Reduction Encoding for Low Power Applications Hsin-Wei Lin Saneei, M. Afzali-Kusha, A. and Navabi, Z. Sign Bit Reduction Encoding for Low Power.

Approaches to Low-Power Implementations of DSP Systems Class Advisor : Dr. Fakhraie Presentor : Nariman Moezi DSP Design & Implementation Course Seminar.

Low-Power Multipliers with Data Wordlength Reduction Kyungtae Han Brian L. Evans Earl E. Swartzlander, Jr.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

DSP Processors We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters.

L28:Lower Power Algorithm for Multimedia Systems(2) 성균관대학교 조 준 동

3 rd Nov CSV881: Low Power Design1 Power Estimation and Modeling M. Balakrishnan.

Model Computer CPU Arithmetic Logic Unit Control Unit Memory Unit

ECE 448: Lab 6 DSP and FPGA Embedded Resources (Digital Downconverter)

L13 :Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수

Computer Studies/ICT SS2

DIGITAL SIGNAL PROCESSORS. Von Neumann Architecture Computers to be programmed by codes residing in memory. Single Memory to store data and program.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Instructor: Oluwayomi Adamo Digital Systems Design.

Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑教授組員 : R 張馨怡 R 林秀萍.

More on Pipelining 1 CSE 2312 Computer Organization and Assembly Language Programming Vassilis Athitsos University of Texas at Arlington.

Computer Organization and Assembly Languages Yung-Yu Chuang 2005/09/29

CDA 3101 Spring 2016 Introduction to Computer Organization

More on Pipelining 1 CSE 2312 Computer Organization and Assembly Language Programming Vassilis Athitsos University of Texas at Arlington.

MICROPROCESSOR DETAILS 1 Updated April 2011 ©Paul R. Godin prgodin gmail.com.

Low Power IP Design Methodology for Rapid Development of DSP Intensive SOC Platforms T. Arslan A.T. Erdogan S. Masupe C. Chun-Fu D. Thompson.

Data Word Length Reduction for Low- Power DSP Software Kyungtae Han March 24, 2004.

Data Reuse in Embedded Processors Peter Trenkle CPE631 Project Presentation.

By Wannarat Computer System Design Lecture 3 Wannarat Suntiamorntut.

L9 : Low Power DSP Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab.

CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.

Instruction Memory value Description ADD1xx Add the value stored at memory address xx to the value in the accumulator register SUB2xx Subtract the value.

1 Lecture 5Multiplication and Division ECE 0142 Computer Organization.

Variable Word Width Computation for Low Power

Low-power Digital Signal Processing for Mobile Phone chipsets

Evaluating Register File Size

Sequential Multipliers

Embedded Systems Design

CDA 3101 Summer 2007 Introduction to Computer Organization

The fetch-execute cycle

Subject Name: Digital Signal Processing Algorithms & Architecture

Multiplier-less Multiplication by Constants

Instructions Instructions (referred to as micro-instructions in the book) specify a relatively simple task to be executed It is assumed that data are stored.

Overheads for Computers as Components 2nd ed.

A.R. Hurson 323 CS Building, Missouri S&T

Programmer’s View of the EAGLE

University of Texas at Austin

Data Wordlength Reduction for Low-Power Signal Processing Software

Instruction execution and ALU

微處理機 Microprocessor (100上) ARM 內核嵌入式SOC原理

Presentation transcript:

A Survey on Low Power Multiplication / Accumulation Speaker : Byoung-Woon Kim

Contents Introduction [1] Interlaced Accumulation Programming [2] Operand Swapping [3] Selective Coefficient Negation [4] Coefficient Optimization [5] Coefficient Reordering Conclusion & Future Works

Power Distribution of a DSP Hirotsugu [ISLPED ‘96] : For each test programs Control Address Generation Data Op. Memory Pin Peripheral Clocking Bus Misc. Normalized Power Consumption (%) Variation due to Data Dependency

ALU MULT ACC PR XY MUL > (5 * ALU) X Y [ Modified Booth Encoding ] One of 0, X, -X, 2X, -2X based on each 2 bits of Y Multiplication and Accumulation: MAC Major operation in DSP PR CSA CPA

Power Consumption by a Multiplier Power Consumption by Data Dependency X : Energy per cycle Y : # of input transitions (nJ) 36-bit ALU (nJ) 16x16 MPY Average = 7nJ Little Correlation

Power Consumption by a Multiplier What is an important input in terms of power ? (nJ) 0x8000 x (random) (nJ) (random) x 0x Average = 5nJ Average = 1nJ

Power Consumption by a Multiplier Booth encoding is a significant overhead (nJ) 0x5555 x (random) (nJ) (random) x 0x Average = 6nJ Average = 4nJ

Interlaced Accumulation Programming(1/2) Hirotsugu [ISLPED ‘96] Y(k) = C0 * X(k ) + C1 * X(k-1) + C2 * X(k-2) Y(k+1) = C0 * X(k+1) + C1 * X(k ) + C2 * X(k-1) Y(k+2) = C0 * X(k+2) + C1 * X(k+1) + C2 * X(k ) tap FIR filter (n=3) Y(k) = C0 * X(k ) + C1 * X(k-1) + C2 * X(k-2) Y(k+1) = C0 * X(k+1) + C1 * X(k ) + C2 * X(k-1) Y(k+2) = C0 * X(k+2) + C1 * X(k+1) + C2 * X(k )

Interlaced Accumulation Programming(2/2) More than 40% power is saved by –Keeping a constant at one operand of multiplier X is kept: 7nJ -> 5 ~ 6nJ Y is kept: 7nJ -> 1 ~ 3nJ –Reducing the number of memory access by a half Traditional : two memory operands Interlaced : one memory operand ( data re-use by temporary register )

Operand Swapping (1/2) Weight= how many additions are needed ? By Booth Encoding X000X0 Y= Weight = 2 7FFFAAAA 0001AAAA 7FFF AAAA 7FFFAAAA 0001 ABA*BB*A Saving 54% 68% 58% Current (mW)Operands Low Weight High Switching

Operand Swapping (2/2) For filter operations, one operand is usually is constant. => Operand swapping in compile-time. X Y LowS HighS LowSHighSLowSHighS LowW ->LowWHighW ->HighW LowW ->HighW LowS: Low switching HighS: High switching Current (mA) LowW: Low weight HighW: High weight Candidate for Operand Swapping

Selective Coefficient Negation To reduce the toggle –store Coeff[i] or -Coeff[i] on memory According to the negation, –use `multiply and add’ (MAC+ instruction) –use `multiply and sub’ (MAC- instruction) GSM Vocoder : 11% power reduction ACC = ACC + (X * Y) ACC = ACC - (X * Y)

Coefficient Optimization Mahesh [TVLSI ‘98] The design of the finite wordlength FIR filter –Given N coefficients and constraints, –Find a new set of coefficients such that the total Hamming distance between successive coefficients is minimized. => using a coefficient perturbation & an algorithm similar to simulated annealing But, Hamming distance is not a good cost-function !!!

Coefficient Ordering MAC operation : commutative, associative Finding a good ordering –N! cases for a N-tap filter Y(k) = C0 * X(k ) + C1 * X(k-1) + C2 * X(k-2) Y(k) = C1 * X(k-1 ) + C0 * X(k ) + C2 * X(k-2)

Conclusion & Future Works Power characteristics of a multiplier Some techniques for low power MACs –Interlaced accumulation programming –Operand swapping –Selective coefficient negation –Coefficient optimization & ordering Find an accurate power model for a multiplier –Cost function for coefficient optimization & instruction-level power optimization An implementation of a multiplier supporting –Selective ‘operand swapping’ & ‘negation’