Chapter 12 Software Optimisation

Slides:

Advertisements

Similar presentations

Chapter 7 Linear Assembly

Advertisements

Details.L and.S units TMS320C6000 Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004.

TMS320C6000 Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004 Architectural Overview.

Lecture 6 Programming the TMS320C6x Family of DSPs.

Assembly and Linear Assembly Evgeny Kirshin, 05/10/2011

TMS320C6000 Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004 Architectural Overview.

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Programmability Issues

TMS320C6000 Architectural and Programming Overview.

 Suppose for a moment that you were asked to perform a task and were given the following list of instructions to perform:

TMS320C6000 Architectural Overview.  Describe C6000 CPU architecture.  Introduce some basic instructions.  Describe the C6000 memory map.  Provide.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

C66x CorePac: Achieving High Performance. Agenda 1.CorePac Architecture 2.Single Instruction Multiple Data (SIMD) 3.Memory Access 4.Pipeline Concept.

Chapter 12 Software Optimisation. Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 2 Software Optimisation Chapter This.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.

DSP Processors We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.

C66x CorePac: Achieving High Performance. Agenda 1.CorePac Architecture 2.Single Instruction Multiple Data (SIMD) 3.Memory Access 4.Pipeline Concept.

ajay patil 1 TMS320C6000 Assembly Language and its Rules Assignment One of the simplest operations in C is to assign a constant to a variable: One.

MICROPROCESSORS Dr. Hugh Blanton ENTC TMS320C6x INSTRUCTION SET.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

ECSE436 Tutorial Assembly and Linear Assembly Laurier Boulianne.

The Little man computer

Computer Organization and Architecture + Networks

CSE 351 Section 9 3/1/12.

Computer Organization

Chapter 2 Memory and process management

Chapter 7 Linear Assembly

Control Flow Testing Handouts

Handouts Software Testing and Quality Assurance Theory and Practice Chapter 4 Control Flow Testing

A Closer Look at Instruction Set Architectures

William Stallings Computer Organization and Architecture 8th Edition

TMS320C6713 Assembly Language

Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)

COMBINED PAGING AND SEGMENTATION

Optimizing Compilers Background

The Hardware/Software Interface CSE351 Winter 2013

Outline of the Chapter Basic Idea Outline of Control Flow Testing

A Closer Look at Instruction Set Architectures

Details .L and .S units TMS320C6000.

Lecture 5: GPU Compute Architecture

Pipelining: Advanced ILP

Morgan Kaufmann Publishers The Processor

Instruction Level Parallelism and Superscalar Processors

Arrays & Functions Lesson xx

Software and Hardware Circular Buffer Operations

Pipelining and Vector Processing

Lecture 5: GPU Compute Architecture for the last time

Chapter 17 Goertzel Algorithm

CSC 3210 Computer Organization and Programming

Trying to avoid pipeline delays

Multiplier-less Multiplication by Constants

MARIE: An Introduction to a Simple Computer

Coding Concepts (Basics)

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Chapter 1 Introduction(1.1)

Computer Architecture

TI C6701 VLIW MIMD.

ECE 352 Digital System Fundamentals

Instructions in Machine Language

Superscalar and VLIW Architectures

Chapter 12 Pipelining and RISC

Memory System Performance Chapter 3

Basic Concepts of Algorithm

Chapter 11 Processor Structure and function

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

Chapter 12 Software Optimisation

Software Optimisation Chapter This chapter consists of three parts: Part 1: Optimisation Methods. Part 2: Software Pipelining. Part 3: Multi-cycle Loop Pipelining.

Chapter 12 Software Optimisation Part 1 - Optimisation Methods

Objectives Introduction to optimisation and optimisation procedure. Optimisation of C code using the code generation tools. Optimisation of assembly code.

Introduction Software optimisation is the process of manipulating software code to achieve two main goals: Faster execution time. Small code size. Note: It will be shown that in general there is a trade off between faster execution type and smaller code size.

Introduction To implement efficient software, the programmer must be familiar with: Processor architecture. Programming language (C, assembly or linear assembly). The code generation tools (compiler, assembler and linker).

Code Optimisation Procedure

Code Optimisation Procedure

Optimising C Compiler Options The ‘C6x optimising C compiler uses the ANSI C source code and can perform optimisation currently up-to about 80% compared with a hand-scheduled assembly. However, to achieve this level of optimisation, knowledge of different levels of optimisation is essential. Optimisation is performed at different stages and levels.

Assembly Optimisation To develop an appreciation of how to optimise code, let us optimise an FIR filter: For simplicity we write: [1]

Assembly Optimisation To implement Equation 1, we need to perform the following steps: (1) Load the sample x[i]. (2) Load the coefficients h[i]. (3) Multiply x[i] and h[i]. (4) Add (x[i] * h[i]) to the content of an accumulator. (5) Repeat steps 1 to 4 N-1 times. (6) Store the value in the accumulator to y.

Assembly Optimisation Steps 1 to 6 can be translated into the following ‘C6x assembly code: MVK .S1 0,B0 ; Initialise the loop counter MVK .S1 0,A5 ; Initialise the accumulator loop LDH .D1 *A8++,A2 ; Load the samples x[i] LDH .D1 *A9++,A3 ; Load the coefficients h[i] NOP 4 ; Add “nop 4” because the LDH has a latency of 5. MPY .M1 A2,A3,A4 ; Multiply x[i] and h[i] NOP ; Multiply has a latency of 2 cycles ADD .L1 A4,A5,A5 ; Add “x [i]. h[i]” to the accumulator [B0] SUB .L2 B0,1,B0 ;  [B0] B .S1 loop ;  loop overhead NOP 5 ;  The branch has a latency of 6 cycles

Assembly Optimisation In order to optimise the code, we need to: (1) Use instructions in parallel. (2) Remove the NOPs. (3) Remove the loop overhead (remove SUB and B: loop unrolling). (4) Use word access or double-word access instead of byte or half-word access.

Step 1 - Using Parallel Instructions .M1 .M2 .L1 .L2 .S1 .S2 .D1 .D2 Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 NOP ldh ldh nop nop nop nop mpy nop add sub b nop nop nop nop nop

Step 1 - Using Parallel Instructions .M1 .M2 .L1 .L2 .S1 .S2 .D1 .D2 Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 NOP ldh ldh nop nop nop nop mpy nop add sub b nop nop nop Note: Not all instructions can be put in parallel since the result of one unit is used as an input to the following unit. nop nop

Step 2 - Removing the NOPs .L1 .L2 .S1 .S2 .D1 .D2 Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 NOP ldh ldh sub b nop nop mpy nop add loop LDH .D1 *A8++,A2 LDH .D1 *A9++,A3 [B0] SUB .L2 B0,1,B0 [B0] B .S1 loop NOP 2 MPY .M1 A2,B3,A4 NOP ADD .L1 A4,A5,A5

Step 3 - Loop Unrolling The SUB and B instructions consume at least two extra cycles per iteration (this is known as branch overhead). LDH .D1 *A8++,A2 ;Start of iteration 1 || LDH .D2 *B9++,B3 NOP 4 MPY .M1X A2,B3,A4 ;Use of cross path NOP ADD .L1 A4,A5,A5 LDH .D1 *A8++,A2 ;Start of iteration 2 MPY .M1 A2,B3,A4 NOP ; : LDH .D1 *A8++,A2 ; Start of iteration n NOP 4 loop LDH .D1 *A8++,A2 LDH .D1 *A9++,A3 [B0] SUB .L2 B0,1,B0 [B0] B .S1 loop NOP 2 MPY .M1 A2,A3,A4 NOP ADD .L1 A4,A5,A5

Step 4 - Word or Double Word Access The ‘C6711 has two 64-bit data buses for data memory access and therefore up to two 64-bit can be loaded into the registers at any time (see Chapter 2). In addition the ‘C6711 devices have variants of the multiplication instruction to support different operation (see Chapter 2). Note: Store can only be up to 32-bit.

Step 4 - Word or Double Word Access Using word access, MPY and MPYH the previous code can be written as: loop LDW .D1 *A9++,A3 ; 32-bit word is loaded in a single cycle || LDW .D2 *B6++,B1 NOP 4 [B0] SUB .L2 [B0] B .S1 loop NOP 2 MPY .M1 A3,B1,A4 || MPYH .M2 A3,B1,B3 NOP ADD .L1 A4,B3,A5 Note: By loading words and using MPY and MPYH instructions the execution time has been halved since in each iteration two 16x16-bit multiplications are performed.

These increase performance and reduce code size. Optimisation Summary It has been shown that there are four complementary methods for code optimisation: Using instructions in parallel. Filling the delay slots with useful code. Using word or double word load. Loop unrolling. These increase performance and reduce code size.

This increases performance but increases code size. Optimisation Summary It has been shown that there are four complementary methods for code optimisation: Using instructions in parallel. Filling the delay slots with useful code. Using word or double word load. Loop unrolling. This increases performance but increases code size.

Chapter 12 Software Optimisation Part 2 - Software Pipelining

Objectives Why using Software Pipelining, SP? Understand software pipelining concepts. Use software pipelining procedure. Code the word-wide software pipelined dot-product routine. Determine if your pipelined code is more efficient with or without prolog and epilog.

Why using Software Pipelining, SP? SP creates highly optimized loop-code by: Putting several instructions in parallel. Filling delay slots with useful code. Maximizes functional units. SP is implemented by simply using the tools: Compiler options -o2 or -o3. Assembly Optimizer if .sa file.

Software Pipeline concept To explain the concept of software pipelining, we will assume that all instructions execute in one cycle. LDH || LDH MPY ADD How many cycles would it take to perform this loop 5 times? (Disregard delay-slots). ______________ cycles

Software Pipeline Example LDH || LDH MPY ADD How many cycles would it take to perform this loop 5 times? (Disregard delay-slots). ______________ cycles 5 x 3 = 15 Let’s examine hardware (functional units) usage ...

Non-Pipelined Code 1 Cycle ldh .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2 .D1 .D2 mpy 3 add 4 ldh 5 mpy 6 add 7 ldh 8 mpy 9 add

Pipelining Code Pipelining these instructions took 1/2 the cycles! .M1 ldh 2 mpy ldh 3 add mpy ldh 4 add mpy ldh 5 add mpy ldh 6 add mpy 7 add Pipelining these instructions took 1/2 the cycles!

Pipelining Code Pipelining these instructions takes only 7 cycles! .M1 ldh 2 mpy ldh 3 add mpy ldh 4 add mpy ldh 5 add mpy ldh 6 add mpy 7 add Pipelining these instructions takes only 7 cycles!

Single-cycle “loop” iterated three times. Pipelining Code Prolog Staging for loop. .M1 .L1 .D1 .D2 ldh 1 mpy 2 add 3 4 5 6 7 Loop Kernel Single-cycle “loop” iterated three times. Epilog Completing final operations.

Pipelined Code prolog: LDH ; load 1 || LDH MPY ; mpy 1 || LDH ; load 2 || LDH loop: ADD ; add 1 || MPY ; mpy 2 || LDH ; load 3 || LDH ADD ; add 2 || MPY ; mpy 3 || LDH ; load 4 || LDH .

Software Pipelining Procedure 1. Write algorithm in C code & verify. 2. Write ‘C6x Linear Assembly code. 3. Create dependency graph. 4. Allocate registers. 5. Create scheduling table. 6. Translate scheduling table to ‘C6x code.

Software Pipelining Example (Step 1) short DotP(short *m, short *n, short count) { int i; short product; short sum = 0; for (i=0; i < count; i++) { product = m[i] * n[i]; sum += product; } return(sum);

Software Pipelining Procedure 1. Write algorithm in C code & verify. 2. Write ‘C6x Linear Assembly code. 3. Create dependency graph. 4. Allocate registers. 5. Create scheduling table. 6. Translate scheduling table to ‘C6x code.

Write code in Linear Assembly (Step 2) ; for (i=0; i < count; i++) ; prod = m[i] * n[i]; ; sum += prod; loop: ldh *p_m++, m ldh *p_n++, n mpy m, n, prod add prod, sum, sum [count] sub count, 1, count [count] b loop 1. No NOP’s required. 2. No parallel instructions required. 3. You don’t have to specify: Functional units, or Registers.

Software Pipelining Procedure 1. Write algorithm in C code & verify. 2. Write ‘C6x Linear Assembly code. 3. Create a dependency graph (4 steps). 4. Allocate registers. 5. Create scheduling table. 6. Translate scheduling table to ‘C6x code.

Dependency Graph Terminology b na Child Node Parent Node Path 5 .L .D LDH NOT Conditional Path

Dependency Graph Steps (a) Draw the algorithm nodes and paths. (b) Write the number of cycles it takes for each instruction to complete execution. (c) Assign “required” function units to each node. (d) Partition the nodes to A and B sides and assign sides to all functional units.

Dependency Graph (Step a) In this step each instruction is represented by a node. The node is represented by a circle, where: Outside: write instruction. Inside: register where result is written. Nodes are then connected by paths showing the data flow. Note: Conditional paths are represented by dashed lines.

Dependency Graph (Step a) m LDH

Dependency Graph (Step a) m LDH n

Dependency Graph (Step a) m LDH n prod MPY

Dependency Graph (Step a) m LDH n prod MPY sum ADD

Dependency Graph (Step a) m LDH n prod MPY sum ADD

Dependency Graph (Step a) m LDH n prod MPY sum ADD count SUB loop B

Dependency Graph (Step b) In this step the number of cycles it takes for each instruction to complete execution is added to the dependency graph. It is written along the associated data path.

Dependency Graph (Step b) m LDH n prod MPY sum ADD 5 2 1 count SUB loop B 1 6

Dependency Graph (Step c) In this step functional units are assigned to each node. It is advantageous to start allocating units to instructions which require a specific unit: Load/Store. Branch. We do not need to be concerned with multiply as this is the only operation that the .M unit performs. Note: The side is not allocated at this stage.

Dependency Graph (Step c) m LDH n prod MPY sum ADD .D .D 5 5 count SUB 1 .M 1 2 loop B 1 .S 6

Dependency Graph (Step d) The data path is partitioned into side A and B at this stage. To optimise code we need to ensure that a maximum number of units are used with a minimum number of cross paths. To make the partition visible on the dependency graph a line is used. The side can then be added to the functional units associated with each instruction or node.

Dependency Graph (Step d) A Side B Side m LDH n prod MPY sum ADD .D .D 5 5 count SUB 1 .M 1 2 loop B 1 .S 6

Dependency Graph (Step d) m LDH n prod MPY sum ADD A Side .D1 .D2 5 2 1 count SUB loop B .M1x .L1 .L2 .S2 B Side 6

Software Pipelining Procedure 1. Write algorithm in C code & verify. 2. Write ‘C6x Linear Assembly code. 3. Create a dependency graph (4 steps). 4. Allocate registers. 5. Create scheduling table. 6. Translate scheduling table to ‘C6x code.

Step 4 - Allocate Functional Units .M1 .D1 .S1 x1 .L2 .M2 .D2 .S2 x2 sum prod m .M1x count n loop Do we have enough functional units to code this algorithm in a single-cycle loop? A Side B Side m LDH n .D1 .D2 5 5 prod MPY count SUB 1 .M1x .L2 1 2 sum ADD loop B 1 .L1 .S2 6

Step 4 - Allocate Registers Content of Register File A &a a prod sum Reg. A Reg. B A0 B0 A1 B1 A2 B2 A3 B3 A4 B4 ... A15 B15 Content of Register File B count &b b

Software Pipelining Procedure 1. Write algorithm in C code & verify. 2. Write ‘C6x Linear Assembly code. 3. Create a dependency graph (4 steps). 4. Allocate registers. 5. Create scheduling table. 6. Translate scheduling table to ‘C6x code.

Step 5 - Create Scheduling Table LOOP PROLOG 1 2 3 4 5 6 7 8 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 How do we know the loop ends up in cycle 8?

Length of Prolog Answer: m LDH Answer: Count up the length of longest path, in this case we have: 5 + 2 + 1 = 8 cycles 5 prod MPY 2 sum ADD 1

Scheduling Table 1 2 3 4 5 6 7 8 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 LOOP PROLOG 1 2 3 4 5 6 7 8 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2

Scheduling Table 8 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 7 6 5 4 3 2 1 add * LOOP PROLOG add * B * mpy * ldh m * ldh n Branch here Where do we want to branch?

Scheduling Table .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add * mpy B 8 7 6 5 4 LOOP PROLOG .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add * mpy B 8 7 6 5 4 3 2 1 ldh m ldh n * sub

Software Pipelining Procedure 1. Write algorithm in C code & verify. 2. Write ‘C6x Linear Assembly code. 3. Create a dependency graph (4 steps). 4. Allocate registers. 5. Create scheduling table. 6. Translate scheduling table to ‘C6x code.

Translate Scheduling Table to ‘C6x Code C1 ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 PROLOG LOOP 7 6 5 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add sub * * * * * * * B * mpy * ldh m * ldh n

Translate Scheduling Table to ‘C6x Code C1 ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 PROLOG LOOP 7 6 5 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add C2 ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 sub * * * * * * * B * mpy * ldh m * ldh n

Translate Scheduling Table to ‘C6x Code PROLOG C1 ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 LOOP 7 6 5 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add C2 ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 sub * * * * * * * B C3 ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 || [B0] B .S2 loop * mpy * ldh m * ldh n

Translate Scheduling Table to ‘C6x Code C1 ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 LOOP 7 6 5 4 3 2 1 C2 ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add sub * * * * * * C3 ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 || [B0] B .S2 loop * B * mpy C4 ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 || [B0] B .S2 loop * ldh m * ldh n

Translate Scheduling Table to ‘C6x Code C1 ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 LOOP C2 ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 8 7 6 5 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add C3 ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 || [B0] B .S2 loop sub * * sub * * * * B C4 ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 || [B0] B .S2 loop * mpy * ldh ldh m C5 ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 || [B0] B .S2 loop * ldh ldh n

Translate Scheduling Table to ‘C6x Code PROLOG LOOP 8 7 6 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add C6 ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 || [B0] B .S2 loop || mpy .M1x A2,B2,A3 sub * * * sub * * * B * mpy * ldh ldh m * ldh ldh n

Translate Scheduling Table to ‘C6x Code PROLOG LOOP 8 7 6 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add C7 ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 || [B0] B .S2 loop || mpy .M1x A2,B2,A3 sub * * * sub * * * B * mpy * ldh ldh m * ldh ldh n

Translate Scheduling Table to ‘C6x Code PROLOG LOOP 8 7 6 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add * Single-Cycle Loop loop: ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 || [B0] B .S2 loop || mpy .M1x A2,B2,A3 || add .L1 A4,A3,A4 sub * * * sub * * * B * mpy * ldh ldh m * ldh ldh n See Chapter 14 for practical examples

Translate Scheduling Table to ‘C6x Code With this method we have only created the prolog and the loop. Therefore if the filter has 100 taps, then we need to repeat the loop 100 times as we need 100 adds. This means that we are performing 107 loads. These 7 extra loads may lead to some illegal memory acesses. .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add mpy B 8 7 6 5 4 3 2 1 ldh m ldh n sub LOOP PROLOG

Solution: The Epilog We only created the Prolog and Loop … What about the Epilog? The Epilog can be extracted from your results as described below. See example in the next slide.

Dot-Product with Epilog Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog e1: mpy || add Epilog = Loop - Prolog And there is no sub or b in the epilog

Dot-Product with Epilog Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog e1: mpy || add e2: mpy || add Epilog = Loop - Prolog And there is no sub or b in the epilog

Dot-Product with Epilog Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog e1: mpy || add e2: mpy || add e3: mpy || add Epilog = Loop - Prolog And there is no sub or b in the epilog

Dot-Product with Epilog Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog e1: mpy || add e2: mpy || add e3: mpy || add e4: mpy || add Epilog = Loop - Prolog And there is no sub or b in the epilog

Dot-Product with Epilog Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog e1: mpy || add e2: mpy || add e3: mpy || add e4: mpy || add e5: mpy || add Epilog = Loop - Prolog And there is no sub or b in the epilog

Dot-Product with Epilog Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog e1: mpy || add e2: mpy || add e3: mpy || add e4: mpy || add e5: mpy || add Epilog = Loop - Prolog And there is no sub or b in the epilog e6: add

Dot-Product with Epilog Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog e1: mpy || add e2: mpy || add e3: mpy || add e4: mpy || add e5: mpy || add e6: add Epilog = Loop - Prolog And there is no sub or b in the epilog e7: add

Scheduling Table: Prolog, Loop and Epilog

Loop only! Can the code be written as a loop only (i.e. no prolog or epilog)? Yes!

Loop only! (i) Remove all instructions except the branch. .L1 .L2 .S1 .D1 .D2 add * mpy B 8 7 6 5 4 3 2 1 ldh m ldh n sub LOOP PROLOG (i) Remove all instructions except the branch.

Loop only! (i) Remove all instructions except the branch. PROLOG 6 5 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add sub B B mpy ldh m ldh n

Loop only! (i) Remove all instructions except the branch. PROLOG (i) Remove all instructions except the branch. (ii) Zero input registers, accumulator and product registers. 6 5 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 zero a zero sum add sub zero prod zero b B B mpy ldh m ldh n

Loop only! (i) Remove all instructions except the branch. PROLOG (i) Remove all instructions except the branch. (ii) Zero input registers, accumulator and product registers. (iii) Adjust the number of subtractions. 6 5 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 zero a zero sum add sub sub zero prod zero b B B mpy ldh m ldh n

Loop Only - Final Code Overhead Loop b loop b loop || zero m ;input register || zero n ;input register || zero prod ;product register || zero sum ;accumulator || sub ;modify count register loop ldh || ldh || mpy || add || [] sub || [] b loop Overhead Loop

Laboratory exercise Software pipeline using the LDW version of the Dot-Product routine: (1) Write linear assembly. (2) Create dependency graph. (3) Complete scheduling table. (4) Transfer table to ‘C6000 code. To Epilogue or Not to Epilog? Determine if your pipelined code is more efficient with or without prolog and epilog.

Lab Solution: Step 1 - Linear Assembly ; for (i=0; i < count; i++) ; prod = m[i] * n[i]; ; sum += prod; *** count becomes 20 *** loop: ldw *p_m++, m ldw *p_n++, n mpy m, n, prod mpyh m, n, prodh add prod, sum, sum add prodh, sumh, sumh [count] sub count, 1, count [count] b loop ; Outside of Loop add sum, sumh, sum 31

Step 2 - Dependency Graph prodh MPYH m LDW .D1 n .D2 5 2 .M1x prod MPY .M2x 1 sumh ADD .L2 sum .L1 count SUB loop B .S2 .S1 6 A Side B Side 43

Step 2 - Functional Units .M1 .D1 .S1 x1 .L2 .M2 .D2 .S2 x2 sum prod m loop .M1x sumh prodh n count .M2x Do we still have enough functional units to code this algorithm in a single-cycle loop? Yes ! 45

Step 2 - Registers Register File A # # Register File B A0 B0 count A1 return address &a/ret value A4 B4 &x a A5 B5 x count/prod A6 B6 prodh sum A7 B7 sumh 46

Step 3 - Schedule Algorithm LOOP PROLOG 8 7 6 5 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add add B6 B5 B4 B3 B2 B1 sub7 sub6 sub5 sub4 sub3 sub2 sub1 mpy3 mpy2 mpy mpyh3 mpyh2 mpyh ldw8 ldw7 ldw6 ldw5 ldw4 ldw3 ldw2 ldw m ldw8 ldw7 ldw6 ldw5 ldw4 ldw3 ldw2 ldw n 52

Step 4 - ‘C6000 Code The complete code is available in the following location: \Links\DotP LDW.pdf 53

Why Conditional Subtract? loop: ldh *p_m++, m ldh *p_n++, n mpy m, n, prod add prod, sum, sum [count] sub count, 1, count [count] b loop Without Cond. Subtract: Loop (count = 1) (B) With Cond. Subtract: Loop (count = 1) (B) loop (count = 0) (B) X loop (count = 0) (B) X loop (count = -1) (B) loop (count = 0) (B) X loop (count = -2) (B) loop (count = 0) (B) X loop (count = -3) (B) loop (count = 0) (B) X loop (count = -4) (B) loop (count = 0) (B) X Loop never ends Loop ends

Chapter 12 Software Optimisation Part 3 - Pipelining Multi-cycle Loops

Objectives Software pipeline the weighted vector sum algorithm. Describe four iteration interval constraints. Calculate minimum iteration interval. Convert and optimize the dot-product code to floating point code.

What Requires Multi-Cycle Loops? Resource Limitations Running out of resources (Functional Units, Registers, Bus Accesses) Weighted Vector Sum example requires three .D units Live Too Long Minimum iteration interval defined by length of time a Variable is required to exist Loop Carry Path Latency required between loop iterations FIR example and SP floating-point dot product examples are demonstrated Functional Unit Latency > 1 A few ‘C67x instructions require functional units for 2 or 4 cycles rather than one. This defines a minimum iteration interval.

What Requires Multi-Cycle Loops? Four reasons: 1. Resource Limitations. 2. Live Too Long. 3. Loop Carry Path. 4. Double Precision (FUL > 1). Use these four constraints to determine the smallest Iteration Interval (Minimum Iteration Interval or MII).

Resource Limitation: Weighted Vector Sum Step 1 - C Code void WVS(short *c, short *b, short *a, short r, short n) { int i; for (i=0; i < n; i++) { c[i] = a[i] + (r * b[i]) >> 15; } a, b: input arrays c: output array n: length of arrays r: weighting factor Store .D Load .D Load .D Requires 3 .D units

Software Pipelining Procedure 1. Write algorithm in C code & verify. 2. Write ‘C6x Linear Assembly code. 3. Create dependency graph. 4. Allocate registers. 5. Create scheduling table. 6. Translate scheduling table to ‘C6x code.  Write algorithm in C & verify. 2. Write ‘C6x Linear Assembly Code.

c[i] = a[i] + (r * b[i]) >> 15; Step 2 - ‘C6x Linear Code c[i] = a[i] + (r * b[i]) >> 15; loop: LDH *a++, ai LDH *b++, bi MPY r, bi, prod SHR prod, 15, sum ADD ai, sum, ci STH ci, *c++ [i] SUB i, 1, i [i] B loop The full code is available here: \Links\Wvs.sa

Step 3 - Dependency Graph 5 1 2 15 ai LDH bi r prod MPY sum SHR ci ADD *c++ STH A Side B .D1 .L1 .S2 .M2 .D2 .L2 .S1 1 6 i SUB B loop

Step 4 -Allocate Functional Units ci ai, *c i prod bi sum .L1 .M1 .D1 .S1 .L2 .M2 .D2 .S2 loop This requires 3 .D units therefore it cannot fit into a single cycle loop. This may fit into a 2 cycle loop if there are no other constraints.

2 Cycle Loop 2 cycles per loop iteration .L1 .L2 .S1 .S2 .M1 .M2 .D1 Iteration Interval (II): # cycles per loop iteration.

Multi-Cycle Loop Iterations .D1 ldh ldh ldh .D2 ldh ldh ldh .S2 shr shr shr .M1 mpy mpy mpy .M2 .L1 add add add .L2 sub sub sub .S1 b b b cycle 2 cycle 4 cycle 6 .D1 sth sth sth .D2 .S1 .S2 .M1 .M2 .L1 .L2

Multi-Cycle Loop Iterations .D1 ldh ldh ldh .D2 ldh ldh ldh .S2 shr shr shr .M1 mpy mpy mpy .M2 .L1 add add add .L2 sub sub sub .S1 b b b cycle 2 cycle 4 cycle 6 .D1 sth sth sth .D2 .S1 .S2 .M1 .M2 .L1 .L2

Multi-Cycle Loop Iterations .D1 ldh ldh ldh .D2 ldh ldh ldh .S2 shr shr shr .M1 mpy mpy mpy .M2 .L1 add add add .L2 sub sub sub .S1 b b b cycle 2 cycle 4 cycle 6 .D1 sth sth sth .D2 .S1 .S2 .M1 .M2 .L1 .L2

Multi-Cycle Loop Iterations .D1 ldh ldh ldh .D2 ldh ldh ldh .S2 shr shr shr .M1 mpy mpy mpy .M2 .L1 add add add .L2 sub sub sub .S1 b b b cycle 2 cycle 4 cycle 6 .D1 sth sth sth .D2 .S1 .S2 .M1 .M2 .L1 .L2

Multi-Cycle Loop Iterations .D1 ldh ldh ldh .D2 ldh ldh ldh .S2 shr shr shr .M1 mpy mpy mpy .M2 .L1 add add add .L2 sub sub sub .S1 b b b cycle 2 cycle 4 cycle 6 .D1 sth sth sth .D2 .S1 .S2 .M1 .M2 .L1 .L2

Multi-Cycle Loop Iterations .D1 ldh ldh ldh .D2 ldh ldh ldh .S2 shr shr shr .M1 mpy mpy mpy .M2 .L1 add add add .L2 sub sub sub .S1 b b b cycle 2 cycle 4 cycle 6 .D1 sth sth sth .D2 .S1 .S2 .M1 .M2 .L1 .L2

10 How long is the Prolog? What is the length of the longest path? 10 ai bi prod sum ci *c++ 5 1 2 10 What is the length of the longest path? 10 How many cycles per loop? 2

Step 5 - Create Scheduling Chart (0) 2 4 6 8 Unit\cycle .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 1 3 5 7 9

Step 5 - Create Scheduling Chart Unit\cycle 2 4 6 8 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 LDH bi * Unit\cycle 1 3 5 7 9 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2

Step 5 - Create Scheduling Chart Unit\cycle 2 4 6 8 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 LDH bi * Unit\cycle 1 3 5 7 9 .L1 .L2 .S1 .S2 .M1 .M2 MPY mi * .D1 .D2

Step 5 - Create Scheduling Chart Unit\cycle 2 4 6 8 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 LDH bi * Unit\cycle 1 3 5 7 9 .L1 .L2 .S1 .S2 SHR sum * .M1 .M2 MPY mi * .D1 .D2

Step 5 - Create Scheduling Chart Unit\cycle 2 4 6 8 .L1 ADD ci .L2 .S1 .S2 .M1 .M2 .D1 .D2 LDH bi * Unit\cycle 1 3 5 7 9 .L1 .L2 .S1 .S2 SHR sum * .M1 .M2 MPY mi * .D1 .D2

Step 5 - Create Scheduling Chart 2 4 6 8 LDH bi * Unit\cycle .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 1 3 5 7 9 MPY mi SHR sum ADD ci STH c[i]

Step 5 - Create Scheduling Chart 2 4 6 8 LDH bi * Unit\cycle .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 1 3 5 7 9 MPY mi SHR sum ADD ci STH c[i] * LDH ai

Step 5 - Create Scheduling Chart 2 4 6 8 LDH bi * Unit\cycle .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 1 3 5 7 9 MPY mi SHR sum ADD ci STH c[i] Conflict * LDH ai

Conflict Solution Here are two possibilities ... Which is better? 2 4 2 4 6 8 LDH bi * Unit\cycle .D1 .D2 .L1 .L2 .S1 .S2 .M1 .M2 1 3 5 7 9 MPY mi SHR sum STH c[i] LDH ai LDH ai

Conflict Solution Here are two possibilities ... Which is better? Move the LDH to cycle 2. (so you don’t have to go back and recheck crosspaths) 2 4 6 8 LDH bi * Unit\cycle .D1 .D2 .L1 .L2 .S1 .S2 .M1 .M2 1 3 5 7 9 MPY mi SHR sum STH c[i] LDH ai

Step 5 - Create Scheduling Chart Unit\cycle 2 4 6 8 .L1 ADD ci .L2 .S1 .S2 .M1 .M2 .D1 LDH ai * .D2 LDH bi * Unit\cycle 1 3 5 7 9 .L1 .L2 .S1 .S2 SHR sum * .M1 .M2 MPY mi * .D1 STH c[i] .D2

Step 5 - Create Scheduling Chart Unit\cycle 2 4 6 8 .L1 ADD ci .L2 .S1 [i] B * .S2 .M1 .M2 .D1 LDH ai * .D2 LDH bi * Unit\cycle 1 3 5 7 9 .L1 .L2 .S1 .S2 SHR sum * .M1 .M2 MPY mi * .D1 STH c[i] .D2

Step 5 - Create Scheduling Chart Unit\cycle 2 4 6 8 .L1 ADD ci .L2 .S1 [i] B * .S2 .M1 .M2 .D1 LDH ai * .D2 LDH bi * Unit\cycle 1 3 5 7 9 .L1 .L2 [i] SUB i * .S1 .S2 SHR sum * .M1 .M2 MPY mi * .D1 STH c[i] .D2

Step 5 - Create Scheduling Chart Unit\cycle 2 4 6 8 .L1 ADD ci .L2 .S1 [i] B * * .S2 .M1 .M2 .D1 LDH ai * * * .D2 LDH bi * * * * Unit\cycle 1 3 5 7 9 .L1 .L2 [i] SUB i * * * .S1 .S2 SHR sum * .M1 .M2 MPY mi * * .D1 LDH ai STH c[i] .D2

2 Cycle Loop Kernel Unit\cycle 2 4 6 8 .L1 ADD ci .L2 .S1 [i] B * * 2 4 6 8 .L1 ADD ci .L2 .S1 [i] B * * .S2 .M1 .M2 .D1 LDH ai * * * .D2 LDH bi * * * * Unit\cycle 1 3 5 7 9 .L1 .L2 [i] SUB i * * * .S1 .S2 SHR sum * .M1 .M2 MPY mi * * .D1 LDH ai STH c[i] .D2

What Requires Multi-Cycle Loops? Four reasons: 1. Resource Limitations. 2. Live Too Long. 3. Loop Carry Path. 4. Double Precision (FUL > 1).

Live Too Long - Example c = (a >> 5) + a ai 5 x 1 ci LDH ADD SHR LDH ai 5 a0 valid 6 1 LDH 2 3 4

Live Too Long - Example ai 5 x 1 ci LDH ai 5 a0 valid 6 a1 1 2 3 4 LDH LDH ai 5 a0 valid 6 a1 1 LDH 2 3 4 ai LDH ci ADD 5 1 x SHR .S1 .L1 .D1

Live Too Long - Example ai 5 x 1 ci 1 2 3 4 5 6 LDH ADD SHR .S1 .L1 1 LDH 2 LDH 3 4 5 6 ai LDH ci ADD 5 1 x SHR .S1 .L1 .D1 LDH ai LDH LDH a0 valid a1 SHR x0 valid

Oops, rather than adding Let’s look at one solution ... Live Too Long - Example 1 LDH 2 LDH 3 4 5 6 ai LDH ci ADD 5 1 x SHR .S1 .L1 .D1 LDH ai LDH LDH a0 valid a1 SHR x0 valid ADD Oops, rather than adding a0 + x0 we got a1 + x0 Let’s look at one solution ...

Live Too Long - 2 Cycle Solution LDH ai 2 LDH 4 6 a0 valid 1 3 5 7 a1 With a 2 cycle loop, a0 is valid for 2 cycles.

Live Too Long - 2 Cycle Solution LDH ai 2 LDH 4 6 a0 valid x0 valid 1 3 5 SHR 7 a1 Notice, a0 and x0 are both valid for 2 cycles which is the length of the Iteration Interval Adding them ...

Live Too Long - 2 Cycle Solution LDH ai 2 LDH 4 6 a0 valid x0 valid ADD 1 3 5 SHR 7 a1 Works! But what’s the drawback? 2 cycle loop is slower. Here’s a better solution ...

Live Too Long - 1 Cycle Solution LDH ai 5 a0 valid MV b SHR 1 LDH 2 3 4 6 a1 b valid x0 valid ADD ai LDH ci ADD 5 1 x SHR .S1 .L1 .D1 b MV .S2 Using a temporary register solves this problem without increasing the Minimum Iteration Interval

What Requires Multi-Cycle Loops? Four reasons: 1. Resource Limitations. 2. Live Too Long. 3. Loop Carry Path. 4. Double Precision (FUL > 1).

Loop Carry Path The loop carry path is a path which feeds one variable from part of the algorithm back to another. p2 st_y0 MPY.M2 STH.D1 2 1 e.g. Loop carry path = 3. Note: The loop carry path is not the code loop.

Loop Carry Path, e.g. IIR Filter IIR Filter Example y0 = a0*x0 + b1*y1

IIR.SA IIR Filter Example y0 = a0*x0 + b1*y1 IIR: ldh *a_1, A1 ldh *x1, A3 ldh *b_1, B1 ldh *y0, B0 ; y1 is previous y0 mpy A1, A3, prod1 mpy B1, B0, prod2 add prod1, prod2, prod2 sth prod2, *y0

Loop Carry Path - IIR Example B1 y1 y0 st_y0 LDH.D1 LDH.D2 MPY.M1 MPY.M2 ADD.L1 STH.D1 5 2 1 IIR Filter Loop y0 = a1*x1 + b1*y1 Min Iteration Interval Resource = 2 (need 3 .D units) Loop Carry Path = 9 (9 = 5 + 2 + 1 + 1) therefore, MII = 9 1 Result carries over from one iteration of the loop to the next. Can it be minimized?

Loop Carry Path - IIR Example (Solution) B1 y1 y0 st_y0 LDH.D1 LDH.D2 MPY.M1 MPY.M2 ADD.L1 STH.D1 5 2 1 IIR Filter Loop y0 = a1*x1 + b1*y1 Min Iteration Interval Resource = 2 (need 3 .D units) 1 New Loop Carry Path = 3 (3 = 2 + 1) therefore, MII = 3 Since y0 is stored in a CPU register, it can be used directly by MPY (after the first loop iteration).

Reminder: Fixed-Point Dot-Product Example LDH n prod MPY sum ADD .D1 .D2 5 2 .M1x .L1 Is there a loop carry path in this example? Yes, but it’s only “1” Min Iteration Interval Resource = 1 Loop Carry Path = 1  MII = 1 1 For the fixed-point implementation, the Loop Carry Path was not taken into account because it is equal to 1.

Loop Carry Path IIR Example. Enhancing the IIR. Fixed-Point Dot-Product Example. Floating-Point Dot Product Example.

Loop Carry Path due to FUL > 1 Floating-Point Dot-Product Example LDW n prod MPYSP sum ADDSP .D1 .D2 5 4 .M1x .L1 Min Iteration Interval Resource = 1 Loop Carry Path = 4  MII = 4 4

Unrolling the Loop If the MII must be four cycles long, then use all of them to calculate four results. m1 LDW n1 prod1 MPYSP sum1 ADDSP .D1 .D2 4 .M1x .L1 m2 n2 prod2 sum2 m3 n3 prod3 sum3 m4 n4 prod4 sum4

ADDSP Pipeline (Staggered Results) Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 sum = x1 + x5 10 sum = x2 + x6 11 sum = x3 + x7 12 sum = x0 + x4 + x8 ADDSP takes 4 cycles or three delay slots to produce the result.

ADDSP Pipeline (Staggered Results) Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 sum = x1 + x5 10 sum = x2 + x6 11 sum = x3 + x7 12 sum = x0 + x4 + x8

ADDSP Pipeline (Staggered Results) Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 sum = x1 + x5 10 sum = x2 + x6 11 sum = x3 + x7 12 sum = x0 + x4 + x8

ADDSP Pipeline (Staggered Results) Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 sum = x1 + x5 10 sum = x2 + x6 11 sum = x3 + x7 12 sum = x0 + x4 + x8

ADDSP Pipeline (Staggered Results) Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 sum = x1 + x5 10 sum = x2 + x6 11 sum = x3 + x7 12 sum = x0 + x4 + x8

ADDSP Pipeline (Staggered Results) Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 sum = x1 + x5 10 sum = x2 + x6 11 sum = x3 + x7 12 sum = x0 + x4 + x8

ADDSP Pipeline (Staggered Results) Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 sum = x1 + x5 10 sum = x2 + x6 11 sum = x3 + x7 12 sum = x0 + x4 + x8

ADDSP Pipeline (Staggered Results) Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 sum = x1 + x5 10 sum = x2 + x6 11 sum = x3 + x7 12 sum = x0 + x4 + x8

ADDSP Pipeline (Staggered Results) Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 NOP sum = x1 + x5 10 NOP sum = x2 + x6 11 NOP sum = x3 + x7 12 NOP sum = x0 + x4 + x8 There are effectively four running sums: sum (i) = x(i) + x(i+4) + x(i+8) + … sum (i+1) = x(i+1) + x(i+5) + x(i+9) + … sum (i+2) = x(i+2) + x(i+6) + x(i+10) + … sum (i+3) = x(i+3) + x(i+7) + x(i+11) + …

ADDSP Pipeline (Staggered Results) There are effectively four running sums: sum (i) = x(i) + x(i+4) + x(i+8) + … sum (i+1) = x(i+1) + x(i+5) + x(i+9) + … sum (i+2) = x(i+2) + x(i+6) + x(i+10) + … sum (i+3) = x(i+3) + x(i+7) + x(i+11) + … These need to be combined after the last addition is complete...

ADDSP Pipeline (Combining Results) Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 MV sum, temp sum = x1 + x5 10 sum = x2 + x6 11 sum = x3 + x7 12 sum = x0 + x4 + x8

ADDSP Pipeline (Combining Results) Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 MV sum, temp sum = x1 + x5 10 ADDSP sum, temp, sum1 sum = x2 + x6, temp = x1 + x5 11 sum = x3 + x7 12 sum = x0 + x4 + x8

ADDSP Pipeline (Combining Results) Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 MV sum, temp sum = x1 + x5 10 ADDSP sum, temp, sum1 sum = x2 + x6, temp = x1 + x5 11 MV sum, temp sum = x3 + x7 12 sum = x0 + x4 + x8

ADDSP Pipeline (Combining Results) Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 MV sum, temp sum = x1 + x5 10 ADDSP sum, temp, sum1 sum = x2 + x6, temp = x1 + x5 11 MV sum, temp sum = x3 + x7 12 ADDSP sum, temp, sum2 sum = x0 + x4 + x8, temp = x3 + x7

ADDSP Pipeline (Combining Results) Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 MV sum, temp sum = x1 + x5 10 ADDSP sum, temp, sum1 sum = x2 + x6, temp = x1 + x5 11 MV sum, temp sum = x3 + x7 12 ADDSP sum, temp, sum2 sum = x0 + x4 + x8, temp = x3 + x7 13 NOP 14 NOP sum1 = x1 + x2 + x5 + x6

ADDSP Pipeline (Combining Results) Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 MV sum, temp sum = x1 + x5 10 ADDSP sum, temp, sum1 sum = x2 + x6, temp = x1 + x5 11 MV sum, temp sum = x3 + x7 12 ADDSP sum, temp sum2 sum = x0 + x4 + x8, temp = x3 + x7 13 NOP 14 NOP sum1 = x1 + x2 + x5 + x6 15 NOP 16 ADDSP sum1, sum2, sum sum2 = x0 + x3 + x4 + x7 + x8

ADDSP Pipeline (Combining Results) Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 MV sum, temp sum = x1 + x5 10 ADDSP sum, temp, sum1 sum = x2 + x6, temp = x1 + x5 11 MV sum, temp sum = x3 + x7 12 ADDSP sum, temp sum2 sum = x0 + x4 + x8, temp = x3 + x7 13 NOP 14 NOP sum1 = x1 + x2 + x5 + x6 15 NOP 16 ADDSP sum1, sum2, sum sum2 = x0 + x3 + x4 + x7 + x8 17 NOP 18 NOP 19 NOP 20 NOP sum = x0 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8

What Requires Multi-Cycle Loops? Four reasons: 1. Resource Limitations. 2. Live Too Long. 3. Loop Carry Path. 4. Double Precision (FUL > 1).

MPYDP ties up the functional unit for 4 cycles. Simple FUL Example 5 3 MPYDP prod 10 (4.9) .M1 1 MPYDP 2 3 4 5 6 ... MPYDP ties up the functional unit for 4 cycles.

A Better Way to Diagram this ... 1 MPYDP 5 9 13 2 6 10 14 3 7 11 prod1 15 prod2 4 8 12 16 Since the MPYDP instruction has a functional unit latency (FUL) of “4”, .M1 cannot be used again until the fifth cycle. Hence, MII  4.

What Requires Multi-Cycle Loops? 1. Resource Limitations. 2. Live Too Long. 3. Loop Carry Path. 4. Double Precision (FUL > 1). Lab: Converting your dot-product code to Single-Precision Floating-Point.

Chapter 12 Software Optimisation - End -