Chapter 12 Software Optimisation

Chapter 12 Software Optimisation

Software Optimisation Chapter
This chapter consists of three parts: Part 1: Optimisation Methods. Part 2: Software Pipelining. Part 3: Multi-cycle Loop Pipelining.

Chapter 12 Software Optimisation Part 1 - Optimisation Methods

Objectives Introduction to optimisation and optimisation procedure.
Optimisation of C code using the code generation tools. Optimisation of assembly code.

Introduction Software optimisation is the process of manipulating software code to achieve two main goals: Faster execution time. Small code size. Note: It will be shown that in general there is a trade off between faster execution type and smaller code size.

Introduction To implement efficient software, the programmer must be familiar with: Processor architecture. Programming language (C, assembly or linear assembly). The code generation tools (compiler, assembler and linker).

Code Optimisation Procedure

Optimising C Compiler Options
The ‘C6x optimising C compiler uses the ANSI C source code and can perform optimisation currently up-to about 80% compared with a hand-scheduled assembly. However, to achieve this level of optimisation, knowledge of different levels of optimisation is essential. Optimisation is performed at different stages and levels.

Assembly Optimisation
To develop an appreciation of how to optimise code, let us optimise an FIR filter: For simplicity we write: [1]

To implement Equation 1, we need to perform the following steps: (1) Load the sample x[i]. (2) Load the coefficients h[i]. (3) Multiply x[i] and h[i]. (4) Add (x[i] * h[i]) to the content of an accumulator. (5) Repeat steps 1 to 4 N-1 times. (6) Store the value in the accumulator to y.

Steps 1 to 6 can be translated into the following ‘C6x assembly code: MVK .S1 0,B0 ; Initialise the loop counter MVK .S1 0,A5 ; Initialise the accumulator loop LDH .D1 *A8++,A2 ; Load the samples x[i] LDH .D1 *A9++,A3 ; Load the coefficients h[i] NOP 4 ; Add “nop 4” because the LDH has a latency of 5. MPY .M1 A2,A3,A4 ; Multiply x[i] and h[i] NOP ; Multiply has a latency of 2 cycles ADD .L1 A4,A5,A5 ; Add “x [i]. h[i]” to the accumulator [B0] SUB .L2 B0,1,B0 ;  [B0] B .S1 loop ;  loop overhead NOP 5 ;  The branch has a latency of 6 cycles

In order to optimise the code, we need to: (1) Use instructions in parallel. (2) Remove the NOPs. (3) Remove the loop overhead (remove SUB and B: loop unrolling). (4) Use word access or double-word access instead of byte or half-word access.

Step 1 - Using Parallel Instructions
.M1 .M2 .L1 .L2 .S1 .S2 .D1 .D2 Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 NOP ldh ldh nop nop nop nop mpy nop add sub b nop nop nop nop nop

Step 1 - Using Parallel Instructions
.M1 .M2 .L1 .L2 .S1 .S2 .D1 .D2 Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 NOP ldh ldh nop nop nop nop mpy nop add sub b nop nop nop Note: Not all instructions can be put in parallel since the result of one unit is used as an input to the following unit. nop nop

Step 2 - Removing the NOPs
.L1 .L2 .S1 .S2 .D1 .D2 Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 NOP ldh ldh sub b nop nop mpy nop add loop LDH .D1 *A8++,A2 LDH .D1 *A9++,A3 [B0] SUB .L2 B0,1,B0 [B0] B .S1 loop NOP 2 MPY .M1 A2,B3,A4 NOP ADD .L1 A4,A5,A5

Step 3 - Loop Unrolling The SUB and B instructions consume at least two extra cycles per iteration (this is known as branch overhead). LDH .D1 *A8++,A2 ;Start of iteration 1 || LDH .D2 *B9++,B3 NOP 4 MPY .M1X A2,B3,A4 ;Use of cross path NOP ADD .L1 A4,A5,A5 LDH .D1 *A8++,A2 ;Start of iteration 2 MPY .M1 A2,B3,A4 NOP ; : LDH .D1 *A8++,A2 ; Start of iteration n NOP 4 loop LDH .D1 *A8++,A2 LDH .D1 *A9++,A3 [B0] SUB .L2 B0,1,B0 [B0] B .S1 loop NOP 2 MPY .M1 A2,A3,A4 NOP ADD .L1 A4,A5,A5

Step 4 - Word or Double Word Access
The ‘C6711 has two 64-bit data buses for data memory access and therefore up to two 64-bit can be loaded into the registers at any time (see Chapter 2). In addition the ‘C6711 devices have variants of the multiplication instruction to support different operation (see Chapter 2). Note: Store can only be up to 32-bit.

Step 4 - Word or Double Word Access
Using word access, MPY and MPYH the previous code can be written as: loop LDW .D1 *A9++,A3 ; 32-bit word is loaded in a single cycle || LDW .D2 *B6++,B1 NOP 4 [B0] SUB .L2 [B0] B .S1 loop NOP 2 MPY .M1 A3,B1,A4 || MPYH .M2 A3,B1,B3 NOP ADD .L1 A4,B3,A5 Note: By loading words and using MPY and MPYH instructions the execution time has been halved since in each iteration two 16x16-bit multiplications are performed.

These increase performance and reduce code size.
Optimisation Summary It has been shown that there are four complementary methods for code optimisation: Using instructions in parallel. Filling the delay slots with useful code. Using word or double word load. Loop unrolling. These increase performance and reduce code size.

This increases performance but increases code size.
Optimisation Summary It has been shown that there are four complementary methods for code optimisation: Using instructions in parallel. Filling the delay slots with useful code. Using word or double word load. Loop unrolling. This increases performance but increases code size.

Chapter 12 Software Optimisation Part 2 - Software Pipelining

Objectives Why using Software Pipelining, SP?
Understand software pipelining concepts. Use software pipelining procedure. Code the word-wide software pipelined dot-product routine. Determine if your pipelined code is more efficient with or without prolog and epilog.

Why using Software Pipelining, SP?
SP creates highly optimized loop-code by: Putting several instructions in parallel. Filling delay slots with useful code. Maximizes functional units. SP is implemented by simply using the tools: Compiler options -o2 or -o3. Assembly Optimizer if .sa file.

Software Pipeline concept
To explain the concept of software pipelining, we will assume that all instructions execute in one cycle. LDH || LDH MPY ADD How many cycles would it take to perform this loop 5 times? (Disregard delay-slots). ______________ cycles

Software Pipeline Example
LDH || LDH MPY ADD How many cycles would it take to perform this loop 5 times? (Disregard delay-slots). ______________ cycles 5 x 3 = 15 Let’s examine hardware (functional units) usage ...

Non-Pipelined Code 1 Cycle ldh .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2 .D1 .D2
mpy 3 add 4 ldh 5 mpy 6 add 7 ldh 8 mpy 9 add

Pipelining Code Pipelining these instructions took 1/2 the cycles! .M1
ldh 2 mpy ldh 3 add mpy ldh 4 add mpy ldh 5 add mpy ldh 6 add mpy 7 add Pipelining these instructions took 1/2 the cycles!

Pipelining Code Pipelining these instructions takes only 7 cycles! .M1
ldh 2 mpy ldh 3 add mpy ldh 4 add mpy ldh 5 add mpy ldh 6 add mpy 7 add Pipelining these instructions takes only 7 cycles!

Single-cycle “loop” iterated three times.
Pipelining Code Prolog Staging for loop. .M1 .L1 .D1 .D2 ldh 1 mpy 2 add 3 4 5 6 7 Loop Kernel Single-cycle “loop” iterated three times. Epilog Completing final operations.

Pipelined Code prolog: LDH ; load 1 || LDH
MPY ; mpy 1 || LDH ; load 2 || LDH loop: ADD ; add 1 || MPY ; mpy 2 || LDH ; load 3 || LDH ADD ; add 2 || MPY ; mpy 3 || LDH ; load 4 || LDH .

Software Pipelining Procedure
1. Write algorithm in C code & verify. 2. Write ‘C6x Linear Assembly code. 3. Create dependency graph. 4. Allocate registers. 5. Create scheduling table. 6. Translate scheduling table to ‘C6x code.

Software Pipelining Example (Step 1)
short DotP(short *m, short *n, short count) { int i; short product; short sum = 0; for (i=0; i < count; i++) { product = m[i] * n[i]; sum += product; } return(sum);

1. Write algorithm in C code & verify. 2. Write ‘C6x Linear Assembly code. 3. Create dependency graph. 4. Allocate registers. 5. Create scheduling table. 6. Translate scheduling table to ‘C6x code.

Write code in Linear Assembly (Step 2)
; for (i=0; i < count; i++) ; prod = m[i] * n[i]; ; sum += prod; loop: ldh *p_m++, m ldh *p_n++, n mpy m, n, prod add prod, sum, sum [count] sub count, 1, count [count] b loop 1. No NOP’s required. 2. No parallel instructions required. 3. You don’t have to specify: Functional units, or Registers.

1. Write algorithm in C code & verify. 2. Write ‘C6x Linear Assembly code. 3. Create a dependency graph (4 steps). 4. Allocate registers. 5. Create scheduling table. 6. Translate scheduling table to ‘C6x code.

Dependency Graph Terminology
b na Child Node Parent Node Path 5 .L .D LDH NOT Conditional Path

Dependency Graph Steps
(a) Draw the algorithm nodes and paths. (b) Write the number of cycles it takes for each instruction to complete execution. (c) Assign “required” function units to each node. (d) Partition the nodes to A and B sides and assign sides to all functional units.

Dependency Graph (Step a)
In this step each instruction is represented by a node. The node is represented by a circle, where: Outside: write instruction. Inside: register where result is written. Nodes are then connected by paths showing the data flow. Note: Conditional paths are represented by dashed lines.

m LDH

m LDH n

m LDH n prod MPY

m LDH n prod MPY sum ADD

m LDH n prod MPY sum ADD count SUB loop B

Dependency Graph (Step b)
In this step the number of cycles it takes for each instruction to complete execution is added to the dependency graph. It is written along the associated data path.

Dependency Graph (Step b)
m LDH n prod MPY sum ADD 5 2 1 count SUB loop B 1 6

Dependency Graph (Step c)
In this step functional units are assigned to each node. It is advantageous to start allocating units to instructions which require a specific unit: Load/Store. Branch. We do not need to be concerned with multiply as this is the only operation that the .M unit performs. Note: The side is not allocated at this stage.

Dependency Graph (Step c)
m LDH n prod MPY sum ADD .D .D 5 5 count SUB 1 .M 1 2 loop B 1 .S 6

Dependency Graph (Step d)
The data path is partitioned into side A and B at this stage. To optimise code we need to ensure that a maximum number of units are used with a minimum number of cross paths. To make the partition visible on the dependency graph a line is used. The side can then be added to the functional units associated with each instruction or node.

A Side B Side m LDH n prod MPY sum ADD .D .D 5 5 count SUB 1 .M 1 2 loop B 1 .S 6

m LDH n prod MPY sum ADD A Side .D1 .D2 5 2 1 count SUB loop B .M1x .L1 .L2 .S2 B Side 6

Step 4 - Allocate Functional Units
.M1 .D1 .S1 x1 .L2 .M2 .D2 .S2 x2 sum prod m .M1x count n loop Do we have enough functional units to code this algorithm in a single-cycle loop? A Side B Side m LDH n .D1 .D2 5 5 prod MPY count SUB 1 .M1x .L2 1 2 sum ADD loop B 1 .L1 .S2 6

Step 4 - Allocate Registers
Content of Register File A &a a prod sum Reg. A Reg. B A0 B0 A1 B1 A2 B2 A3 B3 A4 B4 ... A15 B15 Content of Register File B count &b b

Step 5 - Create Scheduling Table
LOOP PROLOG 1 2 3 4 5 6 7 8 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 How do we know the loop ends up in cycle 8?

Length of Prolog Answer:
m LDH Answer: Count up the length of longest path, in this case we have: = 8 cycles 5 prod MPY 2 sum ADD 1

Scheduling Table 1 2 3 4 5 6 7 8 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 LOOP
PROLOG 1 2 3 4 5 6 7 8 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2

Scheduling Table 8 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 7 6 5 4 3 2 1 add *
LOOP PROLOG add * B * mpy * ldh m * ldh n Branch here Where do we want to branch?

Scheduling Table .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add * mpy B 8 7 6 5 4
LOOP PROLOG .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add * mpy B 8 7 6 5 4 3 2 1 ldh m ldh n * sub

Translate Scheduling Table to ‘C6x Code
C ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 PROLOG LOOP 7 6 5 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add sub * * * * * * * B * mpy * ldh m * ldh n

C ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 PROLOG LOOP 7 6 5 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add C ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 sub * * * * * * * B * mpy * ldh m * ldh n

PROLOG C ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 LOOP 7 6 5 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add C ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 sub * * * * * * * B C ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 || [B0] B .S2 loop * mpy * ldh m * ldh n

C ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 LOOP 7 6 5 4 3 2 1 C ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add sub * * * * * * C ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 || [B0] B .S2 loop * B * mpy C ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 || [B0] B .S2 loop * ldh m * ldh n

C ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 LOOP C ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 8 7 6 5 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add C ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 || [B0] B .S2 loop sub * * sub * * * * B C ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 || [B0] B .S2 loop * mpy * ldh ldh m C ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 || [B0] B .S2 loop * ldh ldh n

PROLOG LOOP 8 7 6 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add C6 ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 || [B0] B .S2 loop || mpy .M1x A2,B2,A3 sub * * * sub * * * B * mpy * ldh ldh m * ldh ldh n

PROLOG LOOP 8 7 6 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add C7 ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 || [B0] B .S2 loop || mpy .M1x A2,B2,A3 sub * * * sub * * * B * mpy * ldh ldh m * ldh ldh n

PROLOG LOOP 8 7 6 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add * Single-Cycle Loop loop: ldh .D1 *A1++,A2 || ldh .D2 *B1++,B2 || [B0] sub .L2 B0,1,B0 || [B0] B .S2 loop || mpy .M1x A2,B2,A3 || add .L1 A4,A3,A4 sub * * * sub * * * B * mpy * ldh ldh m * ldh ldh n See Chapter 14 for practical examples

With this method we have only created the prolog and the loop. Therefore if the filter has 100 taps, then we need to repeat the loop 100 times as we need 100 adds. This means that we are performing 107 loads. These 7 extra loads may lead to some illegal memory acesses. .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add mpy B 8 7 6 5 4 3 2 1 ldh m ldh n sub LOOP PROLOG

Solution: The Epilog We only created the Prolog and Loop …
What about the Epilog? The Epilog can be extracted from your results as described below. See example in the next slide.

Dot-Product with Epilog
Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog e1: mpy || add Epilog = Loop - Prolog And there is no sub or b in the epilog

Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog e1: mpy || add e2: mpy || add Epilog = Loop - Prolog And there is no sub or b in the epilog

Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog e1: mpy || add e2: mpy || add e3: mpy || add Epilog = Loop - Prolog And there is no sub or b in the epilog

Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog e1: mpy || add e2: mpy || add e3: mpy || add e4: mpy || add Epilog = Loop - Prolog And there is no sub or b in the epilog

Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog e1: mpy || add e2: mpy || add e3: mpy || add e4: mpy || add e5: mpy || add Epilog = Loop - Prolog And there is no sub or b in the epilog

Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog e1: mpy || add e2: mpy || add e3: mpy || add e4: mpy || add e5: mpy || add Epilog = Loop - Prolog And there is no sub or b in the epilog e6: add

Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog e1: mpy || add e2: mpy || add e3: mpy || add e4: mpy || add e5: mpy || add e6: add Epilog = Loop - Prolog And there is no sub or b in the epilog e7: add

Scheduling Table: Prolog, Loop and Epilog

Loop only! Can the code be written as a loop only (i.e. no prolog or epilog)? Yes!

Loop only! (i) Remove all instructions except the branch. .L1 .L2 .S1
.D1 .D2 add * mpy B 8 7 6 5 4 3 2 1 ldh m ldh n sub LOOP PROLOG (i) Remove all instructions except the branch.

Loop only! (i) Remove all instructions except the branch. PROLOG 6 5 4
3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add sub B B mpy ldh m ldh n

Loop only! (i) Remove all instructions except the branch.
PROLOG (i) Remove all instructions except the branch. (ii) Zero input registers, accumulator and product registers. 6 5 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 zero a zero sum add sub zero prod zero b B B mpy ldh m ldh n

Loop only! (i) Remove all instructions except the branch.
PROLOG (i) Remove all instructions except the branch. (ii) Zero input registers, accumulator and product registers. (iii) Adjust the number of subtractions. 6 5 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 zero a zero sum add sub sub zero prod zero b B B mpy ldh m ldh n

Loop Only - Final Code Overhead Loop b loop b loop
|| zero m ;input register || zero n ;input register || zero prod ;product register || zero sum ;accumulator || sub ;modify count register loop ldh || ldh || mpy || add || [] sub || [] b loop Overhead Loop

Laboratory exercise Software pipeline using the LDW version of the Dot-Product routine: (1) Write linear assembly. (2) Create dependency graph. (3) Complete scheduling table. (4) Transfer table to ‘C6000 code. To Epilogue or Not to Epilog? Determine if your pipelined code is more efficient with or without prolog and epilog.

Lab Solution: Step 1 - Linear Assembly
; for (i=0; i < count; i++) ; prod = m[i] * n[i]; ; sum += prod; *** count becomes 20 *** loop: ldw *p_m++, m ldw *p_n++, n mpy m, n, prod mpyh m, n, prodh add prod, sum, sum add prodh, sumh, sumh [count] sub count, 1, count [count] b loop ; Outside of Loop add sum, sumh, sum 31

Step 2 - Dependency Graph
prodh MPYH m LDW .D1 n .D2 5 2 .M1x prod MPY .M2x 1 sumh ADD .L2 sum .L1 count SUB loop B .S2 .S1 6 A Side B Side 43

Step 2 - Functional Units
.M1 .D1 .S1 x1 .L2 .M2 .D2 .S2 x2 sum prod m loop .M1x sumh prodh n count .M2x Do we still have enough functional units to code this algorithm in a single-cycle loop? Yes ! 45

Step 2 - Registers Register File A # # Register File B A0 B0 count A1
return address &a/ret value A4 B4 &x a A5 B5 x count/prod A6 B6 prodh sum A7 B7 sumh 46

Step 3 - Schedule Algorithm
LOOP PROLOG 8 7 6 5 4 3 2 1 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 add add B6 B5 B4 B3 B2 B1 sub7 sub6 sub5 sub4 sub3 sub2 sub1 mpy3 mpy2 mpy mpyh3 mpyh2 mpyh ldw8 ldw7 ldw6 ldw5 ldw4 ldw3 ldw2 ldw m ldw8 ldw7 ldw6 ldw5 ldw4 ldw3 ldw2 ldw n 52

Step 4 - ‘C6000 Code The complete code is available in the following location: \Links\DotP LDW.pdf 53

Why Conditional Subtract?
loop: ldh *p_m++, m ldh *p_n++, n mpy m, n, prod add prod, sum, sum [count] sub count, 1, count [count] b loop Without Cond. Subtract: Loop (count = 1) (B) With Cond. Subtract: Loop (count = 1) (B) loop (count = 0) (B) X loop (count = 0) (B) X loop (count = -1) (B) loop (count = 0) (B) X loop (count = -2) (B) loop (count = 0) (B) X loop (count = -3) (B) loop (count = 0) (B) X loop (count = -4) (B) loop (count = 0) (B) X Loop never ends Loop ends

Chapter 12 Software Optimisation Part 3 - Pipelining Multi-cycle Loops

Objectives Software pipeline the weighted vector sum algorithm.
Describe four iteration interval constraints. Calculate minimum iteration interval. Convert and optimize the dot-product code to floating point code.

What Requires Multi-Cycle Loops?
Resource Limitations Running out of resources (Functional Units, Registers, Bus Accesses) Weighted Vector Sum example requires three .D units Live Too Long Minimum iteration interval defined by length of time a Variable is required to exist Loop Carry Path Latency required between loop iterations FIR example and SP floating-point dot product examples are demonstrated Functional Unit Latency > 1 A few ‘C67x instructions require functional units for 2 or 4 cycles rather than one. This defines a minimum iteration interval.

Four reasons: 1. Resource Limitations. 2. Live Too Long. 3. Loop Carry Path. 4. Double Precision (FUL > 1). Use these four constraints to determine the smallest Iteration Interval (Minimum Iteration Interval or MII).

Resource Limitation: Weighted Vector Sum
Step 1 - C Code void WVS(short *c, short *b, short *a, short r, short n) { int i; for (i=0; i < n; i++) { c[i] = a[i] + (r * b[i]) >> 15; } a, b: input arrays c: output array n: length of arrays r: weighting factor Store .D Load .D Load .D Requires 3 .D units

1. Write algorithm in C code & verify. 2. Write ‘C6x Linear Assembly code. 3. Create dependency graph. 4. Allocate registers. 5. Create scheduling table. 6. Translate scheduling table to ‘C6x code.  Write algorithm in C & verify. 2. Write ‘C6x Linear Assembly Code.

c[i] = a[i] + (r * b[i]) >> 15;
Step 2 - ‘C6x Linear Code c[i] = a[i] + (r * b[i]) >> 15; loop: LDH *a++, ai LDH *b++, bi MPY r, bi, prod SHR prod, 15, sum ADD ai, sum, ci STH ci, *c++ [i] SUB i, 1, i [i] B loop The full code is available here: \Links\Wvs.sa

Step 3 - Dependency Graph
5 1 2 15 ai LDH bi r prod MPY sum SHR ci ADD *c++ STH A Side B .D1 .L1 .S2 .M2 .D2 .L2 .S1 1 6 i SUB B loop

Step 4 -Allocate Functional Units
ci ai, *c i prod bi sum .L1 .M1 .D1 .S1 .L2 .M2 .D2 .S2 loop This requires 3 .D units therefore it cannot fit into a single cycle loop. This may fit into a 2 cycle loop if there are no other constraints.

2 Cycle Loop 2 cycles per loop iteration .L1 .L2 .S1 .S2 .M1 .M2 .D1
Iteration Interval (II): # cycles per loop iteration.

Multi-Cycle Loop Iterations
.D1 ldh ldh ldh .D2 ldh ldh ldh .S2 shr shr shr .M1 mpy mpy mpy .M2 .L1 add add add .L2 sub sub sub .S1 b b b cycle 2 cycle 4 cycle 6 .D1 sth sth sth .D2 .S1 .S2 .M1 .M2 .L1 .L2

10 How long is the Prolog? What is the length of the longest path? 10
ai bi prod sum ci *c++ 5 1 2 10 What is the length of the longest path? 10 How many cycles per loop? 2

Step 5 - Create Scheduling Chart (0)
2 4 6 8 Unit\cycle .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 1 3 5 7 9

Step 5 - Create Scheduling Chart
Unit\cycle 2 4 6 8 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 LDH bi * Unit\cycle 1 3 5 7 9 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2

Unit\cycle 2 4 6 8 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 LDH bi * Unit\cycle 1 3 5 7 9 .L1 .L2 .S1 .S2 .M1 .M2 MPY mi * .D1 .D2

Unit\cycle 2 4 6 8 .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 LDH bi * Unit\cycle 1 3 5 7 9 .L1 .L2 .S1 .S2 SHR sum * .M1 .M2 MPY mi * .D1 .D2

Unit\cycle 2 4 6 8 .L1 ADD ci .L2 .S1 .S2 .M1 .M2 .D1 .D2 LDH bi * Unit\cycle 1 3 5 7 9 .L1 .L2 .S1 .S2 SHR sum * .M1 .M2 MPY mi * .D1 .D2

2 4 6 8 LDH bi * Unit\cycle .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 1 3 5 7 9 MPY mi SHR sum ADD ci STH c[i]

2 4 6 8 LDH bi * Unit\cycle .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 1 3 5 7 9 MPY mi SHR sum ADD ci STH c[i] * LDH ai

2 4 6 8 LDH bi * Unit\cycle .L1 .L2 .S1 .S2 .M1 .M2 .D1 .D2 1 3 5 7 9 MPY mi SHR sum ADD ci STH c[i] Conflict * LDH ai

Conflict Solution Here are two possibilities ... Which is better? 2 4
2 4 6 8 LDH bi * Unit\cycle .D1 .D2 .L1 .L2 .S1 .S2 .M1 .M2 1 3 5 7 9 MPY mi SHR sum STH c[i] LDH ai LDH ai

Conflict Solution Here are two possibilities ...
Which is better? Move the LDH to cycle 2. (so you don’t have to go back and recheck crosspaths) 2 4 6 8 LDH bi * Unit\cycle .D1 .D2 .L1 .L2 .S1 .S2 .M1 .M2 1 3 5 7 9 MPY mi SHR sum STH c[i] LDH ai

Unit\cycle 2 4 6 8 .L1 ADD ci .L2 .S1 .S2 .M1 .M2 .D1 LDH ai * .D2 LDH bi * Unit\cycle 1 3 5 7 9 .L1 .L2 .S1 .S2 SHR sum * .M1 .M2 MPY mi * .D1 STH c[i] .D2

Unit\cycle 2 4 6 8 .L1 ADD ci .L2 .S1 [i] B * .S2 .M1 .M2 .D1 LDH ai * .D2 LDH bi * Unit\cycle 1 3 5 7 9 .L1 .L2 .S1 .S2 SHR sum * .M1 .M2 MPY mi * .D1 STH c[i] .D2

Unit\cycle 2 4 6 8 .L1 ADD ci .L2 .S1 [i] B * .S2 .M1 .M2 .D1 LDH ai * .D2 LDH bi * Unit\cycle 1 3 5 7 9 .L1 .L2 [i] SUB i * .S1 .S2 SHR sum * .M1 .M2 MPY mi * .D1 STH c[i] .D2

Unit\cycle 2 4 6 8 .L1 ADD ci .L2 .S1 [i] B * * .S2 .M1 .M2 .D1 LDH ai * * * .D2 LDH bi * * * * Unit\cycle 1 3 5 7 9 .L1 .L2 [i] SUB i * * * .S1 .S2 SHR sum * .M1 .M2 MPY mi * * .D1 LDH ai STH c[i] .D2

2 Cycle Loop Kernel Unit\cycle 2 4 6 8 .L1 ADD ci .L2 .S1 [i] B * *
2 4 6 8 .L1 ADD ci .L2 .S1 [i] B * * .S2 .M1 .M2 .D1 LDH ai * * * .D2 LDH bi * * * * Unit\cycle 1 3 5 7 9 .L1 .L2 [i] SUB i * * * .S1 .S2 SHR sum * .M1 .M2 MPY mi * * .D1 LDH ai STH c[i] .D2

Four reasons: 1. Resource Limitations. 2. Live Too Long. 3. Loop Carry Path. 4. Double Precision (FUL > 1).

Live Too Long - Example c = (a >> 5) + a ai 5 x 1 ci LDH ADD SHR
LDH ai 5 a0 valid 6 1 LDH 2 3 4

Live Too Long - Example ai 5 x 1 ci LDH ai 5 a0 valid 6 a1 1 2 3 4 LDH
LDH ai 5 a0 valid 6 a1 1 LDH 2 3 4 ai LDH ci ADD 5 1 x SHR .S1 .L1 .D1

Live Too Long - Example ai 5 x 1 ci 1 2 3 4 5 6 LDH ADD SHR .S1 .L1
1 LDH 2 LDH 3 4 5 6 ai LDH ci ADD 5 1 x SHR .S1 .L1 .D1 LDH ai LDH LDH a0 valid a1 SHR x0 valid

Oops, rather than adding Let’s look at one solution ...
Live Too Long - Example 1 LDH 2 LDH 3 4 5 6 ai LDH ci ADD 5 1 x SHR .S1 .L1 .D1 LDH ai LDH LDH a0 valid a1 SHR x0 valid ADD Oops, rather than adding a0 + x0 we got a1 + x0 Let’s look at one solution ...

Live Too Long - 2 Cycle Solution
LDH ai 2 LDH 4 6 a0 valid 1 3 5 7 a1 With a 2 cycle loop, a0 is valid for 2 cycles.

LDH ai 2 LDH 4 6 a0 valid x0 valid 1 3 5 SHR 7 a1 Notice, a0 and x0 are both valid for 2 cycles which is the length of the Iteration Interval Adding them ...

LDH ai 2 LDH 4 6 a0 valid x0 valid ADD 1 3 5 SHR 7 a1 Works! But what’s the drawback? 2 cycle loop is slower. Here’s a better solution ...

LDH ai 5 a0 valid MV b SHR 1 LDH 2 3 4 6 a1 b valid x0 valid ADD ai LDH ci ADD 5 1 x SHR .S1 .L1 .D1 b MV .S2 Using a temporary register solves this problem without increasing the Minimum Iteration Interval

Loop Carry Path The loop carry path is a path which feeds one variable from part of the algorithm back to another. p2 st_y0 MPY.M2 STH.D1 2 1 e.g. Loop carry path = 3. Note: The loop carry path is not the code loop.

Loop Carry Path, e.g. IIR Filter
IIR Filter Example y0 = a0*x0 + b1*y1

IIR.SA IIR Filter Example y0 = a0*x0 + b1*y1 IIR: ldh *a_1, A1
ldh *x1, A3 ldh *b_1, B1 ldh *y0, B0 ; y1 is previous y0 mpy A1, A3, prod1 mpy B1, B0, prod2 add prod1, prod2, prod2 sth prod2, *y0

Loop Carry Path - IIR Example
B1 y1 y0 st_y0 LDH.D1 LDH.D2 MPY.M1 MPY.M2 ADD.L1 STH.D1 5 2 1 IIR Filter Loop y0 = a1*x1 + b1*y1 Min Iteration Interval Resource = 2 (need 3 .D units) Loop Carry Path = 9 (9 = ) therefore, MII = 9 1 Result carries over from one iteration of the loop to the next. Can it be minimized?

Loop Carry Path - IIR Example (Solution)
B1 y1 y0 st_y0 LDH.D1 LDH.D2 MPY.M1 MPY.M2 ADD.L1 STH.D1 5 2 1 IIR Filter Loop y0 = a1*x1 + b1*y1 Min Iteration Interval Resource = 2 (need 3 .D units) 1 New Loop Carry Path = 3 (3 = 2 + 1) therefore, MII = 3 Since y0 is stored in a CPU register, it can be used directly by MPY (after the first loop iteration).

Reminder: Fixed-Point Dot-Product Example
LDH n prod MPY sum ADD .D1 .D2 5 2 .M1x .L1 Is there a loop carry path in this example? Yes, but it’s only “1” Min Iteration Interval Resource = 1 Loop Carry Path = 1  MII = 1 1 For the fixed-point implementation, the Loop Carry Path was not taken into account because it is equal to 1.

Loop Carry Path IIR Example. Enhancing the IIR.
Fixed-Point Dot-Product Example. Floating-Point Dot Product Example.

Loop Carry Path due to FUL > 1 Floating-Point Dot-Product Example
LDW n prod MPYSP sum ADDSP .D1 .D2 5 4 .M1x .L1 Min Iteration Interval Resource = 1 Loop Carry Path = 4  MII = 4 4

Unrolling the Loop If the MII must be four cycles long, then use all of them to calculate four results. m1 LDW n1 prod1 MPYSP sum1 ADDSP .D1 .D2 4 .M1x .L1 m2 n2 prod2 sum2 m3 n3 prod3 sum3 m4 n4 prod4 sum4

ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 sum = x1 + x5 10 sum = x2 + x6 11 sum = x3 + x7 12 sum = x0 + x4 + x8 ADDSP takes 4 cycles or three delay slots to produce the result.

Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 sum = x1 + x5 10 sum = x2 + x6 11 sum = x3 + x7 12 sum = x0 + x4 + x8

Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 NOP sum = x1 + x5 10 NOP sum = x2 + x6 11 NOP sum = x3 + x7 12 NOP sum = x0 + x4 + x8 There are effectively four running sums: sum (i) = x(i) + x(i+4) + x(i+8) + … sum (i+1) = x(i+1) + x(i+5) + x(i+9) + … sum (i+2) = x(i+2) + x(i+6) + x(i+10) + … sum (i+3) = x(i+3) + x(i+7) + x(i+11) + …

There are effectively four running sums: sum (i) = x(i) + x(i+4) + x(i+8) + … sum (i+1) = x(i+1) + x(i+5) + x(i+9) + … sum (i+2) = x(i+2) + x(i+6) + x(i+10) + … sum (i+3) = x(i+3) + x(i+7) + x(i+11) + … These need to be combined after the last addition is complete...

ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 MV sum, temp sum = x1 + x5 10 sum = x2 + x6 11 sum = x3 + x7 12 sum = x0 + x4 + x8

Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 MV sum, temp sum = x1 + x5 10 ADDSP sum, temp, sum1 sum = x2 + x6, temp = x1 + x5 11 sum = x3 + x7 12 sum = x0 + x4 + x8

Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 MV sum, temp sum = x1 + x5 10 ADDSP sum, temp, sum1 sum = x2 + x6, temp = x1 + x5 11 MV sum, temp sum = x3 + x7 12 sum = x0 + x4 + x8

Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 MV sum, temp sum = x1 + x5 10 ADDSP sum, temp, sum1 sum = x2 + x6, temp = x1 + x5 11 MV sum, temp sum = x3 + x7 12 ADDSP sum, temp, sum2 sum = x0 + x4 + x8, temp = x3 + x7

Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 MV sum, temp sum = x1 + x5 10 ADDSP sum, temp, sum1 sum = x2 + x6, temp = x1 + x5 11 MV sum, temp sum = x3 + x7 12 ADDSP sum, temp, sum2 sum = x0 + x4 + x8, temp = x3 + x7 13 NOP 14 NOP sum1 = x1 + x2 + x5 + x6

Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 MV sum, temp sum = x1 + x5 10 ADDSP sum, temp, sum1 sum = x2 + x6, temp = x1 + x5 11 MV sum, temp sum = x3 + x7 12 ADDSP sum, temp sum2 sum = x0 + x4 + x8, temp = x3 + x7 13 NOP 14 NOP sum1 = x1 + x2 + x5 + x6 15 NOP 16 ADDSP sum1, sum2, sum sum2 = x0 + x3 + x4 + x7 + x8

Cycle Instruction Result 0 ADDSP x0, sum, sum sum = 0 1 ADDSP x1, sum, sum sum = 0 2 ADDSP x2, sum, sum sum = 0 3 ADDSP x3, sum, sum sum = 0 4 ADDSP x4, sum, sum sum = x0 5 ADDSP x5, sum, sum sum = x1 6 ADDSP x6, sum, sum sum = x2 7 ADDSP x7, sum, sum sum = x3 8 ADDSP x8, sum, sum sum = x0 + x4 9 MV sum, temp sum = x1 + x5 10 ADDSP sum, temp, sum1 sum = x2 + x6, temp = x1 + x5 11 MV sum, temp sum = x3 + x7 12 ADDSP sum, temp sum2 sum = x0 + x4 + x8, temp = x3 + x7 13 NOP 14 NOP sum1 = x1 + x2 + x5 + x6 15 NOP 16 ADDSP sum1, sum2, sum sum2 = x0 + x3 + x4 + x7 + x8 17 NOP 18 NOP 19 NOP 20 NOP sum = x0 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8

MPYDP ties up the functional unit for 4 cycles.
Simple FUL Example 5 3 MPYDP prod 10 (4.9) .M1 1 MPYDP 2 3 4 5 6 ... MPYDP ties up the functional unit for 4 cycles.

A Better Way to Diagram this ...
1 MPYDP 5 9 13 2 6 10 14 3 7 11 prod1 15 prod2 4 8 12 16 Since the MPYDP instruction has a functional unit latency (FUL) of “4”, .M1 cannot be used again until the fifth cycle. Hence, MII  4.

1. Resource Limitations. 2. Live Too Long. 3. Loop Carry Path. 4. Double Precision (FUL > 1). Lab: Converting your dot-product code to Single-Precision Floating-Point.

Chapter 12 Software Optimisation - End -

Chapter 12 Software Optimisation

Similar presentations

Presentation on theme: "Chapter 12 Software Optimisation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 12 Software Optimisation

Similar presentations

Presentation on theme: "Chapter 12 Software Optimisation"— Presentation transcript:

Similar presentations

About project

Feedback