Chapter 12 Software Optimisation. Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 2 Software Optimisation Chapter This.

Chapter 12 Software Optimisation

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 2 Software Optimisation Chapter This chapter consists of three parts: Part 1:Optimisation Methods. Part 2:Software Pipelining. Part 3:Multi-cycle Loop Pipelining.

Chapter 12 Software Optimisation Part 1 - Optimisation Methods

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 4Objectives  Introduction to optimisation and optimisation procedure.  Optimisation of C code using the code generation tools.  Optimisation of assembly code.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 5Introduction  Software optimisation is the process of manipulating software code to achieve two main goals:  Faster execution time.  Small code size. Note: It will be shown that in general there is a trade off between faster execution type and smaller code size.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 6Introduction  To implement efficient software, the programmer must be familiar with:  Processor architecture.  Programming language (C, assembly or linear assembly).  The code generation tools (compiler, assembler and linker).

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 7 Code Optimisation Procedure

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 8 Code Optimisation Procedure

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 9 Optimising C Compiler Options  The ‘C6x optimising C compiler usesthe ANSI C source code and can perform optimisation currently up-to about 80% compared with a hand-scheduled assembly.  The ‘C6x optimising C compiler uses the ANSI C source code and can perform optimisation currently up-to about 80% compared with a hand-scheduled assembly.  However, to achieve this level of optimisation, knowledge of different levels of optimisation is essential. Optimisation is performed at different stages and levels  However, to achieve this level of optimisation, knowledge of different levels of optimisation is essential. Optimisation is performed at different stages and levels.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 10 Assembly Optimisation  To develop an appreciation of how to optimise code, let us optimise an FIR filter:  For simplicity we write: [1]

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 11 Assembly Optimisation  To implement Equation 1, we need to perform the following steps: (1)Load the sample x[i]. (2)Load the coefficients h[i]. (3)Multiply x[i] and h[i]. (4)Add (x[i] * h[i]) to the content of an accumulator. (5)Repeat steps 1 to 4 N-1 times. (6)Store the value in the accumulator to y.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 12 Assembly Optimisation  Steps 1 to 6 can be translated into the following ‘C6x assembly code: MVK.S10,B0; Initialise the loop counter MVK.S10,A5; Initialise the accumulator loopLDH.D1*A8++,A2; Load the samples x[i] LDH.D1*A9++,A3; Load the coefficients h[i] NOP4; Add “nop 4” because the LDH has a latency of 5. MPY.M1A2,A3,A4; Multiply x[i] and h[i] NOP; Multiply has a latency of 2 cycles ADD.L1A4,A5,A5; Add “x [i]. h[i]” to the accumulator [B0]SUB.L2B0,1,B0; [B0]B.S1loop;  loop overhead NOP5;  The branch has a latency of 6 cycles

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 13 Assembly Optimisation  In order to optimise the code, we need to: (1)Use instructions in parallel. (2)Remove the NOPs. (3)Remove the loop overhead (remove SUB and B: loop unrolling). (4)Use word access or double-word access instead of byte or half-word access.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 14 Step 1 - Using Parallel Instructions ldh mpy ldh b nop nop nop nop nop nop nop nop nop add sub nop

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 15 Step 1 - Using Parallel Instructions ldh mpy ldh b nop nop nop nop nop nop nop nop nop add sub nop Note: Not all instructions can be put in parallel since the result of one unit is used as an input to the following unit.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 16 Step 2 - Removing the NOPs ldh mpy ldh b nop nop add sub nop loopLDH.D1*A8++,A2 LDH.D1*A9++,A3 [B0]SUB.L2B0,1,B0 [B0]B.S1loop NOP2 MPY.M1A2,B3,A4 NOP ADD.L1A4,A5,A5

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 17 Step 3 - Loop Unrolling  The SUB and B instructions consume at least two extra cycles per iteration (this is known as branch overhead). loopLDH.D1*A8++,A2 LDH.D1*A9++,A3 [B0]SUB.L2B0,1,B0 [B0]B.S1loop NOP2 MPY.M1A2,B3,A4 NOP ADD.L1A4,A5,A5 LDH.D1*A8++,A2;Start of iteration 1 ||LDH.D1*B9++,B3 NOP4 MPY.M1XA2,B3,A4;Use of cross path NOP ADD.L1A4,A5,A5 LDH.D1*A8++,A2;Start of iteration 2 ||LDH.D1*A9++,A3 NOP4 MPY.M1A2,B3,A4 NOP ADD.L1A4,A5,A5 ;: LDH.D1*A8++,A2; Start of iteration n ||LDH.D1*A9++,A3 NOP 4 MPY.M1A2,B3,A4 NOP ADD.L1A4,A5,A5

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 18 Step 4 - Word or Double Word Access  The ‘C6711 has two 64-bit data buses for data memory access and therefore up to two 64-bit can be loaded into the registers at any time (see Chapter 2).  In addition the ‘C6711 devices have variants of the multiplication instruction to support different operation (see Chapter 2). Note: Store can only be up to 32-bit.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 19 loop LDW.D1*A9++,A3; 32-bit word is loaded in a single cycle ||LDW.D2*B6++,B1 NOP4 [B0]SUB.L2 [B0]B.S1loop NOP2 MPY.M1A2,B3,A4 ||MPYH.M2A0,B1,B3 NOP ADD.L1A4,B3,A5 Step 4 - Word or Double Word Access  Using word access, MPY and MPYH the previous code can be written as:  Note: By loading words and using MPY and MPYH instructions the execution time has been halved since in each iteration two 16x16- bit multiplications are performed.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 20 Optimisation Summary  It has been shown that there are four complementary methods for code optimisation:  Using instructions in parallel.  Filling the delay slots with useful code.  Using word or double word load.  Loop unrolling. These increase performance and reduce code size.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 21 Optimisation Summary This increases performance but increases code size.  It has been shown that there are four complementary methods for code optimisation:  Using instructions in parallel.  Filling the delay slots with useful code.  Using word or double word load.  Loop unrolling.

Chapter 12 Software Optimisation Part 2 - Software Pipelining

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 23Objectives  Why using Software Pipelining, SP?  Understand software pipelining concepts.  Use software pipelining procedure.  Code the word-wide software pipelined dot-product routine.  Determine if your pipelined code is more efficient with or without prolog and epilog.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 24 Why using Software Pipelining, SP?  SP creates highly optimized loop-code by:  Putting several instructions in parallel.  Filling delay slots with useful code.  Maximizes functional units.  SP is implemented by simply using the tools:  Compiler options -o2 or -o3.  Assembly Optimizer if.sa file.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 25 Software Pipeline concept LDH LDH || LDH || LDH MPY MPY ADD ADD How many cycles would it take to perform this loop 5 times? (Disregard delay-slots). ______________ cycles To explain the concept of software pipelining, we will assume that all instructions execute in on cycle.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 26 Software Pipeline Example LDH LDH || LDH || LDH MPY MPY ADD ADD How many cycles would it take to perform this loop 5 times? (Disregard delay-slots). ______________ cycles 5 x 3 = 15 Let’s examine hardware (functional units) usage...

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 27 Non-Pipelined Code.M1.M2.L1.L2.S1.S2.D1.D2 1Cycleldhldh 2mpy3add4ldhldh5mpy 6add 7ldhldh8mpy 9add. D1. D2

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 28 Pipelining Code.M1.M2.L1.L2.S1.S2.D1.D21Cycle ldhldh 2mpyldhldh 3addmpyldhldh 4addmpyldhldh 5addmpyldhldh 6addmpy 7add Pipelining these instructions took 1/2 the cycles!

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 29 Pipelining Code.M1.M2.L1.L2.S1.S2.D1.D21Cycle ldhldh 2mpyldhldh 3addmpyldhldh 4addmpyldhldh 5addmpyldhldh 6addmpy 7add Pipelining these instructions takes only 7 cycles!

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 30 Loop Kernel Single-cycle “loop” iterated three times. Pipelining Code.M1.L1.D1.D2ldhldh1mpy2ldhldh add3mpyldhldh mpyaddldhldh4 addmpy5ldhldh add6mpy 7add Prolog Staging for loop. Epilog Completing final operations.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 31 Pipelined Code prolog:LDH; load 1 ||LDH MPY; mpy 1 ||LDH; load 2 ||LDH loop:ADD; add 1 ||MPY; mpy 2 ||LDH; load 3 ||LDH ADD; add 2 ||MPY; mpy 3 ||LDH; load 4 ||LDH..

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 32 Software Pipelining Procedure 1.Write algorithm in C code & verify. 1.Write algorithm in C code & verify. 2.Write ‘C6x Linear Assembly code. 2.Write ‘C6x Linear Assembly code. 3.Create dependency graph. 3.Create dependency graph. 4.Allocate registers. 4.Allocate registers. 5.Create scheduling table. 5.Create scheduling table. 6.Translate scheduling table to ‘C6x code. 6.Translate scheduling table to ‘C6x code.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 33 short DotP(short *m, short *n, short count) { int i; short product; short product; short sum = 0; short sum = 0; for (i=0; i < count; i++) for (i=0; i < count; i++) { product = m[i] * n[i]; product = m[i] * n[i]; sum += product; sum += product; } return(sum); return(sum);} Software Pipelining Example (Step 1)

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 34 Software Pipelining Procedure 1.Write algorithm in C code & verify. 1.Write algorithm in C code & verify. 2.Write ‘C6x Linear Assembly code. 2.Write ‘C6x Linear Assembly code. 3.Create dependency graph. 3.Create dependency graph. 4.Allocate registers. 4.Allocate registers. 5.Create scheduling table. 5.Create scheduling table. 6.Translate scheduling table to ‘C6x code. 6.Translate scheduling table to ‘C6x code.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 35 ; for (i=0; i < count; i++) ; prod = m[i] * n[i]; ; sum += prod; loop:ldh*p_m++, m ldh*p_n++, n mpym, n, prod addprod, sum, sum [count]subcount, 1, count [count]subcount, 1, count [count] bloop [count] bloop 1.No NOP’s required. 2.No parallel instructions required. 3.You don’t have to specify:  Functional units, or  Registers. Write code in Linear Assembly (Step 2)

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 36 Software Pipelining Procedure 1.Write algorithm in C code & verify. 1.Write algorithm in C code & verify. 2.Write ‘C6x Linear Assembly code. 2.Write ‘C6x Linear Assembly code. 3.Create a dependency graph (4 steps). 3.Create a dependency graph (4 steps). 4.Allocate registers. 4.Allocate registers. 5.Create scheduling table. 5.Create scheduling table. 6.Translate scheduling table to ‘C6x code. 6.Translate scheduling table to ‘C6x code.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 37 Dependency Graph Terminology ab na Child Node Parent Node Path 5.L.D.DLDHLDH5 NOT Conditional Path

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 38 Dependency Graph Steps (a)Draw the algorithm nodes and paths. (b)Write the number of cycles it takes for each instruction to complete execution. (c)Assign “required” function units to each node. (d)Partition the nodes to A and B sides and assign sides to all functional units.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 39 Dependency Graph (Step a)  In this step each instruction is represented by a node.  The node is represented by a circle, where:  Outside: write instruction.  Inside: register where result is written.  Nodes are then connected by paths showing the data flow. Note: Conditional paths are represented by dashed lines.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 40 Dependency Graph (Step a) mLDH

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 41 Dependency Graph (Step a) mLDHnLDH

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 42 Dependency Graph (Step a) mLDHnLDH prodMPY

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 43 Dependency Graph (Step a) mLDHnLDH prodMPY sumADD

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 44 Dependency Graph (Step a) mLDHnLDHprod MPY sum ADD

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 45 Dependency Graph (Step a) mLDHnLDHprod MPY sum ADD countSUB loopB

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 46 Dependency Graph (Step b)  In this step the number of cycles it takes for each instruction to complete execution is added to the dependency graph.  It is written along the associated data path.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 47 Dependency Graph (Step b) mLDHnLDHprod MPY sum ADD 5 5 2 1 countSUBloop B 1 1 6

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 48 Dependency Graph (Step c)  In this step functional units are assigned to each node.  It is advantageous to start allocating units to instructions which require a specific unit:  Load/Store.  Branch.  We do not need to be concerned with multiply as this is the only operation that the.M unit performs. Note: The side is not allocated at this stage.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 49 Dependency Graph (Step c) mLDHnLDHprod MPY sum ADD.D.D 5 5 2 1 countSUB loopB1 1.M.S 6

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 50 Dependency Graph (Step d)  The data path is partitioned into side A and B at this stage.  To optimise code we need to ensure that a maximum number of units are used with a minimum number of cross paths.  To make the partition visible on the dependency graph a line is used.  The side can then be added to the functional units associated with each instruction or node.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 51 Dependency Graph (Step d) A Side B Side mLDHnLDHprod MPY sum ADD.D.D 5 5 2 1 countSUB loopB1 1.M.S 6

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 52 Dependency Graph (Step d) m LDH n LDH prod MPY sum ADD A Side.D1.D2 5 5 2 1 count SUB loop B 1 1.M1x.L1.L2.S2 B Side 6

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 54 Step 4 - Allocate Functional Units Do we have enough functional units to code this algorithm in a single-cycle loop?.L1.M1.D1.S1x1.L2.M2.D2.S2x2sumprodm.M1xcountnloop mLDHnLDH prodMPY sumADD A Side.D1.D2 5 5 2 1 countSUB loopB1 1.M1x.L1.L2.S2 B Side 6

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 55 Step 4 - Allocate Registers Content of Register File A &a a prod sum Reg. A Reg. B A0B0 A1B1 A2B2 A3B3 A4B4...... A15B15 Content of Register File B count &b b

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 57.L1.L2.S1.S2.M1.M2.D1.D2 Step 5 - Create Scheduling Table How do we know the loop ends up in cycle 8? LOOPPROLOG87654321

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 58 Length of Prolog Answer:   Count up the length of longest path, in this case we have: 5 + 2 + 1 = 8 cycles mLDH prodMPY sumADD5 2 1

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 59.L1.L2.S1.S2.M1.M2.D1.D2 Scheduling Table LOOPPROLOG87654321

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 60 Scheduling Table 8.L1.L2.S1.S2.M1.M2.D1.D2 7654321LOOPPROLOG ******* ldh a ******* ldh b **mpy add *****B Where do we want to branch? Branch here

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 61 Scheduling Table.L1.L2.S1.S2.M1.M2.D1.D2 add **mpy *****B 87654321 ******* ldh m ******* ldh n ******sub LOOPPROLOG

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 63 Translate Scheduling Table to ‘C6x Code.L1.L2.S1.S2.M1.M2.D1.D2 add **mpy *****B 76543211 ******* ldh m ******* ldh n ******sub LOOPPROLOG C1 ldh.D1 *A1++,A2 || ldh.D2 *B1++,B2 || ldh.D2 *B1++,B2

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 64 Translate Scheduling Table to ‘C6x Code.L1.L2.S1.S2.M1.M2.D1.D2 add **mpy *****B 76543221 ******* ldh a ******* ldh b ******sub LOOPPROLOG C1 ldh.D1 *A1++,A2 || ldh.D2 *B1++,B2 || ldh.D2 *B1++,B2 C2 ldh.D1 *A1++,A2 || ldh.D2 *B1++,B2 || ldh.D2 *B1++,B2 || [B0] sub.L2 B0,1,B0 || [B0] sub.L2 B0,1,B0

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 65 Translate Scheduling Table to ‘C6x Code.L1.L2.S1.S2.M1.M2.D1.D2 add **mpy *****B 76543321 ******* ldh m ******* ldh n ******sub LOOPPROLOG C1 ldh.D1 *A1++,A2 || ldh.D2 *B1++,B2 || ldh.D2 *B1++,B2 C2 ldh.D1 *A1++,A2 || ldh.D2 *B1++,B2 || ldh.D2 *B1++,B2 || [B0] sub.L2 B0,1,B0 || [B0] sub.L2 B0,1,B0 C3 ldh.D1 *A1++,A2 || ldh.D2 *B1++,B2 || ldh.D2 *B1++,B2 || [B0] sub.L2 B0,1,B0 || [B0] sub.L2 B0,1,B0 || [B0] B.S2 loop || [B0] B.S2 loop

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 66 Translate Scheduling Table to ‘C6x Code.L1.L2.S1.S2.M1.M2.D1.D2 add **mpy *****B 76544321 ******* ldh m ******* ldh n ******sub LOOP C1 ldh.D1 *A1++,A2 || ldh.D2 *B1++,B2 || ldh.D2 *B1++,B2 C2 ldh.D1 *A1++,A2 || ldh.D2 *B1++,B2 || ldh.D2 *B1++,B2 || [B0] sub.L2 B0,1,B0 || [B0] sub.L2 B0,1,B0 C3 ldh.D1 *A1++,A2 || ldh.D2 *B1++,B2 || ldh.D2 *B1++,B2 || [B0] sub.L2 B0,1,B0 || [B0] sub.L2 B0,1,B0 || [B0] B.S2 loop || [B0] B.S2 loop C4 ldh.D1 *A1++,A2 || ldh.D2 *B1++,B2 || ldh.D2 *B1++,B2 || [B0] sub.L2 B0,1,B0 || [B0] sub.L2 B0,1,B0 || [B0] B.S2 loop || [B0] B.S2 loop

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 67 Translate Scheduling Table to ‘C6x Code.L1.L2.S1.S2.M1.M2.D1.D2 add **mpy ***B*B 87654321 ***ldh*** ldh m ***ldh*** ldh n ***sub**sub LOOP C1 ldh.D1 *A1++,A2 || ldh.D2 *B1++,B2 || ldh.D2 *B1++,B2 C2 ldh.D1 *A1++,A2 || ldh.D2 *B1++,B2 || ldh.D2 *B1++,B2 || [B0] sub.L2 B0,1,B0 || [B0] sub.L2 B0,1,B0 C3 ldh.D1 *A1++,A2 || ldh.D2 *B1++,B2 || ldh.D2 *B1++,B2 || [B0] sub.L2 B0,1,B0 || [B0] sub.L2 B0,1,B0 || [B0] B.S2 loop || [B0] B.S2 loop C4 ldh.D1 *A1++,A2 || ldh.D2 *B1++,B2 || ldh.D2 *B1++,B2 || [B0] sub.L2 B0,1,B0 || [B0] sub.L2 B0,1,B0 || [B0] B.S2 loop || [B0] B.S2 loop C5 ldh.D1 *A1++,A2 || ldh.D2 *B1++,B2 || ldh.D2 *B1++,B2 || [B0] sub.L2 B0,1,B0 || [B0] sub.L2 B0,1,B0 || [B0] B.S2 loop || [B0] B.S2 loop

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 68 Translate Scheduling Table to ‘C6x Code.L1.L2.S1.S2.M1.M2.D1.D2 add **mpy **B**B 87644321 **ldh**** ldh m **ldh**** ldh n **sub***sub LOOPPROLOG C6 ldh.D1 *A1++,A2 || ldh.D2 *B1++,B2 || [B0] sub.L2 B0,1,B0 || [B0] B.S2 loop || mpy.M1x A2,B2,A3

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 69 Translate Scheduling Table to ‘C6x Code.L1.L2.S1.S2.M1.M2.D1.D2 add **mpy **B**B 87644321 **ldh**** ldh m **ldh**** ldh n **sub***sub LOOPPROLOG C7 ldh.D1 *A1++,A2 || ldh.D2 *B1++,B2 || [B0] sub.L2 B0,1,B0 || [B0] B.S2 loop || mpy.M1x A2,B2,A3

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 70 Translate Scheduling Table to ‘C6x Code.L1.L2.S1.S2.M1.M2.D1.D2 add **mpy **B**B 87644321 **ldh**** ldh m **ldh**** ldh n **sub***sub LOOPPROLOG * Single-Cycle Loop loop: ldh.D1 *A1++,A2 || ldh.D2 *B1++,B2 || [B0] sub.L2 B0,1,B0 || [B0] B.S2 loop || mpy.M1x A2,B2,A3 || add.L1 A4,A3,A4 Complete code

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 71  With this method we have only created the prolog and the loop.  Therefore if the filter has a 100 taps, then we need to repeat the loop 100 times as we need 100 adds.  This means that we are performing 107 loads. These 7 extra loads may lead to some illegal memory acesses. Translate Scheduling Table to ‘C6x Code.L1.L2.S1.S2.M1.M2.D1.D2 add mpympympy BBBBBB 87654321 ldh m ldh n subsubsubsubsubsubsub LOOPPROLOG

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 72 Solution: The Epilog We only created the Prolog and Loop … What about the Epilog? The Epilog can be extracted from your results as described below. See example in the next slide.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 73 Dot-Product with Epilog Prolog Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog Epilog = Loop - Prolog And there is no sub or b in the epilog e1: mpy || add

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 74 Dot-Product with Epilog Prolog Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog e1: mpy || add e2: mpy || add Epilog = Loop - Prolog And there is no sub or b in the epilog

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 75 Dot-Product with Epilog Prolog Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog e1: mpy || add e2: mpy || add e3: mpy || add Epilog = Loop - Prolog And there is no sub or b in the epilog

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 76 Dot-Product with Epilog Prolog Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog e1: mpy || add e2: mpy || add e3: mpy || add e4: mpy || add Epilog = Loop - Prolog And there is no sub or b in the epilog

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 77 Dot-Product with Epilog Prolog Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog e1: mpy || add e2: mpy || add e3: mpy || add e4: mpy || add e5: mpy || add Epilog = Loop - Prolog And there is no sub or b in the epilog

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 78 Dot-Product with Epilog Prolog Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog e1: mpy || add e2: mpy || add e3: mpy || add e4: mpy || add e5: mpy || add e6: add Epilog = Loop - Prolog And there is no sub or b in the epilog

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 79 Dot-Product with Epilog Prolog Prolog p1: ldh||ldh p2: ldh||ldh || []sub p3: ldh||ldh || []sub || []b p4: ldh||ldh || []sub || []b p5: ldh||ldh || []sub || []b p6: ldh||ldh || mpy || []sub || []b p7: ldh||ldh || mpy || []sub || []b Loop Loop loop: ldh || ldh || mpy || add || [] sub || [] b Epilog e1: mpy || add e2: mpy || add e3: mpy || add e4: mpy || add e5: mpy || add e6: add e7: add Epilog = Loop - Prolog And there is no sub or b in the epilog

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 80 Scheduling Table: Prolog, Loop and Epilog

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 81 Loop only! Yes!  Can the code be written as a loop only (i.e. no prolog or epilog)?

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 82 Loop only!.L1.L2.S1.S2.M1.M2.D1.D2 add **mpy *****B 87654321 ******* ldh m ******* ldh n ******sub LOOP PROLOG (i)Remove all instructions except the branch.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 83 Loop only! (i)Remove all instructions except the branch..L1.L2.S1.S2.M1.M2.D1.D2 add BBBBB 654321 ldh n sub LOOP PROLOG ldh m mpy B

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 84 Loop only!.L1.L2.S1.S2.M1.M2.D1.D2 add zero sum zero a BBBBB 654321 ldh n sub LOOP PROLOG ldh m mpy B zero b zero prod (i)Remove all instructions except the branch. (ii)Zero input registers, accumulator and product registers.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 85 Loop only!.L1.L2.S1.S2.M1.M2.D1.D2 add BBBBB 654321 ldh n sub LOOP PROLOG ldh m mpy B sub zero sum zero a zero b zero prod (i)Remove all instructions except the branch. (ii)Zero input registers, accumulator and product registers. (iii)Adjust the number of subtractions.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 86 Loop Only - Final Code b loop || zero m ;input register || zero n ;input register b loop || zero prod ;product register ||zero sum ;accumulator b loop ||sub ;modify count register loopldh ||ldh || mpy || add || [] sub || [] b loop Loop Overhead

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 87 Laboratory exercise  Software pipeline using the LDW version of the Dot-Product routine: (1)Write linear assembly. (2)Create dependency graph. (3)Complete scheduling table. (4)Transfer table to ‘C6000 code.  To Epilogue or Not to Epilog?  Determine if your pipelined code is more efficient with or without prolog and epilog.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 88 ; for (i=0; i < count; i++) ; prod = m[i] * n[i]; ; sum += prod; *** count becomes 20 *** loop:ldw*p_m++, m ldw*p_n++, n mpym, n, prod mpyhm, n, prodh addprod, sum, sum addprodh, sumh, sumh [count]subcount, 1, count [count] bloop ; Outside of Loop addsum, sumh, sum Lab Solution: Step 1 - Linear Assembly

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 89 Step 2 - Dependency Graph Step 2 - Dependency Graphprodh MPYH m LDW.D1 n LDW.D2 55 2.M1x prod MPY 2.M2x 1 sumh ADD.L2 sum ADD.L1 1 count SUB loop B 1.S2.S1 6 A Side B Side

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 90 Step 2 - Functional Units Do we still have enough functional units to code this algorithm in a single-cycle loop? Yes !.L1.M1.D1.S1x1.L2.M2.D2.S2x2sumprodmloop.M1xsumhprodhncount.M2x

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 91 Step 2 - Registers Register File A &a/ret value a ## A0B0 A1B1 A2B2 A3B3 A4B4 A5B5 Register File B count x count/prod A6B6prodh return address &x sumA7B7sumh

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 92 Step 3 - Schedule Algorithm.L1.L2.S1.S2.M1.M2.D1.D2 add mpy 3 mpy 2 mpy 87654321 ldw 8 ldw 7 ldw 6 ldw 5 ldw 4 ldw 3 ldw 2 ldw m ldw 8 ldw 7 ldw 6 ldw 5 ldw 4 ldw 3 ldw 2 ldw n mpyh 3 mpyh 2 mpyh sub 7 sub 6 sub 5 sub 4 sub 3 sub 2 sub 1 add B6B6B6B6 B5B5B5B5 B4B4B4B4 B3B3B3B3 B2B2B2B2 B1B1B1B1 LOOPPROLOG

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 93  The complete code is available in the following location:  \Links\DotP LDW.pdf \Links\DotP LDW.pdf \Links\DotP LDW.pdf Step 4 - ‘C6000 Code

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 94 loop (count = 0)(B) loop:ldh*p_m++, m ldh*p_n++, n mpym, n, prod addprod, sum, sum [count]subcount, 1, count [count]subcount, 1, count [count] bloop [count] bloop Why Conditional Subtract? X Without Cond. Subtract: Loop (count = 1)(B) loop (count = -1)(B) loop (count = -2)(B) loop (count = -3)(B) loop (count = -4)(B) Loop never ends loop (count = 0)(B) X With Cond. Subtract: Loop (count = 1)(B) loop (count = 0)(B) Loop ends X X X X

Chapter 12 Software Optimisation Part 3 - Pipelining Multi-cycle Loops

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 96Objectives  Software pipeline the weighted vector sum algorithm.  Describe four iteration interval constraints.  Calculate minimum iteration interval.  Convert and optimize the dot-product code to floating point code.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 97 What Requires Multi-Cycle Loops?  Resource Limitations  Running out of resources (Functional Units, Registers, Bus Accesses) Weighted Vector Sum example requires three.D units  Live Too Long  Minimum iteration interval defined by length of time a Variable is required to exist  Loop Carry Path  Latency required between loop iterations FIR example and SP floating-point dot product examples are demonstrated  Functional Unit Latency > 1  A few ‘C67x instructions require functional units for 2 or 4 cycles rather than one. This defines a minimum iteration interval.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 98 What Requires Multi-Cycle Loops? Four reasons: 1.Resource Limitations. 2.Live Too Long. 3.Loop Carry Path. 4.Double Precision (FUL > 1). Use these four constraints to determine the smallest Iteration Interval (Minimum Iteration Interval or MII).

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 99 void WVS(short *c, short *b, short *a, short r, short n) { int i; for (i=0; i < n; i++) { for (i=0; i < n; i++) { c[i] = a[i] + (r * b[i]) >> 15; c[i] = a[i] + (r * b[i]) >> 15; }} Step 1 - C Code a, b:input arrays c:output array n:length of arrays r:weighting factor Resource Limitation: Weighted Vector Sum Store.D Load.D Requires 3.D units

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 100 Software Pipelining Procedure 1.Write algorithm in C code & verify. 1.Write algorithm in C code & verify. 2.Write ‘C6x Linear Assembly code. 2.Write ‘C6x Linear Assembly code. 3.Create dependency graph. 3.Create dependency graph. 4.Allocate registers. 4.Allocate registers. 5.Create scheduling table. 5.Create scheduling table. 6.Translate scheduling table to ‘C6x code. 6.Translate scheduling table to ‘C6x code. Write algorithm in C & verify. Write algorithm in C & verify. 2.Write ‘C6x Linear Assembly Code. 2.Write ‘C6x Linear Assembly Code.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 101 Step 2 - ‘C6x Linear Code loop:LDH*a++, ai LDH*b++, bi MPYr, bi, prod SHRprod, 15, sum ADDai, sum, ci STHci, *c++ [i]SUBi, 1, i [i]SUBi, 1, i [i] Bloop [i] Bloop  The full code is available here: \Links\Wvs.sa \Links\Wvs.sa \Links\Wvs.sa c[i] = a[i] + (r * b[i]) >> 15;

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 102 Step 3 - Dependency Graph 5 1 5 2 1 15 aiLDHbiLDHr prod MPY sum SHR ci ADD *c++ STH 1 1 1 6 iSUBB loop ASideBSide.D1.L1.D1.S2.M2.D2.L2.S1

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 103 Step 4 -Allocate Functional Units ci ai, *c ai, *c i prod bi sum.L1.M1.D1.S1.L2.M2.D2.S2 loop  This requires 3.D units therefore it cannot fit into a single cycle loop.  This may fit into a 2 cycle loop if there are no other constraints.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 104 2 Cycle Loop Iteration Interval (II): # cycles per loop iteration..L1.L2.S1.S2.M1.M2.D1.D2.L1.L2.S1.S2.M1.M2.D1.D2 loop: 2 cycles per loop iteration Cycle 1 Cycle 2

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 105 Multi-Cycle Loop Iterations.D1.D2.S2.M1.M2.L1.L2.S1.D1.D2.S1.S2.M1.M2.L1.L2 loop 2 ldh ldh shr mpy add sub b sth loop 3 ldh ldh shr mpy add sub b sth cycle 3 cycle 5 loop 1 ldh ldh shr mpy add sub b sth cycle 1 cycle 2 cycle 4 cycle 6

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 106 Multi-Cycle Loop Iterations.D1.D2.S2.M1.M2.L1.L2.S1.D1.D2.S1.S2.M1.M2.L1.L2 loop 2 ldh ldh shr mpy add sub b sth loop 3 ldh ldh shr mpy add sub b sth cycle 3 cycle 5 loop 1 ldh ldh shr mpy add sub b sth cycle 1 cycle 2 cycle 4 cycle 6

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 107 Multi-Cycle Loop Iterations.D1.D2.S2.M1.M2.L1.L2.S1.D1.D2.S1.S2.M1.M2.L1.L2 ldh ldh shr mpy add sub b sth loop 3 ldh ldh shr mpy add sub b sth cycle 3 cycle 5 loop 1 ldh ldh shr mpy add sub b sth cycle 1 cycle 2 cycle 4 cycle 6 loop 2

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 109 Multi-Cycle Loop Iterations.D1.D2.S2.M1.M2.L1.L2.S1.D1.D2.S1.S2.M1.M2.L1.L2 ldh ldh shr mpy add sub b sth ldh ldh shr mpy add sub b sth cycle 3 cycle 5 loop 1 ldh ldh shr mpy add sub b sth cycle 1 cycle 2 cycle 4 cycle 6 loop 2 loop 3

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 111 How long is the Prolog? ai bi prod sum ci *c++ 5 1 5 2 1 1 What is the length of the longest path? 10 10 How many cycles per loop? 2

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 112 Step 5 - Create Scheduling Chart Step 5 - Create Scheduling Chart (0)

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 113 Step 5 - Create Scheduling Chart 02468 LDH bi **** Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 13579

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 114 Step 5 - Create Scheduling Chart 02468 LDH bi **** Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 13579 MPY mi **

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 115 Step 5 - Create Scheduling Chart 02468 LDH bi **** Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 13579 MPY mi ** SHR sum *

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 116 Step 5 - Create Scheduling Chart 02468 LDH bi **** Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 13579 MPY mi ** SHR sum * ADD ci

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 117 Step 5 - Create Scheduling Chart 02468 LDH bi **** Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 13579 MPY mi ** SHR sum * ADD ci * STH c[i]

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 118 Step 5 - Create Scheduling Chart 02468 LDH bi **** Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 13579 MPY mi ** SHR sum * ADD ci * STH c[i] ** LDH ai

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 119 Step 5 - Create Scheduling Chart 02468 LDH bi **** Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 13579 MPY mi ** SHR sum * ADD ci * STH c[i] ** LDH ai Conflict

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 120 Conflict Solution 02468 LDH bi **** Unit\cycle.D1.D2 Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 13579 MPY mi ** SHR sum * * STH c[i] ** LDH ai Here are two possibilities... Which is better?

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 121 Conflict Solution Here are two possibilities... Which is better? Move the LDH to cycle 2. (so you don’t have to go back and recheck crosspaths) 02468 LDH bi **** Unit\cycle.D1.D2 Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 13579 MPY mi ** SHR sum * * STH c[i] ** LDH ai

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 122 Step 5 - Create Scheduling Chart 02468 LDH bi **** Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 13579 MPY mi ** SHR sum * ADD ci LDH ai *** STH c[i]

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 123 Step 5 - Create Scheduling Chart 02468 LDH bi **** Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 13579 MPY mi ** SHR sum * ADD ci LDH ai *** STH c[i] [i] B **

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 124 Step 5 - Create Scheduling Chart 02468 LDH bi **** Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 13579 MPY mi ** SHR sum * ADD ci LDH ai *** STH c[i] [i] B ** [i] SUB i ***

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 125 Step 5 - Create Scheduling Chart 0 LDH bi 2 LDH ai * 4 [i] B * * 6 * * * 8 ADD ci * * * Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 13 [i] SUB i LDH ai 5 * MPY mi 7 * SHR sum * 9 * * * STH c[i]

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 126 2 Cycle Loop Kernel 0 LDH bi 2 LDH ai * 4 [i] B * * 6 * * * 8 ADD ci * * * Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 Unit\cycle.L1.L2.S1.S2.M1.M2.D1.D2 13 [i] SUB i LDH ai 5 * MPY mi 7 * SHR sum * 9 * * * STH c[i]

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 127 What Requires Multi-Cycle Loops? Four reasons: 1.Resource Limitations. 2.Live Too Long. 3.Loop Carry Path. 4.Double Precision (FUL > 1).

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 128 Live Too Long - Example 0 LDH ai 5 a0 valid 6 1 LDH 2 LDH 3 LDH 4 LDH aiLDHci ADD 5 1 5 x SHR.S1.L1.D1 c = (a >> 5) + a

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 129 Live Too Long - Example 0 LDH ai 5 a0 valid 6 a1 1 LDH 2 LDH 3 LDH 4 LDH aiLDHci ADD 5 1 5 x SHR.S1.L1.D1

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 130 Live Too Long - Example 0 LDH ai 5 a0 valid SHR 6 a1 x0 valid 1 LDH 2 LDH 3 LDH 4 LDH aiLDHci ADD 5 1 5 x SHR.S1.L1.D1

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 131 Live Too Long - Example 0 LDH ai 5 a0 valid SHR 6 a1 x0 valid ADD 1 LDH 2 LDH 3 LDH 4 LDH aiLDHci ADD 5 1 5 x SHR.S1.L1.D1 Oops, rather than adding a0 + x0 we got a1 + x0 Let’s look at one solution...

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 132 Live Too Long - 2 Cycle Solution 0 LDH ai 2 LDH 4 LDH 6 a0 valid 135 7 a1 With a 2 cycle loop, a0 is valid for 2 cycles.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 133 Live Too Long - 2 Cycle Solution 0 LDH ai 2 LDH 4 LDH 6 a0 valid x0 valid 135 a0 valid SHR 7 a1 x0 valid Notice, a0 and x0 are both valid for 2 cycles which is the length of the Iteration Interval Adding them...

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 134 Live Too Long - 2 Cycle Solution 0 LDH ai 2 LDH 4 LDH 6 a0 valid x0 valid ADD 135 a0 valid SHR 7 a1 x0 valid Works! But what’s the drawback? 2 cycle loop is slower. Here’s a better solution...

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 135 Live Too Long - 1 Cycle Solution aiLDHci ADD 5 1 5 x SHR.S1.L1.D1 1 b MV.S2 Using a temporary register solves this problem without increasing the Minimum Iteration Interval 0 LDH ai 5 a0 valid MV b SHR 1 LDH 2 LDH 3 LDH 4 LDH 6 a1 b valid x0 valid ADD

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 137 Loop Carry Path  The loop carry path is a path which feeds one a variable from part of the algorithm back to another. p2 st_y0 MPY.M2 STH.D1 21 e.g. Loop carry path = 3. Note: The loop carry path is not the code loop.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 138 Loop Carry Path, e.g. IIR Filter IIR Filter Example y0 = a0*x0 + b1*y1

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 139IIR.SA IIR:ldh*a_1, A1 ldh*x1, A3 ldh*b_1, B1 ldh*y0, B0; y1 is previous y0 mpyA1, A3, prod1 mpyB1, B0, prod2 addprod1, prod2, prod2 sthprod2, *y0 IIR Filter Example y0 = a0*x0 + b1*y1

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 140 Loop Carry Path - IIR Example p1 x1A1 p2 B1y1 y0 st_y0 LDH.D1LDH.D2 MPY.M1 MPY.M2 ADD.L1 STH.D1 5 2 1 IIR Filter Loop y0 = a1*x1 + b1*y1 Min Iteration Interval Resource = 2 (need 3.D units) 1 Result carries over from one iteration of the loop to the next. Loop Carry Path = 9 (9 = 5 + 2 + 1 + 1) therefore, MII = 9 Can it be minimized?

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 141 Loop Carry Path - IIR Example (Solution) p1 x1A1 p2 B1y1 y0 st_y0 LDH.D1LDH.D2 MPY.M1 MPY.M2 ADD.L1 STH.D1 5 2 1 IIR Filter Loop y0 = a1*x1 + b1*y1 Min Iteration Interval Resource = 2 (need 3.D units) 1 New Loop Carry Path = 3 (3 = 2 + 1) therefore, MII = 3 Since y0 is stored in a CPU register, it can be used directly by MPY (after the first loop iteration).

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 142 Reminder: Fixed-Point Dot-Product Example m LDH n LDH prod MPY sum ADD.D1.D2 5 2.M1x.L1 Is there a loop carry path in this example? 1 Yes, but it’s only “1” Min Iteration Interval Resource = 1 Loop Carry Path = 1  MII = 1 For the fixed-point implementation, the Loop Carry Path was not taken into account because it is equal to 1.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 143 Loop Carry Path  IIR Example.  Enhancing the IIR.  Fixed-Point Dot-Product Example.  Floating-Point Dot Product Example.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 144 Loop Carry Path due to FUL > 1 Floating-Point Dot-Product Example m LDW n LDW prod MPYSP sum ADDSP.D1.D2 5 4.M1x.L1 4 Min Iteration Interval Resource = 1 Loop Carry Path = 4  MII = 4

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 145 Unrolling the Loop If the MII must be four cycles long, then use all of them to calculate four results. m1 LDW n1 LDW prod1 MPYSP sum1 ADDSP.D1.D2 4.M1x.L1 m2 LDW n2 LDW prod2 MPYSP sum2 ADDSP.D1.D2 4.M1x.L1 m3 LDW n3 LDW prod3 MPYSP sum3 ADDSP.D1.D2 4.M1x.L1 m4 LDW n4 LDW prod4 MPYSP sum4 ADDSP.D1.D2 4.M1x.L1

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 146 ADDSP Pipeline (Staggered Results) CycleInstructionResult 0ADDSPx0, sum, sumsum = 0 1ADDSPx1, sum, sumsum = 0 2ADDSPx2, sum, sumsum = 0 3ADDSPx3, sum, sumsum = 0 4ADDSPx4, sum, sumsum = x0 5ADDSPx5, sum, sumsum = x1 6ADDSPx6, sum, sumsum = x2 7ADDSPx7, sum, sumsum = x3 8ADDSPx8, sum, sumsum = x0 + x4 9sum = x1 + x5 10sum = x2 + x6 11sum = x3 + x7 12sum = x0 + x4 + x8  ADDSP takes 4 cycles or three delay slots to produce the result.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 147 ADDSP Pipeline (Staggered Results) CycleInstructionResult 0ADDSPx0, sum, sumsum = 0 1ADDSPx1, sum, sumsum = 0 2ADDSPx2, sum, sumsum = 0 3ADDSPx3, sum, sumsum = 0 4ADDSPx4, sum, sumsum = x0 5ADDSPx5, sum, sumsum = x1 6ADDSPx6, sum, sumsum = x2 7ADDSPx7, sum, sumsum = x3 8ADDSPx8, sum, sumsum = x0 + x4 9sum = x1 + x5 10sum = x2 + x6 11sum = x3 + x7 12sum = x0 + x4 + x8

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 154 ADDSP Pipeline (Staggered Results)  There are effectively four running sums: sum (i) = x(i) + x(i+4) + x(i+8) + … sum (i+1) = x(i+1) + x(i+5) + x(i+9) + … sum (i+2) = x(i+2) + x(i+6) + x(i+10) + … sum (i+3) = x(i+3) + x(i+7) + x(i+11) + … CycleInstructionResult 0ADDSPx0, sum, sumsum = 0 1ADDSPx1, sum, sumsum = 0 2ADDSPx2, sum, sumsum = 0 3ADDSPx3, sum, sumsum = 0 4ADDSPx4, sum, sumsum = x0 5ADDSPx5, sum, sumsum = x1 6ADDSPx6, sum, sumsum = x2 7ADDSPx7, sum, sumsum = x3 8ADDSPx8, sum, sumsum = x0 + x4 9NOPsum = x1 + x5 10NOPsum = x2 + x6 11NOPsum = x3 + x7 12NOPsum = x0 + x4 + x8

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 155 ADDSP Pipeline (Staggered Results)  There are effectively four running sums: sum (i) = x(i) + x(i+4) + x(i+8) + … sum (i+1) = x(i+1) + x(i+5) + x(i+9) + … sum (i+2) = x(i+2) + x(i+6) + x(i+10) + … sum (i+3) = x(i+3) + x(i+7) + x(i+11) + …  These need to be combined after the last addition is complete...

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 156 ADDSP Pipeline (Combining Results) CycleInstructionResult 0ADDSPx0, sum, sumsum = 0 1ADDSPx1, sum, sumsum = 0 2ADDSPx2, sum, sumsum = 0 3ADDSPx3, sum, sumsum = 0 4ADDSPx4, sum, sumsum = x0 5ADDSPx5, sum, sumsum = x1 6ADDSPx6, sum, sumsum = x2 7ADDSPx7, sum, sumsum = x3 8ADDSPx8, sum, sumsum = x0 + x4 9MV sum, tempsum = x1 + x5 10sum = x2 + x6 11sum = x3 + x7 12sum = x0 + x4 + x8

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 157 ADDSP Pipeline (Combining Results) CycleInstructionResult 0ADDSPx0, sum, sumsum = 0 1ADDSPx1, sum, sumsum = 0 2ADDSPx2, sum, sumsum = 0 3ADDSPx3, sum, sumsum = 0 4ADDSPx4, sum, sumsum = x0 5ADDSPx5, sum, sumsum = x1 6ADDSPx6, sum, sumsum = x2 7ADDSPx7, sum, sumsum = x3 8ADDSPx8, sum, sumsum = x0 + x4 9MV sum, tempsum = x1 + x5 10ADDSP sum, temp, sum1sum = x2 + x6, temp = x1 + x5 11sum = x3 + x7 12sum = x0 + x4 + x8

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 158 ADDSP Pipeline (Combining Results) CycleInstructionResult 0ADDSPx0, sum, sumsum = 0 1ADDSPx1, sum, sumsum = 0 2ADDSPx2, sum, sumsum = 0 3ADDSPx3, sum, sumsum = 0 4ADDSPx4, sum, sumsum = x0 5ADDSPx5, sum, sumsum = x1 6ADDSPx6, sum, sumsum = x2 7ADDSPx7, sum, sumsum = x3 8ADDSPx8, sum, sumsum = x0 + x4 9MV sum, tempsum = x1 + x5 10ADDSP sum, temp, sum1sum = x2 + x6, temp = x1 + x5 11MVsum, tempsum = x3 + x7 12sum = x0 + x4 + x8

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 159 ADDSP Pipeline (Combining Results) CycleInstructionResult 0ADDSPx0, sum, sumsum = 0 1ADDSPx1, sum, sumsum = 0 2ADDSPx2, sum, sumsum = 0 3ADDSPx3, sum, sumsum = 0 4ADDSPx4, sum, sumsum = x0 5ADDSPx5, sum, sumsum = x1 6ADDSPx6, sum, sumsum = x2 7ADDSPx7, sum, sumsum = x3 8ADDSPx8, sum, sumsum = x0 + x4 9MV sum, tempsum = x1 + x5 10ADDSP sum, temp, sum1sum = x2 + x6, temp = x1 + x5 11MVsum, tempsum = x3 + x7 12ADDSPsum, temp, sum2sum = x0 + x4 + x8, temp = x3 + x7

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 160 ADDSP Pipeline (Combining Results) CycleInstructionResult 0ADDSPx0, sum, sumsum = 0 1ADDSPx1, sum, sumsum = 0 2ADDSPx2, sum, sumsum = 0 3ADDSPx3, sum, sumsum = 0 4ADDSPx4, sum, sumsum = x0 5ADDSPx5, sum, sumsum = x1 6ADDSPx6, sum, sumsum = x2 7ADDSPx7, sum, sumsum = x3 8ADDSPx8, sum, sumsum = x0 + x4 9MV sum, tempsum = x1 + x5 10ADDSP sum, temp, sum1sum = x2 + x6, temp = x1 + x5 11MVsum, tempsum = x3 + x7 12ADDSPsum, temp sum2sum = x0 + x4 + x8, temp = x3 + x7 13NOP 14NOPsum1 = x1 + x2 + x5 + x6

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 161 ADDSP Pipeline (Combining Results) CycleInstructionResult 0ADDSPx0, sum, sumsum = 0 1ADDSPx1, sum, sumsum = 0 2ADDSPx2, sum, sumsum = 0 3ADDSPx3, sum, sumsum = 0 4ADDSPx4, sum, sumsum = x0 5ADDSPx5, sum, sumsum = x1 6ADDSPx6, sum, sumsum = x2 7ADDSPx7, sum, sumsum = x3 8ADDSPx8, sum, sumsum = x0 + x4 9MV sum, tempsum = x1 + x5 10ADDSP sum, temp, sum1sum = x2 + x6, temp = x1 + x5 11MVsum, tempsum = x3 + x7 12ADDSPsum, temp sum2sum = x0 + x4 + x8, temp = x3 + x7 13NOP 14NOPsum1 = x1 + x2 + x5 + x6 15NOP 16ADDSPsum1, sum2, sumsum2 = x0 + x3 + x4 + x7 + x8

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 162 ADDSP Pipeline (Combining Results) CycleInstructionResult 0ADDSPx0, sum, sumsum = 0 1ADDSPx1, sum, sumsum = 0 2ADDSPx2, sum, sumsum = 0 3ADDSPx3, sum, sumsum = 0 4ADDSPx4, sum, sumsum = x0 5ADDSPx5, sum, sumsum = x1 6ADDSPx6, sum, sumsum = x2 7ADDSPx7, sum, sumsum = x3 8ADDSPx8, sum, sumsum = x0 + x4 9MV sum, tempsum = x1 + x5 10ADDSP sum, temp, sum1sum = x2 + x6, temp = x1 + x5 11MVsum, tempsum = x3 + x7 12ADDSPsum, temp sum2sum = x0 + x4 + x8, temp = x3 + x7 13NOP 14NOPsum1 = x1 + x2 + x5 + x6 15NOP 16ADDSPsum1, sum2, sumsum2 = x0 + x3 + x4 + x7 + x8 17NOP 18NOP 19NOP 20NOPsum = x0 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 164 Simple FUL Example prod 35 MPYDP 10 (4.9).M1 1 MPYDP 2345 MPYDP 6... MPYDP ties up the functional unit for 4 cycles.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 165 A Better Way to Diagram this....M1 1 MPYDP 5 MPYDP 9 MPYDP 13 MPYDP.M1 261014.M1 3711 prod1 15 prod2.M1 481216 Since the MPYDP instruction has a functional unit latency (FUL) of “4”,.M1 cannot be used again until the fifth cycle. Hence, MII  4.

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 166 What Requires Multi-Cycle Loops? 1.Resource Limitations. 1.Resource Limitations. 2.Live Too Long. 2.Live Too Long. 3.Loop Carry Path. 3.Loop Carry Path. 4.Double Precision (FUL > 1). 4.Double Precision (FUL > 1). Lab: Converting your dot-product code to Single-Precision Floating-Point.

Chapter 12 Software Optimisation - End -

Chapter 12 Software Optimisation. Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 2 Software Optimisation Chapter This.

Similar presentations

Presentation on theme: "Chapter 12 Software Optimisation. Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 2 Software Optimisation Chapter This."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 12 Software Optimisation. Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 2 Software Optimisation Chapter This.

Similar presentations

Presentation on theme: "Chapter 12 Software Optimisation. Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 12, Slide 2 Software Optimisation Chapter This."— Presentation transcript:

Similar presentations

About project

Feedback