Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 2 / 32 11/2/2015 Understanding the TigerSHARC ALU pipeline TigerSHARC has many pipelines If these pipelines stall – then the processor speed goes down Need to understand how the ALU pipeline works Learn to use the pipeline viewer Understanding what the pipeline viewer tells in detail Avoiding having to use the pipeline viewer Improving code efficiency Excel and Project (Gantt charts) are useful tool
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 3 / 32 11/2/2015 Register File and COMPUTE Units
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 4 / 32 11/2/2015 Simple Example IIR -- Biquad For (Stages = 0 to 3) Do S0 = X in * H5 + S2 * H3 + S1 * H4 Y out = S0 * H0 + S1 * H1 + S2 * H2 S2 = S1 S1 = S0 S0 S1 S2 Horrible IIR code example as can’t re-use in a loop Works as a simple example for understanding TigerSHARC pipeline
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 5 / 32 11/2/2015 Code return float when using XR8 register – NOTE NOT XFR8
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 6 / 32 11/2/2015 Step 2 – Using C++ code as comments set up the coefficients XFR0 = 0.0;; Does not exist XR0 = 0.0;; DOES EXIST Bit-patterns require integer registers Leave what you wanted to do behind as comments
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 7 / 32 11/2/2015 Expect to take 8 cycles to execute
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 8 / 32 11/2/2015 PIPELINE STAGES See page 8-34 of Processor manual 10 pipeline stages, but may be completely desynchronized (happen semi- independently) Instruction fetch -- F1, F2, F3 and F4 Integer ALU – PreDecode, Decode, Integer, Access Compute Block – EX1 and EX2
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 9 / 32 11/2/2015 Pipeline Viewer Result XR0 = 1.0 enters PD 39025, enters E2 stage at cycle is stored into XR0 at cycle cycles execution time
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 10 / 32 11/2/2015 Pipeline Viewer Result XR6 = 5.5 enters PD stage at cycle enters E2 stage at cycle is stored into XR6 at cycle cycles execution time Each instruction takes 7 cycles but one new result each cycle Result – ONCE pipeline filled 8 cycles = 8 register transfer operations Key – don’t break pipeline with any jumps
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 11 / 32 11/2/2015 Doing filter operations – generates different results XR8 = XR6 enters PD at 39833, enters EX2 at 39838, stored – 7 cycles XFR23 = R9 * R4 enters PD at 39834, enters EX2 at 39839, stored – 7 cycles XFR0 = R0 + R23 enters PD at 39835, enters EX2 at 39841, stored – 8 cycles WHY? – FIND OUT WITH MOUSE CLICK ON S MARKER THEN CONTROL
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 12 / 32 11/2/2015 Instruction 0x17e XFR8 = R8 + R23 is STALLED (waiting) for instruction 0x17d XFR23 = R8 * R4 to complete Bubble B means that the pipeline is doing “nothing” Meaning that the instruction shown is “place holder” (garbage)
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 13 / 32 11/2/2015 Information on Window Event Icons
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 14 / 32 11/2/2015 Result of Analysis Can’t use Float result immediately after calculation Writing XFR23 = R8 * R4;; XFR8 = R8 + R23;; // MUST WAIT FOR XFR23 // calculation to be completed Is the same as coding XFR23 = R8 * R4;; NOP;; Note DOUBLE ;; -- extra cycle because of stall XFR8 = R8 + R23;; Proof – write the code with the stalls shown in it Writing this way means we don’t have to use the pipeline viewer all the time Pipeline viewer is only available with (slow) simulator #define SHOW_ALU_STALL nop
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 15 / 32 11/2/2015 Code with stalls shown 8 code lines 5 expected stalls Expect 13 cycles to complete if theory is correct
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 16 / 32 11/2/2015 Analysis approach IS correct Same speed with and without nops
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 17 / 32 11/2/2015 Process for coding for improved speed – code re-organization Make a copy of the code so can test iirASM( ) and iirASM_Optimized( ) to make sure get correct result Make a table of code showing ALU resource usage (paper, EXCEL, Project (Gantt chart) ) Identify data dependencies Make all “temp operations” use different register Move instructions “forward” to fill delay slots, BUT don’t break data dependencies
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 18 / 32 11/2/2015 Copy and paste to make IIRASM_Optimized( )
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 19 / 32 11/2/2015 Need to re-order instructions to fill delay slots with useful instructions After refactoring code to fill delay slots, must run tests to ensure that still have the correct result Change – and “retest” NOT EASY TO DO MUST HAVE A SYSTEMATIC PLAN TO HANDLE OPTIMIZATION I USE EXCEL
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 20 / 32 11/2/2015 Show resource usage and data dependencies All temporary register usage involves the SAME XFR23 register This typically stalls out the processor
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 21 / 32 11/2/2015 Change all temporary registers to use different register names Then check code produces correct answer All temporary register usage involves a DIFFERENT Register ALWAYS FOLLOW THIS PROCESS WHEN OPTIMIZING
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 22 / 32 11/2/2015 Move instructions forward, without breaking data dependencies What appears possible! DO one thing at a time and then check that code still works
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 23 / 32 11/2/2015 Check that code still operates 1 cycle saved Have put “our” marker stall instruction in parallel with moved instruction using ; rather than ;; Move this instruction up in code sequence to fill delay slot Check that code still runs after this optimization stage
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 24 / 32 11/2/2015 Move next multiplication up. NOTE certain stalls remain, although reason for STALL changes from why they were inserted before
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 25 / 32 11/2/2015 Move up the R10 and R9 assignment operations -- check 4 cycle improvement?
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 26 / 32 11/2/2015 CHECK THE PIPELINE AFTER TESTING
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 27 / 32 11/2/2015 Are there still more improvements possible (I can see 4 more moves)
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 28 / 32 11/2/2015 Problems with approach Identifying all the data dependencies Keep track of how the data dependencies change as you move the code around Handling all of this “automatically” I started the following design tool as something that might work, but it actually turned out very useful. M. R. Smith and J. Miller, "Microprocessor Scheduling -- the irony of using Microsoft Project", "Don’t say “CAN’T do it - Say “Gantt it”! The irony of organizing microprocessors with a big business tool" Circuit Cellar magazine, Vol. 184, pp , November 2005.
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 29 / 32 11/2/2015 Using Microsoft Project – Step 1
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 30 / 32 11/2/2015 Add dependencies and resource usage – then activate level
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 31 / 32 11/2/2015 Microsoft Project as a microprocessor design tool Will look at this in more detail when we start using memory operations to fill the coefficient and state arrays
Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 32 / 32 11/2/2015 Understanding the TigerSHARC ALU pipeline TigerSHARC has many pipelines If these pipelines stall – then the processor speed goes down Need to understand how the ALU pipeline works Learn to use the pipeline viewer Understanding what the pipeline viewer tells in detail Avoiding having to use the pipeline viewer Improving code efficiency Excel and Project (Gantt charts) are useful tool