Presentation is loading. Please wait.

Presentation is loading. Please wait.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.

Similar presentations


Presentation on theme: "Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline."— Presentation transcript:

1 Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline

2 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 2 / 32 11/2/2015 Understanding the TigerSHARC ALU pipeline TigerSHARC has many pipelines If these pipelines stall – then the processor speed goes down Need to understand how the ALU pipeline works  Learn to use the pipeline viewer Understanding what the pipeline viewer tells in detail  Avoiding having to use the pipeline viewer Improving code efficiency  Excel and Project (Gantt charts) are useful tool

3 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 3 / 32 11/2/2015 Register File and COMPUTE Units

4 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 4 / 32 11/2/2015 Simple Example IIR -- Biquad For (Stages = 0 to 3) Do  S0 = X in * H5 + S2 * H3 + S1 * H4  Y out = S0 * H0 + S1 * H1 + S2 * H2  S2 = S1  S1 = S0 S0 S1 S2 Horrible IIR code example as can’t re-use in a loop Works as a simple example for understanding TigerSHARC pipeline

5 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 5 / 32 11/2/2015 Code return float when using XR8 register – NOTE NOT XFR8

6 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 6 / 32 11/2/2015 Step 2 – Using C++ code as comments set up the coefficients XFR0 = 0.0;; Does not exist XR0 = 0.0;; DOES EXIST Bit-patterns require integer registers Leave what you wanted to do behind as comments

7 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 7 / 32 11/2/2015 Expect to take 8 cycles to execute

8 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 8 / 32 11/2/2015 PIPELINE STAGES See page 8-34 of Processor manual 10 pipeline stages, but may be completely desynchronized (happen semi- independently) Instruction fetch -- F1, F2, F3 and F4 Integer ALU – PreDecode, Decode, Integer, Access Compute Block – EX1 and EX2

9 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 9 / 32 11/2/2015 Pipeline Viewer Result XR0 = 1.0 enters PD stage @ 39025, enters E2 stage at cycle 39830 is stored into XR0 at cycle 39831 -- 7 cycles execution time

10 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 10 / 32 11/2/2015 Pipeline Viewer Result XR6 = 5.5 enters PD stage at cycle 39032 enters E2 stage at cycle 39837 is stored into XR6 at cycle 39838 -- 7 cycles execution time Each instruction takes 7 cycles but one new result each cycle Result – ONCE pipeline filled 8 cycles = 8 register transfer operations Key – don’t break pipeline with any jumps

11 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 11 / 32 11/2/2015 Doing filter operations – generates different results XR8 = XR6 enters PD at 39833, enters EX2 at 39838, stored 39839 – 7 cycles XFR23 = R9 * R4 enters PD at 39834, enters EX2 at 39839, stored 39840 – 7 cycles XFR0 = R0 + R23 enters PD at 39835, enters EX2 at 39841, stored 39842 – 8 cycles WHY? – FIND OUT WITH MOUSE CLICK ON S MARKER THEN CONTROL

12 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 12 / 32 11/2/2015 Instruction 0x17e XFR8 = R8 + R23 is STALLED (waiting) for instruction 0x17d XFR23 = R8 * R4 to complete Bubble B means that the pipeline is doing “nothing” Meaning that the instruction shown is “place holder” (garbage)

13 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 13 / 32 11/2/2015 Information on Window Event Icons

14 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 14 / 32 11/2/2015 Result of Analysis Can’t use Float result immediately after calculation Writing XFR23 = R8 * R4;; XFR8 = R8 + R23;; // MUST WAIT FOR XFR23 // calculation to be completed Is the same as coding XFR23 = R8 * R4;; NOP;;  Note DOUBLE ;; -- extra cycle because of stall XFR8 = R8 + R23;; Proof – write the code with the stalls shown in it  Writing this way means we don’t have to use the pipeline viewer all the time  Pipeline viewer is only available with (slow) simulator  #define SHOW_ALU_STALL nop

15 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 15 / 32 11/2/2015 Code with stalls shown 8 code lines 5 expected stalls Expect 13 cycles to complete if theory is correct

16 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 16 / 32 11/2/2015 Analysis approach IS correct Same speed with and without nops

17 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 17 / 32 11/2/2015 Process for coding for improved speed – code re-organization Make a copy of the code so can test iirASM( ) and iirASM_Optimized( ) to make sure get correct result Make a table of code showing ALU resource usage (paper, EXCEL, Project (Gantt chart) ) Identify data dependencies Make all “temp operations” use different register Move instructions “forward” to fill delay slots, BUT don’t break data dependencies

18 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 18 / 32 11/2/2015 Copy and paste to make IIRASM_Optimized( )

19 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 19 / 32 11/2/2015 Need to re-order instructions to fill delay slots with useful instructions After refactoring code to fill delay slots, must run tests to ensure that still have the correct result Change – and “retest” NOT EASY TO DO MUST HAVE A SYSTEMATIC PLAN TO HANDLE OPTIMIZATION I USE EXCEL

20 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 20 / 32 11/2/2015 Show resource usage and data dependencies All temporary register usage involves the SAME XFR23 register This typically stalls out the processor

21 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 21 / 32 11/2/2015 Change all temporary registers to use different register names Then check code produces correct answer All temporary register usage involves a DIFFERENT Register ALWAYS FOLLOW THIS PROCESS WHEN OPTIMIZING

22 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 22 / 32 11/2/2015 Move instructions forward, without breaking data dependencies What appears possible! DO one thing at a time and then check that code still works

23 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 23 / 32 11/2/2015 Check that code still operates 1 cycle saved Have put “our” marker stall instruction in parallel with moved instruction using ; rather than ;; Move this instruction up in code sequence to fill delay slot Check that code still runs after this optimization stage

24 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 24 / 32 11/2/2015 Move next multiplication up. NOTE certain stalls remain, although reason for STALL changes from why they were inserted before

25 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 25 / 32 11/2/2015 Move up the R10 and R9 assignment operations -- check 4 cycle improvement?

26 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 26 / 32 11/2/2015 CHECK THE PIPELINE AFTER TESTING

27 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 27 / 32 11/2/2015 Are there still more improvements possible (I can see 4 more moves)

28 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 28 / 32 11/2/2015 Problems with approach Identifying all the data dependencies Keep track of how the data dependencies change as you move the code around Handling all of this “automatically” I started the following design tool as something that might work, but it actually turned out very useful. M. R. Smith and J. Miller, "Microprocessor Scheduling -- the irony of using Microsoft Project", "Don’t say “CAN’T do it - Say “Gantt it”! The irony of organizing microprocessors with a big business tool" Circuit Cellar magazine, Vol. 184, pp 26 - 35, November 2005.

29 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 29 / 32 11/2/2015 Using Microsoft Project – Step 1

30 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 30 / 32 11/2/2015 Add dependencies and resource usage – then activate level

31 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 31 / 32 11/2/2015 Microsoft Project as a microprocessor design tool Will look at this in more detail when we start using memory operations to fill the coefficient and state arrays

32 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 32 / 32 11/2/2015 Understanding the TigerSHARC ALU pipeline TigerSHARC has many pipelines If these pipelines stall – then the processor speed goes down Need to understand how the ALU pipeline works  Learn to use the pipeline viewer Understanding what the pipeline viewer tells in detail  Avoiding having to use the pipeline viewer Improving code efficiency  Excel and Project (Gantt charts) are useful tool


Download ppt "Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline."

Similar presentations


Ads by Google