Presentation is loading. Please wait.

Presentation is loading. Please wait.

Trying to avoid pipeline delays

Similar presentations


Presentation on theme: "Trying to avoid pipeline delays"— Presentation transcript:

1 Trying to avoid pipeline delays
Inter-leafing two sets of operations XY Compute block

2 Tackled today Review of coding a hardware circular buffer
Roughly understanding where pipeline delays may occur “Refactor” the working code to improve the speed without spending any time on examining whether delays really there – works at the moment principle “Refactoring” working code to perform operations using both X and Y ALU’s – in principle twice the speed 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

3 DCRemoval( ) Not as complex as FIR, but many of the same requirements
Memory intensive Addition intensive Loops for main code FIFO implemented as circular buffer Not as complex as FIR, but many of the same requirements Easier to handle You use same ideas in optimizing FIR over Labs 2 and 3 Two issues – speed and accuracy. Develop suitable tests for CPP code and check that various assembly language versions satisfy the same tests 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

4 Alternative approach Move pointers rather than memory values
In principle – 1 memory read, 1 memory write, pointer addition, conditional equate 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

5 Note: Software circular buffer is NOT necessarily more efficient than data moves
Now spending more time on moving / checking the software circular buffer pointers than moving the data? SLOWER FASTER 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

6 Next step – Hardware circular buffer
Do exactly the same pointer calculations as with software circular buffers, but now the calculations are done behind the scenes – high speed – using specialized pointer features Only available with J0, J1, J2 and J3 registers (On older ADSP – all pointer registers) Jx -- The pointer register JBx – The BASE register – set to start of the FIFO array JLx – The length register – set to length of the FIFO array VERY BIG WARNING? – Reset to zero. On older ADSP it was very important that the length register be reset to zero, otherwise all the other functions using this register would suddenly start using circular buffer by mistake. Still advisable – but need special syntax for causing circular buffer operations to occur 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

7 Store values into hardware FIFO
CB instruction ONLY works on POST-MODIFY operations 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

8 Next stage in improving code speed Hardware circular buffers
2 Was 4 3 + N * Was 4 + N * 5 Was * log2N 6 Was * N N Was N N = 128 – instructions = 549 cycles delay cycle = 879 cycles Delays are now >50% of useful time Was delay cycles = 1011 cycle Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

9 On TigerSHARC Pipeline Issue
After you issue the command to read from memory, then must wait for value to come Problem – may be trading memory wait delays for I-ALU delays Memory pipeline delay XR5 =CB [J0 += 1];; XR4 = R4 + R5;; XR6 = CB [J1 += 1];; XR7 = R7 + R6;; No Memory pipeline delay 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

10 Now perform Math operation using circular buffer operation
Note the possible memory delays Memory cache helps? Wait for read of R2, use it, then wait for read of R3 and then use it 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

11 Simple interleaving of code Possible saving of memory delays
Original order 1 2 3 4 New order 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

12 Interleaving of code Same instructions – different order
2 Was 4 3 + N * Was 4 + N * 5 Was * log2N 6 Was * N N Was N N = 128 – instructions = 549 cycles delay cycle = 594 cycles Delays were 10% of useful time Was delay cycle = 879 cycles Delays were >50% of useful time Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

13 The code is too slow because we are not taking advantage of the available resources
Bring in up to 128 bits (4 instructions) per cycle Ability to bring in 4 32-bit values along J data bus (data1) and 4 along K bus (data2) Perform address calculations in J and K ALU – single cycle hardware circular buffers Perform math operations on both X and Y compute blocks Background DMA activity Off-load some of the processing to the second processor 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

14 Understanding how to use MIMD mode Process left filter in X-Compute, right in Y
XR6 = 0;; Puts 0 into XR6 register YR6 = 0;; Puts 0 into YR6 register XYR6 = 0;; Puts 0 into XR6 and YR6 at same time 1 instruction saved 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

15 Understanding how to use MIMD mode Process left filter in X-Compute, right in Y
XR6 = R6 + R2;; Adds XR6 + XR2 registers YR6 = R6 + R2;; Adds YR6 + YR2 registers XYR6 = R6 + R2;; Adds XR6 + XR2, AND YR6 + YR2 at same time N instructions saved 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

16 Understanding how to use MIMD mode Process left filter in X-Compute, right in Y
XR6 = ASHIFT R6 BY -7;; XR6 = XR6 >> 7 YR6 = ASHIFT R6 BY -7;; YR6 = YR6 >> 7 XYR6 = ASHIFT R6 BY -7;; XR6 = XR6 >> 7 and YR6 = YR6 >> 7 at same time 1 instruction saved 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

17 Final operation – dual subtraction
12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

18 MIMD mode 2 8 Was 4 3 + N * 3 Was 4 + N * 5 1 Was 1 + 2 * log2N 6
N Was N N = 128 – instructions = 421 cycles delay cycles = 590 Now delays are 50% of useful time Was delay cycle = 594 cycles Delays were 10% of useful time Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

19 Why no improvement? Extra delays from where?
Back to having to wait for R2 to come in from memory before the sum can occur 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

20 The code is too slow because we are not taking advantage of the available resources
Bring in up to 128 bits (4 instructions) per cycle Ability to bring in 4 32-bit values along J data bus (data1) and 4 along K bus (data2) Perform address calculations in J and K ALU – single cycle hardware circular buffers Perform math operations on both X and Y compute blocks Background DMA activity Off-load some of the processing to the second processor 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

21 Multiple data busses Many issues to solve before we can bring in 8 data values per cycle Are the data values aligned so can access 4 values at once? If they are not aligned – what can you do? One step at a time – Next lecture Lets us bring 1 value in along the J-Data bus and another in along the K-data bus 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

22 Exercise on handling interleaving of instructions and X-Y compute operations
12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

23 Tackled today Review of coding a hardware circular buffer
Roughly understanding where pipeline delays may occur “Refactor” the working code to improve the speed without spending any time on examining whether delays really there – works at the moment principle “Refactoring” working code to perform operations using both X and Y ALU’s – in principle twice the speed 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada


Download ppt "Trying to avoid pipeline delays"

Similar presentations


Ads by Google