Presentation is loading. Please wait.

Presentation is loading. Please wait.

A first attempt at learning about optimizing the TigerSHARC code

Similar presentations


Presentation on theme: "A first attempt at learning about optimizing the TigerSHARC code"— Presentation transcript:

1 A first attempt at learning about optimizing the TigerSHARC code
TigerSHARC assembly syntax

2 What we NOW KNOW! Can we return from an assembly language routine without crashing the processor? Return a parameter from assembly language routine (Is it same for ints and floats?) Pass parameters into assembly language Do IF THEN ELSE statements Read and write values to memory Read and write values in a loop Do some mathematics on the values fetched from memory All this stuff is demonstrated by coding HalfWaveRectifyASM( ) 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

3 Not bad for a first effort Faster than compiler in debug mode
Need to learn from the compiler on how to speed code 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

4 How does compiler do it? Use mixed mode to show
Warning – out of order instructions displayed 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

5 Many new instructions. Many parallel instruction
Many new instructions. Many parallel instruction. Ones inside loop are key How important is stating if jump is predicted or not? BIG 25% 523  435 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

6 Many new instructions. Many parallel instruction
Many new instructions. Many parallel instruction. Ones inside loop are key JMP (NP)  435 XR1 not J  491 How important is not using J registers when reading from memory XR1 rather than J1 Now need Condition XALT rather than JLT XCOMP rather than COMP 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

7 Many new instructions. Many parallel instruction
Many new instructions. Many parallel instruction. Ones inside loop are key JMP (NP)  435 XR1 not J  491 and ++ operator  435 How important is not using J registers when reading from memory, and using pointers (*pt++) rather than array ( pt[count]) XR1 rather than J1 Now need Condition XALT rather than JLT XCOMP rather than COMP 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

8 Code to this point. Note new instructions using XR2 and R2
Try a little thing. R2 = 0 is a constant – move outside loop Already outside loop Difference, about half the time – expect improve by 12 cycles Got 491  476 = 15 – timing only accurate to around 10 cycles 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

9 The IF THEN JUMPS in the loop are killing us
The IF THEN JUMPS in the loop are killing us. Rewrite C++ code into optimized form Reduce loop size from 6 if > 0 and 7 if < 0 to 4 any way. Loop size 24 – expect improvement of 48 cycles We go from 476 to 250 cycles That’s 225 cycles or roughly 9 cycles each time around the loop The jumps were causing us 9 cycles by disrupting the pipeline Need to get rid of this jump and counter increment. Blackfin has hardware loops Does the TigerSHARC – Duh!! 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

10 Many new instructions. Many parallel instruction
Many new instructions. Many parallel instruction. Ones inside loop are key JMP (NP)  435 XR1 not J and ++ operator Remove inner jumps from loop Hardware loop instructions LC0 = loop counter 0 – may only be a few hardware loops possible SHARC ADSP – allows 6, Blackfin ADSP-BF5XX – allows 2, so need to still understand software loops IF LC0E  If hardware loop expired, IF NLC0E, if not expired – MM!! 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

11 With hardware loops – 166 cycles! Are we cooking or what!
Fine tuning – can we save N cycles (1 each time round loop) by merging instructions 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

12 Merge those two instructions and use our fancy SIGN-BIT trick for float code
We are beating the optimized compiler on the float code by a factor of 2 We need 1 cycle to beat the compiler on the optimized int code Find in for Assignment 1 I did 138 cycles 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

13 My code passes the tests in 138 cycles Extra 11 cycles from outside the loop (not worth the time and effort if the loop was larger, or there were more points to process) Does turning off the Cache make any difference to our code Find out in assignment 1 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

14 Trying to understand what we have done
Most TigerSHARC instructions can be made conditional. WHY? Because doing a NOP instruction (if condition not met) is much less disruptive to the instruction pipeline than doing a JUMP (lose of 9 cycles if jump taken – probably more because of code format) 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

15 Trying to understand what we have done
Use J registers for address operations, but store values from memory in XR1 and YR1 WHY? Instructions like this [J1] = XR1;; has the potential to be put in parallel with more operations 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

16 Why mostly conditional instructions?
TigerSHARC has a very deep pipeline, so that conditional jumps cause a potential large disruption of the pipeline Better to use non-jump instructions which don’t disrupt pipeline, even if instruction is not executed (acts as nop) If (N < 1) return_value = NULL; else return_value = NULL; 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

17 Why mostly conditional instructions?
If (N < 1) return_value = NULL; else return_value = value; COMP(N, 1);; IF NJLT, JUMP _ELSE;; J5 = NULL;; JUMP _END_IF;; _ELSE: J5 = value;; If (N < 1) return_value = NULL; else return_value = value; COMP(N, 1);; IF NJLT; DO J5 = NULL;; IF JLT; DO J5 = value;; Concept is there – we need to Check on whether syntax is correct 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

18 Hardware – zero overhead loop. About 4
Hardware – zero overhead loop. About 4 * N cycles better (N is times round the loop) LC0 = N;; Load counter 0 with value N Start_of_loop_LABEL: Loop code here ;; IF NLC0E, JUMP Start_of_loop_LABEL;; NLC0E – Not LC0 expired – essentially Compare LC0 with 2 If less than 2, continue (don’t jump) If 2 or more, then decrement LC0 and jump All sorts of stall issues if not properly aligned –TigerSHARC manual 8-23 CAN’T USE WHEN THERE IS A FUNCTION CALL IN THE LOOP? WHY NOT? – WHAT HAPPENS – NEED TO EXPLORE MORE. Using a software loop when there is a function is okay since calling a function is slow anyway – don’t need efficiency 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

19 Hardware – zero overhead loop. BIG WARNING
LC0 = N;; Load counter 0 with value N LC0 uses UNSIGNED ARITHMETIC – MAKE SURE N is not negative, as a negative number has the same bit pattern as a VERY large unsigned number, and the processor will go around the loop for a week We did a check for N <= 0 before entering the hardware loop as another part of our code – so we lucked in – otherise could have big problems. This issue is so important (and time wasting in the laboratories) that will be deducting marks in quizzes and exams 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

20 What’s this XR1, YR1 and R1 stuff
TigerSHARC is designed to do many things at once So you need appropriate syntax to control it 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

21 What’s this XR1, YR1 and R1 stuff
XYR1 = R2 + R3;; does 2 adds XR1 = XR2 + XR3 and YR1 = YR2 + YR3; You can add the X values and not the Y values with this syntax XR1 = R2 + R3;; And NOT with XR1 = XR2 + XR3;; Ugly – but they (ADI) will not change the syntax (DAMY) 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

22 What’s this XR1, YR1 and R1 stuff
XYR1 = [J0 += 0x1];; Does a 32-bit fetch and puts the same value into XR1 and YR1. Same as doing XR1 = [J0 += 0];; AND YR1 = [J0 += 1];; at the same time XYR1 = L[J0 +0x2];; Does a dual 64 bit fetch and is the same as doing XR1 = [J0 += 1];; AND 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

23 What’s this XR1, YR1 and R1 stuff
XYR1 = [J0 += 0x1];; means XR1 = [J0 += 0];; AND YR1 = [J0 += 1];; XYR1 = L[J0 +0x2];; XR1 = [J0 += 1];; AND YR1 = [J0 += 1];; at the same time XR1:0 = L[J0 +0x2];; XR0 = [J0 += 1];; AND XR1 = [J0 += 1];; XYR1:0 = L[J0 +0x2];; XR0 = [J0 += 0];; AND YR0 = [J0 += 1];; AND XR1 = [J0 += 0];; 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

24 What’s this XR1, YR1 and R1 stuff
XYR1:0 = L[J0 +0x2];; means XR0 = [J0 += 0];; AND YR0 = [J0 += 1];; AND XR1 = [J0 += 0];; YR1 = [J0 += 1];; XR3:0 = Q[J0 +0x4];; XR0 = [J0 += 1];; AND XR1 = [J0 += 1];; AND XR2 = [J0 += 1];; AND XR3 = [J0 += 1];; XYR3:0 = Q[J0 +0x4];; XR1 = [J0 += 0];; AND YR1 = [J0 += 1];; AND XR2 = [J0 +=0];; AND YR2 = [J0 += 1];; AND XR3 = [J0 += 0];; AND YR3 = [J0 += 1];; 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

25 Float release – identify new instructions
I see 1 new instructions 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

26 Difference between integer and math operations
XYR1 = R2 + R3;; does 2 INTEGER adds XR1 = XR2 + XR3 and YR1 = YR2 + YR3; SYNTAX XR1 = R2 + R3;; And NOT with XR1 = XR2 + XR3;; Use F syntax to make it a float operation XYFR1 = R2 + R3;; does 2 FLOATING adds XFR1 = R2 + R3 YFR1 = R2 + R3; 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

27 Exercise 1 – needed for Lab. 1
FIR filter operation -- data and filter-coefficients are both integer arrays – Write in C++ New_value from Audio A/D, output sent to Audio D/A 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

28 Exercise – needed for Lab. 1
FIR filter operation -- data and filter-coefficients are both integer arrays -- ASM 5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

29 Insert C++ code – for Lab. 1
5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

30 Insert assembler code version (Lab. 2)
5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada

31 What we NOW KNOW EVERYTHING FOR THE FINAL (REALLY -- ALMOST)!
Can we return from an assembly language routine without crashing the processor? Return a parameter from assembly language routine (Is it same for ints and floats?) Pass parameters into assembly language Do IF THEN ELSE statements Read and write values to memory Read and write values in a loop Do some mathematics on the values fetched from memory All this stuff was demonstrated by coding HalfWaveRectifyASM( ) --  5/11/2019 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada


Download ppt "A first attempt at learning about optimizing the TigerSHARC code"

Similar presentations


Ads by Google