A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.

Slides:

Advertisements

Similar presentations

Lecture 6 Programming the TMS320C6x Family of DSPs.

Advertisements

Processor Architecture Needed to handle FFT algoarithm M. Smith.

Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.

Blackfin BF533 EZ-KIT Control The O in I/O Activating a FLASH memory “output line” Part 2.

Boot Issues Processor comparison TigerSHARC multi-processor system Blackfin single-core.

Daddy! -- Where do instructions come from? Program Sequencer controls program flow and provides the next instruction to be executed Straight line code,

Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

Building a simple loop using Blackfin assembly code M. Smith, Electrical and Computer Engineering, University of Calgary, Canada.

Review of Blackfin Syntax Moves and Adds 1) What we already know and have to remember to apply 2) What we need to learn.

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.

A look at interrupts What are interrupts and why are they needed in an embedded system? Equally as important – how are these ideas handled on the Blackfin.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter.

Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm.

Understanding the Blackfin ADSP-BF5XX Assembly Code Format

A look at interrupts What are interrupts and why are they needed.

TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.

Getting the O in I/O to work on a typical microcontroller Ideas of how to send output signals to the radio controlled car. The theory behind the LED controller.

Blackfin BF533 EZ-KIT Control The O in I/O Activating a FLASH memory “output line” Part 2.

Getting the O in I/O to work on a typical microcontroller Activating a FLASH memory “output line” Part 1 Main part of Laboratory 1 Also needed for “voice.

Ultra sound solution Impact of C++ DSP optimization techniques.

Processor Architecture Needed to handle FFT algoarithm M. Smith.

Blackfin Array Handling Part 2 Moving an array between locations int * MoveASM( int foo[ ], int fee[ ], int N);

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.

Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.

A Play Core Timer Interrupts Acted by the Human Microcontroller Ensemble from ENCM511.

Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Thursday 3 rd October.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.

Generating “Rectify( )” Test driven development approach to TigerSHARC assembly code production Assembly code examples Part 1 of 3.

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

Blackfin Array Handling Part 1 Making an array of Zeros void MakeZeroASM(int foo[ ], int N);

Building a simple loop using Blackfin assembly code If you can handle the while-loop correctly in assembly code on any processor, then most of the other.

A Play Core Timer Interrupts Acted by the Human Microcontroller Ensemble from ENCM415.

“Lab. 5” – Updating Lab. 3 to use DMA Test we understand DMA by using some simple memory to memory DMA Make life more interesting, since hardware is involved,

Generating a software loop with memory accesses TigerSHARC assembly syntax.

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

Software and Hardware Circular Buffer Operations

General Optimization Issues

TigerSHARC processor General Overview.

Generating the “Rectify” code (C++ and assembly code)

Generating “Rectify( )”

A Play Core Timer Interrupts

The planned and expected

Overview of SHARC processor ADSP Program Flow and other stuff

Trying to avoid pipeline delays

Generating a software loop with memory accesses

Understanding the TigerSHARC ALU pipeline

Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.

TigerSHARC processor and evaluation board

VisualDSP++ and Test Driven Development What happened last lecture?

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

Understanding the TigerSHARC ALU pipeline

Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.

Using Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.

Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.

Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.

* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.

Getting serious about “going fast” on the TigerSHARC

General Optimization Issues

Explaining issues with DCremoval( )

General Optimization Issues

Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.

Building a simple loop using Blackfin assembly code

Understanding the TigerSHARC ALU pipeline

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

A first attempt at learning about optimizing the TigerSHARC code

Working with the Compute Block

Blackfin Syntax Stores, Jumps, Calls and Conditional Jumps

A first attempt at learning about optimizing the TigerSHARC code

* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.

Presentation transcript:

A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 2 / 28 What we NOW KNOW! Can we return from an assembly language routine without crashing the processor? Return a parameter from assembly language routine (Is it same for ints and floats?) Pass parameters into assembly language (Is it same for ints and floats?) Do IF THEN ELSE statements Read and write values to memory Read and write values in a loop Do some mathematics on the values fetched from memory All this stuff is demonstrated by coding HalfWaveRectifyASM( )

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 3 / 28 Not bad for a first effort Faster than compiler in debug mode Need to learn from the compiler on how to speed code

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 4 / 28 How does compiler do it? Look at source code and use mixed mode to show Warning – out of order instructions displayed

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 5 / 28 Many new instructions. Many parallel instruction. Ones inside loop are key How important is coding if conditional jump (NP or not) is predicted or not? BIG 25% 523  435

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 6 / 28 Many new instructions. Many parallel instruction. Ones inside loop are key How important is not using J registers when reading from memory XR1 rather than J1 Now need Condition XALT rather than JLT XCOMP rather than COMP JMP (NP) 523  435 XR1 not J1 435  491

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 7 / 28 Many new instructions. Many parallel instruction. Ones inside loop are key How important is not using J registers as a destination when reading from memory, and using pointers (*pt++) rather than array ( pt[count]) XR1 rather than J1 Now need Condition XALT rather than JLT XCOMP rather than COMP JMP (NP) 523  435 XR1 not J1 435  491 and ++ operator 491  435

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 8 / 28 Redoing our code to this point. Note new instructions using XR2 and R2 Try a little thing. R2 = 0 is a constant – move outside loop Found we had already set R2 = 0 outside loop Difference, about half the time – expect improve by 12 cycles Got 491  476 = 15 – timing only accurate to around 10 cycles

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 9 / 28 The IF THEN JUMPS in the loop are killing us. Rewrite C++ code into optimized form Reduce loop size from 6 if > 0 and 7 if < 0 to 4 any way. Loop size 24 – expect improvement of 48 cycles We go from 476 to 250 cycles That’s 225 cycles or roughly 9 cycles saved each time around the loop The jumps were causing us 9 cycles by disrupting the TigerSHARC pipeline Need to get rid of this jump and counter increment. Blackfin has hardware loops Does the TigerSHARC – Duh!!

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 10 / 28 Many new instructions. Many parallel instruction. Ones inside loop are key Hardware loop instructions LC0 = loop counter 0 – may only be a few hardware loops possible SHARC ADSP – allows 6, Blackfin ADSP-BF5XX – allows 2, so need to still understand software loops IF LC0E  If hardware loop expired, IF NLC0E, if not expired – MM!! JMP (NP) 523  435 XR1 not J1 491 and ++ operator 435 Remove inner jumps from loop 250

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 11 / 28 With hardware loops – 166 cycles! Are we cooking or what! Fine tuning – can we save N cycles (1 each time round loop) by merging instructions

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 12 / 28 Merge those two instructions and use our fancy SIGN-BIT trick for float code We are beating the optimized compiler on the float code by a factor of 2 We need 1 cycle to beat the compiler on the optimized int code Find in for Assignment 1 I did 138 cycles

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 13 / 28 My code passes the tests in 138 cycles Extra 11 cycles from outside the loop (not worth the time and effort if the loop was larger, or there were more points to process) Does turning off the Cache make any difference to our code Find out in assignment 1

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 14 / 28 What is the theoretical maximum speed? This is something I always work out BEFORE optimizing. I have a target to meet – normally finish all processing before next sample comes in. If my code (in theory) can’t meet that target, I need to find a different approach, not spend days optimizing useless code. In theory – if I have written the code with no hidden stalls – 1 cycle per instruction 6 instructions outside the loop 4 instruction inside the loop – N * 4 cycles Very short loop – read that getting out of very short loop stalls the pipeline – lets add 5 cycles for that * = 107 in theory, 138 in practice Difference 21 – close enough to being 24, or 1 stall per cycle Can use the pipeline viewer to find out where the problem is occurring. In a long loop, done 4096 times, might be worth it.

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 15 / 28 Trying to understand what we have done Most TigerSHARC instructions can be made conditional. WHY? Because doing a NOP instruction (if condition not met) is much less disruptive to the instruction pipeline than doing a JUMP (lose of 9 cycles if jump taken – probably more because of code format)

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 16 / 28 Why mostly conditional instructions? TigerSHARC has a very deep pipeline, so that conditional jumps cause a potential large disruption of the pipeline Better to use non-jump instructions which don’t disrupt pipeline, even if instruction is not executed (acts as nop) If (N < 1) return_value = NULL; else return_value = NULL;

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 17 / 28 Why mostly conditional instructions? If (N < 1) return_value = NULL; else return_value = value; COMP(N, 1);; IF NJLT, JUMP _ELSE;; J5 = NULL;; JUMP _END_IF;; _ELSE: J5 = value;; If (N < 1) return_value = NULL; else return_value = value; COMP(N, 1);; IF NJLT; DO, J5 = NULL;; IF JLT; DO, J5 = value;; Concept is there – we need to check on whether syntax is correct

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 18 / 28 Trying to understand what we have done Use J registers for address operations, but store values from memory in XR1 and YR1 WHY? Instructions like this [J1] = XR1;; has the potential to be put in parallel with more operations

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 19 / 28 Hardware – zero overhead loop. About 4 * N cycles better (N is times round the loop) LC0 = N;; Load counter 0 with value N Start_of_loop_LABEL: Loop code here ;; IF NLC0E, JUMP Start_of_loop_LABEL;; NLC0E – Not LC0 expired – essentially Compare LC0 with 2 If less than 2, continue (don’t jump) If 2 or more, then decrement LC0 and jump All sorts of stall issues if not properly aligned –TigerSHARC manual 8-23 CAN’T USE WHEN THERE IS A FUNCTION CALL IN THE LOOP? WHY NOT? – WHAT HAPPENS – NEED TO EXPLORE MORE. Using a software loop when there is a function is okay since calling a function is slow anyway – don’t need efficiency

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 20 / 28 Hardware – zero overhead loop. BIG WARNING LC0 = N;; Load counter 0 with value N LC0 uses UNSIGNED ARITHMETIC – MAKE SURE N is not negative, as a negative number has the same bit pattern as a VERY large unsigned number, and the processor will go around the loop for a week We did a check for N <= 0 before entering the hardware loop as another part of our code – so we lucked in – otherise could have big problems. This issue is so important (and time wasting in the laboratories) that will be deducting marks in quizzes and exams

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 21 / 28 What’s this XR1, YR1 and R1 stuff TigerSHARC is designed to do many things at once So you need appropriate syntax to control it

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 22 / 28 What’s this XR1, YR1 and R1 stuff XYR1 = R2 + R3;; does 2 adds XR1 = XR2 + XR3 and YR1 = YR2 + YR3; You can add the X values and not the Y values with this syntax XR1 = R2 + R3;; And NOT with XR1 = XR2 + XR3;; Ugly – but they (ADI) will not change the syntax (DAMY)

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 23 / 28 What’s this XR1, YR1 and R1 stuff XYR1 = [J0 += 0x1];; Does a 32-bit fetch and puts the same value into XR1 and YR1. Same as doing XR1 = [J0 += 0];; AND YR1 = [J0 += 1];; at the same time XYR1 = L[J0 +0x2];; Does a dual 64 bit fetch and is the same as doing XR1 = [J0 += 1];; AND YR1 = [J0 += 1];; at the same time

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 24 / 28 What’s this XR1, YR1 and R1 stuff XYR1 = [J0 += 0x1];; means XR1 = [J0 += 0];; AND YR1 = [J0 += 1];; XYR1 = L[J0 +0x2];; means XR1 = [J0 += 1];; AND YR1 = [J0 += 1];; at the same time XR1:0 = L[J0 +0x2];; means XR0 = [J0 += 1];; AND XR1 = [J0 += 1];; XYR1:0 = L[J0 +0x2];; means XR0 = [J0 += 0];; AND YR0 = [J0 += 1];; AND XR1 = [J0 += 0];; YR1 = [J0 += 1];;

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 25 / 28 What’s this XR1, YR1 and R1 stuff XYR1:0 = L[J0 +0x2];; means XR0 = [J0 += 0];; AND YR0 = [J0 += 1];; AND XR1 = [J0 += 0];; YR1 = [J0 += 1];; XR3:0 = Q[J0 +0x4];; means XR0 = [J0 += 1];; AND XR1 = [J0 += 1];; AND XR2 = [J0 += 1];; AND XR3 = [J0 += 1];; XYR3:0 = Q[J0 +0x4];; means XR0 = [J0 += 0];; AND YR0 = [J0 += 1];; AND XR1 = [J0 += 0];; AND YR1 = [J0 += 1];; AND XR2 = [J0 +=0];; AND YR2 = [J0 += 1];; AND XR3 = [J0 += 0];; AND YR3 = [J0 += 1];;

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 26 / 28 Float release generated by C++ compiler – identify new instructions I see 1 new instruction

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 27 / 28 Difference between integer and math operations XYR1 = R2 + R3;; does 2 INTEGER adds XR1 = XR2 + XR3 and YR1 = YR2 + YR3; SYNTAX XR1 = R2 + R3;; And NOT with XR1 = XR2 + XR3;; Use F syntax to make it a float operation XYFR1 = R2 + R3;; does 2 FLOATING adds XFR1 = R2 + R3 and YFR1 = R2 + R3;

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 28 / 28 Exercise 1 – needed for Lab. 1 FIR filter operation -- data and filter-coefficients are both integer arrays – Write in C++ New_value from Audio A/D, output sent to Audio D/A

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 29 / 28 Exercise – needed for Lab. 1 FIR filter operation -- data and filter- coefficients are both integer arrays -- ASM

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 30 / 28 Insert C++ code – for Lab. 1

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 31 / 28 Insert assembler code version (Lab. 2)

2/10/2016 TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary, Canada 32 / 28 What we NOW KNOW EVERYTHING FOR THE FINAL (REALLY -- ALMOST)! Can we return from an assembly language routine without crashing the processor? Return a parameter from assembly language routine (Is it same for ints and floats?) Pass parameters into assembly language (Is it same for ints and floats?) Do IF THEN ELSE statements Read and write values to memory Read and write values in a loop Do some mathematics on the values fetched from memory All this stuff was demonstrated by coding HalfWaveRectifyASM( ) --