Presentation is loading. Please wait.

Presentation is loading. Please wait.

-- Tutorial A tool to assist in developing parallel ADSP2106X code

Similar presentations


Presentation on theme: "-- Tutorial A tool to assist in developing parallel ADSP2106X code"— Presentation transcript:

1 -- Tutorial A tool to assist in developing parallel ADSP2106X code
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation In Slide Show, click on the right mouse button Select “Meeting Minder” Select the “Action Items” tab Type in action items as they come up Click OK to dismiss this box This will automatically create an Action Item slide at the end of your presentation with your points entered. SquishDSP -- Tutorial A tool to assist in developing parallel ADSP2106X code M. R. Smith, Electrical and Computer Engineering University of Calgary, Alberta, Canada ucalgary.ca *

2 Material covered Efficiency of assembly code produced by the optimizing VisualDSP++ compiler depends on design/form of the “C/C++” algorithm. Simple code example and a variety of design formats for speed Need to further optimize code developed by optimizing compiler or through custom development processes Use of the tool SquishDSP to assist in identifying dependencies in your code and possible find parallelization of instructions Speed improvement is algorithm and design dependent, but we have doubled the speed of code produced by the VisualDSP++ compiler. Further tests are needed to see if the improvements scale for more complex DSP algorithms. This tutorial was developed for teaching purposes and some parts “may provide BGOs” for people familiar with concepts 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

3 Typical but simple DSP algorithm
Note -- loop, memory intensive, multiplication and addition intensive, use of constants -- typical DSP stuff. Note use of both “dm” and “pm” arrays Uses “known” constant array size as that provides better opportunities for optimizing compiler than “variable” size of array passed in as a parameter to the subroutine. 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

4 VisualDSP++ output Much more parallel ADSP2106X code than was available from VisualDSP 4.1 1 calculations in each loop Average 2 cycles/calculation 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

5 Alternate source code -- larger loops
Approach 1 For (count < N / 2) Begin 1; …... 5; 6; ….. End Loop May lead to more parallel instructions in the ‘middle’ of the new of the longer loop May lead to “running out of program memory on ADSP2106X if DSP algorithm code length is long. (Not just this code is in memory!) Variation needed if N is not a factor of 2 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

6 Unroll the loop Anticipated tighter code from variant 1 on ADSP2106X
Chose second format as thought the approach might be useful on Hammerhead ADSP2116X in SIMD mode. GOOD for SIMD? 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

7 Variant 1 -- Double loop using count++
2 calculations in each loop Average 5 cycles/calculation VERY POOR OPTIMIZATION Unexpected software loop increases overhead 2 cycles per loop 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

8 Variant 2 -- using index [count + 1]
Very impressed in some ways 6 calculations in each loop Average 2 cycles/calculation OPTIMIZATION NO BETTER THAN ORIGINAL SINGLE LOOP EXAMPLE BUT LOOK EASY TO FURTHER REDUCE LOOP CYCLE COUNT AS COMPILER HAS PLACED VALUES IN CORRECT REGISTERS FOR PARALLEL OPS 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

9 Variant 2 -- using index [count + 1]
EASY TO REDUCE CYCLES AS COMPILER HAS PLACED VALUES IN CORRECT REGISTERS FOR PARALLEL SHARC OPS FOR EXAMPLE Move pm(i13, m12) down one cycle allows a parallel operation F12=F0*F4, F1=F11+F12; One cycle decrease already 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

10 Further speed improvement?
By playing around with the code, I thought I could get the code down to 1 cycle per calculation. However, even with this simple code, I was not sure whether I was handling all the data dependencies correctly. Would be impossible with a larger code sequence. I therefore decided to move the code into Microsoft Project which is a business scheduling tool, rather than write my own scheduler! Hence the tool SquishDSP 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

11 Step 1 -- Develop the initial code -- process.c
Notes LOOP SIZE -- FIXED as a constant MAXSIZE and not a variable Use of both DM and PM data busses in “C” program. Double loop of code with index registers. This [count] then [count+1] form of double loop was chosen from several variants tried. 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

12 Step 2 -- Pass through VisualDSP++
Note in “process.s” that compiler has unrolled the loop further -- 6 calculations performed per loop Initially work with “loop component” only in next stages 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

13 Step 3A -- First Stages of SquishDSP
Pass 1 -- Replace “commas” in instruction that are not instruction separators. This was initially to get the code into a .CSV format but is currently retained as a reliable approach to prepare for Pass 2. Pass 2 -- Identify, and break up all parallel instructions into single instructions taking care of “local dependencies”, retain original instructions 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

14 Step 3B -- First Stages of SquishDSP
Pass 3 -- Add dependency information in a Microsoft Project compatible format Pass 4 -- Reformat into a totally Project compatible format, and “pretty format” to restore original ADSP2106X style of syntax 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

15 Steps inside Microsoft Project
Input “microsoftproject.txt” using “txt -- Default Task Information” Select TOOLS | Resource Levelling | Clear Leveling -- Note the overused resources Select TOOLS | Resource Levelling | Level Now -- Note the proper allocation of resources Click “Task Name Bar” -- Select SORT | Ascending | START Cut and paste columns “Task Name, Duration, Start” into Notepad file “rescheduledproject.txt” Tried saving file directly from Project, then sorting the tasks by date etc. Project interface was very clumsy for this type of files. (I don’t know how to access “.mpp” formatted files.) In addition, Project did a better job of SORT | Ascending | START 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

16 Some side issues You can model different processor architectures quite easily Suppose you have single cycle addition but double cycle multiplication. Simply set the task duration for each use of the MULTIPLIER to 2. Adjustments to Microsoft Project -- Fine detail Set to “Don’t split tasks to allow activities to occur on different days”. Not applicable at the moment. Other “fine details” to come 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

17 Step 4A -- Input into “Microsoft Project”
Select “txt -- Default Task Information” and import file 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

18 Display in ‘non-leveled’ mode
Select TOOLS | Resource Levelling | Clear Leveling -- Note the highly overused resources using SquishDSP V1.0 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

19 Display in ‘leveled’ mode
Select TOOLS | Resource Levelling | Level Now -- Note the proper allocation of resources even when using SquishDSP 1.0 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

20 Step 4B -- Display in ‘non-leveled’ mode
Select TOOLS | Resource Levelling | Clear Leveling -- Note there are now only a few overused resources as Project has already been able to resolve most conflicts with SquishDSP 2.0 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

21 Step 4C -- Display in ‘leveled’ mode
Select TOOLS | Resource Levelling | Level Now -- Note the proper rescheduling of resources 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

22 Step 4D -- Sort the tasks by “Start” date
Click in “Task Name” base and select “Sort | Ascending | Start” 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

23 Step 4E -- Prepare ‘rescheduledproject.txt”
Cut and paste “Task Name, Duration, Start” into notepad file 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

24 Step 5 -- Second Stage of SquishDSP
Pass 6 performs the following operations Based on ‘Start date information’ from the Microsoft project files, regroup instructions into parallel instructions Check to see if the syntax of the registers is correct for parallel operations on the ADSP2106X If the syntax is not correct, break up the instructions into valid instructions and send out appropriate error messages Correct syntax for parallel operations means Post-modify using modify registers on all memory operations Multiplication using registers R(0, 1, 2, 3) * R(4, 5, 6, 7) Addition/Subtraction using R(8, 9, 10, 11) +/- R(12, 13, 14, 15) Float and Integer data registers recognized as equivalent Parallel + and - operations are not currently recognized as valid. 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

25 Step 5 -- Second Stage of SquishDSP
Original code was a loop of 12 cycles This one is of 8 cycles “Original” code available for checking 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

26 Current Approach to Optimization
Original starting code For (count < N) Begin 1; 2; 3; 4; 5; 6; End Loop Optimized code For (count < N) Begin 1, 2A; 2B, 3A; 3B, 4; 5, 6; End Loop 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

27 Alternate source code -- larger loops
Approach 1 For (count < N / 2) Begin 1; …... 5; 6; ….. End Loop May lead to more parallel instructions in the ‘middle’ of the new of the longer loop May lead to “running out of program memory on ADSP2106X if DSP algorithm code length is long. (Not just this code is in memory!) Variation needed if N is not a factor of 2 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

28 Double Loop with N != 2 * p F1=F11+F12, r0=dm(i2,m1);
F13=F0*F4, r2=dm(i4,m4), pm(i13,m12)=r1; ..... F8=F11+F13; F12=F2*F4, pm(i12,m9)=r8; lcntr=10, do(pc,_L$ )until lce; _L$ : //end loop _L$ ; -- end double loop _L$ : 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

29 Adjust ‘lcntr’ values In this example, the lcntr value was originally 21. We must use lcntr = 10 for the new double loop and cut and paste the original loop outside the new loop to ensure that the total overall loop count is valid. You can now see why the task of developing an optimizing compiler is not trivial. The optimizing compiler must be able to handle the general case reliably! 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

30 Double loop re-optimized
1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

31 Optimization results SquishDSP SquishDSP
Original code -- loop of 12 cycles with 6 sets of operations per loop loop of 8 cycles with 6 sets of operations per loop -- saving of 33% of the time Double original loop -- loop of 24 cycles with 12 sets of operations per loop loop of 14 Cycles with 12 sets of operations per loop -- increased efficiency of 42% of time Overall code length 20 cycles (14 in loop and 6 outside) SquishDSP SquishDSP 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

32 Source code re-arrangement
We can identify that some of the internal stages of the new rescheduled code are running totally parallel -- 4 operations per code. This suggests that rescheduling the loop operations will allow the generation of a highly efficient loop. Rescheduling the loop means bring out instructions from the loop and delaying all write operations until late in the loop To ensure accurate rearrangement of the code, perhaps we should change the priorities on the “pm” Microsoft Project tasks to be “As Late as Possible” rather than move by hand as was done in this example. Note that compiler has already done some moving 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

33 Alternate starting points
Approach 2 1; 2; For (count < N) Begin 3; 4; 5; 6; End Loop Possible adjustment of index registers Valid approach if instructions 1 and 2 do not make any “permanent changes”. “Permanent changes” means no WRITING to external memory May require adjustment to registers after the loop because of the extra instructions -- particularly index registers that are post-modified. 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

34 Removed code from loop till first “write operation”
1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

35 Moved 3 pm( ) write operations later in loop
These can now be moved outside the loop 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

36 How many instructions to move?
Very easy to make minor changes to original code “process.s” open in a NotePad window, save the file, reactivate and quickly bring the file into Microsoft Project for examination. Turned out that bringing “just two” instructions out of the loop was the best solution. SquishDSP 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

37 Optimum loop configuration
1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

38 Optimum result -- 1 calculation per loop -- Double VisualDSP++ speed
This loop is now just 6 cycles for 6 calculations Speed improvement will be very algorithm dependent 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

39 Savings are very algorithm dependent
Original code -- loop of 12 cycles with 6 sets of operations per loop loop of 8 cycles with 6 sets of operations per loop -- saving of 33% of the time Double original loop -- loop of 24 cycles with 12 sets of operations per loop loop of 14 Cycles with 12 sets of operations per loop -- increased efficiency of 42% of time Overall code length 20 cycles (14 in loop and 6 outside) Original code with 2 instructions extracted -- loop of 12 cycles with 6 sets of operations loop of 6 Cycles with 6 sets of operations per loop -- increased efficiency of 50% -- processor at maximum pipeline capability. Overall code length 8 cycles (6 in loop and 2 outside) SquishDSP SquishDSP SquishDSP 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

40 Real life is not as simple as this
Loops from Optimizing compiler already have instructions inside and outside the loop 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

41 Real “final” source code (without stack operations)
Code has been adjusted for original instructions outside the loop 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

42 SquishDSP Final Output
1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

43 Conclusions SquishDSP SquishDSP SquishDSP
In we have a simple tool that appears to do a good job on further optimizing the output from the current version of VisualDSP++. Even when the equivalent features are added into a later version of VisualDSP++ then will still be useful for optimizing “hand-code” Further work means more testing on Is the tool “really” doing the job we think it is, or is it missing vital dependencies? Does it give back something useful for larger source files? Can we remove the dependency on the intermediate stage using Microsoft Project? SquishDSP SquishDSP SquishDSP 1/16/2019 ENEL SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith --

44 SquishDSP For further information on this ADSP2106X utility Contact -- Dr. Mike Smith Not for general distribution -- under development


Download ppt "-- Tutorial A tool to assist in developing parallel ADSP2106X code"

Similar presentations


Ads by Google