Presentation is loading. Please wait.

Presentation is loading. Please wait.

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Similar presentations


Presentation on theme: "This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during."— Presentation transcript:

1 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation In Slide Show, click on the right mouse button Select “Meeting Minder” Select the “Action Items” tab Type in action items as they come up Click OK to dismiss this box This will automatically create an Action Item slide at the end of your presentation with your points entered. SHARC ECOLOGY 201 Using a Project Management Tool to handle Microprocessor Resources M. R. Smith, University of Calgary, Canada smithmr @ ucalgary.ca SHARC2001 Workshop, Boston

2 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 2 / 48 Series of Talks and Workshops zCACHE-DSP – Talk on a simple process tool to identify cache conflicts in DSP code. zSQUISH-DSP – Talk on using a project management tool to automate identification of parallel DSP processor instructions. zSHARC Ecology 101 – Workshop showing how to systematically write parallel 2106X code. zSHARC Ecology 201 – Workshop on SQUISH- DSP and CACHE-DSP tools.

3 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 3 / 48 Material covered zEfficiency of assembly code produced by the optimizing VisualDSP++ compiler depends on design/form of the “C/C++” algorithm. ySimple code example and a variety of design formats for speed zNeed to further improve speed of code developed by optimizing compiler or through custom development processes zUse of the tool SquishDSP to assist in identifying dependencies in your code and possible find parallelization of instructions ySpeed improvement is algorithm and design dependent, but we have doubled the speed of code produced by the VisualDSP++ compiler. yFurther tests are needed to see if the improvements scale for more complex DSP algorithms. zThis tutorial was developed for teaching purposes and some parts “may provide BGOs” for people familiar with concepts

4 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 4 / 48 Typical but simple DSP algorithm zNote -- loop, memory intensive, multiplication and addition intensive, use of constants -- typical DSP stuff. zNote use of both “dm” and “pm” arrays zUses “known” constant array size as that provides better opportunities for optimizing compiler than “variable” size of array passed in as a parameter to the subroutine.

5 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 5 / 48 VisualDSP++ output zMuch more parallel ADSP2106X code than was available from VisualDSP 4.1 1 calculations in each loop Average 2 cycles/calculation

6 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 6 / 48 Alternate source code -- larger loops Approach 1 For (count < N / 2) Begin 1; …... 5; 6; 1; ….. 5; 6; End Loop z May lead to more parallel instructions in the ‘middle’ of the new of the longer loop z May lead to “running out of program memory on ADSP2106X if DSP algorithm code length is long. (Not just this code is in memory!) z Variation needed if N is not a factor of 2

7 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 7 / 48 Unroll the loop zAnticipated tighter code from variant 1 on ADSP2106X zChose second format as thought the approach might be useful on Hammerhead ADSP2116X in SIMD mode. GOOD for SIMD?

8 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 8 / 48 Variant 1 -- Double loop using count++ Unexpected software loop increases overhead 2 cycles per loop 2 calculations in each loop Average 5 cycles/calculation VERY POOR OPTIMIZATION

9 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 9 / 48 Variant 2 -- using index [count + 1] Very impressed in some ways 6 calculations in each loop Average 2 cycles/calculation OPTIMIZATION NO BETTER THAN ORIGINAL SINGLE LOOP EXAMPLE BUT LOOK EASY TO FURTHER REDUCE LOOP CYCLE COUNT AS COMPILER HAS PLACED VALUES IN CORRECT REGISTERS FOR PARALLEL OPS

10 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 10 / 48 EASY TO REDUCE CYCLES AS COMPILER HAS PLACED VALUES IN CORRECT REGISTERS FOR PARALLEL SHARC OPS Variant 2 -- using index [count + 1] FOR EXAMPLE Move pm(i13, m12) down one cycle allows a parallel operation F12=F0*F4, F1=F11+F12; One cycle decrease already

11 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 11 / 48 Further speed improvement? zBy playing around with the code, I thought I could get the code down to 1 cycle per calculation. zHowever, even with this simple code, I was not sure whether I was handling all the data dependencies correctly. zWould be impossible with a larger code sequence. zI therefore decided to move the code into Microsoft Project which is a business scheduling tool, rather than write my own scheduler! zHence the tool

12 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 12 / 48 SquishDSP V1.0 reformatted the ADSP2106X code into something suitable for input into Microsoft Project. The reformatting process identified a few dependencies between instructions. It basically allowed Microsoft Project to work “by default” knowing that the compiler had already “ordered things in a semi-reasonable way. Worked extremely well, but a few instructions were out of place and had to be moved by hand. Okay if you knew what to look for and what to expect. Unlikely to work on “long loops” or with hand custom coded -- my specialty. SquishDSP V2.0 identifies most of the dependencies before the code is submitted to Microsoft Project. The results with SquishDSP 2.0 are given here.. Contact Mike Smith walthamstow@shaw.ca for further infomation

13 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 13 / 48 Step 1 -- Develop the initial code -- process.c zNotes yLOOP SIZE -- FIXED as a constant MAXSIZE and not a variable yUse of both DM and PM data busses in “C” program. yDouble loop of code with index registers. yThis [count] then [count+1] form of double loop was chosen from several variants tried.

14 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 14 / 48 Step 2 -- Pass through VisualDSP++ zNote in “process.s” that compiler has unrolled the loop further -- 6 calculations performed per loop zInitially work with “loop component” only in next stages

15 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 15 / 48 Step 3A -- First Stages of SquishDSP zPass 1 -- Replace “commas” in instruction that are not instruction separators. This was initially to get the code into a.CSV format but is currently retained as a reliable approach to prepare for Pass 2. zPass 2 -- Identify, and break up all parallel instructions into single instructions taking care of “local dependencies”, retain original instructions

16 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 16 / 48 Step 3B -- First Stages of SquishDSP zPass 3 -- Add dependency information in a Microsoft Project compatible format zPass 4 -- Reformat into a totally Project compatible format, and “pretty format” to restore original ADSP2106X style of syntax

17 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 17 / 48 Step 4A -- Input into “Microsoft Project” zSelect “txt -- Default Task Information” and import file

18 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 18 / 48 Display in ‘non-leveled’ mode zSelect TOOLS | Resource Levelling | Clear Leveling -- Note the highly overused resources using SquishDSP V1.0

19 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 19 / 48 Display in ‘leveled’ mode zSelect TOOLS | Resource Levelling | Level Now -- Note the proper allocation of resources even when using SquishDSP 1.0

20 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 20 / 48 Step 4B -- Display in ‘non-leveled’ mode zSelect TOOLS | Resource Levelling | Clear Leveling -- Note there are now only a few overused resources as Project has already been able to resolve most conflicts with SquishDSP 2.0

21 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 21 / 48 Step 4C -- Display in ‘leveled’ mode zSelect TOOLS | Resource Levelling | Level Now -- Note the proper rescheduling of resources

22 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 22 / 48 Step 4D -- Sort the tasks by “Start” date zClick in “Task Name” base and select “Sort | Ascending | Start”

23 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 23 / 48 Step 4E -- Prepare ‘rescheduledproject.txt” zCut and paste “Task Name, Duration, Start” into notepad file

24 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 24 / 48 Steps inside Microsoft Project zInput “microsoftproject.txt” using “txt -- Default Task Information” zSelect TOOLS | Resource Levelling | Clear Leveling -- Note the overused resources zSelect TOOLS | Resource Levelling | Level Now -- Note the proper allocation of resources zClick “Task Name Bar” -- Select SORT | Ascending | START zCut and paste columns “Task Name, Duration, Start” into Notepad file “rescheduledproject.txt” zTried saving file directly from Project, then sorting the tasks by date etc. Project interface was very clumsy for this type of files. (I don’t know how to access “.mpp” formatted files.) In addition, Project did a better job of SORT | Ascending | START

25 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 25 / 48 Step 5 -- Second Stage of SquishDSP zPass 6 performs the following operations yBased on ‘Start date information’ from the Microsoft project files, regroup instructions into parallel instructions yCheck to see if the syntax of the registers is correct for parallel operations on the ADSP2106X yIf the syntax is not correct, break up the instructions into valid instructions and send out appropriate error messages zCorrect syntax for parallel operations means yPost-modify using modify registers on all memory operations yMultiplication using registers R(0, 1, 2, 3) * R(4, 5, 6, 7) yAddition/Subtraction using R(8, 9, 10, 11) +/- R(12, 13, 14, 15) yFloat and Integer data registers recognized as equivalent yParallel + and - operations are not currently recognized as valid.

26 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 26 / 48 Step 5 -- Second Stage of SquishDSP zOriginal code was a loop of 12 cycles zThis one is of 8 cycles z“Original” code available for checking

27 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 27 / 48 Some side issues zYou can model different processor architectures quite easily ySuppose you have single cycle addition but double cycle multiplication. Simply set the task duration for each use of the MULTIPLIER to 2. zAdjustments to Microsoft Project -- Fine detail ySet to “Don’t split tasks to allow activities to occur on different days”. Not applicable at the moment. yOther “fine details” to come

28 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 28 / 48 Current Approach to Optimization Original starting code For (count < N) Begin 1; 2; 3; 4; 5; 6; End Loop Optimized code For (count < N) Begin 1, 2A; 2B, 3A; 3B, 4; 5, 6; End Loop

29 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 29 / 48 Alternate source code -- larger loops Approach 1 For (count < N / 2) Begin 1; …... 5; 6; 1; ….. 5; 6; End Loop z May lead to more parallel instructions in the ‘middle’ of the new of the longer loop z May lead to “running out of program memory on ADSP2106X if DSP algorithm code length is long. (Not just this code is in memory!) z Variation needed if N is not a factor of 2

30 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 30 / 48 F1=F11+F12, r0=dm(i2,m1); F13=F0*F4, r2=dm(i4,m4), pm(i13,m12)=r1;..... F8=F11+F13; F12=F2*F4, pm(i12,m9)=r8; lcntr=10, do(pc,_L$2066012-1)until lce; _L$2000019: F1=F11+F12, r0=dm(i2,m1); F13=F0*F4, r2=dm(i4,m4), pm(i13,m12)=r1;..... F8=F11+F13; F12=F2*F4, pm(i12,m9)=r8; F1=F11+F12, r0=dm(i2,m1); F13=F0*F4, r2=dm(i4,m4), pm(i13,m12)=r1;..... F8=F11+F13; F12=F2*F4, pm(i12,m9)=r8; //end loop _L$2000019; -- end double loop _L$2066012: Double Loop with N != 2 * p

31 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 31 / 48 Adjust ‘lcntr’ values zIn this example, the lcntr value was originally 21. zWe must use lcntr = 10 for the new double loop and cut and paste the original loop outside the new loop to ensure that the total overall loop count is valid. zYou can now see why the task of developing an optimizing compiler is not trivial. yThe optimizing compiler must be able to handle the general case reliably!

32 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 32 / 48 Double loop re-optimized

33 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 33 / 48 Optimization results zOriginal code -- loop of 12 cycles with 6 sets of operations per loop y loop of 8 cycles with 6 sets of operations per loop -- saving of 33% of the time zDouble original loop -- loop of 24 cycles with 12 sets of operations per loop y loop of 14 Cycles with 12 sets of operations per loop -- increased efficiency of 42% of time yOverall code length 20 cycles (14 in loop and 6 outside)

34 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 34 / 48 Source code re-arrangement zWe can identify that some of the internal stages of the new rescheduled code are running totally parallel -- 4 operations per code. zThis suggests that rescheduling the loop operations will allow the generation of a highly efficient loop. zRescheduling the loop means bring out instructions from the loop and delaying all write operations until late in the loop yTo ensure accurate rearrangement of the code, perhaps we should change the priorities on the “pm” Microsoft Project tasks to be “As Late as Possible” rather than move by hand as was done in this example. zNote that compiler has already done some moving

35 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 35 / 48 Alternate starting points Approach 2 1; 2; For (count < N) Begin 3; 4; 5; 6; 1; 2; End Loop Possible adjustment of index registers z Valid approach if instructions 1 and 2 do not make any “permanent changes”. z “Permanent changes” means no WRITING to external memory z May require adjustment to registers after the loop because of the extra instructions -- particularly index registers that are post-modified.

36 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 36 / 48 Removed code from loop till first “write operation”

37 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 37 / 48 Moved 3 pm( ) write operations later in loop These can now be moved outside the loop

38 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 38 / 48 How many instructions to move? zVery easy to make minor changes to original code “process.s” open in a NotePad window, save the file, reactivate and quickly bring the file into Microsoft Project for examination. zTurned out that bringing “just two” instructions out of the loop was the best solution.

39 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 39 / 48 Optimum loop configuration

40 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 40 / 48 Optimum result -- 1 calculation per loop -- Double VisualDSP++ speed zThis loop is now just 6 cycles for 6 calculations zSpeed improvement will be very algorithm dependent

41 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 41 / 48 Savings are very algorithm dependent zOriginal code -- loop of 12 cycles with 6 sets of operations per loop y loop of 8 cycles with 6 sets of operations per loop -- saving of 33% of the time zDouble original loop -- loop of 24 cycles with 12 sets of operations per loop y loop of 14 Cycles with 12 sets of operations per loop -- increased efficiency of 42% of time yOverall code length 20 cycles (14 in loop and 6 outside) zOriginal code with 2 instructions extracted -- loop of 12 cycles with 6 sets of operations y loop of 6 Cycles with 6 sets of operations per loop -- increased efficiency of 50% -- processor at maximum pipeline capability. yOverall code length 8 cycles (6 in loop and 2 outside)

42 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 42 / 48 Real life is not as simple as this zLoops from Optimizing compiler already have instructions inside and outside the loop

43 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 43 / 48 Real “final” source code (without stack operations) zCode has been adjusted for original instructions outside the loop

44 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 44 / 48 Final Output

45 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 45 / 48 Conclusions 1 zUseful with the critical inner loops of DSP algorithms zHandling Cache Conflicts yCome to Cache-DSP talk tomorrow morning yUse of Primavera PV3 tool with special macros zHow handle instructions inside Delay Slots of jump instructions (especially conditional instructions with other parallel instructions)

46 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 46 / 48 Conclusions -- 2 zIn we have a simple tool that appears to do a good job on further optimizing the output from the current version of VisualDSP++. zEven when the equivalent features are added into a later version of VisualDSP++ then will still be useful for optimizing “hand-code” zFurther work means more testing on yIs the tool “really” doing the job we think it is, or is it missing vital dependencies? yDoes it give back something useful for larger source files? yCan we remove the dependency on the intermediate stage using Microsoft Project? GUI interface is very useful.

47 6/1/2015 SQUISHDSP -- ADSP2106X parallelization tool Copyright M. Smith -- smithmr@ucalgary.ca 47 / 48 Acknowledgements zFinancial support of Natural Sciences and Engineering Research Council (NSERC) of Canada and University of Calgary zFinancial support from Analog Devices. Dr. Mike Smith is ADI University Professor 2001/2002 zFuture financial support from Alberta Provincial Government through Alberta Software Engineering Research Consortium (ASERC)

48 For further information on this ADSP2106X utility Contact -- Dr. Mike Smith walthamstow@shaw.ca smithmr@ucalgary.ca


Download ppt "This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during."

Similar presentations


Ads by Google