Presentation is loading. Please wait.

Presentation is loading. Please wait.

Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

Similar presentations


Presentation on theme: "Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,"— Presentation transcript:

1 Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary, Canada smithmr @ ucalgary.ca This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation In Slide Show, click on the right mouse button Select “Meeting Minder” Select the “Action Items” tab Type in action items as they come up Click OK to dismiss this box This will automatically create an Action Item slide at the end of your presentation with your points entered.

2 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 2 / 37 To be tackled today What’s the problem? Standard Code Development of “C”-code Process for “Code with parallel instruction” Rewrite with specialized resources Move to “resource chart” Unroll the loop Adjust code Reroll the loop Check if worth the effort

3 ADSP-2106x -- Parallelism opportunities Ability for parallel memory operation, One each on pm, dm and instruction cache busses Memory pointer operations Post modify 2 index registers Automatic circular buffer operations Automatic bit reverse addressing Many parallel operations and register to register bus transfers Rn = Rx + Ry or Rn = Rx * Ry Rm = Rx + Ry, Rn = Rx - Ry with/without Rp = Rq * Rr Zero overhead loops Instruction pipeline issues Key issue -- Only 48? bits available in OPCODE to describe 16 data registers in 3 destinations and 6 sources = 135 bits 2 * (8 index + 8 modify + 16 data) = 64 bits Condition code selection, 32 bit constants etc.

4 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 4 / 37 Basic code development -- any system Write the “C” code for the function void Convert(float *temperature, int N) which converts an array of temperatures measured in “Celsius” (Canadian Market) to Fahrenheit (American Market) Convert the code to ADSP 21061/68K etc. assembly code, following the standard coding and documentation practices

5 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 5 / 37 Parallel Instruction Code Development Write the 21k assembly code for the function void Convert(float *temperature, int N) which etc…... Determine the instruction flow through the architecture using a resource usage diagram Theoretically optimize the code -- a 2 minute counting process Compare and contrast the amount of time to perform the subroutine before and after customization.

6 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 6 / 37 Standard “C” code void Convert(float *temperature, int N) { int count; for (count = 0; count < N; count++) { *temperature = (*temperature) * 9 / 5 + 32; temperature++ } Standard Warning -- What does optimizing compiler do with 9 / 5 becomes 1 or 1.8?

7 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 7 / 37 Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Move algorithm to “Resource Usage Chart” Optimize using techniques Compare and contrast time -- setup and loop

8 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 8 / 37 21061-style load/store “C” code void Convert(register float *temperature, register int N) { register int count; register float *pt = temperature; register float scratch; for (count = 0; count < N; count++) { scratch = *pt; scratch = scratch * (9 / 5); scratch = scratch + 32; *pt = scratch; pt++; }

9 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 9 / 37 Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Move algorithm to “Resource Usage Chart” Optimize using techniques Compare and contrast time -- setup and loop

10 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 10 / 37 Straight conversion -- PROLOGUE // void Convert( reg float *temperature, reg int N ) {.segment/pm seg_pmco;.global _Convert; _Convert: // register int count = GARBAGE; #define countR1 scratchR1 //register float *pt = temperature; #define pt scratchDMpt pt = INPAR1; //float scratch = GARBAGE; #define scratchF2 F2 // For the CURRENT code -- no volatile // registers are needed -- may not remain true

11 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 11 / 37 Straight conversion of code //for (count = 0; count < N; count++) { LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: //scratch = *pt; scratchF2 = dm(0, pt);// Not ++ as pt re-used // scratch = scratch * (9 / 5); // INPAR1 (R4) is dead -- can reuse as F4 #define constantF4 F4// Must be float constantF4 = 1.8 // No division, Use register constant scratchF2 = scratchF2 * constantF4; // scratch = scratch + 32; #define F0_32 F0// Must be float F0_32 = 32.0; scratchF2 = scratchF2 + F0_32; // *pt = scratch; pt++; dm(pt, 1) = scratchF2; LOOP_END: 5 magic lines of code // NOT F0 = 32 gives F0 = 1 * 10 -45

12 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 12 / 37 Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach. Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. Move algorithm to “Resource Usage Chart” Optimize using techniques (Attempt to) Compare and contrast time -- setup and loop

13 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 13 / 37 Speed rules IF you want adds and multiplys to occur on the same line F1 = F2 * F3, F4 = F5 + F6; Want to do as a single instruction Not enough bits in the opcode Register description 4 + 4 + 4 + 4 + 4 + 4 (bits) Plus bits for describing math operations, conditions and memory ops? Fn = F(0, 1, 2 or 3) * F(4, 5, 6 or 7) Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) Must rearrange register usage with program code for this to be possible Register description 4 + 2 + 2 + 4 + 2 + 2 (bits) -- other bits “understood” Inconvenient rather than limiting e.g. F6 = F0 * F4, F7 = F8 + F12, F9 = F8 - F12; Not accepted F6 = F4 * F0, F7 = F8 + F12, F9 = F8 - F12; Not accepted F7 = F8 + F12, F9 = F8 - F12, F6 = F0 * F4;

14 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 14 / 37 When should we worry about the register assignment? #define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 LCNTR = INPAR2, DO LOOP_END- 1 UNTIL LCE: scratchF2 = dm(pt, zeroDM);// Not ++ as to be re-used // INPAR1 (R4) is dead -- can reuse #define constantF4 F4// Must be float constantF4 = 1.8; scratchF2 = scratchF2 * constantF4 #define F0_32 F0// Must be float F0_32 = 32.0; scratchF2 = scratchF2 + F0_32; dm(pt, plus1DM) = F0_32; LOOP_END:

15 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 15 / 37 Check on required register use #define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: scratchF2 = dm(pt, zeroDM); Are there special requirements here on F2 -- becomes source later?? // INPAR1 (R4) is dead -- can reuse #define constantF4 F4// Must be float constantF4 = 1.8; scratchF2 = scratchF2 * constantF4 Fn = F(0,1,2 or 3) * F(4,5,6 or 7), #define F0_32 F0// Must be float F0_32 = 32.0; scratchF2 = scratchF2 + F0_32; Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2;

16 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 16 / 37 Register re-assignment -- Step 1 #define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 -- OKAY LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: scratchF2 = dm(pt, zeroDM); // INPAR1 (R4) is dead -- can reuse #define constantF4// Must be float -- OKAY constantF4 = 1.8; scratchF2 = scratchF2 * constantF4 -- SOURCES okay here Fn = F(0,1,2 or 3) * F(4,5,6 or 7), #define F0_32 F0// Must be float F0_32 = 32.0; -- WRONG to use F0 here -- ADDITION scratchF2 = scratchF2 + F0_32; -- WRONG to use F2 as DEST early Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2; -- OKAY

17 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 17 / 37 Register re-assignment -- Step 2 #define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: scratchF2 = dm(pt, zeroDM); // INPAR1 (R4) is dead -- can reuse #define constantF4 F4// Must be float constantF4 = 1.8; scratchF8 = scratchF2 * constantF4 answer must be in F(8, 9, 10 or 11) #define F12_32 F12// INPAR3 is available F12_32 = 32.0; scratchF2 = scratchF8 + F12_32 ; Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2;

18 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 18 / 37 Fix poor coding practice -- “C” or assembly #define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: scratchF2 = dm(pt, zeroDM); // INPAR1 (R4) is dead -- can reuse #define constantF4 F4// Must be float constantF4 = 1.8; MOVE OUTSIDE LOOP scratchF8 = scratchF2 * constantF4 answer must be in F(8, 9, 10 or 11) #define F12_32 F12// INPAR3 is available F12_32 = 32.0; MOVE OUTSIDE LOOP scratchF2 = scratchF8 + F12_32 ; Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2;

19 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 19 / 37 Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. Move algorithm to “Resource Usage Chart” Optimize using techniques (Attempt to) Compare and contrast time -- setup and loop

20 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 20 / 37 Resource Management -- Chart1 -- Basic code LOOPEND: -1 UNTIL LCE In theory -- if we could find out how *, + and dm in parallel DATA-BUS is limiting resource dm 2 cycle loop possible Before proceeding -- Is 2 cycle loop needed? Is 2 cycle loop enough?

21 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 21 / 37 Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. Move algorithm to “Resource Usage Chart” Optimize parallelism using techniques Attempt to -- watch out for special situations where code will fail Compare and contrast time -- setup and loop

22 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 22 / 37 Resource 2 -- unroll the loop -- 5 times here Each pass through the loop involves Read Multiply Add Write

23 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 23 / 37 Resource Management 3 -- identify resource usage during decode and writeback stages of each instructions Model used -- depends on where operands are relative to equals sign ‘Reading’ -- fetching things for ALU/FPU -- Like 68K decode phase ‘Writeback’ -- storing results from ALU/FPU THESE PHASES ARE ‘CONCEPTS’ RATHER THAN “ IMPLEMENTED’ Reading

24 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 24 / 37 Resource Management 4 Check what can be moved in parallel with other instructions OKAY TO MOVE F2 src freed up before F2 dest occurs OKAY TO MOVE Empty spot if can move * and + instructs which this instruction MUST follow NO !!! or just possible NO? Why a problem? F2 =

25 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 25 / 37 Memory resource availability Move up F2 = dm(pt, ZERODM) from second loop into first loop However now we have a possible conflict about which F2 should be used for the dm(pt, plus1DM) = F2 instruction if we further optimize by trying to fill the other empty delay slots -- see next slide

26 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 26 / 37 Resource management Overlapping two parts of the loop

27 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 27 / 37 Resource Management 5 -- What’s up, Doc? Attempting to fill all unused resource availability Why spend time on simulating algorithm to see if problem really exists when there is a simple solution -- use different registers Problem may/may not exist with this simple example but very likely to exist in more complex algorithm

28 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 28 / 37 Resource 6 -- Solution -- Save and then use F9

29 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 29 / 37 Resource Management 7 -- Some parallelism possible with Read, Mult, Add and Write mixed across 5 loop comps. Problem 1 -- No resource in maximum usage -- code in-efficient Problem 2 -- Worth about 50% on an exam question on parallelism. We have answered “Optimize the straight line code for a loop of the form ‘for count = 0, count <5’ “ -- What if loop size 2048 or more?

30 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 30 / 37 WRONG -- CONCEPT GOOD, IMPLEMENTATION BAD as we are no longer indexing correctly through the data. Problem 1 -- No resource in maximum usage -- code in-efficient Problem 2 -- Worth about 50% on an exam question on parallelism. We have answered “Optimize the straight line code for a loop of the form ‘for count = 0, count <5’ “ -- What if loop size 2048 or more?

31 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 31 / 37 Resource Management 8 Unroll the loop a bit more -- 9 loop components DM BUS USAGE NOW MAXed OUT (after a while) CODE PATTERN APPEARING

32 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 32 / 37 Resource Management 9 Identify the loop components LOOP BODY FILL ALU/FPU PIPE EMPTY ALU/FPU PIPELINE

33 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 33 / 37 Resource 9 -- Final code version -1 UNTIL LCE LOOPEND : FILL USE EMPTY ALU/FPU PIPE

34 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 34 / 37 Speed improvements BEFORE START LOOP EXIT ENTRY 4 + N*4 + 5 + 5 = 14 + 4 * N NOW with 2-fold loop unfolding START LOOP EXIT ENTRY 4 + 7+ (N – 2) * 5 / 2 + 5 + 8 + 5 = 24 + 2.5 * N NOW with 3-fold loop unfolding START LOOP EXIT ENTRY 4 + 5 + (N – 2) * 6 / 3 + 5 + 1 + 5 = 16 + 2 * N Factor of 4 / 2.5 with a little effort -- Factor of 4 /2 with more effort

35 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 35 / 37 Question to Ask We now know the final code Should we have made the substitution F2 to F9? Who cares -- do it anyway as more likely to be necessary rather than unnecessary in most algorithms! No real disadvantage since we can probably overlap the save and recovery of the non-volatile R9 with other instructions! Will the code work?

36 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 36 / 37 Resource 9 -- Final code version -1 UNTIL LCE LOOPEND : N = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 Only works if (N - 2) / 3 is an integer.

37 6/2/2015 ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061 Copyright smithmr@ucalgary.ca 37 / 37 Tackled today What’s the problem? Standard Code Development of “C”-code Process for “Code with parallel instruction” Rewrite with specialized resources Move to “resource chart” Unroll the loop Adjust code Reroll the loop Check if worth the effort To come -- Tutorial practice of parallel coding To come -- Optimum FIR filter with parallelism


Download ppt "Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,"

Similar presentations


Ads by Google