 Send in audio signals and use sharp FIR filter to pick out 42 Hz and 59 Hz signals and send out warning tones ◦ Try FIR filter of 256 taps, down sample.

 Send in audio signals and use sharp FIR filter to pick out 42 Hz and 59 Hz signals and send out warning tones ◦ Try FIR filter of 256 taps, down sample and then use FIR filter of 256 taps – equivalent to 1 FIR filter of 256 * 256 taps with a bandwidth of 96000 / 256 * 256 Hz ◦ Use code from Lab 0, Lab 1, assignment 1 as much as possible  Develop C++ version (show that fails unless optimized code) – Assignment 1  Modify your Lab 1assembly code to demonstrate (test and audio) speed improvement for following steps ◦ 1) software to hardware loop ◦ 2) parallel dm, pm access, don’t unroll loop ◦ 3) parallel dm, pm access, unroll loop 4 times. Don’t move code outside loop, do parallel dm, pm access in parallel with multiple instructions ◦ 4) parallel dm, pm access, unroll loop 4 times. Don’t move code outside loop, do parallel dm, pm access in parallel with multiple and add instructions  Remember to provide resource chart and compare your timing to expected

 Can the processor meet the requirements?  Two forms of the code – which one is needed ◦ Grab one audio value -- Process everything before next individual audio samples ◦ Grab one audio block – Collect next audio block and process last audio block before next audio block collected  Real life – worse case ◦ Each channel needs 2 256-tap FIR filters ◦ Total channels – 42 Hz + harmonics, 19 Hz plus harmonics (19 * 3 = 57 Hz) – say 8 channels ◦ Need to generate audio warning signals ◦ Modify FIR filter coefficients to following signals – might not be constant frequency  Do the best case timing analysis to see whether algorithm works

 Similarity between one signal and another, and at what locations the similarity occurs  Have a heart beat signal 000ABcD0000  Have a signal from patient running 00000000ABcD0000000ABcD0000000ABcD0000  Use 0000DcAB0000 as coefficients in FIR filter 00000000ABcD0000000ABcD0000000ABcD0000 000ABcD0000 -- minimum filter output 000ABcD0000 -- some output 000ABcD0000 -- max output 000ABcD0000 -- less output 000ABcD0000 – max again

 Draw a picture of the situation  Known signal sent to ultrasound transmitter A  Noisy signal picked up at receiver B ◦ Do auto-correlation to get best estimate of delay  Known signal sent to ultrasound transmitter B  Noisy signal picked up at receiver A ◦ Do auto-correlation to get best estimate of delay ◦ Differences in delay time are related to speed of air in mine shaft

 Simplest step up from doing examples exactly the same as lab examples  Many standard formats  Complex array – real and imaginary  Components stored alternately in memory R1, I1, R2, I2, R3, I3 … access using dm(IX, MdmX) where MdmX = 2  Components stored in alternate blocks R1, R2, R3, … I1, I2, I3 access using dm(I1X, MdmP1) and dm(I2X, MdmP1) or access using dm(IdmX, MdmP1) and pm(IpmX, MpmP1) where MdmP1 and MplP1 are set to +1 by compiler  Speed depends on format used and what you are doing with values

complex CalculateComplexCorrelation (complex firstArray[ ], int numPts, int offset) { complex correlation = 0 + j0 -- Missing piece of code for (int k = 0; k < numPts - offset; k++) { // Could be other forms of the algorithm // This is more “autocorrelation” – comparing signal to itself // Would work best when information of interest is in the centre of the signals correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); } return correlation; Repeat many times along firstArray for different offsets  Auto-correlation and cross-correlation and convolution are all equivalent to FIR operations where the FIR cofficients are data values rather than fixed values  // How do you return a complex value? Don’t know // Two choices – in R0 (real part) and R1 (imaginary part)  // more likely (Another exam) switch to SIMD mode and use R0 and S0

 There is absolutely no point trying to optimize a loop that calls a subroutine / function ◦ The cost of setting up subroutine call (handling incoming parameters and return values) and jumping in an out of subroutine  Question reminded you of this ◦ Assume that the Conjugate function is in-lined for speed. ◦ That means you need to go and write out the equation with inlined code

 Enter and exit CalculateCorrelation( ) – 20 cycles  Set up pointers inpar_Rx  Ix – 30 cycles  Set up and use hardware loop – 20 cycles  Set up sum < 10 cycles  So basically timing is (numPts – offset) * loop Body count  correlation = realCorrelation + kImageCorrelation = correlation + realCorrelation + kImageCorrelation + firstArray[k] * Conjugate(firstArray [k+offset]); + (a + jb) * (c – jd) -- read in as c + jd or RC + jIC = RC + jIC + a*c +b *d + j( - a * d + c * b)

RC + jIC = RC + jIC + a*c +b *d + j( - a * d + c * b) Means -- two sets of calculations RC_RX0= RC_RX0+ a*c +b *d RX0 does not mean R0 And IC_RX1= IC_RX1 + ( - a * d + c * b) Looks like 8 memory access per tap (point), fetch a, b, c, d TWICE Actually could optimize to 4 fetches and reuse (a, b, c, d IF there are enough registers to store the fetched values and do all the calculations if we unroll the loop and have to cope with memory access delays)

Reference sheet says MULTIFUNCTION COMPUTE OPERATION On certain registers only, unlike standard COMPUTE Multiplication FN = FQ * FR, with FQ=F(0,1,2,3) and FR=F(4,5,6,7) ALU Compute FN = FX op FY, FX=F(8,9,10,11),FY=F(12,13,14,15) So when doing this RC_RX0= RC_RX0+ a*c +b *d bring a and b into F(0,1,2,3); bring c and d into F(4,5,6,7) store a * c result into F(8,9,10,11) and store b * d result into F(12,13,14,15) store a * c + b * d result into F(8,9,10,11) which would work if RC_RX0 was in F(12,13,14,15) Questions to answer 1) Why? 2) How do we handle IC_R1= IC_R1 + ( - a * d + c * b) given the way the registers were being used by the RC_RX0= RC_RX0+ a*c +b *d calculations

RC_R0= RC_R0 + a*c +b *d And IC_R1= IC_R1 + ( - a * d + c * b)  Looks like 8 memory access per tap (point), but actually could optimize to 4 and reuse (IF there are enough registers)  4 multiples and 4 adds Can (if switch into SIMD mode) do 2 multiplication + 2 adds + 4 memory accesses per cycle 2 cycles needed in SIMD mode time 2 * Numpoints / 500 us < 50% of 10 us (at 96 kHz) Will work provided Numpoints < 5000 / 4 Problem to solve if working with SIMD mode– make sure that we don’t end up with a in register R1 and c in register S1 because then can’t multiply together Could we -- Unroll loop so do first dm pm fetch in R1 and R4 and have SIMD do the (hidden) second dm pm fetch into S1 and S4

 Even the simplest problem is essentially impossible to translate in time available – that why I say GPA A- starts around 80%  You need to demonstrate that ◦ You know what you need to do; so that if you had enough time you could complete ◦ Really key – able to use this knowledge to check that the compiler was doing a good job  15 marks split across the following (16 as first error is free) 1.REALLY KEY – Design the code before translating it 2.Format of assembly language code and course coding requirements 3.Demonstrate understanding of parameter passing and return – in R registers 4.Need to save and recover registers – know what is volatile and what is not 5.KEY -- Need to move passed pointers (in R registers) into I registers 6.How to set up arrays to allow simultaneous dm, pm access 7.Hardware / software loop differences 8.KEY -- Post-modify and pre-modify difference 9.KEY -- USING F registers when doing mults and adds in multi-function mode 10.Complex number theory and format on DSP processors

#include // How do you return a complex value? Don’t know // Two choices – in R0 (real part) and R1 (imaginary part) // more likely (Midterm 2) switch to SIMD mode and use R0 and S0.section seg_pmco;.global _ CalculateComplexCorrelation__NM; _CalculateComplexCorrelation__NM: R16 not a real fake – would look like Rx = dm(2, SP) – but why learn that when could cut-and-paste for a C++ code example complex CalculateComplexCorrelation (complex firstArray[ ], int numPts, int offset) { R0, R1 for return values (pretend) 4 parameters in very complex as using stack operations Fake by pretending R4 and R16 (dm and pm pointer) R8 R12 – Then move R16 into real register Rx

corrReal_F0 = 0.0; corrImag_F1 = 0.0; maxLoop_R8 = numPts_R8 – offset_R12; This sets Z, N flags if LE JUMP END; // no DB realPt_I4 = inPar_R4; imagPt_I12 = inPar_R16; // Want to handle offset into arrays easily Save I5 and I13 to stack // need more R registers Save R3, R6, R7, R9, R10 inParR4Offset_R4 = inPar_R4 + offset_R12; inParR4Offset_R5 = inPar_R5 + offset_R12; realPtOffset_I5= inParR4Offset_R4 imagPtOffset_I13 = inParR4Offset_R5 // Do a code review and fix the minor bug correlation = 0 + k0 set up pointers There are other ways of doing this using modify registers

set up loop using R8 information should be on reference sheet for (int k = 0; k < numPts - offset; k++) { Would look something like this Modify(SP, 3); R0 = I3; // Can’t save Ix directly to memory dm(1, SP) = R0 R0 = I13; // Can’t save Ix directly to memory dm(2, SP) = R0 // Also there is no pm stack implimented

// Read real part of 1 and complex part of other firstReal_R6 = dm(realPt_I4, DMPLUS1), secondImag_R10 = pm(imagPtOffset_I13, PMPLUS1); secondReal_R9 = dm(realPtOffset_I5, DMPLUS1), firstImag_R7 = pm(imagPtOffset_I13, PMPLUS1); temp_F2 = F6 * F9; temp_F3 = F7 * F10; real_F0 =F0 + F2; real_F0 = F0 + F3; temp_F2 = F6 * F10; temp_F3 = F7 * F9; imag_F1 = F1 – F2; imag_F1 = F1 + F3 correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); // Use math explained above // I am just writing code – not trying to optimize // Valid code BUT these instructions ARE NOT executed in parallel – wrong syntax, wrong registers for multi-function // real update // imag update – less documented temp registers used and discarded quickly – okay under exam condition

END: Recover registers in reverse order R10, R9, R7, R6, R3 Values already in R0 and R1 5 magic lines to return to C } return correlation; (R0 and R1)

 Demonstrate unroll loop – unroll 2 * p times ◦ Unrolling allows us to move (make parallel) parts of the first set of operations and second operations ◦ In real life – may unroll up to 8 times to find parallel operations – demonstrate concept in midterm (time)  If switching to SIMD -- unroll 4 * p times  Write the optimization design using C++ syntax ◦ Don’t switch to assembly code until VERY last moments ◦ Write in the simplest possible version of C  Concentrate on the loop as that is where we get the speed

for (int k= 0; k < numPts - offset; k++) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); } Becomes for (int k = 0; k < numPts - offset; k = k+ 2) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); correlation = correlation + firstArray[k + 1] * Conjugate(firstArray [k+offset + 1]); } Problem 1 – Can’t switch to SIMD mode if k + offset is not divisible by 2 SIMD mode does R0 = dm[2 * x] and S0 = dm[2 * x + 1] Meaning it can do dual fetch dm[1000], dm[1001], but not dm[1001], dm[1002] Means our speed estimate is out by factor of 2 since we can’t switch to SIMD mode – or if we do switch -- code must become more complex – so don’t switch to SIMD

for (int k = 0; k < numPts - offset; k++) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); } Becomes If (numPts – offset) is even then unrolled code becomes for (int k = 0; k < numPts - offset; k = k+ 2) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); correlation = correlation + firstArray[k + 1] * Conjugate(firstArray [k+offset + 1]); } Else for (int k = 0; k < numPts – offset - 1; k = k+ 2) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); correlation = correlation + firstArray[k + 1] * Conjugate(firstArray [k+offset + 1]); } k = numPts – offset – 1; correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]);

correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); correlation = correlation + firstArray[k + 1] * Conjugate(firstArray [k+offset + 1]); correlation = correlation + (a[k] + jb[k] )* (a[k + offset] - jb[k + offset] ); correlation = correlation + (a[k + 1] + jb[k + 1] )* (a[k + offset + 1] - jb[k + offset + 1] ); Look at real part only -- use correlationRe = correlationRe + (a[k] * a[k + offset]) + (b[k] * b[k + offset] ) correlationRe = correlationRe + (a[k + 1] * a[k + offset + 1]) + (b[k + 1] * b[k + offset + 1] )

Temp1 = a[k] ; Note register renaming Temp2 = a[k + offset]; Use this approach incase there Mult3 = temp1 * temp2 are unexpected timing delays Temp4 = b[k]; then can interlink the 2 unrolls Temp5 = b[k+offset]; Mult6 = temp4 * temp5; Plan to put imag array on pm access corrRe = corrRe + Mult3 corrRe = corrRe+ Mult6 Temp11 = a[k+ 1] ; Temp12 = a[k + offset + 1]; Mult13 = temp11 * temp12 Temp14 = b[k + 1] ; Temp15 = b[k+offset + 1]; Mult16 = temp14 * temp51; corrRe = corrRe + Mult13 corrRe = corrRe+ Mult16

Use this order because of instruction format On certain registers only, unlike standard COMPUTE Multiplication FN = FQ * FR, with FQ=F(0,1,2,3) and FR=F(4,5,6,7) ALU Compute FN = FX op FY, with FX=F(8,9,10,11),FY=F(12,13,14,15) OtherMultaddDMPM Temp1 = a[k] ; Temp2 = a[k + offset]; Mult3 = temp1 * temp2 Temp4 = b[k]; Temp5 = b[k+offset]; Mult6 = temp4 * temp5; corrRe = corrRe + Mult3 corrRe = corrRe+ Mult6 Temp11 = a[k+ 1] ; Temp12 = a[k + offset + 1]; Mult13 = temp11 * temp12 Temp14 = b[k + 1] ; Temp15 = b[k+offset + 1]; Mult16 = temp14 * temp51; corrRe = corrRe + Mult13 corrRe = corrRe+ Mult16

OtherMultaddDMPM Temp1 = a[k] ; Temp2 = a[k + offset]; Mult3 = temp1 * temp2 Temp4 = b[k]; Temp5 = b[k+offset] ; Mult6 = temp4 * temp5; corrRe = corrRe + Mult3 corrRe = corrRe+ Mult6 Temp11 = a[k + 1] ; Temp12 = a[k + offset + 1]; Mult13 = temp11 * temp12 Temp14 = b[k + 1]; Temp15 = b[k+offset + 1]; Mult16 = temp14 * temp15; Mult16 = temp14 * temp51; corrRe = corrRe + Mult13 corrRe = corrRe+ Mult16

OtherMultaddDMPM Temp1 = a[k] ; Temp4 = b[k]; Temp1 = a[k] ; Temp2 = a[k + offset]; Temp5 = b[k+offset] ; Temp2 = a[k + offset]; Mult3 = temp1 * temp2 Temp4 = b[k]; Temp5 = b[k+offset]; Mult6 = temp4 * temp5; corrRe = corrRe + Mult3 corrRe = corrRe+ Mult6 Temp11 = a[k + 1] ; Temp14 = b[k + 1]; Temp11 = a[k+ 1] ; Temp12 = a[k + offset + 1]; Temp15 = b[k+offset + 1]; Temp12 = a[k + offset + 1]; Mult13 = temp11 * temp12 Temp14 = b[k + 1] ; Temp15 = b[k+offset + 1]; Mult16 = temp14 * temp15; Mult16 = temp14 * temp51; corrRe = corrRe + Mult13 corrRe = corrRe+ Mult16

OtherMultaddDMPM Temp1 = a[k] ; Temp4 = b[k]; Temp1 = a[k] ; Temp2 = a[k + offset]; Temp5 = b[k+offset] ; Temp2 = a[k + offset]; Mult3 = temp1 * temp2 Temp11 = a[k + 1] ; Temp14 = b[k + 1]; Mult3 = temp1 * temp2 Temp4 = b[k]; Temp5 = b[k+offset]; Mult6 = temp4 * temp5; Temp12 = a[k + offset + 1]; Temp15 = b[k+offset + 1]; Mult6 = temp4 * temp5; Mult13 = temp11 * temp12corrRe = corrRe + Mult3 Imag fetches corrRe = corrRe + Mult3 Mult16 = temp14 * temp15;corrRe = corrRe+ Mult6 Imag fetches corrRe = corrRe+ Mult6 imag mult 1 corrRe = corrRe + Mult13 Imag fetches imag mult 2 corrRe = corrRe+ Mult16 Imag fetches Temp11 = a[k+ 1] ; Imag mult 1Imag add 1 Temp12 = a[k + offset + 1]; imag mult 1Imag add 2 Mult13 = temp11 * temp12 Imag add 1 Temp14 = b[k + 1] ; Imag add 2 Temp15 = b[k+offset + 1]; Mult16 = temp14 * temp51; Efficiency 8 in 12 corrRe = corrRe + Mult13 corrRe = corrRe+ Mult16

Multiplication FN = FQ * FR, with FQ=F(0,1,2,3) and FR=F(4,5,6,7) ALU Compute FN = FX op FY, with FX=F(8,9,10,11),FY=F(12,13,14,15) OtherMultaddDMPM Temp1 = a[k] ; Temp4 = b[k]; Temp2 = a[k + offset]; Temp5 = b[k+offset] ; Mult3 = temp1 * temp2 F2 F5 Temp11 = a[k + 1] ; Temp14 = b[k + 1]; What register for Mult 3 What register for Temp 11 ? Mult6 = temp4 * temp5; F3 F6 Temp12 = a[k + offset + 1]; Temp15 = b[k+offset + 1]; Mult13 = temp11 * temp12 ? ? corrRe = corrRe + Mult3 F0 F0 Illegal use of F0 Mult16 = temp14 * temp15; ? ? corrRe = corrRe+ Mult6 F0 F0 imag mult 1 corrRe = corrRe + Mult13 imag mult 2 corrRe = corrRe+ Mult16 Imag add 1 Imag add 2

 Send in audio signals and use sharp FIR filter to pick out 42 Hz and 59 Hz signals and send out warning tones ◦ Try FIR filter of 256 taps, down sample.

Similar presentations

Presentation on theme: " Send in audio signals and use sharp FIR filter to pick out 42 Hz and 59 Hz signals and send out warning tones ◦ Try FIR filter of 256 taps, down sample."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

 Send in audio signals and use sharp FIR filter to pick out 42 Hz and 59 Hz signals and send out warning tones ◦ Try FIR filter of 256 taps, down sample.

Similar presentations

Presentation on theme: " Send in audio signals and use sharp FIR filter to pick out 42 Hz and 59 Hz signals and send out warning tones ◦ Try FIR filter of 256 taps, down sample."— Presentation transcript:

Similar presentations

About project

Feedback