DMA example Video image manipulation

Slides:

Advertisements

Similar presentations

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Advertisements

Jan 28, 2004Blackfin Compute Unit REV B A comparison of DSP Architectures BlackFin ADSP-BFXXX Compute Unit Based on a ENEL white paper prepared by.

Real time DSP Professors: Eng. Julian S. Bruno Eng. Jerónimo F. Atencio Sr. Lucio Martinez Garbino.

Microprocessor or Microcontroller Not just a case of “you say tomarto and I say tomayto” M. Smith, ECE University of Calgary, Canada.

Boot Issues Processor comparison TigerSHARC multi-processor system Blackfin single-core.

Daddy! -- Where do instructions come from? Program Sequencer controls program flow and provides the next instruction to be executed Straight line code,

6/2/2015 Labs in ENCM415. Laboratory 2 PF control, Copyright M. Smith, ECE, University of Calgary, Canada 1 Temperature Sensor Laboratory 2 Part 2 – Developing.

Thermal arm-wrestling Design of a video game using two programmable flags (PF) interrupts Tutorial on handling 2 Hardware interrupts from an external device.

Building a simple loop using Blackfin assembly code M. Smith, Electrical and Computer Engineering, University of Calgary, Canada.

Specialized Video (8-bit) and Vector (16-bit) Instructions on the Blackfin There is always a “MAKE-UP-YOUR-QUESTION-AND-ANSWER-IT” Question on a Dr. Smith.

Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.

Microprocessor or Microcontroller Not just a case of “you say tomarto and I say tomayto” M. Smith, ECE University of Calgary, Canada.

Core Timer Code Development How you could have done the Take- Home Quiz using a test driven development (TDD) approach.

Specialized Video (8-bit) and Vector (16-bit) Instructions on the Blackfin Expand on these ideas for Q9 question and answer on the final.

Developing a bicycle speed-o-meter A comparison between the Analog Devices ADSP-BF533 (Blackfin) and Motorola MC68332.

Understanding the Blackfin ADSP-BF5XX Assembly Code Format

Microprocessor or Microcontroller Not just a case of “you say tomarto and I say tomayto” M. Smith, ECE University of Calgary, Canada.

Laboratory 1 – ENCM415 Familiarization with the Analog Devices’ VisualDSP++ Integrated Development Environment.

Developing a bicycle speed-o-meter Midterm Review.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

Building a simple loop using Blackfin assembly code If you can handle the while-loop correctly in assembly code on any processor, then most of the other.

MATH Lesson 2 Binary arithmetic.

Lec 3: Data Representation

Array multiplier TU/e Processor Design 5Z032.

Integer Division.

CHAPTER 1 INTRODUCTION NUMBER SYSTEMS AND CONVERSION

Developing a bicycle speed-o-meter

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

Lecture 8: Addition, Multiplication & Division

Software and Hardware Circular Buffer Operations

SPI Compatible Devices

Thermal arm-wrestling

The planned and expected

Overview of SHARC processor ADSP Program Flow and other stuff

Trying to avoid pipeline delays

Understanding the TigerSHARC ALU pipeline

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

* M. R. Smith, University of Calgary, Alberta,

Shift & Rotate Instructions)

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

Assembly Language Review

Understanding the TigerSHARC ALU pipeline

Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.

Thermal arm-wrestling

Using Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.

Assembly Language Review

Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.

Storing Negative Integers

Expand on these ideas for Q9 question and answer on the final

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Getting serious about “going fast” on the TigerSHARC

Thermal arm-wrestling

General Optimization Issues

Concept of TDD Test Driven Development

Explaining issues with DCremoval( )

Lab. 4 – Part 2 Demonstrating and understanding multi-processor boot

DMA example Video image manipulation

Developing a bicycle speed-o-meter

Independent timers build into the processor

Developing a bicycle speed-o-meter

Developing a bicycle speed-o-meter

Thermal arm-wrestling

Building a simple loop using Blackfin assembly code

Developing a bicycle speed-o-meter Part 2

Understanding the TigerSHARC ALU pipeline

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

A first attempt at learning about optimizing the TigerSHARC code

Working with the Compute Block

Presentation transcript:

DMA example Video image manipulation

Video , Copyright M. Smith, ECE, University of Calgary, Canada Problem to solve Build video images in SDRAM Scale all the images (increase grey scale by a fixed scaling factor) Determine whether is more efficient to Work using the images in SDRAM Bring images from SDRAM (using DMA), scale them, then put back Using a multi-threaded version of task 2 Multiplication and Division issues Some possible Q9 areas for the final Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Video image Blanking information Frame 1 - luminance + colour information Blanking information Frame 2 - luminance + colour information Blanking information Have ability to manipulate frame information with touching blanking information Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Frame information Pixel 1 uses G1 + CB1 + CR1 Image brightness decreasing Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Video , Copyright M. Smith, ECE, University of Calgary, Canada Set up TEST Tasks done one after Another Tasks done with DMA occurring at the same time as other tasks Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

3 threads – sequential Scaling intensity by 19 Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Video , Copyright M. Smith, ECE, University of Calgary, Canada Task being performed Note – out of order of instructions associated with C++ code Loop involves 1 read / 1 write + 2 operations not involving r / w memory which gives DMA operation some bus bandwidth to work with Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Three threads in parallel Not the best solution? Start first DMA transfer – wait Start second DMA transfer start doing math operation done in parallel Wait till second DMA done Transfer math results back – wait Start third DMA transfer Wait till third DMA done Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Video , Copyright M. Smith, ECE, University of Calgary, Canada Results of the tests Need to use “profiling of the code” to determine where the “waste of time now is” Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Multiplication code – 16bit Note – out of order of instructions associated with C++ code IS -- integer signed multiplication FS – fractional signed (form of block floating point) – on many processors Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Multiplication details Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Multiplication possibilities R1.L = R2.L * R3.L; // Using multiplier 0 R1.H = R2.H * R3.H; // Using multiplier 1 R1.L = R2.L * R3.L, R1.H = R2.H * R3.H; Using both multipliers in parallel R2 = [P0++]; R3 = [P1++]; [P2++] = R1; R1.L = R2.L * R3.L, R1.H = R2.H * R3.H || R4 = [P0++] || R5 = [I1++]; Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Video , Copyright M. Smith, ECE, University of Calgary, Canada Multiply and add test Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Multiply and add result and code -- in SDRAM 3 cycle loop -- Note special MAC instruction A0 += R0.L * R1.L (IS) involves both an ADD and a multiplication MAC – multiply and accumulate Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Video , Copyright M. Smith, ECE, University of Calgary, Canada MAC syntax details Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Hints at possible advantage A0 += R2.L * R3.L, A1 -= R2.H * R3.H || R4 = [P0++] || R5 = [I1++]; Involves 2 multiplies Involves 4 adds -- A0 +=, A1+=, P0++ and I1++ Involves 2 memory reads MNOP || R2 = W[P0++] (X) || R3 = W[I1++] (X); // MNOP  multiplier NOP P1 = 100 – 2 ; LSET (START, FINISH) LC1 = P1 >> 1; // Go round the loop 49 times START: A0 += R2.L * R3.L, A1 -= R2.H * R3.H || R4 = W[P0++] (X) || R5 = W[I1++] (X); FINISH: A0 += R4.L * R5.L, A1 -= R4.H * R5.H || R2 = W[P0++] (X) || R3 = W[I1++] (X); Using R2, R3 and then R4, R5 in an attempt to avoid pipeline issues May not be required – would have to examine pipeline viewer to see what happens FINAL EXAM REVIEW -- What is the syntax error? Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Multiply and accumulate operation Filter operation on 16-bit values sum = 0; for count = 0 to N – 1 sum = sum + value[count] * coeff[count]; sum = sum / N; Does not take much to overflow a signed sixteen-bit register value1 = 32000; value2 = 32000; value1 + value2 about -1000 as a signed 16-bit value value1 = value2 = 32000; coeff1 = coeff2 = 32000; value1 * coeff1 + value2 * coeff2 has overflowed as a 32-bit value Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Multiply and accumulate operation – solving the problem Does not take much to overflow a signed sixteen-bit register value1 = 32000; value2 = 32000; value1 + value2 about -1000 as a signed 16-bit value value1 = value2 = 32000; coeff1 = coeff2 = 32000; value1 * coeff1 + value2 * coeff2 has overflowed as a 32-bit value Take all input values and divide by N will guarantee that the sum of N values will not overflow the number representation – but does not give accurate answer – what if input 32000, 32000, 16000 today but 1, 3, 5, 7, tomorrow? Use a special 40 bit register for storing the sum. Makes it less likely to cause an overflow. Do theoretical calculation to determine how many bits are needed to store accurate answer Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Video , Copyright M. Smith, ECE, University of Calgary, Canada Mult 16 x 16 To give 32 bits Adder is 40 bits Accumulator is 40 bits Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Example – filter 100 values in only 50 instructions .section data .byte2 array[100], coeffs[100]; P0.H = hi(array); P0.L = lo(array); I1 = hi(coeff); I1 = lo(coeff); MNOP || R2 = W[P0++] (X) || R3 = W[I1++] (X); // MNOP  multiplier NOP P1 = 100 - 2; LSET (START, FINISH) LC1 = P1 >> 1; // Go round 49 times START: A0 += R2.L * R3.L, A1 -= R2.H * R3.H || R4 = [P0++] (X) || R5 = [I1++] (X); FINISH: A0 += R4.L * R5.L, A1 -= R4.H * R5.H || R2 = [P0++] (X) || R3 = [I1++] (X); R0.L = (A0 += R2.L * R3.L), R0.H = (A1 -= R2.H * R3.H); R0.L = R0.L + R0.H (NS); Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Video , Copyright M. Smith, ECE, University of Calgary, Canada Convert the following code using parallel instructions and ensuring maximum accuracy #define N 1024 .section data .byte2 array[N]; // #define N 1024 // short array[N]; // short CalculateAverage( ) { // Determine sum; // return average Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Option for doing multiplication R0 = R1 * R2; 32 bit Mimics C++ multiplication User must make sure that multiplication does not overflow 32-bits – no flags on error R0.L = R1.L * R2.H (mode); 16 bit Default – signed fraction IS -- integer signed IU -- integer unsigned Uses A0 and A1 multipliers Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Video , Copyright M. Smith, ECE, University of Calgary, Canada Warning -- For more details see article When 1 + 1 = 2; but 2 * 2 ! = 4; Published in Circuit Cellar magazine Link available from December 415 web-page Sounds like a good Q9 to me for the final if you add some more details Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Addition and multiplication on Blackfin If R0 = 0x12345678 – then what is result of R0.L = 0xFFFF, and why? Math question what is result of 0.1 * 10-2 + 0.2* 10-2? Express the answer in the format 0.XYZ * 10-2 Math question what is result of 0.1 * 10-2 * 0.2* 10-2? Express the answer in the format 0.XYZ * 10-2 R0.L = 0x6; R1.L = 0x7; What is result of R2.L = R0.L + R1.L (NS); and why? Treated as a 2’s complement number Treated as a signed fractional number (format R0.L = 6 * 2-31) What is result of R2.H = R0.L * R1.L; and why? What is result of R2.H = R0.L * R1.L (IS); and why? Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Other “multiplication” types Multiply by 2 or 4 R0 = (R1 + R2) << 1: (or << 2) (or Pn) P0 = P1 + (P2 << 1); (or << 2) P only Useful when using P2 as the index in a loop Multiply by 1/2 , 1/4, 1/8, 1/2N R0 >>=3; divide by 8 (R0 unsigned number) 0x8000000 / 8 = 0x10000000 (unsigned (+ve) number) R0 >>>= 3; divide by 8 (R0 signed number) 0x8000000 / 8 = 0xE0000000 (negative number) R0 = ASHIFT R1 BY -3; (negative divide, +ve mult) Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Video , Copyright M. Smith, ECE, University of Calgary, Canada Division Fast divide by 2 , 4, 8, 2N using shift R0 >>=3; divide by 8 (R0 unsigned number) 0x8000000 / 8 = 0x10000000 (unsigned (+ve) number) R0 >>>= 3; divide by 8 (R0 signed number) 0x8000000 / 8 = 0xE0000000 (negative number) R0 = ASHIFT R1 BY -3; (negative divide, +ve mult) More flexible using DIVS and DIVQ Slow – must be performed in a loop Example code 70 / 5 Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Video , Copyright M. Smith, ECE, University of Calgary, Canada Code example -- P10-25 .global _DivideASM; _DivideASM: R0 = 70; // Divide(70, 5); R1 = 5; P0 = 15; // Evaluate quotentient to 15 bits (loop info) R0 <<= 1; // Book says "needed for integer division" DIVS(R0, r1); // Determines MSB of quotient LOOP .div_prim lc0 = P0; LOOP_BEGIN .div_prim; DIVQ(R0, R1);  DIFFERENT LOOP SYNTAX LOOP_END .div_prim; R0 = R0.L(X); RTS; Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Video , Copyright M. Smith, ECE, University of Calgary, Canada Problem to solve Build video images in SDRAM Scale all the images (increase grey scale by a fixed scaling factor) Determine whether is more efficient to Work using the images in SDRAM Bring images from SDRAM (using DMA), scale them, then put back Using a multi-threaded version of task 2 Multiplication and Division issues Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018

Video , Copyright M. Smith, ECE, University of Calgary, Canada Information taken from Analog Devices On-line Manuals with permission http://www.analog.com/processors/resources/technicalLibrary/manuals/ Information furnished by Analog Devices is believed to be accurate and reliable. However, Analog Devices assumes no responsibility for its use or for any infringement of any patent other rights of any third party which may result from its use. No license is granted by implication or otherwise under any patent or patent right of Analog Devices. Copyright  Analog Devices, Inc. All rights reserved. Video , Copyright M. Smith, ECE, University of Calgary, Canada 11/29/2018