Working with the Compute Block

Slides:

Advertisements

Similar presentations

Instruction Set Design

Advertisements

Processor Architecture Needed to handle FFT algoarithm M. Smith.

Detailed look at the TigerSHARC pipeline Cycle counting for the IALU versionof the DC_Removal algorithm.

Boot Issues Processor comparison TigerSHARC multi-processor system Blackfin single-core.

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.

TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada

Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm.

TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada

TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.

Processor Architecture Needed to handle FFT algoarithm M. Smith.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.

Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.

Generating “Rectify( )” Test driven development approach to TigerSHARC assembly code production Assembly code examples Part 1 of 3.

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

Blackfin Array Handling Part 1 Making an array of Zeros void MakeZeroASM(int foo[ ], int N);

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.

Building a simple loop using Blackfin assembly code If you can handle the while-loop correctly in assembly code on any processor, then most of the other.

Generating a software loop with memory accesses TigerSHARC assembly syntax.

Embedded Systems Design

Additional Assembly Programming Concepts

CS703 - Advanced Operating Systems

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

Subject Name: Digital Signal Processing Algorithms & Architecture

Software and Hardware Circular Buffer Operations

General Optimization Issues

TigerSHARC processor General Overview.

Introduction to the C Language

Generating the “Rectify” code (C++ and assembly code)

Generating “Rectify( )”

Introduction to Test Driven Development

Microcoded CCU (Central Control Unit)

Automated Testing Environment

CSCE Fall 2013 Prof. Jennifer L. Welch.

DMA example Video image manipulation

Overview of SHARC processor ADSP Program Flow and other stuff

Trying to avoid pipeline delays

Generating a software loop with memory accesses

Understanding the TigerSHARC ALU pipeline

What are the characteristics of DSP algorithms?

Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.

VisualDSP++ and Test Driven Development What happened last lecture?

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

Understanding the TigerSHARC ALU pipeline

Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.

Using Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.

What time is it?. What time is it? Major Concepts: a data structure model: basic representation of data, such as integers, logic values, and characters.

* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.

Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.

Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.

* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.

Getting serious about “going fast” on the TigerSHARC

CSCE Fall 2012 Prof. Jennifer L. Welch.

General Optimization Issues

Explaining issues with DCremoval( )

General Optimization Issues

Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.

Instructions in Machine Language

DMA example Video image manipulation

Chapter 12 Pipelining and RISC

Data Structures & Algorithms

Building a simple loop using Blackfin assembly code

Understanding the TigerSHARC ALU pipeline

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Chapter 6 Programming the basic computer

A first attempt at learning about optimizing the TigerSHARC code

A first attempt at learning about optimizing the TigerSHARC code

Building tests and code for a “software radio”

Presentation transcript:

Working with the Compute Block M. R. Smith, ECE University of Calgary Canada

Tackled today Problems with using I-ALU as an “integer” processor TigerSHARC processor architecture What features are available for DSP optimization, and what “do we have to worry about” when using these features? Moving the DCremoval( ) over to the X Compute block Using test macros – useful to know, real time waster for the labs in this class. 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

DCRemoval( ) Not as complex as FIR, but many of the same requirements Memory intensive Addition intensive Loops for main code FIFO implemented as circular buffer Not as complex as FIR, but many of the same requirements Easier to handle You use same ideas in optimizing FIR over Labs 2 and 3 Two issues – speed and accuracy. Develop suitable tests for CPP code and check that various assembly language versions satisfy the same tests 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Set up time In principle 1 cycle / instruction 2 + 4 instructions 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

First key element – Sum Loop -- Order (N) Second key element – Shift Loop – Order (log2N) 4 instructions N * 5 instructions 1 + 2 * log2N 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Third key element – FIFO circular buffer -- Order (N) 6 3 6 * N 2 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Time in theory Set up pointers to buffers Insert values into buffers 2 SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return 2 4 4 + N * 5 1 + 2 * log2N 6 3 + 6 * N --------------------------- 22 + 11 N + 2 log2N N = 128 – instructions = 1444 1444 cycles + 1100 delay cycles C++ debug mode – 9500 cycles??????? 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Is the code too slow? Code is slow IFF (if and only if) you don’t have 2,500 cycles available to perform this part of the software defined radio algorithm. Other components of SDR + other components of complete system must complete within the time between 2 samples at 48 kHz 48,000 interrupts per second 500,000,000 cycles available every second 10,500 cycles available per interrupt My ball-park – Never design code that at the design stage takes more than 50% of available cycles. From take-home quiz 1 – DCremoval( ) – 17% of code time – Need 6 * 2,500 cycles = 15,000 for SDR component alone 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

The code is too slow because we are not taking advantage of the available resources Bring in up to 128 bits (4 instructions) per cycle Ability to bring in 4 32-bit values along J data bus (data1) and 4 along K bus (data2) Perform address calculations in J and K ALU – single cycle hardware circular buffers Perform math operations on both X and Y compute blocks Background DMA activity Off-load some of the processing to the second processor 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Version 2 – Move the algorithm component from I-ALU over to Compute Block 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Steps for faster code development Cut and paste old code – Change name only _DCremovalASM_JALU__FPiT1 Becomes _DCremovalASM_Compute__FPiT1 Run test to confirm 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Add timing and execution tests 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Element we want to change void DCremovalASM(int *, int *) Setting up the static arrays Defining and then setting pointers Moving incoming parameters in FIFO Summing the FIFO values Performing (FAST) division Returning the correct values Updating the FIFO in preparation for next time this function is called – discarding oldest value, and “rippling” the FIFO to make the “newest” FIFO slot empty 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Perform sum – using I-ALU 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Perform sum – using Compute Block #define left_sum_XR6 XR6 left_sum_XR6 = 0;; #define left_XR2 XR2 left_XR2 = [left_buffpt_J0 + i_J8];; left_sum_XR6 = R6 + R2;; NOTE SYNTAX left_sum_XR6 = ASHIFT R6 BY -7;; 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Final sum code Don’t use XR6 = J31 J31 is NOT A ZERO if used with COMPUTE block – condition code reg. 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Other necessary changes 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Time in theory Set up pointers to buffers 2 Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return 2 4 4 + N * 5 1 Was 1 + 2 * log2N 6 3 + 6 * N --------------------------- 23 + 11 N Was 22 + 11 N + 2 log2N N = 128 – instructions = 1430 Was 2500 cycles 1444 cycles + 1100 delay cycles 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Time in Practice Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return 2 4 4 + N * 5 1 Was 1 + 2 * log2N 6 3 + 6 * N --------------------------- 23 + 11 N Was 22 + 11 N + 2 log2N N = 128 – instructions = 1430 1430 + 300 delay cycles = 1730 cycles Was 2,500 cycles 1444 cycles + 1100 delay cycles Improved more than expected as accidentally making better use of available resources 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Possible explanation of speed improvement Must wait for value to arrive from memory Must wait for I-ALU to become available so can calculate address or do add Remember – working in a loop Wait for I-ALU Savings 2 * N = 256 Actual 700 = 6 * N 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Next stage in improving code speed Software and hardware circular buffers Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return 2 4 4 + N * 5 1 Was 1 + 2 * log2N 6 3 + 6 * N --------------------------- 23 + 11 N Was 22 + 11 N + 2 log2N N = 128 – instructions = 1430 1430 + 300 delay cycles = 1730 cycles 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Making the tests quicker to develop Is there an alternative to – cut-and-paste? Do you want to bother to learn and then use it? 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Develop Call-RETURN test macro 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Develop – Validate operation test macro In practice: Not as trivial an exercise as it looks Acts as “1 long C++ line”. Any error message – unspecific My favourite error Tabs and / or spaces after final \ on each line Solution – use “Home / End” keys to check that \ is at the end of the line 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Timing test macro – not trivial Need a new special loop control function generated for each test Name must change Print statement contents must change 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Some standard “C++” macro issues A #define must be one line “by definition” So cheat – use final \ -- says newline that follows the \ is not a “new-line character” #define FOO_MACRO(FEE, FUM) \ /* Must have C like comments */ \ /* # character means – turn parameter to string array */ \ puts(#FEE); \ /* ## character means – concatenate parameter \ DoLoop##FUM( ); \ /* Watch out for trailing ; and } – may be required / definitely not wanted */ \ THIS BREAK OVER 2 LINES -- ILLEGAL ; 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Using macros Learning how to do the concatenation and print formatting macros took me about 10 times as long as just cut-and-pasting In the labs – you use test macros at your own risk – the T.A.s and myself will not help you debug them In the exams – you can’t use macros Please note, I have defined macros and am now using them Exam macro -- PLEASE_ANSWER_EXAMQUESTION_FOR_ME( ) causes the marker macro ZERO_OUT_OF_100( ) to be activated Personal opinion – learn the concept for use at a later time – don’t worry about them in the labs 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada

Tackled today Problems with using I-ALU as an “integer” processor TigerSHARC processor architecture What features are available for DSP optimization, and what “do we have to worry about” when using these features? Moving the DCremoval( ) over to the X Compute block Using test macros – useful to know, real time waster for the labs in this class. 5/26/2019 Working with the COMPUTE Block, M. Smith, ECE, University of Calgary, Canada