Software and Hardware Circular Buffer Operations

Slides:



Advertisements
Similar presentations
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Advertisements

Tuan Tran. What is CISC? CISC stands for Complex Instruction Set Computer. CISC are chips that are easy to program and which make efficient use of memory.
Detailed look at the TigerSHARC pipeline Cycle counting for the IALU versionof the DC_Removal algorithm.
Boot Issues Processor comparison TigerSHARC multi-processor system Blackfin single-core.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter.
Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
11/11/05ELEC CISC (Complex Instruction Set Computer) Veeraraghavan Ramamurthy ELEC 6200 Computer Architecture and Design Fall 2005.
TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.
Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.
Generating “Rectify( )” Test driven development approach to TigerSHARC assembly code production Assembly code examples Part 1 of 3.
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Blackfin Array Handling Part 1 Making an array of Zeros void MakeZeroASM(int foo[ ], int N);
RISC and CISC. What is CISC? CISC is an acronym for Complex Instruction Set Computer and are chips that are easy to program and which make efficient use.
A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.
Building a simple loop using Blackfin assembly code If you can handle the while-loop correctly in assembly code on any processor, then most of the other.
“Lab. 5” – Updating Lab. 3 to use DMA Test we understand DMA by using some simple memory to memory DMA Make life more interesting, since hardware is involved,
Generating a software loop with memory accesses TigerSHARC assembly syntax.
Overview of Instruction Set Architectures
William Stallings Computer Organization and Architecture 8th Edition
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
General Optimization Issues
TigerSHARC processor General Overview.
Generating the “Rectify” code (C++ and assembly code)
Generating “Rectify( )”
Central Processing Unit
Computer Organization and ASSEMBLY LANGUAGE
DMA example Video image manipulation
The planned and expected
Overview of SHARC processor ADSP Program Flow and other stuff
Trying to avoid pipeline delays
Generating a software loop with memory accesses
Understanding the TigerSHARC ALU pipeline
What are the characteristics of DSP algorithms?
Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.
“C” and Assembly Language- What are they good for?
Lab. 2 – More details – Later tasks
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Understanding the TigerSHARC ALU pipeline
Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.
Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
Getting serious about “going fast” on the TigerSHARC
General Optimization Issues
Explaining issues with DCremoval( )
General Optimization Issues
Lab. 4 – Part 2 Demonstrating and understanding multi-processor boot
Practical Session 8, Memory Management 2
Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.
Instructions in Machine Language
DMA example Video image manipulation
Data Structures & Algorithms
Building a simple loop using Blackfin assembly code
Overview of SHARC processor ADSP-2106X Memory Operations
Understanding the TigerSHARC ALU pipeline
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
A first attempt at learning about optimizing the TigerSHARC code
Lecture 4: Instruction Set Design/Pipelining
Working with the Compute Block
Practical Session 9, Memory Management continues
Blackfin Syntax Moves and Adds
A first attempt at learning about optimizing the TigerSHARC code
Presentation transcript:

Software and Hardware Circular Buffer Operations M. R. Smith, ECE University of Calgary Canada

Tackled today Have moved the DCremoval( ) over to the X Compute block Circular Buffer Issues DCRemoval( ) FIR( ) Coding a software circular buffer in C++ and TigerSHARC assembly code Coding a hardware circular buffer Where to next? 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

DCRemoval( ) Not as complex as FIR, but many of the same requirements Memory intensive Addition intensive Loops for main code FIFO implemented as circular buffer Not as complex as FIR, but many of the same requirements Easier to handle You use same ideas in optimizing FIR over Labs 2 and 3 Two issues – speed and accuracy. Develop suitable tests for CPP code and check that various assembly language versions satisfy the same tests 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Next stage in improving code speed Software circular buffers Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return 2 4 4 + N * 5 1 Was 1 + 2 * log2N 6 3 + 6 * N --------------------------- 23 + 11 N Was 22 + 11 N + 2 log2N N = 128 – instructions = 1430 1430 + 300 delay cycles = 1730 cycles 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

DCRemoval( ) FIFO implemented as circular buffer If there are N points in the circular buffer, then this approach of moving the data from memory to memory location requires N Memory read / N Memory write (possible data bus conflicts) 2N memory address calculations 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Alternative approach Move pointers rather than memory values In principle – 1 memory read, 1 memory write, pointer addition, conditional equate 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Note: Software circular buffer is NOT necessarily more efficient than data moves Watch out – my version of FIR uses a different sort of circular buffer FIR FIFO – newest element earliest in array (matching FIR equation) DCremoval FIFO – newest element latest in array – because that is the way I thought of it 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Note: Software circular buffer is NOT necessarily more efficient than data moves Now spending more time on moving / checking the software circular buffer pointers than moving the data? SLOWER FASTER 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

On TigerSHARC Since we can have multiply instructions on one line, then “perhaps” if we can avoid pipeline delays then software circular buffer is faster than memory moves Pipeline delay XR4 = R4 + R5;; XR4 = R4 + R6;; Second instruction needs result of first No Pipeline delay XR3 = R4 + R6;; Second instruction DOES NOT need result of first 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Generate the tests for the software circular buffer routine 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

New static pointers needed in Software circular buffer code 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

New sets of register defines Now using many of TigerSHARC registers 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Code for storing new value into FIFO requires knowledge of “next-empty” location First you must get the address of where the static variable – saved_next_pointer Second you must access that address to get the actual pointer Third you must use the pointer value Will be problem in labs and exams with static variables stored in memory 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Adjustment of software circular buffer pointer must be done carefully Get and update pointer Check the pointer Save corrected pointer 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Next stage in improving code speed Software circular buffers Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return 2 8 Was 4 4 + N * 5 1 Was 1 + 2 * log2N 6 14 Was 3 + 6 * N --------------------------- 37 + 5 N Was 23 + 11 N N = 128 – instructions = 677 cycles 677 + 360 delay cycles = 1011 cycles Was 1430 + 300 delay cycles = 1730 cycles 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Next step – Hardware circular buffer Do exactly the same pointer calculations as with software circular buffers, but now the calculations are done behind the scenes – high speed – using specialized pointer features Only available with J0, J1, J2 and J3 registers (On older ADSP-21061 – all pointer registers) Jx -- The pointer register JBx – The BASE register – set to start of the FIFO array JLx – The length register – set to length of the FIFO array VERY BIG WARNING? – Reset to zero. On older ADSP-21061 it was very important that the length register be reset to zero, otherwise all the other functions using this register would suddenly start using circular buffer by mistake. Still advisable – but need special syntax for causing circular buffer operations to occur 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Setting up the circular buffer functions Remember all the tests to start with 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Store values into hardware FIFO CB instruction ONLY works on POST-MODIFY operations 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Now perform Math operation using circular buffer operation MUST NOT DO XR2 = CB [J0 + i_J8]; Save N cycles as no longer need to increment index 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Update the static variables Further special CB instructions A few cycles saved here 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Next stage in improving code speed Hardware circular buffers 2 8 Was 4 3 + N * 4 Was 4 + N * 5 1 Was 1 + 2 * log2N 6 14 Was 3 + 6 * N --------------------------- 37 + 4 N Was 23 + 5 N N = 128 – instructions = 549 cycles 549 + 300 delay cycle = 879 cycles Delays are now >50% of useful time Was 677 + 360 delay cycles = 1011 cycle Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Tackle the summation part of FIR Exercise in using CB (Assignment 2) 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Place assembly code here 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

The code is too slow because we are not taking advantage of the available resources Bring in up to 128 bits (4 instructions) per cycle Ability to bring in 4 32-bit values along J data bus (data1) and 4 along K bus (data2) Perform address calculations in J and K ALU – single cycle hardware circular buffers Perform math operations on both X and Y compute blocks Background DMA activity Off-load some of the processing to the second processor 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Tackled today Have moved the DCremoval( ) over to the X Compute block Circular Buffer Issues DCRemoval( ) FIR( ) Coding a software circular buffer in C++ and TigerSHARC assembly code Coding a hardware circular buffer Where to next? 11/14/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada