Trying to avoid pipeline delays

Slides:

Advertisements

Similar presentations

Processor Architecture Needed to handle FFT algoarithm M. Smith.

Advertisements

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Detailed look at the TigerSHARC pipeline Cycle counting for the IALU versionof the DC_Removal algorithm.

Boot Issues Processor comparison TigerSHARC multi-processor system Blackfin single-core.

What are the characteristics of DSP algorithms? M. Smith and S. Daeninck.

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.

TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter.

Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

11/11/05ELEC CISC (Complex Instruction Set Computer) Veeraraghavan Ramamurthy ELEC 6200 Computer Architecture and Design Fall 2005.

TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.

Processor Architecture Needed to handle FFT algoarithm M. Smith.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.

Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.

Generating “Rectify( )” Test driven development approach to TigerSHARC assembly code production Assembly code examples Part 1 of 3.

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

Blackfin Array Handling Part 1 Making an array of Zeros void MakeZeroASM(int foo[ ], int N);

RISC and CISC. What is CISC? CISC is an acronym for Complex Instruction Set Computer and are chips that are easy to program and which make efficient use.

A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.

“Lab. 5” – Updating Lab. 3 to use DMA Test we understand DMA by using some simple memory to memory DMA Make life more interesting, since hardware is involved,

Generating a software loop with memory accesses TigerSHARC assembly syntax.

Computer Hardware What is a CPU.

Advanced Architectures

William Stallings Computer Organization and Architecture 8th Edition

Architecture Background

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

Software and Hardware Circular Buffer Operations

General Optimization Issues

TigerSHARC processor General Overview.

Generating the “Rectify” code (C++ and assembly code)

Generating “Rectify( )”

DMA example Video image manipulation

Overview of SHARC processor ADSP Program Flow and other stuff

Generating a software loop with memory accesses

Understanding the TigerSHARC ALU pipeline

What are the characteristics of DSP algorithms?

Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.

“C” and Assembly Language- What are they good for?

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

Understanding the TigerSHARC ALU pipeline

Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.

Overview of TigerSHARC processor ADSP-TS101 Compute Operations

1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.

Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.

Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.

* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Getting serious about “going fast” on the TigerSHARC

General Optimization Issues

Explaining issues with DCremoval( )

General Optimization Issues

Lab. 4 – Part 2 Demonstrating and understanding multi-processor boot

Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.

DMA example Video image manipulation

Chapter 12 Pipelining and RISC

CSC3050 – Computer Architecture

Building a simple loop using Blackfin assembly code

Overview of SHARC processor ADSP-2106X Memory Operations

Understanding the TigerSHARC ALU pipeline

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

A first attempt at learning about optimizing the TigerSHARC code

Lecture 4: Instruction Set Design/Pipelining

Working with the Compute Block

A first attempt at learning about optimizing the TigerSHARC code

* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.

Presentation transcript:

Trying to avoid pipeline delays Inter-leafing two sets of operations XY Compute block

Tackled today Review of coding a hardware circular buffer Roughly understanding where pipeline delays may occur “Refactor” the working code to improve the speed without spending any time on examining whether delays really there – works at the moment principle “Refactoring” working code to perform operations using both X and Y ALU’s – in principle twice the speed 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

DCRemoval( ) Not as complex as FIR, but many of the same requirements Memory intensive Addition intensive Loops for main code FIFO implemented as circular buffer Not as complex as FIR, but many of the same requirements Easier to handle You use same ideas in optimizing FIR over Labs 2 and 3 Two issues – speed and accuracy. Develop suitable tests for CPP code and check that various assembly language versions satisfy the same tests 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Alternative approach Move pointers rather than memory values In principle – 1 memory read, 1 memory write, pointer addition, conditional equate 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Note: Software circular buffer is NOT necessarily more efficient than data moves Now spending more time on moving / checking the software circular buffer pointers than moving the data? SLOWER FASTER 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Next step – Hardware circular buffer Do exactly the same pointer calculations as with software circular buffers, but now the calculations are done behind the scenes – high speed – using specialized pointer features Only available with J0, J1, J2 and J3 registers (On older ADSP-21061 – all pointer registers) Jx -- The pointer register JBx – The BASE register – set to start of the FIFO array JLx – The length register – set to length of the FIFO array VERY BIG WARNING? – Reset to zero. On older ADSP-21061 it was very important that the length register be reset to zero, otherwise all the other functions using this register would suddenly start using circular buffer by mistake. Still advisable – but need special syntax for causing circular buffer operations to occur 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Store values into hardware FIFO CB instruction ONLY works on POST-MODIFY operations 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Next stage in improving code speed Hardware circular buffers 2 8 Was 4 3 + N * 4 Was 4 + N * 5 1 Was 1 + 2 * log2N 6 14 Was 3 + 6 * N --------------------------- 37 + 4 N Was 23 + 5 N N = 128 – instructions = 549 cycles 549 + 300 delay cycle = 879 cycles Delays are now >50% of useful time Was 677 + 360 delay cycles = 1011 cycle Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

On TigerSHARC Pipeline Issue After you issue the command to read from memory, then must wait for value to come Problem – may be trading memory wait delays for I-ALU delays Memory pipeline delay XR5 =CB [J0 += 1];; XR4 = R4 + R5;; XR6 = CB [J1 += 1];; XR7 = R7 + R6;; No Memory pipeline delay 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Now perform Math operation using circular buffer operation Note the possible memory delays Memory cache helps? Wait for read of R2, use it, then wait for read of R3 and then use it 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Simple interleaving of code Possible saving of memory delays Original order 1 2 3 4 New order 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Interleaving of code Same instructions – different order 2 8 Was 4 3 + N * 4 Was 4 + N * 5 1 Was 1 + 2 * log2N 6 14 Was 3 + 6 * N --------------------------- 37 + 4 N Was 23 + 5 N N = 128 – instructions = 549 cycles 549 + 50 delay cycle = 594 cycles Delays were 10% of useful time Was 549 + 300 delay cycle = 879 cycles Delays were >50% of useful time Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

The code is too slow because we are not taking advantage of the available resources Bring in up to 128 bits (4 instructions) per cycle Ability to bring in 4 32-bit values along J data bus (data1) and 4 along K bus (data2) Perform address calculations in J and K ALU – single cycle hardware circular buffers Perform math operations on both X and Y compute blocks Background DMA activity Off-load some of the processing to the second processor 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Understanding how to use MIMD mode Process left filter in X-Compute, right in Y XR6 = 0;; Puts 0 into XR6 register YR6 = 0;; Puts 0 into YR6 register XYR6 = 0;; Puts 0 into XR6 and YR6 at same time 1 instruction saved 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Understanding how to use MIMD mode Process left filter in X-Compute, right in Y XR6 = R6 + R2;; Adds XR6 + XR2 registers YR6 = R6 + R2;; Adds YR6 + YR2 registers XYR6 = R6 + R2;; Adds XR6 + XR2, AND YR6 + YR2 at same time N instructions saved 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Understanding how to use MIMD mode Process left filter in X-Compute, right in Y XR6 = ASHIFT R6 BY -7;; XR6 = XR6 >> 7 YR6 = ASHIFT R6 BY -7;; YR6 = YR6 >> 7 XYR6 = ASHIFT R6 BY -7;; XR6 = XR6 >> 7 and YR6 = YR6 >> 7 at same time 1 instruction saved 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Final operation – dual subtraction 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

MIMD mode 2 8 Was 4 3 + N * 3 Was 4 + N * 5 1 Was 1 + 2 * log2N 6 --------------------------- 37 + 3 N Was 37 + 4 N N = 128 – instructions = 421 cycles 421 + 180 delay cycles = 590 Now delays are 50% of useful time Was 549 + 50 delay cycle = 594 cycles Delays were 10% of useful time Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Why no improvement? Extra delays from where? Back to having to wait for R2 to come in from memory before the sum can occur 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

The code is too slow because we are not taking advantage of the available resources Bring in up to 128 bits (4 instructions) per cycle Ability to bring in 4 32-bit values along J data bus (data1) and 4 along K bus (data2) Perform address calculations in J and K ALU – single cycle hardware circular buffers Perform math operations on both X and Y compute blocks Background DMA activity Off-load some of the processing to the second processor 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Multiple data busses Many issues to solve before we can bring in 8 data values per cycle Are the data values aligned so can access 4 values at once? If they are not aligned – what can you do? One step at a time – Next lecture Lets us bring 1 value in along the J-Data bus and another in along the K-data bus 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Exercise on handling interleaving of instructions and X-Y compute operations 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada

Tackled today Review of coding a hardware circular buffer Roughly understanding where pipeline delays may occur “Refactor” the working code to improve the speed without spending any time on examining whether delays really there – works at the moment principle “Refactoring” working code to perform operations using both X and Y ALU’s – in principle twice the speed 12/1/2018 Software Circular Buffer Issues, M. Smith, ECE, University of Calgary, Canada