Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.

Slides:



Advertisements
Similar presentations
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Advertisements

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.
Detailed look at the TigerSHARC pipeline Cycle counting for the IALU versionof the DC_Removal algorithm.
Boot Issues Processor comparison TigerSHARC multi-processor system Blackfin single-core.
What are the characteristics of DSP algorithms? M. Smith and S. Daeninck.
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
Chapter 12 Pipelining Strategies Performance Hazards.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter.
Pipelining III Andreas Klappenecker CPSC321 Computer Architecture.
Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
1 Chapter Six - 2nd Half Pipelined Processor Forwarding, Hazards, Branching EE3055 Web:
TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.
1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
Generating “Rectify( )” Test driven development approach to TigerSHARC assembly code production Assembly code examples Part 1 of 3.
Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.
Chapter 4 The Processor. Chapter 4 — The Processor — 2 Introduction We will examine two MIPS implementations A simplified version A more realistic pipelined.
CMPE 421 Parallel Computer Architecture
Processor Types and Instruction Sets CS 147 Presentation by Koichiro Hongo.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
Computer Orgnization Rabie A. Ramadan Lecture 9. Cache Mapping Schemes.
Real-World Pipelines Idea Divide process into independent stages
GCSE Computing - The CPU
Chapter Six.
Instruction Level Parallelism
CSCI206 - Computer Organization & Programming
Pipeline Implementation (4.6)
Pipelining: Advanced ILP
Morgan Kaufmann Publishers The Processor
Software and Hardware Circular Buffer Operations
General Optimization Issues
Lecture 6: Advanced Pipelines
TigerSHARC processor General Overview.
Generating the “Rectify” code (C++ and assembly code)
Trying to avoid pipeline delays
Understanding the TigerSHARC ALU pipeline
CSCI206 - Computer Organization & Programming
Chapter Six.
Chapter Six.
Understanding the TigerSHARC ALU pipeline
Control unit extension for data hazards
Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.
November 5 No exam results today. 9 Classes to go!
* From AMD 1996 Publication #18522 Revision E
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
Getting serious about “going fast” on the TigerSHARC
General Optimization Issues
Explaining issues with DCremoval( )
General Optimization Issues
Control unit extension for data hazards
Understanding the TigerSHARC ALU pipeline
Control unit extension for data hazards
A first attempt at learning about optimizing the TigerSHARC code
GCSE Computing - The CPU
Working with the Compute Block
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
Presentation transcript:

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 2 / 32 11/2/2015 Understanding the TigerSHARC ALU pipeline TigerSHARC has many pipelines If these pipelines stall – then the processor speed goes down Need to understand how the ALU pipeline works  Learn to use the pipeline viewer Understanding what the pipeline viewer tells in detail  Avoiding having to use the pipeline viewer Improving code efficiency  Excel and Project (Gantt charts) are useful tool

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 3 / 32 11/2/2015 Register File and COMPUTE Units

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 4 / 32 11/2/2015 Simple Example IIR -- Biquad For (Stages = 0 to 3) Do  S0 = X in * H5 + S2 * H3 + S1 * H4  Y out = S0 * H0 + S1 * H1 + S2 * H2  S2 = S1  S1 = S0 S0 S1 S2 Horrible IIR code example as can’t re-use in a loop Works as a simple example for understanding TigerSHARC pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 5 / 32 11/2/2015 Code return float when using XR8 register – NOTE NOT XFR8

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 6 / 32 11/2/2015 Step 2 – Using C++ code as comments set up the coefficients XFR0 = 0.0;; Does not exist XR0 = 0.0;; DOES EXIST Bit-patterns require integer registers Leave what you wanted to do behind as comments

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 7 / 32 11/2/2015 Expect to take 8 cycles to execute

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 8 / 32 11/2/2015 PIPELINE STAGES See page 8-34 of Processor manual 10 pipeline stages, but may be completely desynchronized (happen semi- independently) Instruction fetch -- F1, F2, F3 and F4 Integer ALU – PreDecode, Decode, Integer, Access Compute Block – EX1 and EX2

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 9 / 32 11/2/2015 Pipeline Viewer Result XR0 = 1.0 enters PD 39025, enters E2 stage at cycle is stored into XR0 at cycle cycles execution time

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 10 / 32 11/2/2015 Pipeline Viewer Result XR6 = 5.5 enters PD stage at cycle enters E2 stage at cycle is stored into XR6 at cycle cycles execution time Each instruction takes 7 cycles but one new result each cycle Result – ONCE pipeline filled 8 cycles = 8 register transfer operations Key – don’t break pipeline with any jumps

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 11 / 32 11/2/2015 Doing filter operations – generates different results XR8 = XR6 enters PD at 39833, enters EX2 at 39838, stored – 7 cycles XFR23 = R9 * R4 enters PD at 39834, enters EX2 at 39839, stored – 7 cycles XFR0 = R0 + R23 enters PD at 39835, enters EX2 at 39841, stored – 8 cycles WHY? – FIND OUT WITH MOUSE CLICK ON S MARKER THEN CONTROL

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 12 / 32 11/2/2015 Instruction 0x17e XFR8 = R8 + R23 is STALLED (waiting) for instruction 0x17d XFR23 = R8 * R4 to complete Bubble B means that the pipeline is doing “nothing” Meaning that the instruction shown is “place holder” (garbage)

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 13 / 32 11/2/2015 Information on Window Event Icons

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 14 / 32 11/2/2015 Result of Analysis Can’t use Float result immediately after calculation Writing XFR23 = R8 * R4;; XFR8 = R8 + R23;; // MUST WAIT FOR XFR23 // calculation to be completed Is the same as coding XFR23 = R8 * R4;; NOP;;  Note DOUBLE ;; -- extra cycle because of stall XFR8 = R8 + R23;; Proof – write the code with the stalls shown in it  Writing this way means we don’t have to use the pipeline viewer all the time  Pipeline viewer is only available with (slow) simulator  #define SHOW_ALU_STALL nop

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 15 / 32 11/2/2015 Code with stalls shown 8 code lines 5 expected stalls Expect 13 cycles to complete if theory is correct

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 16 / 32 11/2/2015 Analysis approach IS correct Same speed with and without nops

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 17 / 32 11/2/2015 Process for coding for improved speed – code re-organization Make a copy of the code so can test iirASM( ) and iirASM_Optimized( ) to make sure get correct result Make a table of code showing ALU resource usage (paper, EXCEL, Project (Gantt chart) ) Identify data dependencies Make all “temp operations” use different register Move instructions “forward” to fill delay slots, BUT don’t break data dependencies

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 18 / 32 11/2/2015 Copy and paste to make IIRASM_Optimized( )

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 19 / 32 11/2/2015 Need to re-order instructions to fill delay slots with useful instructions After refactoring code to fill delay slots, must run tests to ensure that still have the correct result Change – and “retest” NOT EASY TO DO MUST HAVE A SYSTEMATIC PLAN TO HANDLE OPTIMIZATION I USE EXCEL

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 20 / 32 11/2/2015 Show resource usage and data dependencies All temporary register usage involves the SAME XFR23 register This typically stalls out the processor

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 21 / 32 11/2/2015 Change all temporary registers to use different register names Then check code produces correct answer All temporary register usage involves a DIFFERENT Register ALWAYS FOLLOW THIS PROCESS WHEN OPTIMIZING

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 22 / 32 11/2/2015 Move instructions forward, without breaking data dependencies What appears possible! DO one thing at a time and then check that code still works

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 23 / 32 11/2/2015 Check that code still operates 1 cycle saved Have put “our” marker stall instruction in parallel with moved instruction using ; rather than ;; Move this instruction up in code sequence to fill delay slot Check that code still runs after this optimization stage

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 24 / 32 11/2/2015 Move next multiplication up. NOTE certain stalls remain, although reason for STALL changes from why they were inserted before

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 25 / 32 11/2/2015 Move up the R10 and R9 assignment operations -- check 4 cycle improvement?

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 26 / 32 11/2/2015 CHECK THE PIPELINE AFTER TESTING

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 27 / 32 11/2/2015 Are there still more improvements possible (I can see 4 more moves)

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 28 / 32 11/2/2015 Problems with approach Identifying all the data dependencies Keep track of how the data dependencies change as you move the code around Handling all of this “automatically” I started the following design tool as something that might work, but it actually turned out very useful. M. R. Smith and J. Miller, "Microprocessor Scheduling -- the irony of using Microsoft Project", "Don’t say “CAN’T do it - Say “Gantt it”! The irony of organizing microprocessors with a big business tool" Circuit Cellar magazine, Vol. 184, pp , November 2005.

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 29 / 32 11/2/2015 Using Microsoft Project – Step 1

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 30 / 32 11/2/2015 Add dependencies and resource usage – then activate level

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 31 / 32 11/2/2015 Microsoft Project as a microprocessor design tool Will look at this in more detail when we start using memory operations to fill the coefficient and state arrays

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 32 / 32 11/2/2015 Understanding the TigerSHARC ALU pipeline TigerSHARC has many pipelines If these pipelines stall – then the processor speed goes down Need to understand how the ALU pipeline works  Learn to use the pipeline viewer Understanding what the pipeline viewer tells in detail  Avoiding having to use the pipeline viewer Improving code efficiency  Excel and Project (Gantt charts) are useful tool