Understanding the TigerSHARC ALU pipeline

Slides:

Advertisements

Similar presentations

Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.

Advertisements

Detailed look at the TigerSHARC pipeline Cycle counting for the IALU versionof the DC_Removal algorithm.

Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter.

Pipelining III Andreas Klappenecker CPSC321 Computer Architecture.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Chapter Six - 2nd Half Pipelined Processor Forwarding, Hazards, Branching EE3055 Web:

CSCE 212 Quiz 9 – 3/30/11 1.What is the clock cycle time based on for single-cycle and for pipelining? 2.What two actions can be done to resolve data hazards?

Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.

Processor Architecture Needed to handle FFT algoarithm M. Smith.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.

Generating a software loop with memory accesses TigerSHARC assembly syntax.

Real-World Pipelines Idea Divide process into independent stages

Instruction Level Parallelism

CSCI206 - Computer Organization & Programming

Single Clock Datapath With Control

Pipeline Implementation (4.6)

Pipelining: Advanced ILP

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

Chapter 4 The Processor Part 3

Morgan Kaufmann Publishers The Processor

Instruction Level Parallelism and Superscalar Processors

Software and Hardware Circular Buffer Operations

General Optimization Issues

Lecture 6: Advanced Pipelines

Pipelining review.

TigerSHARC processor General Overview.

Generating the “Rectify” code (C++ and assembly code)

Generating “Rectify( )”

Systems I Pipelining II

DMA example Video image manipulation

Overview of SHARC processor ADSP Program Flow and other stuff

Pipelining in more detail

Trying to avoid pipeline delays

Understanding the TigerSHARC ALU pipeline

CSCI206 - Computer Organization & Programming

Figure 8.1 Architecture of a Simple Computer System.

Lecture: Out-of-order Processors

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

Control unit extension for data hazards

Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.

November 5 No exam results today. 9 Classes to go!

* From AMD 1996 Publication #18522 Revision E

Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.

* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Getting serious about “going fast” on the TigerSHARC

General Optimization Issues

Explaining issues with DCremoval( )

General Optimization Issues

Lab. 4 – Part 2 Demonstrating and understanding multi-processor boot

CS203 – Advanced Computer Architecture

DMA example Video image manipulation

Lecture 4: Advanced Pipelines

Control unit extension for data hazards

Understanding the TigerSHARC ALU pipeline

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Control unit extension for data hazards

A first attempt at learning about optimizing the TigerSHARC code

Lecture 1 An Overview of High-Performance Computer Architecture

Working with the Compute Block

* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.

Presentation transcript:

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline

Understanding the TigerSHARC ALU pipeline TigerSHARC has many pipelines If these pipelines stall – then the processor speed goes down Need to understand how the ALU pipeline works Learn to use the pipeline viewer Understanding what the pipeline viewer tells in detail Avoiding having to use the pipeline viewer Improving code efficency Excel and Project (Gantt charts) are useful tool 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Register File and COMPUTE Units 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Simple Example IIR -- Biquad For (Stages = 0 to 3) Do S0 = Xin * H5 + S2 * H3 + S1 * H4 Yout = S0 * H0 + S1 * H1 + S2 * H2 S2 = S1 S1 = S0 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Code return float when using XR8 register – NOTE NOT XFR8 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Step 2 – Using C++ code as comments set up the coefficients XFR0 = 0.0;; Does not exist XR0 = 0.0;; DOES EXIST Bit-patterns require integer registers Leave what you wanted to do behind as comments 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada Expect to take 8 cycles to execute 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

PIPELINE STAGES See page 8-34 of Processor manual 10 pipeline stages, but may be completely desynchronized (happen semi-independently) Instruction fetch -- F1, F2, F3 and F4 Integer ALU – PreDecode, Decode, Integer, Access Compute Block – EX1 and EX2 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Pipeline Viewer Result XR0 = 1.0 enters PD stage @ 39025, enters E2 stage at cycle 39830 is stored into XR0 at cycle 39831 -- 7 cycles execution time 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Pipeline Viewer Result XR6 = 5.5 enters PD stage at cycle 39032 enters E2 stage at cycle 39837 is stored into XR6 at cycle 39838 -- 7 cycles execution time Each instruction takes 7 cycles but one new result each cycle Result – once pipeline filled 8 cycles = 8 register transfer operations 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada Doing filter operations – generates different results XR8 = XR6 enters PD at 39833, enters EX2 at 39838, stored 39839 – 7 cycles XFR23 = R9 * R4 enters PD at 39834, enters EX2 at 39839, stored 39840 – 7 cycles XFR0 = R0 + R23 enters PD at 39835, enters EX2 at 39841, stored 39842 – 8 cycles WHY? – FIND OUT WITH MOUSE CLICK ON S MARKER THEN CONTROL 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada Instruction 0x17e XFR8 = R8 + R23 is STALLED (waiting) for 0x17d to complete XFR23 = R8 * R4 Bubble B means that the pipeline is doing “nothing” Meaning that the instruction shown is “place holder” (garbage) 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Information on Window Event Icons 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada Result of Analysis Can’t use Float result immediately after calculation Writing XFR23 = R8 * R4;; XFR8 = R8 + R23;; // MUST WAIT FOR XFR23 // calculation to be completed Is the same as coding XFR23 = R8 * R4;; NOP;;  Note DOUBLE ;; -- extra cycle because of stall XFR8 = R8 + R23;; Proof – write the code with the stalls shown in it Writing this way means we don’t have to use the pipeline viewer all the time Pipeline viewer is only available with (slow) simulator #define SHOW_ALU_STALL nop 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada Code with stalls shown 8 code lines 5 expected stalls Expect 13 cycles to complete if theory is correct 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Analysis approach IS correct 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Process for coding for improved speed – code re-organization Make a copy of the code so can test iirASM( ) and iirASM_Optimized( ) to make sure get correct result Make a table of code showing ALU resource usage (paper, EXCEL, Project (Gantt chart) ) Identify data dependencies Make all “temp operations” use different register Move instructions “forward” to fill delay slots, BUT don’t break data dependencies 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Copy and paste to make IIRASM_Optimized( ) 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada Need to re-order instructions to fill delay slots with useful instructions After refactoring code to fill delay slots, must run tests to ensure that still have the correct result Change – and check NOT EASY MUST HAVE A PLAN I USE EXCEL 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Show resource usage and data dependencies 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada Change all temporary registers to use different register names Then check code produces correct answer 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Move instructions forward, without breaking data dependencies What appears possible! DO one thing at a time and then check that code still works 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Check that code still operates 1 cycle saved 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada Move next multiplication up. NOTE certain stalls remain, although reason for STALL changes 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Move up the R10 and R9 assignment operations -- check 4 cycle improvement? 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

CHECK THE PIPELINE AFTER TESTING 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Are there still more improvements possible (I can see 4 more moves) 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Problems with approach Identifying all the data dependencies Keep track of how the data dependencies change as you move the code around Handling all of this “automatically” I started the following design tool as something that might work, but it actually turned out very useful. M. R. Smith and J. Miller, "Microprocessor Scheduling -- the irony of using Microsoft Project", "Don’t say “CAN’T do it - Say “Gantt it”! The irony of organizing microprocessors with a big business tool" Circuit Cellar magazine, Vol. 184, pp 26 - 35, November 2005. 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Using Microsoft Project – Step 1 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Add dependencies and resource usage – then activate level 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Microsoft Project as a microprocessor design tool Will look at this in more detail when we start using memory operations to fill the coefficient and state arrays 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Understanding the TigerSHARC ALU pipeline TigerSHARC has many pipelines If these pipelines stall – then the processor speed goes down Need to understand how the ALU pipeline works Learn to use the pipeline viewer Understanding what the pipeline viewer tells in detail Avoiding having to use the pipeline viewer Improving code efficiency Excel and Project (Gantt charts) are useful tool 1/2/2019 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada