Understanding the TigerSHARC ALU pipeline

Slides:



Advertisements
Similar presentations
Machine cycle.
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Computer Organization and Architecture
Processor Architecture Needed to handle FFT algoarithm M. Smith.
1 A few words about the quiz Closed book, but you may bring in a page of handwritten notes. –You need to know what the “core” MIPS instructions do. –I.
Detailed look at the TigerSHARC pipeline Cycle counting for the IALU versionof the DC_Removal algorithm.
What are the characteristics of DSP algorithms? M. Smith and S. Daeninck.
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
ENCM 515 Review talk on 2001 Final A. Wong, Electrical and Computer Engineering, University of Calgary, Canada ucalgary.ca.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter.
TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.
Pipelining What is it? How does it work? What are the benefits? What could go wrong? By Derek Closson.
Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
What have mr aldred’s dirty clothes got to do with the cpu
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.
RISC Architecture RISC vs CISC Sherwin Chan.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.
Generating “Rectify( )” Test driven development approach to TigerSHARC assembly code production Assembly code examples Part 1 of 3.
5/13/99 Ashish Sabharwal1 Pipelining and Hazards n Hazards occur because –Don’t have enough resources (ALU’s, memory,…) Structural Hazard –Need a value.
A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.
More on Pipelining 1 CSE 2312 Computer Organization and Assembly Language Programming Vassilis Athitsos University of Texas at Arlington.
Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.
Generating a software loop with memory accesses TigerSHARC assembly syntax.
Real-World Pipelines Idea Divide process into independent stages
Memory – Caching: Writes
Speed up on cycle time Stalls – Optimizing compilers for pipelining
\course\cpeg323-08F\Topic6b-323
Instruction Level Parallelism and Superscalar Processors
Software and Hardware Circular Buffer Operations
General Optimization Issues
Pipelining and Vector Processing
Lecture 6: Advanced Pipelines
Pipelining review.
Single-cycle datapath, slightly rearranged
TigerSHARC processor General Overview.
Generating the “Rectify” code (C++ and assembly code)
Generating “Rectify( )”
Superscalar Processors & VLIW Processors
CSCE Fall 2013 Prof. Jennifer L. Welch.
Systems I Pipelining II
Pipelining in more detail
Trying to avoid pipeline delays
Generating a software loop with memory accesses
What are the characteristics of DSP algorithms?
Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.
Chapter Six.
VisualDSP++ and Test Driven Development What happened last lecture?
Understanding the TigerSHARC ALU pipeline
Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.
Instruction Execution Cycle
Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
Getting serious about “going fast” on the TigerSHARC
CSCE Fall 2012 Prof. Jennifer L. Welch.
General Optimization Issues
Explaining issues with DCremoval( )
General Optimization Issues
Handling Arrays Completion of ideas needed for a general and complete program Final concepts needed for Final.
ECE 352 Digital System Fundamentals
Lecture 4: Advanced Pipelines
Systems I Pipelining II
Systems I Pipelining II
Understanding the TigerSHARC ALU pipeline
A first attempt at learning about optimizing the TigerSHARC code
Working with the Compute Block
COMPUTER ORGANIZATION AND ARCHITECTURE
A first attempt at learning about optimizing the TigerSHARC code
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
Pipelining Hazards.
Presentation transcript:

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 1 Getting code to work

Understanding the TigerSHARC ALU pipeline TigerSHARC has many pipelines If these pipelines stall – then the processor speed goes down Need to understand how the ALU pipeline works Learn to use the pipeline viewer May be different answer for floating point and integer operations 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Register File and COMPUTE Units 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Simple Example IIR -- Biquad For (Stages = 0 to 3) Do S0 = Xin * H5 + S2 * H3 + S1 * H4 Yout = S0 * H0 + S1 * H1 + S2 * H2 S2 = S1 S1 = S0 Not a great bit of IIR code as It can’t be used in a loop on an array of values as is really necessary 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Set up the tests. Want to make sure correct answer as code changes #include <EmbeddedUnit/EmbeddedUnit.h> #include <EmbeddedUnit/CommonTests.h> #include <EmbeddedUnit/EmbeddedTests.h> 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Step 1 – Stub plus return value Build an assembly language stub for float iirASM(void); Make it return a floating point value of 40.5 to show that we can return a value of 40.5 J8 is an INTEGER so how can we return 40.5? ANSWER – WE DON’T We return the “bit pattern” for 40.5, which is the same as an “INTEGER” bit pattern 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Code does not work when passing back floats with J8 register We are passing back 40.5 in normal return register, but that is obviously NOT what the C++ compiler was expecting Wrong code convention 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Code does work when using XR8 register – NOTE NOT XFR8 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Step 2 – Using C++ code as comments -- set up the coefficients XFR0 = 0.0;; DOES NOT EXIST as a float instruction XR0 = 0.0;; DOES EXIST Bit-patterns require integer X registers Leave what you wanted to do behind as comments 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada “ARCHITECTURAL ISSUES “– DON’T NEED SPECIAL FLOAT = CONSTANT INSTRUCTIONS Initialize X registers to float values via “integer” operations XR = Then use XFR “float” operations What I want to do is left behind as comments for the stranger reading my code next week (ME) 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Modify C++ code so that it can be translated into assembly code Can only have 1 instruction per line Code must execute sequentially so remember the ;; 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Start with S0 = Xin instruction Can’t use XFR8 = XFR6 to copy a register 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Since XFR8 = XFR6 is not allowed Try XR8 = R6; SIMD  Single instruction Multiple Data SISD  Single instruction SingleData R6 means move XR6 and YR6 (Multiple data move described in 1 instruction) Try XR8 = XR6 (integer – bit-pattern – move) New TigerSHARC architecture issues SIMD versus SISD 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada Some operations are FLOAT operations and must have XFR on left side of equation BUT only R on the right Some operations are SISD operations and must have XR on both side of the equation (or just R on both sides of the equation making them SIMD X and Y with garbage happening on Y) Personally, I think all these problems are “assembler” issues and could be made consistent 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada What we have learnt TigerSHARC has both SISD (single data) and SIMD (multiple data) ability XFR4 = R4 * R5; The answer (left) is single data – so the SISD choice is taken on right – read XR4 and XR5 (bit patterns), treat as floats when do multiplication (F on left), and store (bit pattern of answer) in XR4 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada What we have learnt TigerSHARC has both SISD (single data) and SIMD (multiple data) ability SISD XR4 = XR5;; Move X part of R5 register into X part of R4 register XR4 = YR5;; Move Y part of R5 register into X part of R4 register SIMD XYR4 = R5;; Move X part of R5 register into X part of R4 register and Y part of R5 register into Y part of R4 register R4 = R5;; Short hand version of XYR4 = R5 to confuse you Does YXR4 = R5 also exist? Move X part of R5 register into Y part of R4 register and X part of R5 register into Y part of R4 register 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Disconnect from target and go to simulator 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada Activate Simulator 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Rebuild the project and set breakpoints at start and end of ASM code 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Activate the pipeline viewer 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada Adjust the pipeline window so can see all the instruction pipeline stages Have just located an arrow icon which causes the pipeline window to fill the screen all the way across 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

PIPELINE STAGES See page 8-34 of Processor manual 10 pipeline stages, but may be completely desynchronized (happen semi-indepently) Instruction fetch -- F1, F2, F3 and F4 Integer ALU – PreDecode, Decode, Integer, Access Compute Block – EX1 and EX2 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

PIPELINE STAGES See page 8-34 of Processor manual Instruction fetch -- F1, F2, F3 and F4 Fetch Unit Pipe Memory driven not instruction driven 128 bits fetched – may make up 1, 2, 3, or 4 instruction lines (or parts of a couple of instruction lines Instruction fetched into IAB, instruction alignment buffer 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

PIPELINE STAGES See page 8-34 of Processor manual Integer ALU pipe – PD, D, I and A PreDecode – the next COMPLETE instruction line (1, 2, 3 or 4 ) fetched from IAB Decode – different instructions dispatched to different execution units (J-IALU, K-IALU, Compute Blocks) Data memory access start in Integer stage A stands for Access stage Results are not available EX2 stage, but (by register forwarding) can be sometimes accessed earlier 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

PIPELINE STAGES See page 8-34 of Processor manual Compute Block EX1 and EX2 Result is always written to the target register on the rising edge of CCLK after stage EX2 Following multiple use of register (read and store) in one line guaranteed to pipeline correctly R2 = R0 + R1; R6 = R2 * R3;; R2 at end of instruction R2 value at beginning of instruction used 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Only interested in later stages of the pipeline. Adjust properties 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Run the code till first ASM break point: Note down cycle Number 39830 Then run again till reach second ASM breakpoint Calculate execution time Instruction in pipeline for a long time before simulator stops 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada Pipeline during code execution 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada Pipeline viewer says 26 cycles but what do we expect to get from our code? 1 2 3 4 5 6 7 8 8 cycles in this part of the code as expect 1 instruction per clock cycle 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Pipeline viewer says 26 cycles but what do we expect -- 21 20% error in timing Too much Where are the extra cycles coming from? How easy is it to code in such a way that the extra cycles can be removed? ANSWER Fairly straight forward to fix in principle, can be difficult in practice 1 2 3 4 5 6 7 8 9 10 11 12 13 Again 1 instruction / cycle expected 13 cycles expected + 8 from before = 21 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Understanding the TigerSHARC ALU pipeline TigerSHARC has many pipelines If these pipelines stall – then the processor speed goes down Need to understand how the ALU pipeline works Learn to use the pipeline viewer May be different answer for floating point and integer operations 12/2/2018 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada