Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm.

Slides:



Advertisements
Similar presentations
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Advertisements

Processor Architecture Needed to handle FFT algoarithm M. Smith.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
Computer Organization CS224 Fall 2012 Lesson 44. Virtual Memory  Use main memory as a “cache” for secondary (disk) storage l Managed jointly by CPU hardware.
On-Chip Cache Analysis A Parameterized Cache Implementation for a System-on-Chip RISC CPU.
Detailed look at the TigerSHARC pipeline Cycle counting for the IALU versionof the DC_Removal algorithm.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
TigerSHARC CLU Closer look at the XCORRS M. Smith, University of Calgary, Canada
Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Chapter 12 Pipelining Strategies Performance Hazards.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Chapter 12 CPU Structure and Function. Example Register Organizations.
TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.
Review for Midterm 2 CPSC 321 Computer Architecture Andreas Klappenecker.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.
Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.
1 CSC 2405: Computer Systems II Spring 2012 Dr. Tom Way.
5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.
Computer Systems Overview. Page 2 W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware resources of one.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
TDC 311 The Microarchitecture. Introduction As mentioned earlier in the class, one Java statement generates multiple machine code statements Then one.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
CS2100 Computer Organisation Virtual Memory – Own reading only (AY2015/6) Semester 1.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Fetch Directed Prefetching - a Study
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.
Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
COSC 3330/6308 Second Review Session Fall Instruction Timings For each of the following MIPS instructions, check the cycles that each instruction.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Computer Systems Overview. Lecture 1/Page 2AE4B33OSS W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware.
Chapter 1 Computer System Overview
How will execution time grow with SIZE?
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Software and Hardware Circular Buffer Operations
General Optimization Issues
TigerSHARC processor General Overview.
Programming Problem solving Debugging
Trying to avoid pipeline delays
Generating a software loop with memory accesses
Understanding the TigerSHARC ALU pipeline
Virtual Memory فصل هشتم.
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Understanding the TigerSHARC ALU pipeline
Getting serious about “going fast” on the TigerSHARC
General Optimization Issues
Explaining issues with DCremoval( )
General Optimization Issues
Chapter 1 Computer System Overview
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Understanding the TigerSHARC ALU pipeline
A first attempt at learning about optimizing the TigerSHARC code
Working with the Compute Block
A first attempt at learning about optimizing the TigerSHARC code
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
Presentation transcript:

Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm

DC_Removal algorithm performance 2 / 28 To be tackled today Expected and actual cycle count for Compute Block version of DC_Removal algorithm Understanding why the stalls occur and how to fix. Understanding some operations “first time into function” – cache issues?

DC_Removal algorithm performance 3 / 28 Set up time In principle 1 cycle / instruction instructions

DC_Removal algorithm performance 4 / 28 First key element – Sum Loop -- Order (N) Second key element – Shift Loop – Order (log 2 N) 4 instructions N * 5 instructions * log 2 N

DC_Removal algorithm performance 5 / 28 Third key element – FIFO circular buffer -- Order (N) * N 2

DC_Removal algorithm performance 6 / 28 TigerSHARC pipeline

DC_Removal algorithm performance 7 / 28 Time in theory Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return N * * log 2 N * N N + 2 log 2 N N = 128 – instructions = cycles delay cycles C++ debug mode – 9500 cycles??????? Note other tests executed before this test. Means “cache filled”

DC_Removal algorithm performance 8 / 28 Set up time Expected instructions Actual instructions + 2 stalls Why not 4 stalls?

DC_Removal algorithm performance 9 / 28 First time round sum loop Expected 9 instructions LC0 load – 3 stalls Each memory fetch – 4 stalls Actual stalls

DC_Removal algorithm performance 10 / 28 Other times around the loop Expected 5 instructions Each memory fetch – 4 stalls Actual stalls

DC_Removal algorithm performance 11 / 28 Shift Loop – 1 st time around Expected 3 instructions No stalls on LC0 load? 4 stall on ASHIFTR BTB hit followed by 5 aborts

DC_Removal algorithm performance 12 / 28 Time in theory / practice Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return Entry into subroutine 10 stalls? 2 0 stalls 4 2 stalls 4 + N * 5 N * 8 = 1024 stalls * log 2 N 9 stalls 6 3 stalls * N 3 stalls 2 -- Exit from subroutine 10 stalls? N + 2 log 2 N 1061 stalls N = 128 – instructions = cycles stalls = 2505 cycles In practice 2507 cycles C++ debug mode – 9500 cycles??????? Note other tests executed before this test. Means “cache filled”

DC_Removal algorithm performance 13 / 28 Final sum code – Using XR registers

DC_Removal algorithm performance 14 / 28 Time in Practice Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return Entry into subroutine 10 stalls 2 0 stalls 4 2 stalls 4 + N * 5 Was 1024 stalls 1 Was * log 2 N + 9 stalls 6 3 stalls * N 3 stalls 2 10 stalls N Was N + 2 log 2 N N = 128 – instructions = delay cycles = 1709 cycles Was 2,504 cycles with JALU 1444 cycles delay cycles Predicted stall with X-compute block = 249 stalls -- close enough to 256 = N * 2 – or one stall for each memory access Improved more than expected as accidentally making better use of available resources

DC_Removal algorithm performance 15 / 28 Second time into function First time around the loop 2 stalls per loop iteration as predicted

DC_Removal algorithm performance 16 / 28 2 nd time into function 9 th time around the loop Stalls as expected Note sets of 5 quad instructions appear to be fetch in

DC_Removal algorithm performance 17 / 28 Interpretation Currently XR2 = [J0 + J8];; XR6 = R6 + R2;; // Must wait 1 cycle for XR2 to be brought in XR3 = [J1 + J8];; XR7 = R7 + R3;; // Must wait 1 cycle for XR3? Next improvement? XR2 = [J0 + J8];; XR3 = [J1 + J8];; XR6 = R6 + R2;; // XR2 and XR3 are now ready when we want to use // them? XR7 = R7 + R3;; // or do we get DATA / DATA clash along J-bus?

DC_Removal algorithm performance 18 / 28 Pipeline “intermingled” left and right filter operation

DC_Removal algorithm performance 19 / 28 Time in Practice Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return Entry into subroutine 10 stalls 2 0 stalls 4 2 stalls 4 + N * 5 Was 1024 stalls 1 Was * log 2 N + 9 stalls 6 3 stalls * N 3 stalls 2 10 stalls N Was N + 2 log 2 N N = 128 – instructions = delay cycles = 1709 cycles Was 2,504 cycles with JALU 1444 cycles delay cycles Predicted stall with X-compute block = 249 stalls -- close enough to 256 = N * 2 – or one stall for each memory access Intermingled code – around 1430 cycles + 30 stalls

DC_Removal algorithm performance 20 / 28 1 st time into function 1 st time round the loop

DC_Removal algorithm performance 21 / 28 1 st time into function 2 nd, 3 rd, … time round loop

DC_Removal algorithm performance 22 / 28 9 th, 17 th etc time into the loop

DC_Removal algorithm performance 23 / 28 From TigerSHARC p9-11 Reading in 8-words at a time from “memory” into “cache” MIGHT explain the behaviour

DC_Removal algorithm performance 24 / 28 Again, talking about “8” data values

DC_Removal algorithm performance 25 / 28 Read buffer

DC_Removal algorithm performance 26 / 28 Implications – read buffer Prefetch buffer 4 pages Each page bit words = 64 items Buffer = 256 – exactly enough to handle 128 left and 128 right Does that imply that speed does not scale up – 256 point arrays are slower than 2 x as slow as 128 points May make sense to process all of left and then all of right?

DC_Removal algorithm performance 27 / 28 Implications – cache 4 way associative cache 128 cache sets Each cache set has four cache ways Each cache way – 8 32 bit words That’s bit words Things break down when left / right arrays are of size 512, or else do all left then all right – things change at 1024

DC_Removal algorithm performance 28 / 28 To be tackled today Expected and actual cycle count for Compute Block version of DC_Removal algorithm Understanding why the stalls occur and how to fix. Understanding some operations “first time into function” – cache issues?