Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL.

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

Xianfeng Li Tulika Mitra Abhik Roychoudhury
Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
© 2006 Edward F. Gehringer ECE 463/521 Lecture Notes, Spring 2006 Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Spring 2006.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
A scheme to overcome data hazards
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
COMP4611 Tutorial 6 Instruction Level Parallelism
Lecture 12 Reduce Miss Penalty and Hit Time
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
COMP25212 Advanced Pipelining Out of Order Processors.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
S. Barua – CPSC 440 CHAPTER 6 ENHANCING PERFORMANCE WITH PIPELINING This chapter presents pipelining.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Zheng Wu. Background Motivation Analysis Framework Intra-Core Cache Analysis Cache Conflict Analysis Optimization Techniques WCRT Analysis Experiment.
Static WCET Analysis vs. Measurement: What is the Right Way to Assess Real-Time Task Timing? Worst Case Execution Time Prediction by Static Program Analysis.
Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010
EKT303/4 Superscalar vs Super-pipelined.
ECE 720T5 Fall 2011 Cyber-Physical Systems Rodolfo Pellizzoni.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
ECE 720T5 Winter 2014 Cyber-Physical Systems Rodolfo Pellizzoni.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Instruction-Level Parallelism and Its Dynamic Exploitation
Instruction Level Parallelism
William Stallings Computer Organization and Architecture 8th Edition
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
CSC 4250 Computer Architectures
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Instruction Scheduling for Instruction-Level Parallelism
Lecture 6: Advanced Pipelines
CSCI1600: Embedded and Real Time Software
Superscalar Processors & VLIW Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Advanced Computer Architecture
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Computer Architecture
Dynamic Hardware Prediction
CMSC 611: Advanced Computer Architecture
Lecture 1 An Overview of High-Performance Computer Architecture
Instruction Level Parallelism
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL Embedded Systems Dept. of Electrical and Computer Engineering University of Florida Liza Rodriguez Aurelio Morales

2 of 23 Outline Pipelining Review Pipelining Review Timing Analysis Timing Analysis AnomaliesAnomalies Domino EffectsDomino Effects Architecture Classifications Architecture Classifications Conclusions Conclusions

3 of 23 Outline Pipelining Review Pipelining Review Timing Analysis Timing Analysis AnomaliesAnomalies Domino EffectsDomino Effects Architecture Classifications Architecture Classifications Conclusions Conclusions

4 of 23 Pipelining Review Pipelining is an implementation technique where multiple instructions are overlapped in execution Pipelining takes advantage of parallelism that exists among the actions needed to execute and instruction Pipelining is like an assembly line, each stage operates in parallel with the other stages Instructions enter at one end, progress through the stages, and exit at the other end Pipelining is the key implementation technique used to make fast CPUs

5 of 23 Pipelined Example LD r4, 0(r3) FetchDecodeExecuteMemoryWrite Back LD r4, 0(r3)5 cycles (5) ADD r1, r7, r31 cycles (4) ADD r1, r7, r r4LOAD0 + r3read LD r4, 0(r3) ADD r1, r7, r ADDr7 + r3 ADD r2, r6, r ADDr6 + r3 ADD r2, r6, r30 ADD r2, r6, r301 cycles (4) XXX r1r2 Pipeline registers separate functional units to allow parallel operation Pipeline will stall if there is a hazard

6 of 23 Further Optimizations Superscalar – executes more than one instruction per clock cycle by simultaneously dispatching multiple instructions to redundant functional units Branch Prediction – predict branches based on a predefined static algorithm or based on dynamic branch history Out of order execution – instructions are dynamically scheduled to avoid hazards and dependencies that may stall the pipeline FetchDecodeExecuteMemoryWrite Back FetchDecodeExecuteMemoryWrite Back Execute ADD r1, r2, r3 wait SUB r1, r2, r3 wait MUL r6, r7, r8 ready Memory LD r4, (0) r5 wait ST r2, (0) r1 ready LD r4, (0) r1 wait Reservation Stations Functional Units

7 of 23 Outline Pipelining Review Pipelining Review Timing Analysis Timing Analysis AnomaliesAnomalies Domino EffectsDomino Effects Architecture Classifications Architecture Classifications Conclusions Conclusions

8 of 23 Real Time Embedded Systems Timing Analysis The analysis for a set of tasks executing on a given hardware to guarantee that timing constraints will be met Timing requires upper and lower bounds on execution times of tasks to be known: Worst Case Execution Time (WCET), Best Case Execution Time (BCET) Analysis results are highly dependent on the architecture An architecture without accompanying performance analysis technology should not be seriously considered for time critical embedded applications Desired Criteria Soundness – valid, reliable, free from random error Obtainable Precision – architecture has predictability properties Analysis effort to reach precision – depends on solution space to be explored

9 of 23 Timing Analysis Non-Pipelined Architecture – Simple Add the execution times of individual instructions to obtain a bound on the execution time of a basic block Pipelined Architecture – Complex Overlapped instructions - cannot consider individual instructions in isolation Instructions must be considered collectively to obtain timing bounds

10 of 23 Timing Analysis Pipelined Architecture – Complex To do WCET analysis, the most costly pipeline path should be selected To compute a precise bound, the analysis needs to include as many “timing accidents” as possible Timing accidents: data hazards, branch mispredictions, occupied functional units, cache misses, etc. Issues: timing anomalies and domino effects Thus, timing has to follow all possible successor states The more performance enhancing features the pipeline has, the larger the search space

11 of 23 Timing Anomaly Formal definition - a situation where the local worst case does not contribute to the global worst case A better definition – a positive improvement to the architecture that has a negative effect on execution time Examples: A caches miss may result in a shorter execution time Shortening an instruction leads to longer execution time

12 of Timing Anomaly Example: Cache Hit or Miss ALD r4, 0(r3) BADD r5, r4, r4 CADD r1, r6, r6 DMUL r2, r1, r1 EMUL r3, r2, r2 Miss Penalty8 cyc. LSU2 cyc. ALU1 cyc. Multiplier4 cyc. Architecture is made up of functional units and reservation stations – similar to Tomasulo’s Algorithm LSU ALU MULT A B C D E LSU ALU MULT A B C D E A BC DE A BC DE Cache Hit Cache Miss

13 of Timing Anomaly Example: Reduced Instruction AMUL r2, r1, r1 BADD r3, r2, r2 CADD r4, r5, r5 DLD r6, 0(r4) EADD r7, r6, r6 Miss Penalty8 cyc. LSU4 cyc. ALU2 cyc. Multiplier? cyc. Architecture is made up of functional units and reservation stations – similar to Tomasulo’s Algorithm LSU ALU MULT A B C D E LSU ALU MULT A B C D E A CB D E Multiplier = 5 cycles Multiplier = 2 cycles A B C D E

14 of 23 Domino Effects Formal definition – a system exhibits a domino effect if there are two hardware states s, t such that the difference in execution time may be arbitrarily high and cannot be bounded by a constant A better definition – a minor timing accident can cause an unbounded increase in execution time Examples: Timing accident in a loop PowerPC755 pipeline – Schneider Pseudo-least-recently used (PLRU) replacement policy – Berg

15 of Domino Effects A B A B A B A ALU AADD r4, r3, r3 BADD r1, r2, r A B A B A B A ALU First A gets delayed one clock cycle due to a dependency with the previous instruction A DispatchEA +5 A ExecuteImmdt B DispatchDA+4 B ExecuteDA+6 A B A B ABA B ABA BA A

16 of 23 Outline Pipelining Review Pipelining Review Timing Analysis Timing Analysis AnomaliesAnomalies Domino EffectsDomino Effects Architecture Classifications Architecture Classifications Conclusions Conclusions

17 of 23 Classification of Architectures Fully Timing Compositional Architectures No timing anomalies or domino effects Timing analysis can safely follow worst case paths only Example: ARM7 Compositional Architectures with Constant Bounded Effects Exhibit timing anomalies but no domino effects Timing analysis has to consider all paths but can be optimized to safely discard all local non-worst case paths by adding a constant number of cycles to the worst case path – trading precision with efficiency Example: Infineon TriCore Non Compositional Architectures Exhibit timing anomalies and domino effects Timing analysis has to follow all possible paths since a local effect can greatly influence the future execution arbitrarily Example: PowerPC775

18 of 23 Outline Pipelining Review Pipelining Review Timing Analysis Timing Analysis AnomaliesAnomalies Domino EffectsDomino Effects Architecture Classifications Architecture Classifications Conclusions Conclusions

19 of 23 Conclusions Architectural optimizations in embedded systems are necessary to improve performance and to meet critical time constraints Pipelines - multiple issue, out of order execution, branch prediction, etc. However, an architectural optimization may not be worth implementing if effects such as timing anomalies and domino will have a negative impact on timing analysis How good is an optimization if you can’t measure its effects? A trade off exists between the amount of executions time you can save by pipeline optimizations and the amount of precision you lose in timing analysis

20 of 23 Questions?