Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL.

Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL 6935 - Embedded Systems Dept. of Electrical and Computer Engineering University of Florida Liza Rodriguez Aurelio Morales

2 of 23 Outline Pipelining Review Pipelining Review Timing Analysis Timing Analysis AnomaliesAnomalies Domino EffectsDomino Effects Architecture Classifications Architecture Classifications Conclusions Conclusions

4 of 23 Pipelining Review Pipelining is an implementation technique where multiple instructions are overlapped in execution Pipelining takes advantage of parallelism that exists among the actions needed to execute and instruction Pipelining is like an assembly line, each stage operates in parallel with the other stages Instructions enter at one end, progress through the stages, and exit at the other end Pipelining is the key implementation technique used to make fast CPUs

5 of 23 Pipelined Example LD r4, 0(r3) FetchDecodeExecuteMemoryWrite Back LD r4, 0(r3)5 cycles (5) ADD r1, r7, r31 cycles (4) ADD r1, r7, r3 001100r4LOAD0 + r3read LD r4, 0(r3) ADD r1, r7, r3 101011ADDr7 + r3 ADD r2, r6, r30 101011ADDr6 + r3 ADD r2, r6, r30 ADD r2, r6, r301 cycles (4) XXX r1r2 Pipeline registers separate functional units to allow parallel operation Pipeline will stall if there is a hazard

6 of 23 Further Optimizations Superscalar – executes more than one instruction per clock cycle by simultaneously dispatching multiple instructions to redundant functional units Branch Prediction – predict branches based on a predefined static algorithm or based on dynamic branch history Out of order execution – instructions are dynamically scheduled to avoid hazards and dependencies that may stall the pipeline FetchDecodeExecuteMemoryWrite Back FetchDecodeExecuteMemoryWrite Back Execute ADD r1, r2, r3 wait SUB r1, r2, r3 wait MUL r6, r7, r8 ready Memory LD r4, (0) r5 wait ST r2, (0) r1 ready LD r4, (0) r1 wait Reservation Stations Functional Units

8 of 23 Real Time Embedded Systems Timing Analysis The analysis for a set of tasks executing on a given hardware to guarantee that timing constraints will be met Timing requires upper and lower bounds on execution times of tasks to be known: Worst Case Execution Time (WCET), Best Case Execution Time (BCET) Analysis results are highly dependent on the architecture An architecture without accompanying performance analysis technology should not be seriously considered for time critical embedded applications Desired Criteria Soundness – valid, reliable, free from random error Obtainable Precision – architecture has predictability properties Analysis effort to reach precision – depends on solution space to be explored

9 of 23 Timing Analysis Non-Pipelined Architecture – Simple Add the execution times of individual instructions to obtain a bound on the execution time of a basic block Pipelined Architecture – Complex Overlapped instructions - cannot consider individual instructions in isolation Instructions must be considered collectively to obtain timing bounds

10 of 23 Timing Analysis Pipelined Architecture – Complex To do WCET analysis, the most costly pipeline path should be selected To compute a precise bound, the analysis needs to include as many “timing accidents” as possible Timing accidents: data hazards, branch mispredictions, occupied functional units, cache misses, etc. Issues: timing anomalies and domino effects Thus, timing has to follow all possible successor states The more performance enhancing features the pipeline has, the larger the search space

11 of 23 Timing Anomaly Formal definition - a situation where the local worst case does not contribute to the global worst case A better definition – a positive improvement to the architecture that has a negative effect on execution time Examples: A caches miss may result in a shorter execution time Shortening an instruction leads to longer execution time

12 of 23 1 2 3 4 5 6 7 8 9 10 11 12 13 Timing Anomaly Example: Cache Hit or Miss ALD r4, 0(r3) BADD r5, r4, r4 CADD r1, r6, r6 DMUL r2, r1, r1 EMUL r3, r2, r2 Miss Penalty8 cyc. LSU2 cyc. ALU1 cyc. Multiplier4 cyc. Architecture is made up of functional units and reservation stations – similar to Tomasulo’s Algorithm LSU ALU MULT A B C D E 1 2 3 4 5 6 7 8 9 10 11 12 13 LSU ALU MULT A B C D E A BC DE A BC DE Cache Hit Cache Miss

13 of 23 1 2 3 4 5 6 7 8 9 10 11 12 13 Timing Anomaly Example: Reduced Instruction AMUL r2, r1, r1 BADD r3, r2, r2 CADD r4, r5, r5 DLD r6, 0(r4) EADD r7, r6, r6 Miss Penalty8 cyc. LSU4 cyc. ALU2 cyc. Multiplier? cyc. Architecture is made up of functional units and reservation stations – similar to Tomasulo’s Algorithm LSU ALU MULT A B C D E 1 2 3 4 5 6 7 8 9 10 11 12 13 LSU ALU MULT A B C D E A CB D E Multiplier = 5 cycles Multiplier = 2 cycles A B C D E

14 of 23 Domino Effects Formal definition – a system exhibits a domino effect if there are two hardware states s, t such that the difference in execution time may be arbitrarily high and cannot be bounded by a constant A better definition – a minor timing accident can cause an unbounded increase in execution time Examples: Timing accident in a loop PowerPC755 pipeline – Schneider Pseudo-least-recently used (PLRU) replacement policy – Berg

15 of 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domino Effects A B A B A B A ALU AADD r4, r3, r3 BADD r1, r2, r2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 A B A B A B A ALU First A gets delayed one clock cycle due to a dependency with the previous instruction A DispatchEA +5 A ExecuteImmdt B DispatchDA+4 B ExecuteDA+6 A B A B ABA B ABA BA A

17 of 23 Classification of Architectures Fully Timing Compositional Architectures No timing anomalies or domino effects Timing analysis can safely follow worst case paths only Example: ARM7 Compositional Architectures with Constant Bounded Effects Exhibit timing anomalies but no domino effects Timing analysis has to consider all paths but can be optimized to safely discard all local non-worst case paths by adding a constant number of cycles to the worst case path – trading precision with efficiency Example: Infineon TriCore Non Compositional Architectures Exhibit timing anomalies and domino effects Timing analysis has to follow all possible paths since a local effect can greatly influence the future execution arbitrarily Example: PowerPC775

19 of 23 Conclusions Architectural optimizations in embedded systems are necessary to improve performance and to meet critical time constraints Pipelines - multiple issue, out of order execution, branch prediction, etc. However, an architectural optimization may not be worth implementing if effects such as timing anomalies and domino will have a negative impact on timing analysis How good is an optimization if you can’t measure its effects? A trade off exists between the amount of executions time you can save by pipeline optimizations and the amount of precision you lose in timing analysis

20 of 23 Questions?

Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL.

Similar presentations

Presentation on theme: "Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL.

Similar presentations

Presentation on theme: "Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL."— Presentation transcript:

Similar presentations

About project

Feedback