Timing Anomalies in Dynamically Scheduled Microprocessors Thomas Lundqvist, Per Stenstrom (RTSS ‘99) Presented by: Kaustubh S. Patil.

Slides:



Advertisements
Similar presentations
Xianfeng Li Tulika Mitra Abhik Roychoudhury
Advertisements

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Modeling shared cache and bus in multi-core platforms for timing analysis Sudipta Chattopadhyay Abhik Roychoudhury Tulika Mitra.
Performance of Cache Memory
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Instruction-Level Parallelism (ILP)
CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Institut für Datentechnik und Kommunikationetze Analysis of Shared Coprocessor Accesses in MPSoCs Overview Bologna, Simon Schliecker Matthias.
Chapter 12 Pipelining Strategies Performance Hazards.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
Appendix A Pipelining: Basic and Intermediate Concepts
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Predictable Implementation of Real-Time Applications on Multiprocessor Systems-on-Chip Alexandru Andrei, Petru Eles, Zebo Peng, Jakob Rosen Presented By:
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.
Architecture Basics ECE 454 Computer Systems Programming
Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL.
Timing Analysis of Embedded Software for Speculative Processors Tulika Mitra Abhik Roychoudhury Xianfeng Li School of Computing National University of.
Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Ravikumar Source:
Zheng Wu. Background Motivation Analysis Framework Intra-Core Cache Analysis Cache Conflict Analysis Optimization Techniques WCRT Analysis Experiment.
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.
Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.
CMPE 421 Parallel Computer Architecture
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
B. Ramamurthy.  12 stage pipeline  At peak speed, the processor can request both an instruction and a data word on every clock.  We cannot afford pipeline.
Pipelining and Parallelism Mark Staveley
CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S
A Unified WCET Analysis Framework for Multi-core Platforms Sudipta Chattopadhyay, Chong Lee Kee, Abhik Roychoudhury National University of Singapore Timon.
Exploiting Scratchpad-aware Scheduling on VLIW Architectures for High-Performance Real-Time Systems Yu Liu and Wei Zhang Department of Electrical and Computer.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Pipelining Example Laundry Example: Three Stages
ECE 720T5 Fall 2011 Cyber-Physical Systems Rodolfo Pellizzoni.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
ECE 720T5 Winter 2014 Cyber-Physical Systems Rodolfo Pellizzoni.
CS203 – Advanced Computer Architecture Pipelining Review.
PipeliningPipelining Computer Architecture (Fall 2006)
Computer Architecture Principles Dr. Mike Frank
CSC 4250 Computer Architectures
How will execution time grow with SIZE?
PowerPC 604 Superscalar Microprocessor
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers The Processor
CSCI1600: Embedded and Real Time Software
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Control unit extension for data hazards
Control unit extension for data hazards
Memory System Performance Chapter 3
Control unit extension for data hazards
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Timing Anomalies in Dynamically Scheduled Microprocessors Thomas Lundqvist, Per Stenstrom (RTSS ‘99) Presented by: Kaustubh S. Patil

Assumptions in current methods to find Worst Case Execution Time (WCET) Execution time of an instruction is not fixed - Due to pipeline stalls or cache misses - Input data dependency eg. mulhw, mulhwu, mullw in PowerPC architecture In such cases, current methods assume longest instruction latency for every instruction - eg. if the outcome of a cache access is unknown, a cache miss is assumed. - Intuition-based

Claim: Making such assumptions for dynamically scheduled processors is wrong ! Dynamically scheduled processors - out-of-program-order instruction execution For such processors, counter-intuitive increase or decrease in execution time is possible - eg. a cache miss can actually reduce the overall execution time. - These are termed as timing anomalies.

Organization of the presentation Description of architectural features that may cause anomalies Examples of timing anomalies Handling of such anomalies in previous methods Proposed methods to eliminate such anomalies Case study of a previous method in the context of proposed solutions

Terms and definitions Formal definition of timing anomaly - Instruction latency same as instruction execution time - case 1: latency of first instruction increased by i cycles - case 2: it is decreased by d cycles - C be resulting future change in execution time Definition: A situation where, in the first case, C>i or C 0. In-order and out-of-order resources If a processor only contains in-order resources, no timing anomalies can occur

Architecture used for illustrating

Timing anomaly examples A cache-hit results in WCET B is dependent on A In cache-hit case, B gets priority over C In cache-miss case, D & E execute 1 cycle earlier The reason for this anomaly - IU is an out-of-order resource

Timing anomaly examples (…contd) Overall miss penalty can be higher than a single cache miss penalty A,B,C have dependencies C always results in a miss C finishes 11 cycles later instead of one miss penalty of 8 cycles MCIU allows B and D to execute out-of-order

Timing anomaly examples (…contd) Unbounded impact on WCET A and B make a loop body Fast case - ‘ A’ executes as soon as dispatched Slow case - ‘A’ is delayed by one cycle - Old B gets priority over new A - ‘A’ gets delayed in each iteration - Total penalty k cycles if k iterations

Limitations of previous methods Such methods make locally safe decisions, at basic block or instruction level. Timing anomalies due to variable latency instructions and different pipeline states do not allow this. Consider an instruction sequence with n variable latency instructions. Each such instruction can have k different latencies. Need to examine k n possibly different schedules

Methods for eliminating anomalies The pessimistic serial-execution method - All instructions are executed in-order. - All memory references are considered misses. - Which instruction sequence is considered ? - Very pessimistic approach The program modification method - All unknown events and variable latency instructions must result in a predictable pipeline state - If a path is selected as a WCET path among a set of paths, then the end cache & pipeline state must be the same.

The program modification method (…contd) Making pipeline-state predictable Forced in-order resource use is one solution - little processor support Use of sync instruction in PowerPC architecture - to take care of variable latency instructions - also when cache hits are unpredictable sync works for both the previous conditions

The program modification method (…contd) Making cache state predictable After each path invalidate all cache blocks - poor performance Invalidate only differing cache blocks - poor performance again Preload cache blocks - special instruction support eg. icbt,dcbt in PowerPC

Case study: symbolic execution method Instruction level simulation Extended instruction semantics to take care of ‘unknown’ operands eg. Add A,B,C A  B + C,if both B and C are known A  unknown, either B or C is unknown Elimination of infeasible paths Merging of paths to avoid exponential number of paths

Changes to this existing method First pass identifies all places where local decisions need to be made - eg. merging of paths and variable latency instructions Addition of sync and preload instructions at such sites T serial = sum of all latencies and misses T = T serial / 2 in the ideal case

Benchmarks used PSIM, existing instruction-level simulator was extended for symbolic execution and modification of program approach The benchmarks used were: - matmult : Multiplies 2 50*50 matrices - bsort : Bubblesort of 100 integers - isort : Insertsort of 10 integers - fib : Calculates nth element of Fibonacci sequence for n<30 - DES : Encrypts 64 bit data - jfdctint : Discrete cosine transform of an 8*8 pixel image - compress : Compresses 50 bytes of data

Evaluation results ProgramActual WCET Unsafe WCET RatioSerail WCET RatioModified WCET RatioModified slowdown matmult bsort isort fib DES jfdctint compress

Summary Timing anomalies in dynamically scheduled processors may cause wrong WCET estimation using previous methods. Using architecture support to control state of the cache and pipeline, it is possible to eliminate anomalies and the previous methods can be used on such modified programs.

Thank you !!! Questions ?