Presentation is loading. Please wait.

Presentation is loading. Please wait.

Marco Caccamo Department of Computer Science University of Illinois at Urbana-Champaign Toward the Predictable Integration of Real-Time COTS Based Systems.

Similar presentations


Presentation on theme: "Marco Caccamo Department of Computer Science University of Illinois at Urbana-Champaign Toward the Predictable Integration of Real-Time COTS Based Systems."— Presentation transcript:

1 Marco Caccamo Department of Computer Science University of Illinois at Urbana-Champaign Toward the Predictable Integration of Real-Time COTS Based Systems

2 2  Part of this research is a joint work with prof. Lui Sha  This presentation is from selected research sponsored by ◦ National Science Foundation ◦ Lockheed Martin Corporation  Graduate students who led these research efforts were: ◦ Rodolfo Pellizzoni ◦ Bach D. Bui References  R. Pellizzoni, B.D. Bui, M. Caccamo and L. Sha, "Coscheduling of CPU and I/O Transactions in COTS-based Embedded Systems," To appear at IEEE Real- Time Systems Symposium, Barcelona, December 2008.  R. Pellizzoni and M. Caccamo, "Toward the Predictable Integration of Real- Time COTS based Systems", Proceedings of the IEEE Real-Time Systems Symposium, Tucson, Arizona, December 2007. Acknowledgement

3  Embedded systems are increasingly built by using Commercial Off-The-Shelf (COTS) components to reduce costs and time-to- market  This trend is true even for companies in the safety-critical avionic market such as Lockheed Martin Aeronautics, Boeing and Airbus  COTS components usually provide better performance: ◦ SAFEbus used in the Boing777 transfers data up to 60 Mbps, while a COTS interconnection such as PCI Express can reach higher transfer speeds (over three orders of magnitude)  COTS components are mainly optimized for the average case performance and not for the worst-case scenario. COTS HW & RT Embedded Systems

4  Experiment based on an Intel Platform, typical embedded system speed.  PCI-X 133Mhz, 64 bit fully loaded.  Task suffers continuous cache misses.  Up to 44% wcet increase. This is a big problem!!! I/O Bus Transactions & WCETs

5  According to ARINC 653 avionic standard, different computational components should be put into isolated partitions (cyclic time slices of the CPU).  ARINC 653 does not provide any isolation from the effects of I/O bus traffic. A peripheral is free to interfere with cache fetches while any partition (not requiring that peripheral) is executing on the CPU.  To provide true temporal partitioning, enforceable specifications must address the complex dependencies among all interacting resources.  See Aeronautical Radio Inc. ARINC 653 Specification. It defines the Avionics Application Standard Software Interface. ARINC 653 and unpredictable COTS behaviors

6  Cache-peripheral conflict: ◦ Master peripheral working for Task B. ◦ Task A suffers cache miss. ◦ Processor activity can be stalled due to interference at the FSB level.  How relevant is the problem? ◦ Four high performance network cards, saturated bus. ◦ Up to 49% increased wcet for memory intensive tasks. CPU Front Side Bus DDRAM Host PCI Bridge Master peripheral Slave peripheral Task A Task B This effect MUST be considered in wcet computation!! Sebastian Schonberg, Impact of PCI-Bus Load on Applications in a PC Architecture, RTSS 03 PCI Bus Peripheral Integration: Problem Scenario

7  To achieve end-to-end temporal isolation, shared resources (CPU, bus, cache, peripherals, etc.) should either support strong isolation or temporal interference should be quantifiable.  Highly pessimistic assumptions are often made to compensate for the lack of end-to-end temporal isolation on COTS ◦ An example is to account for the effect of all peripheral traffic in the wcet of real-time tasks (up to 44% increment in task wcet)!  Lack of end-to-end temporal isolation raises dramatically integration costs and is source of serious concerns during the development of safety critical embedded systems ◦ At integration time (last phase of the design cycle), testing can reveal unexpected deadline misses causing expensive design rollbacks Goal: End-to-End Temporal Isolation on COTS

8  It is mandatory to have a closer look at HW behavior and its integration with OS, middleware, and applications  We aim at analyzing temporal interference caused by COTS integration ◦ if analyzed performance is not satisfactory, we search for alternative (non-intrusive) HW solutions  see Peripheral Gate Goal: End-to-End Temporal Isolation on COTS

9  We introduced an analytical technique that computes safe bounds on the I/O-induced task delay (D).  To control I/O interference over task execution, we introduced a coscheduling technique for CPU & I/O Peripherals  We designed a COTS-compatible peripheral gate and hardware server to enable/disable I/O peripherals (hw server is in progress!) Main Contributions

10  COTS are inherently unpredictable due to: Graphics Processor MPEG Comp. Digital Video Copper Fibre Channel Network Interface Fibre Channel Network Interface Network Interface IEEE 1394 Network Interface Network Interface Discrete IO IO CPU+Multi- Level Cache PCI Bus 0a PCI Bus 1a Shared Memory Ethernet RS-485 PCI Bus 0b PCI Bus 1b System Controller Network Interface Power PC clocked @ 1000 MHz 64 Bit Wide Memory Bus 256 MB DDR SDRAM Clocked @ 125 MHz 64 Bit PCI-X Clocked @ 100 MHz 32 Bit PCI Clocked @ 66 MHz 32 Bit PCI Clocked @ 33 MHz PCI to PCI Bridge 32 Bit PCI Clocked @ 66 MHz PCI-X to PCI Bridge Port 1 Port 2 Inactive o Pipelined, cached CPUs. o Master (DMA) peripherals. o Etc.  Modern COTS-based embedded architectures are multi-master platforms  We assume a shared memory architecture with single-port RAM  We will show safe bounds for cache-peripheral interference at the main memory level. The cache-peripheral interference problem

11  Similar to network calculus approach.  : maximum cumulative bus time required in any interval of length t.  How to compute: ◦ Measurement. ◦ Knowledge of distributed traffic.  Assumptions: ◦ Maximum non preemtive transaction length: L’ ◦ No buffering in bridges (the analysis was extended in presence of buffering too!). Peripheral Burstiness Bound

12  : cumulative bus time required to fetch/replace cache lines in.  Note: not an upper bound!  Assumptions: ◦ CPU is stalled while waiting for lvl2 cache line fetch (no hyperthreading).  How to compute: ◦ Static analysis. ◦ Profiling.  Profiling yields multiple traces, run delay analysis on all. t Bus time flat curve: CPU executing increasing curve (slope 1): CPU stalled during cache line fetch Cache Miss Profile

13  The proposed analysis computes worst case increase (D) on task computation time due to cache delays caused by FSB interference.  Main idea: treat the FSB + CPU cache logic as a switch that multiplexes accesses to system memory. ◦ Inputs: Cache line misses over time and peripheral bandwidth. ◦ Output: Curve representing the delayed cache misses.  Bus arbitration is assumed RR or FP, transactions are non preemptive. t Cache misses t Periph. Baund. t Cache misses wcet increament (D) CPU PCI wcet (no I/O interference) Cache Delay Analysis

14  Worst case situation: PCI transaction accepted just before CPU cache miss.  Worst case interference: min ( CM, PT/L’ ) * L’ ◦ CM: # of cache misses ◦ PT: total peripheral traffic during task execution ◦ Assuming RR bus arbitration CPU PCI cache line length max transaction length t : cache miss Analysis: Intuition (1/2)

15  The analysis shown is pessimistic; cache misses exhibit burst behavior.  Example: assume 1 peripheral transaction every T time units.  Real analysis: compute exact interference pattern based on burstiness of cache misses and peripheral transactions. CPU PCI TTTTT these CPU memory accesses can not be delayed these peripheral transactions can not delay the CPU t Analysis: Intuition (2/2)

16  Worst case situation: peripheral transaction of length L’ accepted just before CPU cache miss. 051015202530 0 5 10 3540 CACHE CPU 510152025303540455055 Fetch start time in the cache access function c(t) unmodified by peripheral activity Worst Case Interference Scenario

17  Cache Bound: max number of interfering peripheral trans. = number of cache misses.  Let CM be the number of cache misses.  Then. 0510152025303540 CACHE CPU PERIPHERAL 455055 D D Bound: Cache Misses

18  Peripheral Bound: max interference D  max bus time requested by peripherals in interval .  Let.  Then equivalently: 0510152025303540 CACHE CPU PERIPHERAL 455055 D D In general, given a set of fetches {f i,…,f j } with start times {t i,…,t j }  D  E( t j -t i ) In general, given a set of fetches {f i,…,f j } with start times {t i,…,t j }  D  E( t j -t i ) Bound: Peripheral Load

19  There is a circular dependency between the amount of peripheral load that interferes with {f i,…,f j } and the delay D(f i, f j ).  When peripheral traffic is injected on the FSB, the start time of each fetch is delayed. In turn, this increases the time interval between f i and f j and therefore more peripheral traffic can now interfere with those fetches.  Our key idea is that we do not need to modify the start times {f i,…,f j } of fetches when we take into account the I/O traffic injected on the FSB. Instead, we take it into account using the equation that defines Some Insights about Peripheral Bound

20  represents both the maximum delay suffered by fetches within [0-36] and the increase in the time interval for interfering traffic. Fetches in interval [0-36] max interference D 

21 0510152025303540 CACHE PERIPHERAL 455045 5 10 0 5 15 0510152025303540455045   The real worst case delay is 13!  Reason: cache is too bursty, interference from one peripheral trans. is “lost” while the cache is not used. This trans. can not interfere! D E(t 5 -t 1 +D) = 14 The Intersection is not Tight!

22 0510152025303540 CACHE PERIPHERAL 455045 5 10 0 5 15 0510152025303540455045  Solution: split into multiple intervals. .  How many intervals do we need to consider? The Intersection is not Tight!

23  Iterative algorithm evaluates N(N+1)/2 intervals.  Each interval computed in O(1), overall complexity O(N 2 ).  Bound is tight (see RTSS’07).... max delay for miss 1 (u 1 ) max delay for miss 2 (u 2 ) max delay for miss 3 (u 3 ) max delay for miss 4 (u 4 ) Delay Algorithm

24  Multitasking analysis using cyclic executive (it was extended to EDF with restricted-preemption model). 1. Analyze task Control Flow Graph. 2. Build a set of sequential superblocks. 3. Schedule is interleaving of slots composed of superblocks. 4. Algorithm: compute number of superblocks in each slot. 5. Account for additional cache misses due to inter-task cache interference. Multitasking analysis

25  The proposed analysis makes a fairly restrictive assumption: it must know the exact time of each cache miss.  I/O interference is significant: when added to the wcet of all tasks, the system can suffer a huge waste of bandwidth!  Key idea: let’s coschedule CPU & I/O Peripherals  Goal: allow as much peripheral traffic as possible at run-time while using CPU reservations that do NOT include I/O interference (D). Great! But c(t) is hard to get... and 44% is awful

26  Problem: obtaining an exact cache miss pattern is very hard. ◦ CPU simulation requires simulating all peripherals. ◦ Static analysis scales poorly. ◦ In practice testing is often the preferred way.  Our solution: ◦ Split the tasks into intervals. ◦ Insert a checkpoint at the end of each interval. ◦ Measure wcet and worst case # of cache misses for each interval (with no peripheral traffic). ◦ Checkpoints should not break loops or branches (sequential macroblock boundaries). checkpoint start Cache Miss Profile is Hard to Get

27  A coscheduling technique for COTS peripherals 1. divide each task into a series of sequential superblocks; 2. Run off-line profiling for each task, collecting information on wcet and # of cache misses in each superblock (without I/O interference); 3. Compute a safe (wcet+D) bound (it includes I/O interference) for each superblock by assuming a “critical cache miss pattern” 4. Design a peripheral gate (p-gate) to enable/disable I/O peripherals 5. Design a new peripheral (on FPGA board), the reservation controller, which executes the coscheduling algorithm and controls all p-gates. 6. Use profiling information at run-time to coschedule tasks and I/O transactions CPU & I/O coscheduling: HOW TO

28  Input: a set of intervals with wcet and cache misses.  Since we do not know when each cache miss happens within each interval, we need to identify a worst case pattern. wcet 1 CM 1 wcet 2 CM 2 wcet 3 CM 3 wcet 4 CM 4 wcet 5 CM 5 t Bus time t wcet i CM i =4 t Bus time  If the Peripheral Load Curve is concave, then we obtain a tight bound for delay D (details are in a technical report).  If the Peripheral Load Curve is not concave, the bound for delay D is not tight. Simulations showed that the upper bound is within 0.2% of the real worst case delay. This is actually the worst case pattern! Analysis with Interval Information

29 wcet 1 wcet 2 wcet 3 wcet 4 wcet 5 Task total wcet  The on-line algorithm: ◦ Non-safety critical tasks have CPU reservation = wcet (D NOT included!) ◦ At the beginning of each job the p-gates are closed. ◦ At run time, at each checkpoint the OS sends the APMC # of CPU cycles (exec i ) to the reservation controller. ◦ The reservation controller keeps track of accumulated slack time. If slack time  i (wcet i -exec i ) is greater than delay D for next interval, open the p- gate. On-line Coscheduling Algorithm

30 wcet 1 wcet 2 wcet 3 wcet 4 wcet 5 Initial slack = 0 => p-gate closed Coscheduling algorithm: an example  The on-line algorithm: ◦ Non-safety critical tasks have CPU reservation = wcet (D NOT included!) ◦ At the beginning of each job the p-gates are closed. ◦ At run time, at each checkpoint the OS sends the APMC # of CPU cycles (exec i ) to the reservation controller. ◦ The reservation controller keeps track of accumulated slack time. If slack time  i (wcet i -exec i ) is greater than delay D for next interval, open the p- gate.

31 wcet 1 wcet 2 wcet 3 wcet 4 wcet 5 Slack += wcet 1 -exec 1 Slack < D 2 wcet 2 + D 2 exec 1  p-gate closed Coscheduling algorithm: an example  The on-line algorithm: ◦ Non-safety critical tasks have CPU reservation = wcet (D NOT included!) ◦ At the beginning of each job the p-gates are closed. ◦ At run time, at each checkpoint the OS sends the APMC # of CPU cycles (exec i ) to the reservation controller. ◦ The reservation controller keeps track of accumulated slack time. If slack time  i (wcet i -exec i ) is greater than delay D for next interval, open the p- gate.

32 wcet 1 wcet 2 wcet 3 wcet 4 wcet 5 Slack += wcet 2 – exec 2 Slack >= D 3 wcet 3 + D 3 exec 1  p-gate open exec 2 Coscheduling algorithm: an example  The on-line algorithm: ◦ Non-safety critical tasks have CPU reservation = wcet (D NOT included!) ◦ At the beginning of each job the p-gates are closed. ◦ At run time, at each checkpoint the OS sends the APMC # of CPU cycles (exec i ) to the reservation controller. ◦ The reservation controller keeps track of accumulated slack time. If slack time  i (wcet i -exec i ) is greater than delay D for next interval, open the p- gate.

33  System composed of tasks/partitions with different criticalities: each task/partition uses different I/O peripherals.  The right action depends on the task/partition criticality ◦ Class A: block all non relevant peripheral traffic (Reservation=wcet+D) ◦ Class B: coschedule tasks and peripherals to maximize I/O traffic (Reservation=wcet). ◦ Class C: all I/O peripherals are enabled t Class A: safety critical (e.g., flying control) Class B: mission critical (e.g., radar processing) Class C: non critical (e.g., display) System Integration: example for avionic domain

34 Peripheral Gate  We designed the peripheral gate (or p-gate for short) for the PCI/PCI-X bus: it allows us to control peripheral access to the bus.  The peripheral gate is compatible with COTS devices: its use does not require any modifications to either the peripheral or the motherboard.

35  Reservation controller commands Peripheral Gate (p-gate).  Kernel sends scheduling information to Reservation Controller.  Minimal kernel modification (send PID and exec of executing process).  Class A task: block all non relevant peripheral traffic  Class B task: reservation controller implements coscheduling algorithm. CPURAM FSB logic Reservation Controller Peripheral Bus Peripheral Gate Peripheral Gate P#2 P#3 P#1 time Reservation Controller cpu schedule P#1 executing P#2 executing P#3 executing Processes #1,#2,#3 belong to class A Peripheral Gate

36  Testbed uses standard Intel platform.  Reservation controller implemented on FPGA, p- gate uses PCI extender card + discrete logic. Logic analyzer for debugging and measurament P-gateGigabit ethernet NIC Reservation Controller (Xilinx FPGA) Current Prototype

37  Getting this information requires support from the CPU and the OS.  We used Architectural Performance Monitor Counters for the Intel Core2 microarch, but other manifacturers (ex: IBM) have similar support (implementation is specific, the lesson is general).  Two APMCs configured to count cache misses and CPU cycles in user space.  Task descriptor extended with exec. time and cache miss fields.  At context switch, the APMCs are saved/restored in descriptors like any other task-specific CPU registers.  Implemented under Linux/RK. Kernel Implementation

38  We compared our adaptive heuristic with other algorithms.  Assumption: At the beginning of each interval the algorithm chooses whether to open or close the switch for that interval. 1. Slack-only: baseline comparison, uses only remaining slack time when task has finished. 2. Predictive: ◦ Also uses measured average exec times. ◦ “Predicts” slack time in the feature and optimizes open intervals at each step. ◦ Computing an optimal allocation is NP-hard, instead it uses a fast greedy heuristic. 3. Optimal: ◦ Clairvoyant (not implementable). ◦ Provides an upper bound to the performance of any run-time, predictive algorithm. Other Coscheduling Algorithms

39  All run-time algorithms implemented on Xilinx ML505 FPGA.  Optimal computed using Matlab optimization tool.  We used a mpeg decoder as benchmark. ◦ As a trend, video processing is increasingly used in the avionic domain for mission control. ◦ It simulates a Class B application subject to heavy I/O traffic  The task misses its deadline by up to 30% if I/O traffic is always allowed!  The run-time algorithm is already close to the optimal; not much to gain with the improved heuristic. Slack-onlyRun-timePredictiveOptimal 4.89%31.21%36.65%40.85% Results in term of % time the p-gate is open The Test

40  We performed synthetic simulations to better understand the performance of the run-time algorithm.  20 superblocks per task, ◦ α is the variation between wcet and avg computation time. ◦ β is the % of time the task is stalled due to cache misses. β α Simulation Results

41  Problem: blocking the peripheral reduces maximum throughput. ◦ Ok only if critical tasks/partitions run for limited amount of time.  Better solution: implement a hardware server with buffering on SoC ◦ Transactions are queued in hw server’s memory during non relevant partitions. ◦ Interrupts/DMA transfers are delivered only during execution of interested tasks/partitions ◦ Similar to real-time aperiodic servers: a hw server permits aperiodic I/O requests to be analyzed as if they were following a predictable (periodic) pattern DRAM Xilinx FPGA CPU Mem Bridge OPB PCI interface interrupt controller PCI Host bridge peripheral DDRAM  FPGA-based SoC design with Linux device drivers.  Currently in development. Improving the P-Gate: Hardware Server (in progress)

42  A major issue in peripherals integration is task delay due to cache- peripheral contention at the main memory level  We proposed a framework to: 1) analyze the delay due to cache peripheral contention ; 2) control task execution times.  The proposed co-scheduling technique was tested with PCI/PCI-X bus; hw server will be ready soon.  Future work: Extend to multi-processor and distributed systems Conclusions


Download ppt "Marco Caccamo Department of Computer Science University of Illinois at Urbana-Champaign Toward the Predictable Integration of Real-Time COTS Based Systems."

Similar presentations


Ads by Google