T IME -P REDICTABLE E XECUTION OF E MBEDDED S OFTWARE ON M ULTI - CORE P LATFORMS Sudipta Chattopadhyay under the guidance of A/P Abhik Roychoudhury 1.

Slides:



Advertisements
Similar presentations
Approximation of the Worst-Case Execution Time Using Structural Analysis Matteo Corti and Thomas Gross Zürich.
Advertisements

Modern Processor Architectures Make WCET Analysis for HUME Challenging Christian Ferdinand AbsInt Angewandte Informatik GmbH.
Approximating the Worst-Case Execution Time of Soft Real-time Applications Matteo Corti.
Xianfeng Li Tulika Mitra Abhik Roychoudhury
Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Static Bus Schedule aware Scratchpad Allocation in Multiprocessors Sudipta Chattopadhyay Abhik Roychoudhury National University of Singapore.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Modeling shared cache and bus in multi-core platforms for timing analysis Sudipta Chattopadhyay Abhik Roychoudhury Tulika Mitra.
Timing Analysis of Concurrent Programs Running on Shared Cache Multi-Cores Presented By: Rahil Shah Candidate for Master of Engineering in ECE Electrical.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
EDA (CS286.5b) Day 10 Scheduling (Intro Branch-and-Bound)
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Constraint Systems used in Worst-Case Execution Time Analysis Andreas Ermedahl Dept. of Information Technology Uppsala University.
Microarchitectural Approaches to Exceeding the Complexity Barrier © Eric Rotenberg 1 Microarchitectural Approaches to Exceeding the Complexity Barrier.
S CALABLE A ND P RECISE R EFINEMENT OF C ACHE T IMING A NALYSIS VIA M ODEL C HECKING Sudipta Chattopadhyay Abhik Roychoudhury 1.
Testing and Analysis of Device Drivers Supervisor: Abhik Roychoudhury Author: Pham Van Thuan 1.
Timing Predictability - A Must for Avionics Systems - Reinhard Wilhelm Saarland University, Saarbrücken.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Predictable Implementation of Real-Time Applications on Multiprocessor Systems-on-Chip Alexandru Andrei Embedded Systems Laboratory Linköping University,
How to Improve Usability of WCET tools Dr.-Ing. Christian Ferdinand AbsInt Angewandte Informatik GmbH.
Predictable Implementation of Real-Time Applications on Multiprocessor Systems-on-Chip Alexandru Andrei, Petru Eles, Zebo Peng, Jakob Rosen Presented By:
Computer Science 12 Design Automation for Embedded Systems ECRTS 2011 Bus-Aware Multicore WCET Analysis through TDMA Offset Bounds Timon Kelter, Heiko.
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
Design Space Exploration
Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL.
A Modular and Retargetable Framework for Tree-based WCET analysis Antoine Colin Isabelle Puaut IRISA - Solidor Rennes, France.
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
Evaluation and Validation Peter Marwedel TU Dortmund, Informatik 12 Germany 2013 年 12 月 02 日 These slides use Microsoft clip arts. Microsoft copyright.
Timing Analysis of Embedded Software for Speculative Processors Tulika Mitra Abhik Roychoudhury Xianfeng Li School of Computing National University of.
The Global Limited Preemptive Earliest Deadline First Feasibility of Sporadic Real-time Tasks Abhilash Thekkilakattil, Sanjoy Baruah, Radu Dobrin and Sasikumar.
1 Estimating the Worst-Case Energy Consumption of Embedded Software Ramkumar Jayaseelan Tulika Mitra Xianfeng Li School of Computing National University.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Zheng Wu. Background Motivation Analysis Framework Intra-Core Cache Analysis Cache Conflict Analysis Optimization Techniques WCRT Analysis Experiment.
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
A Unified WCET Analysis Framework for Multi-core Platforms Sudipta Chattopadhyay, Chong Lee Kee, Abhik Roychoudhury National University of Singapore Timon.
Exploiting Scratchpad-aware Scheduling on VLIW Architectures for High-Performance Real-Time Systems Yu Liu and Wei Zhang Department of Electrical and Computer.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 3: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
Static WCET Analysis vs. Measurement: What is the Right Way to Assess Real-Time Task Timing? Worst Case Execution Time Prediction by Static Program Analysis.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
ECE 720T5 Fall 2011 Cyber-Physical Systems Rodolfo Pellizzoni.
Real-time aspects Bernhard Weirich Real-time Systems Real-time systems need to accomplish their task s before the deadline. – Hard real-time:
CSE 522 WCET Analysis Computer Science & Engineering Department Arizona State University Tempe, AZ Dr. Yann-Hang Lee (480)
WCET-Aware Dynamic Code Management on Scratchpads for Software-Managed Multicores Yooseong Kim 1,2, David Broman 2,3, Jian Cai 1, Aviral Shrivastava 1,2.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
ECE 720T5 Winter 2014 Cyber-Physical Systems Rodolfo Pellizzoni.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
Timing Anomalies in Dynamically Scheduled Microprocessors Thomas Lundqvist, Per Stenstrom (RTSS ‘99) Presented by: Kaustubh S. Patil.
现代计算机体系结构 主讲教师:张钢天津大学计算机学院 2009 年.
Achieving Timing Predictability by Combining Models
CHaRy Software Synthesis for Hard Real-Time Systems
Worst-case Execution Time (WCET) Estimation
DS Adaptive Isolation for Predictability and Security
Reactive NUMA A Design for Unifying S-COMA and CC-NUMA
Run-Time Guarantees for Real-Time Systems: The USES* Approach
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Flow Path Model of Superscalars
Gabor Madl Ph.D. Candidate, UC Irvine Advisor: Nikil Dutt
CSCI1600: Embedded and Real Time Software
CMPT 886: Computer Architecture Primer
Evaluation and Validation
Interconnect with Cache Coherency Manager
Coe818 Advanced Computer Architecture
Processor Pipelines and Static Worst-Case Execution Time Analysis
CSCI1600: Embedded and Real Time Software
What Are Performance Counters?
Performance Evaluation of Real-Time Systems
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

T IME -P REDICTABLE E XECUTION OF E MBEDDED S OFTWARE ON M ULTI - CORE P LATFORMS Sudipta Chattopadhyay under the guidance of A/P Abhik Roychoudhury 1

E MBEDDED S YSTEMS 2

R EAL - TIME C ONSTRAINTS 3 Embedded system Hard real-time Soft real-time

T IMING A NALYSIS Hard real time systems require absolute timing guarantees System level analysis Single task analysis Worst case execution time (WCET) analysis An upper bound on execution time for all possible inputs Sound over-approximation is obtained by static analysis 4

WCET A NALYSIS Program Micro-architectural modeling Control flow graph WCET of basic blocks constraints Infeasible path constraints Loop bound Path analysis WCET boun d 5

A RCHITECTURE Core 1Core n L1 cache Shared L2 cache Memory Shared bus Resource sharing 6

O VERVIEW 7 Dissertation work ( Time-predictable execution in multi-core ) Unified cache Shared cache + shared bus A multi-core WCET tool Cache related preemption delay analysis Coherence miss modeling Shared scratchpad allocation Core 1 Core n L1 cache Shared L2 cache Memory Shared bus Resource sharing Main Memory L1 instruction cache Instr. accesses Data accesses Bus L1 data cache L2 unified cache Processor Conflicts with different instruction and data memory blocks

M ICRO - ARCHITECTURAL M ODELING pipelinecache branch predictor shared cache shared bus Single CoreMulti Core 8

(AI+MC)MC > RTSS’10= RTSS’10 C OMPARISON 9 WorkMicro-arch. level technique Program level technique PrecisionScalability Classical abstract interpretation (AI) AI × √ Classical model checking (MC) MC √ × RTS’00 (aiT, Chronos) AIInteger linear programming Can be improved √ RTSS’10AIMC Can be improved _ Our approach(AI+MC)Integer linear programming > RTS’00= RTS’00

I MPRECISION IN A BSTRACT I NTERPRETATION p1 p2 Cache state = C1 Cache state = C2 Joined Cache state = C3 10 a b b x Abstract cache set Abstract cache set young b Joined cache state Path p1 or path p2? Joined cache state loses information about path p1 and p2

M ODEL C HECKING ALONE ? A path sensitive search Path sensitive search is expensive – path explosion Worse, combined with possible cache states p1 p2 Cache state = C1 Cache state = C2 11

M ODEL C HECKING ALONE ? A path-sensitive search Path sensitive search is expensive – path explosion Worse, combined with possible cache states p1 p2 12 a b young b x Abstract LRU cache set young a b Abstract LRU cache set young b x Abstract LRU cache set young State Explosion

C ACHE ANALYSIS Program Pipeline analysis Branch predictor modeling WCET of basic blocks constraints Infeasible path constraints Loop bound IPET Micro architectural modeling Path analysis Cache analysis by abstract interpretatio n Analysis outcome Refine by model checker All checked Timeout 13 Refinement by model checker can be terminated at any point Model checker refinement steps are inherently parallel Each model checker refinement step checks light assertion property

R EFINEMENT (I NTER - CORE ) 14 m m Task Cache hit start exit Conflictin g task Cache miss m1m1 m2m2 m cache x < y x == y Infeasible m1m1 m2m2 Spurious ≠m young

R EFINEMENT (I NTER - CORE ) m m Task start exit Conflictin g task m1m1 m2m2 m cache x < y x == y Infeasible m1m1 m2m2 C_m++ Increment conflict C_m++ Increment conflict assert (C_m <= 1) Verified m A Cache Hit 15 young

R EFINEMENT (W HY IT WORKS ?) 16 Path 2 Cache miss m m Conflict to m m’ C_m++ Increment conflict assert (C_m <= 0) Property Does not affect the value of C_m x < y x == y m’ m

E XPERIMENTAL S ETUP (C HRONOS T OOLKIT ) 17 C source GCC simplescalar Binary codeCFG Micro architectural modeling cachepipelineBranch prediction Micro-architectural constraints ILP Flow constraints WCET CBMC C bounded model checking

E XPERIMENTAL R ESULT 18

E XPERIMENTAL R ESULT 19 L1 cache Shared L2 cache WCET 4-way associative, 8 KB Direct-mapped, 256 bytes Average time = 70 secs Tasks cnt jfdctint edn fir fdct ndes

E XTENSION U SING S YMBOLIC E XECUTION Conflictin g task m1m1 m2m2 x < y x == y m1m1 m2m2 C_m++ Increment conflict C_m++ Increment conflict assert (C_m <= 1) x < y constraint solver x = y x < yx ≥ y x < y ˄ x = y unknown NO assert (C_m <= 1) satisfied abort 20

E XTENSION U SING KLEE 21 C source GCC simplescalar Binary codeCFG Micro architectural modeling cachepipelineBranch prediction Micro-architectural constraints ILP Flow constraints WCET CBMC/KLEE

A G ENERIC F RAMEWORK Three different architectural/application settings Intra task (WCET in single core) High priority Low priority Inter task (Cache Related Preemption Delay analysis) cache L1 cache Shared L2 cache Task in Core 1 Task in Core 2 Inter core (WCET in multi-core) 22 Cache conflict Cache conflict Cache conflict

M ICRO - ARCHITECTURAL M ODELING pipelinecache branch predictor shared cache shared bus Single CoreMulti Core 23

T ASK - LEVEL INTERFERENCE Timeline T3 T2 T1 T2 T3 Task interference graph 24 Core 1Core n L1 cache Shared L2 cache T1 T2 T3 Shared bus Tasks

S HARED C ACHE + TDMA S HARED B US T1 T2 T3 T4 Core 1 slot Core 2 slot Core 1 slot Core 2 slot T1 T2 T3 T4 L2 miss due to T2 Disjoint lifetime WAIT T4 25 Core 1 Core 2 L1 cache Shared L2 cache Shared bus Task graphs Time Division Multiple Access (TDMA) T1T2 T3T4 Bus access

O VERVIEW OF THE FRAMEWORK L1 cache analysis L2 cache analysis Filter L1 cache analysis L2 cache analysis L2 conflict analysis Initial interference Filter Bus aware analysis WCRT computation Interference changes ? Yes Estimated WCRT No Task interference monotonically decreases 26

E VALUATION (2- CORE ) One core runs statemate another core runs the program under evaluation 27

E VALUATION (4- CORE ) Either runs (edn, adpcm, compress, statemate) or runs (matmult, fir, jfdcint, statemate) in 4 different cores 28

M ICRO - ARCHITECTURAL M ODELING pipelinecache branch predictor Single Core Interactions shared cache shared bus Multi Core 29

T IMING A NOMALY ( SHARED C ACHE ) hitmiss hit miss hit miss hit May not be the worst case path 30

B ASELINE A BSTRACTION – T IMING I NTERVAL Representing each pipeline stage as a timing interval IF ID EX WB CM Structural dependency R1 := R2 + 5 R5 := R1 * R7 R3 := R5 * 5 Contention A fixed-point analysis derives the timing of each stage as an interval 31 [3,7][4,10] startfinish latency [1,3] End = Start + cache miss latency interval

TDMA S HARED B US A NALYSIS Time Division Multiple Access (TDMA) Offset abstraction Core 0Core 1Core 0Core 1 Core 0Core 1Core 0Core 1 T (core 1) offset round offsetdelay T’ (core 0) delay = 0 32

L OOP C ONSTRUCT How do we define bus context? IF ID EX WB CM previous iteration current iteration Property: If the bus offsets of the cross-iteration edges do not change, WCET of the loop iteration cannot change 33

L OOP C ONSTRUCT Bus context flow graph C1C1 C2C2 C3C3 C4C4 C 5  C 3 C5C5 Property: If C i  C j, then C i+k  C j+k for any k > 0 34 C i = bus context of the loop body at i-th iteration

L OOP C ONSTRUCT C1C1 C2C2 C3C3 C4C4 Compute WCET for each bus context E(C 1 ) = number of times context C 1 is executed Generate linear constraints: E(C 1 ) + E(C 2 ) + E(C 3 ) + E(C 4 ) ≤ loop bound E(C 1 ) ≥ E(C 2 ) Bus context flow graph 35 loop bound Program Micro-architectural modeling Control flow graph WCET of basic blocks constraints Infeasible path constraints Loop bound Path analysis ILP solve r ILP = Integer Linear Programming

B RANCH PREDICTION + C ACHE m’ m m Branch location Maximum number of speculated instructions JOIN Unclear cache access Cache content Cache content 36 Cache conflict

E XPERIMENTAL S ETUP (C HRONOS T OOLKIT ) C source GCC simplescalar Binary codeCFG Micro architectural modeling Private cache pipelineBranch prediction Micro-architectural constraints ILP Flow constraints WCET Shared cacheShared bus 37

E VALUATION ( CACHE + PIPELINE ) jfdctint statemate Imprecision of shared cache analysis 38 Core 1Core 2 Vertically partition Core 1 Core 2 Horizontally partition

E VALUATION (C ACHE + PIPELINE + S PECULATION ) Imprecision of modeling speculation 39

E VALUATION (B US + PIPELINE ) Imprecision of shared bus analysis Imprecision of path analysis 40

R ECAP 41 Dissertation work ( Time-predictable execution in multi-core ) Unified cache Shared cache + shared bus A multi-core WCET tool Cache related preemption delay analysis Coherence miss modeling Shared scratchpad allocation Core 1Core n L1 data cache L1 data cache Shared L2 cache Memory Shared bus Coherence miss traffic Stale data items Core 1Core n L1 cache Shared L2 cache High priority task Low priority task Cache conflict Task c PE-0PE-1PE-N SPM-0SPM-1SPM-N Shared off-chip data bus Off-chip memory External Memory Interface …… Fast on-chip communication media

P ERSPECTIVE 42 Time-predictable execution in single-core Time-predictable execution in multi-core Resource sharing (cache and bus) Data sharing (cache coherence) TestingStatic analysis Shared cache Shared bus Cache coherence Customized hardware Shared scratchpad ARM Cortex A9 MPCore Samsung Exynos Nvidia Tegra II (smart phones) Time Division Multiple Access Aethreal Network-on-chip Sony PSP IBM Cell

P ERSPECTIVE Spurious counter example Abstraction Property Concrete domain Verifier Abstraction refinement Functionality Verification Verified SLAM (Microsoft) BLAST (UC Berkley) MAGIC (CMU) Abstract domain in abstract Interpretation (AI) AI Concrete domain May be spurious Generate Quantitative property Path-sensitive Verification Quantitative Verification Refinement Anytime Verification of Quantitative properties

F UTURE W ORK 44 Battery life Mobile devices x < y x == y m1m1 m2m2 x < y x = y x < y x ≥ y assert (C_m <= 1) Symbolic Execution Static performance analysis + testing Performance testing abort Energy analysis of software Energy-aware software testing x < y ˄ x ≠ y Input (Quantitative property e.g. cache conflict)

T HANK Y OU 45 My sincere thanks to all the Examiners and especially the anonymous Examiner 1 for his comment on symbolic execution