Modeling shared cache and bus in multi-core platforms for timing analysis Sudipta Chattopadhyay Abhik Roychoudhury Tulika Mitra.

Modeling shared cache and bus in multi-core platforms for timing analysis Sudipta Chattopadhyay Abhik Roychoudhury Tulika Mitra

Timing analysis (basics) Hard real time systems need to meet certain deadline  System level or schedulability analysis  Single task analysis (Worst Case Execution Time analysis) WCET : An upper bound on the execution time for all possible inputs  Usually obtained by static analysis  Worst Case Execution Time (WCET) of a program for a given hardware platform Usage of WCET  Schedulability analysis of hard real time systems  Worst case oriented optimization

WCET and BCET Actual BCET Actual WCET Execution Time Observed WCET Estimated BCET Actual Observed Over-estimation WCET = Worst-case Execution Time BCET = Best-case Execution Time Observed BCET Estimated WCET Actual WCET

Timing analysis for multi-cores Modeling shared cache and shared bus  Most common form of resource sharing in multi-cores  Difficulties Conflicts in shared cache arising from other cores Contention in shared bus introduced by other cores Interaction between shared cache and shared bus

Commercial multi-core Shared off-chip Bus Core 0 L1 …. Core N L1 Shared L2 Core 0 L1 …. Core N L1 Shared L2 Off-chip Memory Crossbar Processor 0Processor 1 Intel Core-2 Duo Presence of both shared cache and shared bus

Modeled architecture Shared cache is accessed through a shared bus …. Shared Bus Shared L2 Core 0 L1 …. Core N L1 Shared Bus Shared L3 L2 Architecture A Architecture B L1 Core 0 Core N

Assumptions Perfect data cache, currently we model only shared instruction cache Shared bus is TDMA (Time Division Multiple Access) and TDMA slots are assigned in a round-robin fashion  TDMA is chosen for predictability Separated instruction and data bus  Bus traffic arising from data memory accesses are ignored No self modifying code  Cache coherence need not be modeled Non-preemptive scheduling

Overview of the framework L1 cache analysis L2 cache analysis Cache access classification L1 cache analysis L2 cache analysis L2 conflict analysis Initial interference Cache access classification Bus aware analysis WCRT computation Interference changes ? Yes Estimated WCRT No Iterative fix-point analysis Termination of our analysis is guaranteed

Framework components L1 cache analysis L2 cache analysis L1 cache analysis L2 cache analysis L2 conflict analysis Initial interference Cache access classification Bus aware analysis WCRT computation Interference changes ? Yes Estimated WCRT No Cache access classification

L1 cache analysis (Ferdinand et. al. RTS’97) Abstract cache set {a} {b,c} {c} {a} {c} {a} {b,c} {a} {c} {a} {b,c} {c} {a} {b,c} {a} {b} {c} Evicted blocks low high age Must Join Intersection, maximum age Finds All hit (AH) cache blocks May Join Union, minimum age Finds All Miss (AM) cache blocks Persistence Join Union, maximum age Finds Persistence (PS) or never evicted cache blocks

Framework components L1 cache analysis L1 cache analysis L2 conflict analysis Initial interference Bus aware analysis WCRT computation Interference changes ? Yes Estimated WCRT Cache access classification L2 cache analysis Cache access classification L2 cache analysis1

Per core L2 cache analysis (Puaut et. al. RTSS 2008) Memory reference L1 cache L2 cache All miss Persistence or NC Never accessed (N) Always accessed (A) All hit ACS out = ACS in ACS out = U(ACS in ) Unknown (U) ∏ Join ACS in ACS out = ACS in ACS out = U(ACS in )

Framework components L1 cache analysis L2 cache analysis L1 cache analysis L2 cache analysis Initial interference Cache access classification Bus aware analysis WCRT computation Interference changes ? Yes Estimated WCRT No Cache access classification L2 conflict analysis

Shared cache conflict analysis Our past work (RTSS 2009) Exploit task lifetime to refine shared cache analysis Task interference graph  There exists an edge between two task nodes if they have overlapping lifetimes Analyze each cache set C individually

Task interference graph Timeline T3 T2 T1 T2 T3 Task interference graph

Cache conflict analysis T1 T2 T3 Task interference graph m1 Associativity = 4 T1 T2 T3 m2 m3 T1 T2 T3 m2 m3 m1 shift After conflict analysis m1: AH m2: AH m3: AH m1: AH->AH m2: AH->AH m3: AH->AH All memory blocks remain all hits Cache set C M(C) = 1 M(C) = 2 M(C) = 1

Cache conflict analysis T1 T2 T3 Task interference graph m1 Associativity = 4 T1 T2 T3 m2 m3 T1 T3 After conflict analysis m0, m1: AH m2: AH m3: AH m1: AH->AH m2: AH->NC m3: AH->AH m2 may be replaced from the cache due to conflicts from other cores Cache set C M(C) = 1 M(C) = 3 M(C) = 1 m0 m1 m3 T2

Framework components L1 cache analysis L2 cache analysis L1 cache analysis L2 cache analysis Initial interference Cache access classification WCRT computation Interference changes ? Yes Estimated WCRT No Cache access classification L2 conflict analysis Bus aware analysis

Example : variable bus delay Bus slot: 50 cycles, L2 hit: 10 cycles, L2 miss: 20 cycles, Code Executing on Core0 Right BranchCommon PathLeft Branch C1 = 20 C2 = 10 C5 = 10 M1 = 10 C1 = 20 M2 = 20 C2 = 10 C3 = 20 C4 = 30 C3 = 20 C4 = 20 t = 0 t = 50 t = 100 Core 0 bus slot Core 1 bus slot Core 0 bus slot C5 = 10 t = 150 L2 miss L2 hit First iteration (No bus delay)

Example : variable bus delay Bus slot: 50 cycles, L2 hit: 10 cycles, L2 miss: 20 cycles, Code Executing on Core0 Right BranchCommon PathLeft Branch C1 = 20 C2 = 10 C5 = 10 M1 = 10 C1 = 20 M2 = 20 C2 = 10 C3 = 20 C4 = 30 C3 = 20 C4 = 20 t = 0 t = 50 Core 0 bus slot Core 1 bus slot Core 0 bus slot C5 = 10 L2 miss L2 hit Second iteration (M1 suffers 20 cycles bus delay) Bus delay M1 = 10 C1 = 20 t = 100 t = 150 Conclusion: WCET of different iterations could be different

Possible solutions Source of problem  Each iteration of a loop may start at different offset relative to its bus slot Possible solutions  Virtually unroll all loop iterations – too expensive  Do not model the bus or take maximum possible bus delay – imprecise result Our solution  Assume each loop iteration starts at the same offset relative to its bus slot and add necessary alignment cost

Key observation Core 0 slotCore 1 slotCore 0 slotCore 1 slot Timeline Bus schedule Δ T starts at core 0 Δ T starts at core 0 Round robin schedule follow repeated patterns Core 0 slotCore 1 slot Δ T starts at core 0 T must follow the same Execution pattern if the offset ( Δ) is same Bus schedule

Revisiting the example Bus slot: 50 cycles, L2 hit: 10 cycles, L2 miss: 20 cycles, Code Executing on Core0 C1 = 20 C2 = 10 C5 = 10 M2 = 20 C2 = 10 C3 = 20 C4 = 30 C3 = 20 C4 = 20 t = 0 t = 50 Core 0 bus slot Core 1 bus slot Core 0 bus slot C5 = 10 L2 miss L2 hit Align M1 = 10 C1 = 20 t = 100 t = 150 M1 = 10 C1 = 20 Alignment cost = 20 cycles (all iterations follow the same execution pattern with this alignment) WCET of one iteration <= 100 cycles (No need to virtually unroll the loop) Right BranchCommon PathLeft Branch

Partial Unrolling C1=10 C2=10 L2 Hit C1=10 M2=10 C2=10 C1=10 M2=10 C2=10 C1=10 M2=10 C2=10 C1=10 M2=10 C2=10 C1=10 t=0 t=100 Core0 Bus slot No unrollingPartial unrolling Iter1 Iter2 Iter3 Iter1 Iter2 Iter4 Core0 Bus slot Code Executing on Core0 Alignment cost higher if the loop is very small compared to the length of bus slot Partially unroll such loops till one bus slot is filled

Extension to full programs WCET of inner loop WCET of outer loop

Framework components L1 cache analysis L2 cache analysis L1 cache analysis L2 cache analysis Initial interference Cache access classification Interference changes ? Yes Estimated WCRT No Cache access classification L2 conflict analysis Bus aware WCET/BCET computation WCRT computation

WCRT t3 t2 t4 t1 (1) (2) (1) Assigned core Task graph Peers Task lifetime : [EarliestReady, LatestFinish] EarliestReady(t1) = 0 EarliestReady(t4) >= EarliestFinish(t2) EarliestReady(t4) >= EarliestFinish(t3) EarliestFinish = EarliestReady + BCET LatestReady(t4) >= LatestFinish (t2) LatestReady(t4) >= LatestFinish (t3) t2 has peers LatestFinish (t2) = LatestReady(t2) + WCET(t2) + WCET(t3) t4 has no peers LatestFinish (t4) = LatestReady(t4) + WCET(t4) Computed WCRT = LatestFinish(t4) Earliest time computation Latest time computation

An example T2.1= 10 T3.2=10 T3.1=20 M2.2=20 T2.2=20 M3.2 =20 T3.2 =10 T4.2=10 T4.1 =20 M4.2=10 T4.2 =10 Core 0Core 1Bus Core 0Core 1 Wait T1.1=90 T2.1= 10 T2 lifetime T3 lifetime Bus schedule based on M2.2, M3.2 L2 miss WCRT: 170 cycles T2 and T3 have Disjoint lifetime M2.2 and M3.2 cannot conflict: Both L2 Hit L2 Hit: 10 cycles L2 Miss: 20 cycles Bus slot: 50 cycles M2.2 and M3.2 conflict in L2: Both L2 Miss M4.2 is L2 Hit T1.1= 90 T2.2=20 T3.1= 20 T4.1= 20 Core0 slot Core1 slot Core0 slot Core1 slot

Example contd. Bus schedule based on M2.2, M3.2 L2 Hit Second bus wait for Core 1 eliminated WCRT: 130 cycles T1.1=90 T3.1=20 T2.2 =20 T3.2=10 T2.1= 10 T4.1=20 T4.2=10 M2.2=10 M3.2=10 M4.2=10 Core 0Core 1Bus Wait Core0 slot Core1 slot Core0 slot Core1 slot T3.1=20 M2.2=20 T2.2=20 M3.2 =20 T3.2 =10 T4.1 =20 M4.2=10 T4.2 =10 Core 0Core 1Bus Wait T1.1=90 T2.1= 10 T2 lifetime T3 lifetime Core0 slot Core1 slot Core0 slot Core1 slot

Experimental evaluation Tasks are compiled into Simplescalar PISA compliant binaries CMP_SIM is used for simulation, CMP_SIM is extended with shared bus modeling and for PISA compliant binaries Two setup  Independent tasks running in different cores  Task dependency specified through a task graph

Overestimation ratio (2-core) One core runs statemate another core runs the program under evaluation L1 cache : direct mapped, 1 KB L2 cache : 4-way, 2 KB L1 block size = 32 bytes L2 block size = 64 bytes L1 miss latency = 6 cycles L2 miss latency = 30 cycles Bus slot length = 80 cycles Average Overestimation = 40%

Overestimation ratio (4-core) Either runs (edn, adpcm, compress, statemate) or runs (matmult, fir, jfdcint, statemate) in 4 different cores L1 cache : direct mapped, 1 KB L2 cache : 4-way, 2 KB L1 block size = 32 bytes L2 block size = 64 bytes L1 miss latency = 6 cycles L2 miss latency = 30 cycles Bus slot length = 80 cycles Average Overestimation = 40%

Sensitivity with bus slot length (2-core) Average overestimation ratio for program Statemate

Sensitivity with bus slot length (4-core) Average overestimation ratio for program Statemate

Debie is an online space debris monitoring program manufactured by Space Systems Finland Ltd. Extracted task graph (Debie-test) WCRT analysis of task graph main-tc (1) main-hm (1) main-tm (1) main-hit (1) main-aq (1) main-su (1) tc-test (3) hm-test (4) tm-test (1) hit-test (2) aq-test (4) su-test (2) Assigned core number

Experimental evaluation of Debie-test L1 cache : 2-way, 2 KB L2 cache : 4-way, 8 KB L1 block size = 32 bytes L2 block size = 64 bytes L1 miss latency = 6 cycles L2 miss latency = 30 cycles Bus slot length = 80 cycles Overestimation ratio ~ 20% This difference clearly shows that for real life application bus modeling is essential

Extension to different multi-core architecture (e.g. Intel Core2 Duo) Shared off-chip Bus Core 0 L1 …. Core N L1 Shared L2 Core 0 L1 …. Core N L1 Shared L2 Off-chip Memory Crossbar Processor 0Processor 1 Only L2 cache misses appear in shared bus Overall framework still remains the same, only shared bus waiting time is computed for L2 cache misses

Modeling shared cache and bus in multi-core platforms for timing analysis Sudipta Chattopadhyay Abhik Roychoudhury Tulika Mitra.

Similar presentations

Presentation on theme: "Modeling shared cache and bus in multi-core platforms for timing analysis Sudipta Chattopadhyay Abhik Roychoudhury Tulika Mitra."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Modeling shared cache and bus in multi-core platforms for timing analysis Sudipta Chattopadhyay Abhik Roychoudhury Tulika Mitra.

Similar presentations

Presentation on theme: "Modeling shared cache and bus in multi-core platforms for timing analysis Sudipta Chattopadhyay Abhik Roychoudhury Tulika Mitra."— Presentation transcript:

Similar presentations

About project

Feedback