Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013.

Similar presentations


Presentation on theme: "Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013."— Presentation transcript:

1 Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

2 Multicore 2 Server Desktop Mobile RT/Embedded

3 Challenges: Shared Resources 3 CPU Memory Hierarchy Unicore T1 T2 Core 1 Memory Hierarchy Core 2 Core 3 Core 4 Multicore T1T1 T1T1 T2T2 T2T2 T3T3 T3T3 T4T4 T4T4 T5T5 T5T5 T6T6 T6T6 T7T7 T7T7 T8T8 T8T8 Performance Impact

4 Case Study HRT – Synthetic real-time video capture – P=20, D=13ms – Cache-insensitive X-server – Scrolling text on a gnome-terminal Hardware platform – Intel Xeon 3530 – 8MB shared L3 cache – 4GB DDR3 1333MHz DIMM (1ch) CPU cores are isolated 4 A desktop PC (Intel Xeon 3530) DRAM L3 (8MB) Core1 Core2 HRT Xsrv.

5 HRT Time Distribution 28% deadline violations Due to contention in DRAM 5 solo 99pct: 10.2ms w/ Xserver 99pct: 14.3ms

6 Outline Motivation Background – DRAM basics – Worst-case memory performance – MemGuard[RTAS’13] Improving Real-Time Performance with MemGuard 6

7 Background: DRAM Organization L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1 Core1Core2Core3Core4 Have multiple banks Different banks can be accessed in parallel

8 Best-case L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1 Core1Core2Core3Core4 Fast Peak = 10.6 GB/s – DDR3 1333Mhz

9 Best-case L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1 Core1Core2Core3Core4 Peak = 10.6 GB/s – DDR3 1333Mhz Out-of-order processors Fast

10 Most-cases (*) Intel® 64 and IA-32 Architectures Optimization Reference Manual L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1 Core1Core2Core3Core4 Mess Performance = ??

11 Worst-case (*) Intel® 64 and IA-32 Architectures Optimization Reference Manual 1bank b/w – Less than peak b/w – How much? Slow L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1 Core1Core2Core3Core4

12 Background: DRAM Operation Stateful per-bank access time – Row miss: 19 cycles – Row hit: 9 cycles (*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting) Row 1 Row 2 Row 3 Row 4 Row 5 Bank 1 Row Buffer activate precharge Read/write Col7 READ (Bank 1, Row 3, Col 7)

13 Real Worst-case (*) Intel® 64 and IA-32 Architectures Optimization Reference Manual 1 bank & always row miss  ~1.2GB/s L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1 Core 1 Core 2 Core 3 Core 4 Row 1 Row 2 Row 3 Row 4 Row 1 Row 2 … Request order time Each core = ¼ x 1.2GB/s = 300MB/s ?

14 Background: Memory Controller(MC) Request queue(s) – Not fair (open-row first  re-ordering) – Unpredictable queuing delay 14 Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1.

15 Multiple parallel resources (banks) Stateful bank access latency Queuing delay Unpredictable memory performance Challenges for Real-Time Systems 15

16 MemGuard [RTAS’13] 16 Goal: guarantee minimum memory b/w for each core How: b/w reservation + best effort sharing Operating System Core1 Core2 Core3 Core4 PMC DRAM DIMM MemGuard Multicore Processor Memory Controller BW Regulator BW Regulator BW Regulator BW Regulator 0.6GB/s0.2GB/s Reclaim Manager

17 Reservation Idea – Scheduler regulates per-core memory b/w using h/w counters – Period = 1 scheduler tick (e.g., 1ms) 17 1ms2ms 0 Schedule a RT idle task Suspend the RT idle task Budget Core activity 2121 computationmemory fetch

18 Reservation 18

19 Best-Effort Sharing Spare Sharing [RTAS’13] Proportional Sharing [Unpublished TR] 19 Core0 900MB/s time(ms) Core1 300MB/s 0 guaranteed b/w best-effort b/w throttled reschedule 1 2

20 Case Study HRT – Synthetic real-time video capture – P=20, D=13ms – Cache-insensitive X-server – Scrolling text on a gnome-terminal Hardware platform – Intel Xeon 3530 – 8MB shared cache – 4GB DDR3 1333MHz DIMM 20 A desktop PC (Intel Xeon 3530) DRAM L3 (8MB) Core1 Core2 HRT Xsrv.

21 w/o MemGuard HRT’s 99pct: 10.2ms 21 HRT (solo) HRT (w/ Xserver) HRT’s 99pct: 14.3ms X’s CPU util: 78%

22 MemGuard reserve only (HRT=900MB/s, X=300MB/s) 22 HRT (solo) HRT (w/ Xserver) HRT’s 99pct: 10.7msHRT’s 99pct: 11.2ms X’s CPU util: 4%

23 MemGuard reserve (HRT=900MB/s, X=300MB/s)+ best-effort sharing 23 HRT (solo) HRT (w/ Xserver) HRT’s 99pct: 10.7ms X’s CPU util: 48%

24 MemGuard reserve (HRT=600MB/s, X=600MB/s)+ best-effort sharing 24 HRT (solo) HRT (w/ Xserver) HRT’s 99pct: 10.9 msHRT’s 99pct: 12.1ms X’s CPU util: 61%

25 Real-Time Performance Improvement Using MemGuard, we can achieve – No deadline miss for HRT – Good X-server performance 25 HRT X-server

26 Conclusion Unpredictable memory performance – multiple resources(banks), per-bank state, unpredictable queueing delay MemGuard – Guarantee minimum memory bandwidth for each core – b/w reservation (guaranteed part) + best-effort sharing Case-study – On Intel Xeon multicore platform, using HRT + X-server – MemGuard can improve real-time performance efficiently Limitations and Future Work – Coarse grain (a OS tick) enforcement – Small guaranteed b/w  DRAM bank partitioning (submitted to RTAS’14) 26 https://github.com/heechul/memguard

27 Thank you. 27

28 Evaluation on Intel Core2 T1: Synthetic video capture task (HRT) – Period=20ms(50Hz) – Deadline=14ms, – Metrics: ACET, WCET, stdev, deadline miss ratio (out of 1000 periods) T2: Xserver, update screen (SRT) – Metric: CPU utilization Higher CPU utilization  faster screen update Platform – Intel Core2Quad 8400, 2MB L2 cache x 2, tunable H/W prefetchers – PC6400 DDR2 DRAM DIMM x 1 Three platform configurations – Exp1: Private L2, Prefetch=off – Exp2: Private L2, Prefetch=on – Exp3: Shared L2, Prefetch=on 28 Intel Core2Quad based PC DRAM L2 (pref.) Core0 Core1 Core2 Core3

29 Experiment 1 29 Original DRAM L2 Core1 Core2 T1T2 MemGuard (Reserve only) DRAM L2 Core1 Core2 T1T2 550M/s MemGuard (reclaim + share) DRAM L2 Core1 Core2 T1T2 550M/s 38% 78% 92% Performance guarantee Private L2 Prefetch=off deadline

30 Experiment 1 30 Original DRAM L2 Core1 Core2 T1T2 MemGuard (Reserve only) DRAM L2 Core1 Core2 T1T2 550M/s MemGuard (reclaim + share) DRAM L2 Core1 Core2 T1T2 550M/s 38% 78% 92% Performance guarantee Private L2 Prefetch=off 30% WCET deadline

31 Experiment 1 31 Original DRAM L2 Core1 Core2 T1T2 MemGuard (Reserve only) DRAM L2 Core1 Core2 T1T2 550M/s MemGuard (reclaim + share) DRAM L2 Core1 Core2 T1T2 550M/s 38% 78% 92% Private L2 Prefetch=off 550M/s deadline

32 Experiment 1 32 Original DRAM L2 Core1 Core2 T1T2 MemGuard (Reserve only) DRAM L2 Core1 Core2 T1T2 550M/s MemGuard (reclaim + share) DRAM L2 Core1 Core2 T1T2 550M/s 38% 78% 92% deadline Private L2 Prefetch=off

33 Experiment 1 33 Original DRAM L2 Core1 Core2 T1T2 MemGuard (Reserve only) DRAM L2 Core1 Core2 T1T2 550M/s MemGuard (reclaim + share) DRAM L2 Core1 Core2 T1T2 550M/s 38% 78% 92% Performance target Private L2 Prefetch=off

34 Experiment 2: Prefetcher 34 Original DRAM L2 Core1 Core2 T1T2 MemGuard (Reserve only) DRAM L2 Core1 Core2 T1T2 550M/s MemGuard (reclaim + share) DRAM L2 Core1 Core2 T1T2 550M/s 33% 82% 94% Private L2 Prefetch=ON 60% More slowdown Not enough reserv. Deadline violation deadline

35 Experiment 2-2 35 Original DRAM L2 Core1 Core2 T1T2 MemGuard (Reserve only) DRAM L2 Core1 Core2 T1T2 900M/s 200M/s MemGuard (reclaim + share) DRAM L2 Core1 Core2 T1T2 900M/s 200M/s 14% 69% 94% Private L2 Prefetch=ON Enough reserv. 60% No deadline violation

36 Experiment 3: Shared Cache 36 Original DRAM L2 Core1 Core2 T1T2 MemGuard (Reserve only) DRAM Core1 Core2 T1T2 900M/s 200M/s MemGuard (reclaim + share) DRAM Core1 Core2 T1T2 900M/s 200M/s 11% 63% 92% Shared L2 Prefetch=ON 108% Even more slowdown Minimum reserv. L2 No deadline violation


Download ppt "Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013."

Similar presentations


Ads by Google