Presentation is loading. Please wait.

Presentation is loading. Please wait.

Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Similar presentations


Presentation on theme: "Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1."— Presentation transcript:

1 Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

2 Summary NUMA multicore systems are unfair to local memory accesses Local execution sometimes suboptimal 2

3 Outline NUMA multicores: how it happened Experimental evaluation: Intel Nehalem Bandwidth sharing model The next generation: Intel Westmere 3

4 NUMA multicores: how it happened 3210 BusC Northbridge MC DRAM memory 4 01237654 BusC 4567 MC First generation: SMP

5 NUMA multicores: how it happened 3210 BusC Northbridge DRAM memory 5 7654 BusC MC DRAM memory BusC Next generation: NUMA IC

6 NUMA multicores: how it happened 3210 DRAM memory 6 7654 MC DRAM memory 0 1234567 IC Next generation: NUMA

7 NUMA multicores: how it happened 3210 DRAM memory 7 7654 MC DRAM memory 0 1234567 IC Next generation: NUMA

8 3210 DRAM memory 7654 MC DRAM memory IC Bandwidth sharing Frequent scenario: bandwidth shared between cores Sharing model for the Intel Nehalem 8 01234567

9 Outline NUMA multicores: how it happened Experimental evaluation: Intel Nehalem Bandwidth sharing model The next generation: Intel Westmere 9

10 Evaluation system Intel Nehalem E5520 2 x 4 cores 8 MB level 3 cache 12 GB DDR3 RAM 5.86 GT/s QPI 10 3210 DRAM memory 7654 MC DRAM memory QPI Level 3 cache Global Queue Level 3 cache Global Queue QPI Global Queue Processor 0Processor 1

11 Bandwidth sharing: local accesses 11 3210 DRAM memory 7654 MC DRAM memory QPI Level 3 cache Global Queue Level 3 cache Global Queue 0 DRAM memory 3 Global Queue Processor 0Processor 1

12 Bandwidth sharing: remote accesses 12 3210 DRAM memory 7654 MC DRAM memory QPI Level 3 cache Global Queue Level 3 cache Global Queue 4 DRAM memory 5 Global Queue 0 3 Processor 0Processor 1

13 Bandwidth sharing: combined accesses 13 3210 DRAM memory 7654 MC DRAM memory QPI Level 3 cache Global Queue Level 3 cache Global Queue 4 DRAM memory 5 Global Queue 0 3 Processor 0Processor 1 Global Queue

14 Mechanism to arbitrate between different types of memory accesses We look at fairness of the Global Queue: – local memory accesses – remote memory accesses – combined memory accesses 14

15 Benchmark program STREAM triad for (i=0; i<SIZE; i++) { a[i]=b[i]+SCALAR*c[i]; } Multiple co-executing triad clones 15

16 Multi-clone experiments All memory allocated on Processor 0 Local clones: Remote clones: Example benchmark configurations: 16 CC CC (2L, 0R) CCC CC CCC (0L, 3R)(2L, 3R) Processor 0Processor 1Processor 0Processor 1

17 GQ fairness: local accesses 17 Total bandwidth [GB/s] 3210 DRAM 7654 IMC DRAM QPI Cache GQ Cache GQ C DRAM memory C Processor 0Processor 1 CC

18 GQ fairness: remote accesses 18 Total bandwidth [GB/s] 3210 DRAM 7654 IMC DRAM QPI Cache GQ Cache GQ C DRAM memory C Processor 0Processor 1 CC

19 Global Queue fairness Global Queue fair when there are only local/remote accesses in the system What about combined accesses? 19

20 GQ fairness: combined accesses Execute clones in all possible configurations 20 # local clones 01234 # remote clones 0 1 2 3 4 (2L, 3R)

21 GQ fairness: combined accesses Execute clones in all possible configurations 21 # local clones 01234 # remote clones 0 1 2 3 4

22 GQ fairness: combined accesses 22 Total bandwidth [GB/s]

23 GQ fairness: combined accesses Execute clones in all possible configurations 23 # local clones 01234 # remote clones 0 1 2 3 4

24 Combined accesses 24 Total bandwidth [GB/s]

25 Combined accesses In configuration (4L, 1R) remote clone gets 30% more bandwidth than a local clone Remote execution can be better than local 25

26 Outline NUMA multicores: how it happened Experimental evaluation: Intel Nehalem Bandwidth sharing model The next generation: Intel Westmere 26

27 Bandwidth sharing model 27 3210 DRAM memory 7654 IMC DRAM memory QPI Level 3 cache Global Queue Level 3 cache Global Queue DRAM memory CC

28 Sharing factor (  ) Characterizes the fairness of the Global Queue Dependence of sharing factor on contention? 28

29 Contention affects sharing factor 29 DRAM Processor 0 C CQPI contenders C C C

30 Contention affects sharing factor 30 Sharing factor (  )

31 Combined accesses 31 Total bandwidth [GB/s]

32 Contention affects sharing factor Sharing factor decreases with contention With local contention remote execution becomes more favorable 32

33 Outline NUMA multicores: how it happened Experimental evaluation: Intel Nehalem Bandwidth sharing model The next generation: Intel Westmere 33

34 The next generation Intel Westmere X5680 2 x 6 cores 12 MB level 3 cache 144 GB DDR3 RAM 6.4 GT/s QPI 34 3210 DRAM memory IMC DRAM memory QPI Level 3 cache Global Queue BA98 IMCQPI Level 3 cache Global Queue 76 45

35 The next generation 35 Total bandwidth [GB/s]

36 Conclusions Optimizing for data locality can be suboptimal Applications: – OS scheduling (see ISMM’11 paper) – data placement and computation scheduling 36

37 Thank you! Questions? 37


Download ppt "Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1."

Similar presentations


Ads by Google