Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Summary NUMA multicore systems are unfair to local memory accesses Local execution sometimes suboptimal 2

Outline NUMA multicores: how it happened Experimental evaluation: Intel Nehalem Bandwidth sharing model The next generation: Intel Westmere 3

NUMA multicores: how it happened 3210 BusC Northbridge MC DRAM memory 4 01237654 BusC 4567 MC First generation: SMP

NUMA multicores: how it happened 3210 BusC Northbridge DRAM memory 5 7654 BusC MC DRAM memory BusC Next generation: NUMA IC

NUMA multicores: how it happened 3210 DRAM memory 6 7654 MC DRAM memory 0 1234567 IC Next generation: NUMA

NUMA multicores: how it happened 3210 DRAM memory 7 7654 MC DRAM memory 0 1234567 IC Next generation: NUMA

3210 DRAM memory 7654 MC DRAM memory IC Bandwidth sharing Frequent scenario: bandwidth shared between cores Sharing model for the Intel Nehalem 8 01234567

Evaluation system Intel Nehalem E5520 2 x 4 cores 8 MB level 3 cache 12 GB DDR3 RAM 5.86 GT/s QPI 10 3210 DRAM memory 7654 MC DRAM memory QPI Level 3 cache Global Queue Level 3 cache Global Queue QPI Global Queue Processor 0Processor 1

Bandwidth sharing: local accesses 11 3210 DRAM memory 7654 MC DRAM memory QPI Level 3 cache Global Queue Level 3 cache Global Queue 0 DRAM memory 3 Global Queue Processor 0Processor 1

Bandwidth sharing: remote accesses 12 3210 DRAM memory 7654 MC DRAM memory QPI Level 3 cache Global Queue Level 3 cache Global Queue 4 DRAM memory 5 Global Queue 0 3 Processor 0Processor 1

Bandwidth sharing: combined accesses 13 3210 DRAM memory 7654 MC DRAM memory QPI Level 3 cache Global Queue Level 3 cache Global Queue 4 DRAM memory 5 Global Queue 0 3 Processor 0Processor 1 Global Queue

Mechanism to arbitrate between different types of memory accesses We look at fairness of the Global Queue: – local memory accesses – remote memory accesses – combined memory accesses 14

Benchmark program STREAM triad for (i=0; i<SIZE; i++) { a[i]=b[i]+SCALAR*c[i]; } Multiple co-executing triad clones 15

Multi-clone experiments All memory allocated on Processor 0 Local clones: Remote clones: Example benchmark configurations: 16 CC CC (2L, 0R) CCC CC CCC (0L, 3R)(2L, 3R) Processor 0Processor 1Processor 0Processor 1

GQ fairness: local accesses 17 Total bandwidth [GB/s] 3210 DRAM 7654 IMC DRAM QPI Cache GQ Cache GQ C DRAM memory C Processor 0Processor 1 CC

GQ fairness: remote accesses 18 Total bandwidth [GB/s] 3210 DRAM 7654 IMC DRAM QPI Cache GQ Cache GQ C DRAM memory C Processor 0Processor 1 CC

Global Queue fairness Global Queue fair when there are only local/remote accesses in the system What about combined accesses? 19

GQ fairness: combined accesses Execute clones in all possible configurations 20 # local clones 01234 # remote clones 0 1 2 3 4 (2L, 3R)

GQ fairness: combined accesses Execute clones in all possible configurations 21 # local clones 01234 # remote clones 0 1 2 3 4

GQ fairness: combined accesses 22 Total bandwidth [GB/s]

GQ fairness: combined accesses Execute clones in all possible configurations 23 # local clones 01234 # remote clones 0 1 2 3 4

Combined accesses 24 Total bandwidth [GB/s]

Combined accesses In configuration (4L, 1R) remote clone gets 30% more bandwidth than a local clone Remote execution can be better than local 25

Bandwidth sharing model 27 3210 DRAM memory 7654 IMC DRAM memory QPI Level 3 cache Global Queue Level 3 cache Global Queue DRAM memory CC

Sharing factor (  ) Characterizes the fairness of the Global Queue Dependence of sharing factor on contention? 28

Contention affects sharing factor 29 DRAM Processor 0 C CQPI contenders C C C

Contention affects sharing factor 30 Sharing factor (  )

Combined accesses 31 Total bandwidth [GB/s]

Contention affects sharing factor Sharing factor decreases with contention With local contention remote execution becomes more favorable 32

The next generation Intel Westmere X5680 2 x 6 cores 12 MB level 3 cache 144 GB DDR3 RAM 6.4 GT/s QPI 34 3210 DRAM memory IMC DRAM memory QPI Level 3 cache Global Queue BA98 IMCQPI Level 3 cache Global Queue 76 45

The next generation 35 Total bandwidth [GB/s]

Conclusions Optimizing for data locality can be suboptimal Applications: – OS scheduling (see ISMM’11 paper) – data placement and computation scheduling 36

Thank you! Questions? 37

Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Similar presentations

Presentation on theme: "Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Similar presentations

Presentation on theme: "Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1."— Presentation transcript:

Similar presentations

About project

Feedback