Presentation is loading. Please wait.

Presentation is loading. Please wait.

1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

Similar presentations


Presentation on theme: "1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER."— Presentation transcript:

1

2 1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER

3 2© Copyright 2015 EMC Corporation. All rights reserved. MOTIVATION Next generation of EMC VPLEX hardware is NUMA based – What is the expected performance benefit? – How to best adjust the code to NUMA? Gain experience with NUMA tools

4 3© Copyright 2015 EMC Corporation. All rights reserved. VPLEX OVERVIEW A unique virtual storage technology that enables: – Data mobility and high availability within and between data centers. – Mission critical continuous availability between two synchronous sites. – Distributed RAID1 between 2 sites.

5 4© Copyright 2015 EMC Corporation. All rights reserved. UMA OVERVIEW – CURRENT STATE Uniform Memory Access RAM CPU0 CPU1 CPU2 CPU3 CPU4 CPU5

6 5© Copyright 2015 EMC Corporation. All rights reserved. NODE0NODE1 NUMA OVERVIEW – NEXT GENERATION Non Uniform Memory Access RAM CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 RAM

7 6© Copyright 2015 EMC Corporation. All rights reserved. POLICIES Allocation of memory on specific nodes Binding threads to specific nodes/CPUs Can be applied to: – Process – Memory area

8 7© Copyright 2015 EMC Corporation. All rights reserved. POLICIES CONT. NameDescription defaultAllocate on the local node (the node the thread is running on) bindAllocate on a specific set of nodes interleaveInterleave memory allocations on a set of nodes preferredTry to allocate on a node first * Policies can also be applied to shared memory regions.

9 8© Copyright 2015 EMC Corporation. All rights reserved. DEFAULT POLICY Node 0 Local memory access Local memory access Running thread Node 1 Local memory access Local memory access Running thread

10 9© Copyright 2015 EMC Corporation. All rights reserved. BIND/PREFFERED POLICY Node 0 Running thread local Node 1 Running thread remote

11 10© Copyright 2015 EMC Corporation. All rights reserved. INTERLEAVE POLICY Node 0 Local memory access Local memory access Running thread Node 1 Remote memory access Remote memory access

12 11© Copyright 2015 EMC Corporation. All rights reserved. NUMACTL Command line tool for running a specific NUMA Policy. Useful for programs that cannot be modified or recompiled.

13 12© Copyright 2015 EMC Corporation. All rights reserved. NUMACTL EXAMPLES numactl –cpubind=0 –membind=0,1 run the program on node 0 and allocate memory from nodes 0,1 numactl –interleave=all run the program with memory interleave on all available nodes.

14 13© Copyright 2015 EMC Corporation. All rights reserved. LIBNUMA A library that offers an API for NUMA policy. Fine grained tuning of NUMA policies. – Changing policy in one thread does not affect other threads.

15 14© Copyright 2015 EMC Corporation. All rights reserved. LIBNUMA EXAMPLES numa_available() – checks if NUMA is supported on the system. numa_run_on_node(int node) – binds the current thread on a specific node. numa_max_node() – the number of the highest node in the system. numa_alloc_interleave(size_t size) – allocates size bytes of memory page interleaved on all available nodes. numa_alloc_onnode(size_t size, int node) – allocate memory on a specific node.

16 15© Copyright 2015 EMC Corporation. All rights reserved. HARDWARE OVERVIEW Node 0 cpu0 L2 L1 cpu1 L2 L1 cpu2 L2 L1 cpu3 L2 L1 cpu4 L2 L1 cpu5 L2 L1 L3 cache Node 1 cpu0 L2 L1 cpu1 L2 L1 cpu2 L2 L1 cpu3 L2 L1 cpu4 L2 L1 cpu5 L2 L1 L3 cache RAM QPI 2 hyper threads Quick Path Interconnect

17 16© Copyright 2015 EMC Corporation. All rights reserved. HARDWARE OVERVIEW ProcessorIntel Xeon Processor E5-2620 # Cores6 # Threads12 QPI speed8.0 GT/s = 64 GB/s L1 data cache32 KB L1 instruction cache32 KB L2 cache256 KB L3 cache15 MB RAM62.5 GB Gigatransfers per second GB/s = GT/s * BUS bandwidth (8B)

18 17© Copyright 2015 EMC Corporation. All rights reserved. LINUX PERF TOOL Command line profiler Based on perf_events – Hardware events – counted by the CPU – Software events – counted by the kernel perf list – a list of pre-defined events (to be used in –e). – instructions [Hardware event] – context-switches OR cs [Software event] – L1-dcache-loads [Hardware cache event] – rNNN [Raw hardware event descriptor]

19 18© Copyright 2015 EMC Corporation. All rights reserved. PERF STAT Keeps a running count of selected events during process execution. perf stat [options] –e [list of events] Examples: – perf stat –e page-faults my_exec. #page-faults that occurred during execution of my_exec. – perf stat –a –e instructions,r81d0 sleep 5 System wide count on all CPUs. Counts #intructions and l1 dcache loads.

20 19© Copyright 2015 EMC Corporation. All rights reserved. CHARACTERIZING OUR SYSTEM Linux perf tool CPU Performance counters – L1-dcache-loads – L1-dcache-stores Test: ran IO for 120 seconds Result: RD/WR = 2:1

21 20© Copyright 2015 EMC Corporation. All rights reserved. THE SIMULATOR Measuring performance for different memory allocation policies on a 2 node system. Throughput is measured as the time it takes to complete N iterations. Threads randomly access a shared memory.

22 21© Copyright 2015 EMC Corporation. All rights reserved. THE SIMULATOR CONT. #Threads RD/WR ratio – ratio between the number of read and write operations a thread performs Policy – local / interleave / remote. Size – the size of memory to allocate. #Iterations Node0/Node1 – ratio between threads bound to Node 0 and threads bound Node 1 RW_SIZE - size of read or write operation in each iteration. Config file:

23 22© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #1 Compare performance of 3 policies: – Local – threads access memory on node they run on. – Remote – threads access memory on a different node from which they run on. – Interleave – memory is interleaves across nodes (threads access both local and remote memory)

24 23© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #1 3 policies – local, interleave, remote. #Threads varies from 1-24 (the maximal number of concurrent threads in the system) 2 setups – balanced/unbalanced workload balancedunbalanced

25 24© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #1 #Iterations = 100,000,000 Data size = 2 * 150 MB RD/WR ratio = 2:1 RW_SIZE = 128 Bytes Configurations:

26 25© Copyright 2015 EMC Corporation. All rights reserved. RESULTS - BALANCED WORKLOAD -37% +69% -46% +83% Time it took until the last thread finished working.

27 26© Copyright 2015 EMC Corporation. All rights reserved. RESULTS - UNBALANCED WORKLOAD -35% +73% -45% +87%

28 27© Copyright 2015 EMC Corporation. All rights reserved. RESULTS - COMPARED local remote interleave balanced unbalanced

29 28© Copyright 2015 EMC Corporation. All rights reserved. CONCLUSIONS The more concurrent threads in the system, the more impact memory locality has on performance. In applications with #concurrent threads up to #cores in 1 node, the best solution is to bind the process and allocate memory on the same node. In applications with #concurrent threads up to #cores in a 2 node system, disabling NUMA (interleaving memory) will have similar performance to binding the process and allocating memory on the same node.

30 29© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #2 Local access is significantly faster than remote. Our system uses RW locks to synchronize memory access.  Is maintaining read locality by mirroring the data on both nodes have better performance than the current interleave policy?

31 30© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #2 Purpose: find RD/WR ratio for which maintaining read locality is better than memory interleaving. Setup 1: Interleaving – Single RW lock – Data is interleaved across both nodes Setup 2: Mirroring data – RW lock per node – Each read operation accesses local memory. – Each write operation is done to both local and remote memory.

32 31© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #2 #Iterations = 25,000,000 Data size = 2 * 150 MB RD/WR ratio = 12 : i, { 1 <= i <= 12} #Threads = 8 ; 12 RW_SIZE = 512 ; 1024 ; 2048 ; 4096 Bytes Configurations:

33 32© Copyright 2015 EMC Corporation. All rights reserved. RW LOCKS – MIRRORING VS. INTERLEAVING 8 threads

34 33© Copyright 2015 EMC Corporation. All rights reserved. RW LOCKS – MIRRORING VS. INTERLEAVING 12 threads

35 34© Copyright 2015 EMC Corporation. All rights reserved. CONCLUSIONS Memory op size and % of write operations both play a role in deciding which memory allocation policy is better. In applications with small mem-op size (512B) and up to 50% write operations, mirroring is the better option. In applications with mem-op size of 4KB and more, mirroring is worse than interleaving the memory and using a single RW lock.

36 35© Copyright 2015 EMC Corporation. All rights reserved. SUMMARY Fine grained memory allocation can lead to performance improvement for certain workloads. – More investigation is needed in order to configure a suitable memory policy that utilizes NUMA abilities.

37


Download ppt "1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER."

Similar presentations


Ads by Google