1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Interactive lesson about operating system
Distributed Systems CS
SE-292 High Performance Computing
Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.
Lecture 6: Multicore Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
Page 1 Dorado 400 Series Server Club Page 2 First member of the Dorado family based on the Next Generation architecture Employs Intel 64 Xeon Dual.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
© 2003 IBM Corporation IBM Systems and Technology Group Operating System Attributes for High Performance Computing Ken Rozendal Distinguished Engineer.
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
Processor history / DX/SX SX/DX Pentium 1997 Pentium MMX
1.1 Installing Windows Server 2008 Windows Server 2008 Editions Windows Server 2008 Installation Requirements X64 Installation Considerations Preparing.
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Computer System Architectures Computer System Software
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
MAC OS – Unit A Page: 10-11, Investigating Data Processing Understanding Memory.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Introduction to Open Source Performance Tool --Linux Tool Perf Yiqi Ju (Fred) Sep. 13, 2012.
Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.
Embedded System Lab. 김해천 Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist,
Chapter 1: Introduction. 1.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 1: Introduction What Operating Systems Do Computer-System.
Windows 2000 Course Summary Computing Department, Lancaster University, UK.
CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.
Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
DONE-08 Sizing and Performance Tuning N-Tier Applications Mike Furgal Performance Manager Progress Software
Srihari Makineni & Ravi Iyer Communications Technology Lab
1© Copyright 2012 EMC Corporation. All rights reserved. EMC Mission Critical Infrastructure For Microsoft SQL Server 2012 Accelerated With VFCache EMC.
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
XOberon Operating System CLAUDIA MARIN CS 550 Fall 2005.
CS Operating System & Database Performance Tuning Xiaofang Zhou School of Computing, NUS Office: S URL:
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Mark E. Fuller Senior Principal Instructor Oracle University Oracle Corporation.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.
1© Copyright 2012 EMC Corporation. All rights reserved. EMC VNX5700, EMC FAST Cache, SQL Server AlwaysOn Availability Groups Strategic Solutions Engineering.
1 OS Review Processes and Threads Chi Zhang
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
REMINDER Check in on the COLLABORATE mobile app Best Practices for Oracle on VMware - Deep Dive Darryl Smith Chief Database Architect Distinguished Engineer.
Tuning Threaded Code with Intel® Parallel Amplifier.
Remigius K Mommsen Fermilab CMS Run 2 Event Building.
Título/Title Nome/Name Cargo/Position Foto/ Picture Linux Performance on Power Breno Leitão Software Engineer.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Get the Most out of SQL Server Standard Edition
Process Management Process Concept Why only the global variables?
Solid State Disks Testing with PROOF
Distributed Processors
Installation and database instance essentials
Architecture Background
Boost Linux Performance with Enhancements from Oracle
Chapter 16: Distributed System Structures
High Performance Computing
Operating System Overview
Presentation transcript:

1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER

2© Copyright 2015 EMC Corporation. All rights reserved. MOTIVATION Next generation of EMC VPLEX hardware is NUMA based – What is the expected performance benefit? – How to best adjust the code to NUMA? Gain experience with NUMA tools

3© Copyright 2015 EMC Corporation. All rights reserved. VPLEX OVERVIEW A unique virtual storage technology that enables: – Data mobility and high availability within and between data centers. – Mission critical continuous availability between two synchronous sites. – Distributed RAID1 between 2 sites.

4© Copyright 2015 EMC Corporation. All rights reserved. UMA OVERVIEW – CURRENT STATE Uniform Memory Access RAM CPU0 CPU1 CPU2 CPU3 CPU4 CPU5

5© Copyright 2015 EMC Corporation. All rights reserved. NODE0NODE1 NUMA OVERVIEW – NEXT GENERATION Non Uniform Memory Access RAM CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 RAM

6© Copyright 2015 EMC Corporation. All rights reserved. POLICIES Allocation of memory on specific nodes Binding threads to specific nodes/CPUs Can be applied to: – Process – Memory area

7© Copyright 2015 EMC Corporation. All rights reserved. POLICIES CONT. NameDescription defaultAllocate on the local node (the node the thread is running on) bindAllocate on a specific set of nodes interleaveInterleave memory allocations on a set of nodes preferredTry to allocate on a node first * Policies can also be applied to shared memory regions.

8© Copyright 2015 EMC Corporation. All rights reserved. DEFAULT POLICY Node 0 Local memory access Local memory access Running thread Node 1 Local memory access Local memory access Running thread

9© Copyright 2015 EMC Corporation. All rights reserved. BIND/PREFFERED POLICY Node 0 Running thread local Node 1 Running thread remote

10© Copyright 2015 EMC Corporation. All rights reserved. INTERLEAVE POLICY Node 0 Local memory access Local memory access Running thread Node 1 Remote memory access Remote memory access

11© Copyright 2015 EMC Corporation. All rights reserved. NUMACTL Command line tool for running a specific NUMA Policy. Useful for programs that cannot be modified or recompiled.

12© Copyright 2015 EMC Corporation. All rights reserved. NUMACTL EXAMPLES numactl –cpubind=0 –membind=0,1 run the program on node 0 and allocate memory from nodes 0,1 numactl –interleave=all run the program with memory interleave on all available nodes.

13© Copyright 2015 EMC Corporation. All rights reserved. LIBNUMA A library that offers an API for NUMA policy. Fine grained tuning of NUMA policies. – Changing policy in one thread does not affect other threads.

14© Copyright 2015 EMC Corporation. All rights reserved. LIBNUMA EXAMPLES numa_available() – checks if NUMA is supported on the system. numa_run_on_node(int node) – binds the current thread on a specific node. numa_max_node() – the number of the highest node in the system. numa_alloc_interleave(size_t size) – allocates size bytes of memory page interleaved on all available nodes. numa_alloc_onnode(size_t size, int node) – allocate memory on a specific node.

15© Copyright 2015 EMC Corporation. All rights reserved. HARDWARE OVERVIEW Node 0 cpu0 L2 L1 cpu1 L2 L1 cpu2 L2 L1 cpu3 L2 L1 cpu4 L2 L1 cpu5 L2 L1 L3 cache Node 1 cpu0 L2 L1 cpu1 L2 L1 cpu2 L2 L1 cpu3 L2 L1 cpu4 L2 L1 cpu5 L2 L1 L3 cache RAM QPI 2 hyper threads Quick Path Interconnect

16© Copyright 2015 EMC Corporation. All rights reserved. HARDWARE OVERVIEW ProcessorIntel Xeon Processor E # Cores6 # Threads12 QPI speed8.0 GT/s = 64 GB/s L1 data cache32 KB L1 instruction cache32 KB L2 cache256 KB L3 cache15 MB RAM62.5 GB Gigatransfers per second GB/s = GT/s * BUS bandwidth (8B)

17© Copyright 2015 EMC Corporation. All rights reserved. LINUX PERF TOOL Command line profiler Based on perf_events – Hardware events – counted by the CPU – Software events – counted by the kernel perf list – a list of pre-defined events (to be used in –e). – instructions [Hardware event] – context-switches OR cs [Software event] – L1-dcache-loads [Hardware cache event] – rNNN [Raw hardware event descriptor]

18© Copyright 2015 EMC Corporation. All rights reserved. PERF STAT Keeps a running count of selected events during process execution. perf stat [options] –e [list of events] Examples: – perf stat –e page-faults my_exec. #page-faults that occurred during execution of my_exec. – perf stat –a –e instructions,r81d0 sleep 5 System wide count on all CPUs. Counts #intructions and l1 dcache loads.

19© Copyright 2015 EMC Corporation. All rights reserved. CHARACTERIZING OUR SYSTEM Linux perf tool CPU Performance counters – L1-dcache-loads – L1-dcache-stores Test: ran IO for 120 seconds Result: RD/WR = 2:1

20© Copyright 2015 EMC Corporation. All rights reserved. THE SIMULATOR Measuring performance for different memory allocation policies on a 2 node system. Throughput is measured as the time it takes to complete N iterations. Threads randomly access a shared memory.

21© Copyright 2015 EMC Corporation. All rights reserved. THE SIMULATOR CONT. #Threads RD/WR ratio – ratio between the number of read and write operations a thread performs Policy – local / interleave / remote. Size – the size of memory to allocate. #Iterations Node0/Node1 – ratio between threads bound to Node 0 and threads bound Node 1 RW_SIZE - size of read or write operation in each iteration. Config file:

22© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #1 Compare performance of 3 policies: – Local – threads access memory on node they run on. – Remote – threads access memory on a different node from which they run on. – Interleave – memory is interleaves across nodes (threads access both local and remote memory)

23© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #1 3 policies – local, interleave, remote. #Threads varies from 1-24 (the maximal number of concurrent threads in the system) 2 setups – balanced/unbalanced workload balancedunbalanced

24© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #1 #Iterations = 100,000,000 Data size = 2 * 150 MB RD/WR ratio = 2:1 RW_SIZE = 128 Bytes Configurations:

25© Copyright 2015 EMC Corporation. All rights reserved. RESULTS - BALANCED WORKLOAD -37% +69% -46% +83% Time it took until the last thread finished working.

26© Copyright 2015 EMC Corporation. All rights reserved. RESULTS - UNBALANCED WORKLOAD -35% +73% -45% +87%

27© Copyright 2015 EMC Corporation. All rights reserved. RESULTS - COMPARED local remote interleave balanced unbalanced

28© Copyright 2015 EMC Corporation. All rights reserved. CONCLUSIONS The more concurrent threads in the system, the more impact memory locality has on performance. In applications with #concurrent threads up to #cores in 1 node, the best solution is to bind the process and allocate memory on the same node. In applications with #concurrent threads up to #cores in a 2 node system, disabling NUMA (interleaving memory) will have similar performance to binding the process and allocating memory on the same node.

29© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #2 Local access is significantly faster than remote. Our system uses RW locks to synchronize memory access.  Is maintaining read locality by mirroring the data on both nodes have better performance than the current interleave policy?

30© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #2 Purpose: find RD/WR ratio for which maintaining read locality is better than memory interleaving. Setup 1: Interleaving – Single RW lock – Data is interleaved across both nodes Setup 2: Mirroring data – RW lock per node – Each read operation accesses local memory. – Each write operation is done to both local and remote memory.

31© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #2 #Iterations = 25,000,000 Data size = 2 * 150 MB RD/WR ratio = 12 : i, { 1 <= i <= 12} #Threads = 8 ; 12 RW_SIZE = 512 ; 1024 ; 2048 ; 4096 Bytes Configurations:

32© Copyright 2015 EMC Corporation. All rights reserved. RW LOCKS – MIRRORING VS. INTERLEAVING 8 threads

33© Copyright 2015 EMC Corporation. All rights reserved. RW LOCKS – MIRRORING VS. INTERLEAVING 12 threads

34© Copyright 2015 EMC Corporation. All rights reserved. CONCLUSIONS Memory op size and % of write operations both play a role in deciding which memory allocation policy is better. In applications with small mem-op size (512B) and up to 50% write operations, mirroring is the better option. In applications with mem-op size of 4KB and more, mirroring is worse than interleaving the memory and using a single RW lock.

35© Copyright 2015 EMC Corporation. All rights reserved. SUMMARY Fine grained memory allocation can lead to performance improvement for certain workloads. – More investigation is needed in order to configure a suitable memory policy that utilizes NUMA abilities.