Oracle Database Virtualization and Linux Best Practices

Oracle Database Virtualization and Linux Best Practices
RMOUG– Technology Day Oracle Database Virtualization and Linux Best Practices

Charles Kim @racdba Oracle ACE Director
Founder and President of Viscosity Over 25 years of Oracle Expertise: Mission Critical Databases, RAC, Data Guard, ASM, RMAN, Shareplex / GoldenGate Specializing in “Automation of Everything” done more than once!! President - IOUG Cloud Computing (and Virtualization) SIG Blog Sites: Oracle Exadata Certified Implementation Specialist Certified RAC Expert

Nitin Vengurlekar CTO & Chief Architect at Viscosity
Responsible for Service Delivery Focus on Virtualization, Engineered & Converged Systems Developed white papers and Best Practices for Application/Database High Availability &Consolidation 18 Years with Oracle 6 years with Oracle Support 9 years with RAC Product Management 3 years a “Private Database Cloud” Evangelist Follow me on Twitter: @dbcloudshifu

Viscosity – Who We Are What We Do
Oracle Platinum Partner Experts in Exadata, Cloud and Virtualization Premiere Oracle Services – focus on full stack services E-Business Suite, OBIEE to RAC VMware Partner Selected as One of Four Premier VMware Select Partners for GLobal U2vL Migration Converted dozens of Legacy systems to Virtualized systems In 2016 Migrated/Virtualized 5 Mission Critical Systems Work with VMware engineering for strategic projects Intel 3D XP project

Staff Aug Workforce Capacity on Demand
Oracle License Management Get the most out of your Oracle investment ZERO DOWNTIME Migrations Professional Services Where you need it most CUSTOM Application Development Performance Health Checks How’s it running? Here’s a brief portfolio of what we do, and I’ll highlight a couple of them for you. Performance Health Checks: POPULAR for customers ready to upgrade or migrate to a new platform. BExample: We worked with a customer once where we found major performance issues in their application and database. We were able to identify key issues and we ended up fixing it in time for the client to run their month end processing. If you're experiencing any pain at any level of the stack, let us come help you. Zero Downtime Migrations and Upgrades: This has been our MOST popular service and what we’re most well known for. For example, We’ve helped many customers migrate from AIX to Linux migrations with zero downtime. Managed Services: We provide 24 x 7 support through our managed services. Currently supporting dozens of Exadata’s and ZFS’. Our DBA Services offer full stack architects that specialize in everything from E-business Implementation OBI Data Implementation Analytics Data Security App Dev: While we’re well known for our work DATABASE SERVICES, we’re actually looked upon as LEADERS in in Rapid Application Development as well. We do everything from building custom apps with Oracle APEX to mobile and web design. Last Oracle License Management: We know how hard it is. we actually have experts from Oracle who worked in that space. We want to provide you with the most up to date knowledge and best practices to manage, and more importantly, maximize your investment. You definitely want to do it right. No one wants to get an audit. We provide all these services cross industry, for everyone from Banking, to Healthcare to Retail and Public Sector. DBA Services Remote and On-site On-Call Support Managed Services Staff Aug Workforce Capacity on Demand

The World According to Me 
Got Married in 7.3 Son #1 was born in 8.1.7 Son #2 was born in Son #3 was born in Wrote first book in Beta Formed Viscosity in / Got Married in 7.1.3 Son was born in 8.1.6 Daughter was born in Joined Oracle Product Development in 10.1a (10i) Tinkering with forming Viscosity in Formed Viscosity in /

Agenda A Philosophical Discussion about Why
What to Expect….When You’re Expecting to Virtualize Not platform specific, but will use VMware as context Considerations and Best Practices for Virtualizing Oracle Databases Linux Best Practices Summary Q & A

NO…NO THERE WILL BE NO DISCUSSION ON ORACLE-VMWARE LICENSING or
SUPPORTABILITY!

What to Expect, When You Are Expecting…..to Virtualize
Considerations for Virtualizing Oracle Databases

Virtualization Objective… It’s all in the audience
vAdmins – Virtualize First Policy Show “performance impact can be minimized”, maintain 5% of bare metal Have virtualized other applications/tiers….Now its time to virtualize your database systems DBAs – Justify why …I’ll let *you* virtualize my application. Virtualize all you want, just don’t impact my performance Show me Improvement in HA, provisioning and setup time ----- Meeting Notes (8/19/16 06:39) ----- For vAdmins how do we help

Before you Virtualize …Rationalize, Standardize, and then Consolidate
Determine what consolidation means No big ROI in Simple Server consolidation Drive higher and smarter consolidation densities Combine Server and Component Consolidation Reduce instance count Rationalize - Reduce distinct number of Linux and Database versions Upgrade and standardize the stack Seen lots of cases where Virtualization is just a platform for retiring old hardware. Consolidating w/o rationalization – reduce server footprint You have some level of buy-in from cross-organizational leaders/SME on virtualization

Considerations for Virtualizing Oracle Databases

Considerations - VMware Cluster
What database configurations will you be running RAC or non-RAC or mixed Use good judgment on mixing workloads in same VMware cluster Size the VMware cluster appropriately Create large enough VMware Clusters to house several RAC clusters as well as a collection of non-RAC (single instance) databases 3 node VMware cluster seems to be the sweet spot for customers If non-prod & prod workloads needed to be part of the same cluster, then precise resource pools need to be defined such that Prod VMs have appropriate shares allotted. If non-prod and prod ESXi servers are needed to be part of the same cluster, thus allowing larger/denser Clusters, then precise resource pools need to be defined such that Prod VMs have appropriate shares allotted. Use DRS for non-prod,

If more aggressive consolidation is needed, than memory reservations (for SGA footprint) should be considered. Nevertheless, a suitable VMware slot size should be defined as this affects VM DRS placement policies. A slot [size] is a logical representation of the memory and CPU resources that are used to satisfy the requirements for any VM that will powered on in the cluster. Thus any configured reservations, limit, shares defined at VM level will impact the HA slot calculation. This describes the deployment of a set of RAC databases, each of which is evenly distributed across ESXi hosts using anti-affinity rules.

What features will you be leveraging vMotion, HA, DRS, shares/reservations/limits, FT, etc. Some features aren’t meaningful for RAC Affinity and anti-affinity rules Develop strategy for small, medium and large VMs Makes consolidation definitions and density much easier Consider monster VMs with CPUs. Consolidate many databases on one VM Consolidate many databases into one single database (schema or PDB) Combine the above two models If more aggressive consolidation is needed, than memory reservations (for SGA footprint) should be considered. Nevertheless, a suitable VMware slot size should be defined as this affects VM DRS placement policies. A slot [size] is a logical representation of the memory and CPU resources that are used to satisfy the requirements for any VM that will powered on in the cluster. Thus any configured reservations, limit, shares defined at VM level will impact the HA slot calculation.

As usual .. There’s trade-off No “free virtualized lunch” Higher Performance/Low Latency vs increased hypervisor overhead Decide up front, do you maintain “good avg response times” or strive for low response times for peaks If more aggressive consolidation is needed, than memory reservations (for SGA footprint) should be considered. Nevertheless, a suitable VMware slot size should be defined as this affects VM DRS placement policies. A slot [size] is a logical representation of the memory and CPU resources that are used to satisfy the requirements for any VM that will powered on in the cluster. Thus any configured reservations, limit, shares defined at VM level will impact the HA slot calculation.

When Consolidation is the same as Juggling
Juggling 3 balls Fairly easy Nobody wants to see this Vegas act Juggling 10 balls Much more difficult Need either faster hands or more hands Juggling different sized objects requires more concentration Now that’s a Vegas act I’ll go see ----- Meeting Notes (8/19/16 07:20) ----- Sharing by its very nature will have an impact on everyone The idea is to share properly - using proven best practices Goal is eliminate "nosy neighbors"

Virtualization of Databases – The Time has Come for Final Frontier
Application’s Performance Requirements Processors are getting faster, caches are bigger, NUMA awareness is better, hypervisor are smarter, and storage is faster. All this makes a perfect for virtualizing most any application With each version VMware has been increasing the performance and scalability by leaps and bounds. Any lingering performance concerns relating to VMware virtual machines are clearly due to a lagging perception from very early generations which had limited capabilities. ----- Meeting Notes (8/25/16 10:57) ----- Include network increased in speed Processors are getting faster, caches are bigger, NUMA scheduling is better, hypervisor are smarter, and storage/network is faster.

POINT  COUNTERPOINT TO RAC or NOT
Wide scalable CPU and Memory scaled across nodes Application level HA Better detection and faster instance recovery NOT RAC Easier to create a single monster VM Oracle Restart (GI Standalone) works to detect GI resource failures Cheaper 

Design Considerations for Sizing and Best Practices Hardware and BIOS

Key points about the Best Practices discussion…
You will hear a lot of best practices recommendations Some of the recommendations may not apply to your specific environment There are no brownie points given, nor points taken away, for any side in any discussion because… Perceptions differ

Let’s keep it in perspective - Where’s me Settings
ESXi Host (server) VM Linux Guest BIOS

Building an Optimal Oracle Architecture on vSphere
Starts at the ESXi BIOS level

Hardware BIOS Settings - Best Practices
Ensure all hardware in the system is on HCL Hardware-assisted Virtualization (HV) CPU Virtualization - Intel VT-x / AMD-V Memory Virtualization - Intel EPT / AMD RVI I/O Virtualization - Intel VT-d / AMD-Vi/IOMMU In ESXi Set CPU / MMU Virtualization option to Automatic Hardware Node interleaving disabled – NUMA enabled enabled - NUMA disabled Most out of the box hardware is not VT enabled If you order “virtualization” ready servers/blades, all drivers and BIOS settings are already set Most out of the box hardware is not VT enabled; however if you buy from Dell or Cisco UCS and defined as “virtualization” ready, all drivers and BIOS is already set !! Hardware-assisted Virtualization(HV) CPU Virtualization (Intel VT-x and AMD-V) Gives the option of using either HV or binary translation (BT) to VM Memory Management Unit(MMU) Virtualization (Intel Extended Page Tables (EPT) and AMD Rapid Virtualization Indexing (RVI) AMD RVI / Nested Page Tables (NPT) and Intel EPT addresses the overheads due to MMU virtualization by providing hardware support to virtualize the MMU I/O MMU Virtualization (Intel VT-d and AMD-Vi/IOMMU) Intel Virtualization Technology for Directed I/O and AMD I/O Virtualization is an I/O MMU that remaps I/O DMA transfers and device interrupts , allowing VM’s to have direct access to hardware I/O devices, such as Network cards, HBAs and GPUs CPU Virtualization While HV outperform BT for the vast majority of workloads, there are a few workloads where the reverse is true. For a 64-bit guest operating system to run on an Intel processor, the processor must have hardware-assisted CPU virtualization. UP vs. SMP HALs/Kernels : There are two types of hardware abstraction layers (HALs) and kernels: UP and SMP. UP historically stood for “uniprocessor,” but should now be read as “single-core.” SMP historically stood for “symmetric multi-processor,” but should now be read as multi-core. Although some recent operating systems (including Windows Vista, Windows Server 2008, and Windows 7) use the same HAL or kernel for both UP and SMP installations, many operating systems can be configured to use either a UP HAL/kernel or an SMP HAL/kernel. To obtain the best performance ona single-vCPU virtual machine running an operating system that offers both UP and SMP HALs/kernels, configure the operating system with a UP HAL or kernel. The UP operating system versions are for single-core machines. If used on a multi-core machine, a UP operating system version will recognize and use only one of the cores. The SMP versions, while required in order to fully utilize multi-core machines, can also be used on single-core machines. Due to their extra synchronization code, however, SMP operating system versions used on single-core machines are slightly slower than UP operating system versions used on the same machines Hyper-Threading (HT) does not provide the full power of a physical core provides anywhere from slight to significant increase in system perf by keeping the processor pipeline busier Low-latency, High-performance Workloads requiring Stability, Predictability and Less Jitter BIOS setting for Power Management set to Static High Disable Turbo Boost and Disable Processor C-states / C1E halt State Low-latency, High-performance Workloads Set ESXi Power Management Policy to “High Performance” Set BIOS setting to OS controlled Enable Turbo Boost Disable Processor C-states / C1E halt State Run latest version of the BIOS Enable all cores and sockets Remember , Turbo boost can over-clock portions of the CPU Leaving C-states enabled can increase memory latency and is therefore not recommended for low-latency workloads. Even the enhanced C-state known as C1E introduces longer latencies to wake up the CPUs from halt (idle) states to full-power, so disabling C1E in the BIOS can further lower latencies Intel Turbo Boost, on the other hand, will step up the internal frequency of the processor should the workload demand more power, and should be left enabled for low-latency, high-performance workloads. However, since Turbo Boost can over-clock portions of the CPU, it should be left disabled if the applications require stable, predictable performance and low latency with minimal jitter.

Best Practices – Power Savings vs Performance
Data Centers Managers are looking to improve power utilization rates Energy or Performance Bias Servers come with Power Management schemes to throttle down CPUs/cores P/C-States puts cores to sleep .. Essentially a “core nap” of a millisecond or so. It can take time for CPU to leave these states and return to a running condition Sleep States are great option for Web, App, & mid-tier servers, not for high CPU consuming DB servers Power Management schemes cause inconsistent database performance Systems requiring deterministic performance/low-latency should set to a Balanced Profile at BIOS, disabling all C-States/P-States and any power throttling High-performance profile for VM Leaving C-states enabled can increase memory latency and is therefore not recommended for low-latency workloads. Even the enhanced C-state known as C1E introduces longer latencies to wake up the CPUs from halt (idle) states to full-power, so disabling C1E in the BIOS can further lower latencies Power Management BIOS Settings VMware ESXi includes a full range of host power management capabilities in the software that can save power when a host is not fully utilized (see “Host Power Management in ESXi” on page 24). We recommend that you configure your BIOS settings to allow ESXi the most flexibility in using (or not using) the power management features offered by your hardware, then make your power-management choices within ESXi. In order to allow ESXi to control CPU power-saving features, set power management in the BIOS to “OS Controlled Mode” or equivalent. Even if you don’t intend to use these power-saving features, ESXi provides a convenient way to manage them. 􀂄 Availability of the C1E halt state typically provides a reduction in power consumption with little or no increase the performance of certain single-threaded workloads. We therefore recommend that you enable C1E in BIOS. impact on performance. When “Turbo Boost” is enabled, the availability of C1E can sometimes even However, for a very few workloads that are highly sensitive to I/O latency, especially those with low CPU utilization, C1E can reduce performance. In these cases, you might obtain better performance by disabling C1E in BIOS, if that option is available. 􀂄 C-states deeper than C1/C1E (i.e., C3, C6) allow further power savings, though with an increased chance of performance impacts. We recommend, however, that you enable all C-states in BIOS, then use ESXi host power management to control their use. Turbo Boost which has the ability to dynamically scale up the clock speed of a processor depending on the thermal headroom available. Intel Turbo Boost, on the other hand, will step up the internal frequency of the processor should the workload demand more power, and should be left enabled for low-latency, high-performance workloads. However, since Turbo Boost can over-clock portions of the CPU, it should be left disabled if the applications require stable, predictable performance and low latency with minimal jitter. Enable all cores and sockets Run latest version of the BIOS Remember , Turbo boost can over-clock portions of the CPU Intel SpeedStep Technology is designed to save energy by adjusting the CPU clock frequency up or down depending on how busy the system is. Intel Turbo Boost Technology provides the capability for the CPU to adjust itself to run higher than its stated clock speed if it has enough power to do so. One new feature in the Intel Xeon processor E v3 CPUs is the capability for each core to run at a different speed, using Intel SpeedStepIntel Turbo Boost depends on Intel SpeedStep: if you want to enable Intel Turbo Boost, you must enable Intel SpeedStep first. If you disable Intel SpeedStep, you lose the ability to use Intel Turbo Boost. Intel Turbo Boost is especially useful for latency-sensitive applications C3 and C6 are power-saving halt and sleep states that a CPU can enter when it is not busy. Unfortunately, it can take some time for the CPU to leave these states and return to a running condition. If you are concerned about performance (for all but latency-sensitive single-threaded applications), and if you have the option, disable anything related to C states. Energy or Performance Bias You can use the power-saving mode to reduce system power consumption when the turbo mode is enabled. The mode can be set to Maximum Performance, Balanced Performance, Balanced Power, or Power Saver. Testing has shown that most applications run best with the Balanced Performance setting.

Hardware Best Practices
General Recommendations (for test, dev or non-critical database systems) Low-latency, High-performance Workloads Low-latency, High-performance Workloads requiring Stability, Predictable performance and low latency with Less Jitter Power Management BIOS setting to OS controlled mode , Power Management Policy to “High Performance” BIOS setting for Power Management set to Static High (no OS controlled mode) Same Enable Turbo Boost Disable Turbo Boost Enable Processor C-states & C1E halt State Disable Processor C-states / C1E halt State Leaving C-states enabled can increase memory latency and is therefore not recommended for low-latency workloads. Even the enhanced C-state known as C1E introduces longer latencies to wake up the CPUs from halt (idle) states to full-power, so disabling C1E in the BIOS can further lower latencies Power Management BIOS Settings VMware ESXi includes a full range of host power management capabilities in the software that can save power when a host is not fully utilized (see “Host Power Management in ESXi” on page 24). We recommend that you configure your BIOS settings to allow ESXi the most flexibility in using (or not using) the power management features offered by your hardware, then make your power-management choices within ESXi. In order to allow ESXi to control CPU power-saving features, set power management in the BIOS to “OS Controlled Mode” or equivalent. Even if you don’t intend to use these power-saving features, ESXi provides a convenient way to manage them. 􀂄 Availability of the C1E halt state typically provides a reduction in power consumption with little or no impact on performance. When “Turbo Boost” is enabled, the availability of C1E can sometimes even increase the performance of certain single-threaded workloads. We therefore recommend that you enable C1E in BIOS. However, for a very few workloads that are highly sensitive to I/O latency, especially those with low CPU utilization, C1E can reduce performance. In these cases, you might obtain better performance by disabling C1E in BIOS, if that option is available. 􀂄 C-states deeper than C1/C1E (i.e., C3, C6) allow further power savings, though with an increased chance of performance impacts. We recommend, however, that you enable all C-states in BIOS, then use ESXi host power management to control their use. Turbo Boost which has the ability to dynamically scale up the clock speed of a processor depending on the thermal headroom available. Intel Turbo Boost, on the other hand, will step up the internal frequency of the processor should the workload demand more power, and should be left enabled for low-latency, high-performance workloads. However, since Turbo Boost can over-clock portions of the CPU, it should be left disabled if the applications require stable, predictable performance and low latency with minimal jitter. Run latest version of the BIOS Enable all cores and sockets Remember , Turbo boost can over-clock portions of the CPU Performance Best Practices for VMware vSphere Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs

Understanding NUMA Picture courtesy of NUMA (non-uniform memory access) defines how can each logical CPU access each part of memory. When you have 2 socket system, each CPU (socket) has its own memory, which it can directly access. But it must also be able to access memory in the other socket - and this of course takes more CPU cycles than accessing local memory. NUMA nodes specifies which part of system memory is local to which CPU. You can have more layers of topology, If node interleaving is disabled, ESXi detects the system as NUMA and applies NUMA optimizations. If node interleaving is enabled, ESXi does not detect the system as NUMA. You can configure the NUMA in your system to behave such as to give the best possible performance for your workload. You can for example allow all CPUs to access all memory, or to only access local memory, which then changes how the linux scheduler will distribute processes among the available logical CPUs. NUMA aware scheduling will assign a virtual machine's vCPUs (virtual CPUs) to a NUMA node as a NUMA client If you have many processes requiring not much memory, using only local memory can be benefit, but if you have large processes (Oracle database with its shared memory), using all memory among all cpus might be better. You can use commands such as numastat or numactl --hardware to check NUMA status on your system. Here is info from that 8-socket machine: Intel Nehalem and AMD Opteron architectures are NUMA capable. A NUMA node is essentially a CPU socket (package). Intel 2630 v4 is a Broadwell microarchitecture & contains 4 memory channels

Basics of NUMA In non-NUMA [SMP] all memory and cores are in a uniform pool – All CPU cores can access any memory area This was okay until CPU speeds became extremely fast and bottleneck became interconnect When a thread is scheduled/runs but required memory area is in the other node, a memory lookup is required and QPI access is neccesary This lookup and memory access is the NUMA Penalty To mitigate this Penalty The goal is fit your VM memory-CPU core local as possible; I;e Solution - Bring memory nearer to processor cores -> NUMA NUMA scheduler wants to schedule the thread on the CPU where the thread-memory is assigned Intel SpeedStep Technology is designed to save energy by adjusting the CPU clock frequency up or down depending on how busy the system is. Intel Turbo Boost Technology provides the capability for the CPU to adjust itself to run higher than its stated clock speed if it has enough power to do so. One new feature in the Intel Xeon processor E v3 CPUs is the capability for each core to run at a different speed, using Intel SpeedStepIntel Turbo Boost depends on Intel SpeedStep: if you want to enable Intel Turbo Boost, you must enable Intel SpeedStep first. If you disable Intel SpeedStep, you lose the ability to use Intel Turbo Boost. Intel Turbo Boost is especially useful for latency-sensitive applications C3 and C6 are power-saving halt and sleep states that a CPU can enter when it is not busy. Unfortunately, it can take some time for the CPU to leave these states and return to a running condition. If you are concerned about performance (for all but latency-sensitive single-threaded applications), and if you have the option, disable anything related to C states. Energy or Performance Bias You can use the power-saving mode to reduce system power consumption when the turbo mode is enabled. The mode can be set to Maximum Performance, Balanced Performance, Balanced Power, or Power Saver. Testing has shown that most applications run best with the Balanced Performance setting.

Understanding NUMA and Topology
Enabling NUMA (Node Interleaving disabled) ACPI "BIOS” builds a System Resource Allocation Table (SRAT) SRAT exposes the physical CPU and memory configuration architecture Specifically which CPU and memory ranges belong to a single NUMA node. numactl --hardware numactl –show Maps the memory of each node into a single sequential block of memory address space. Hypervisors uses the SRAT to understand which memory bank is local to a physical CPU and attempts to allocate local memory to each vCPU of the virtual machine. Intel SpeedStep Technology is designed to save energy by adjusting the CPU clock frequency up or down depending on how busy the system is. Intel Turbo Boost Technology provides the capability for the CPU to adjust itself to run higher than its stated clock speed if it has enough power to do so. One new feature in the Intel Xeon processor E v3 CPUs is the capability for each core to run at a different speed, using Intel SpeedStepIntel Turbo Boost depends on Intel SpeedStep: if you want to enable Intel Turbo Boost, you must enable Intel SpeedStep first. If you disable Intel SpeedStep, you lose the ability to use Intel Turbo Boost. Intel Turbo Boost is especially useful for latency-sensitive applications C3 and C6 are power-saving halt and sleep states that a CPU can enter when it is not busy. Unfortunately, it can take some time for the CPU to leave these states and return to a running condition. If you are concerned about performance (for all but latency-sensitive single-threaded applications), and if you have the option, disable anything related to C states. Energy or Performance Bias You can use the power-saving mode to reduce system power consumption when the turbo mode is enabled. The mode can be set to Maximum Performance, Balanced Performance, Balanced Power, or Power Saver. Testing has shown that most applications run best with the Balanced Performance setting.

Understanding NUMA Each NUMA node contains a CPU/memory
A pCPU will use its onboard memory controller to access its own “local” memory Connects to the remaining “remote” memory via an interconnect Disparate memory access (Remote and local) results in “non-uniform” memory access time Premise for best low latency, keep memory access local, memory vs remote Remote memory access, memory migration, and cache snooping/validation traffic all impacts QuickPath Interconnect (QPI) architecture Imperative when designing and configuring a system, that attention must be given to the QPI configuration A host consist of a CPU Package, that is the physical CPU piece with the pins, this is inserted in a socket (pSocket). Together with the local memory, they form a NUMA node. Within the CPU package, cores exist. In this example, the CPU package contains four cores and each core has hyper-threading (HT) enabled. All cores (and thus HT) share the same cache architecture. At the ESXi layer, the PCPU exist. A PCPU is an abstraction layer inside the ESXi kernel and can consume a full core or it can leverage HT. At the VM layer, a virtual socket, and a vCPU exists. A virtual socket can map to a single PCPU or span multiple PCPUs. This depends on the number of vCPUs and the settings cores per socket inside the UI (cpuid.CoresPerSocket). The vCPU is the logical representation of the PCPU inside the virtual machine. The configuration vCPU and cores per socket impact the ability of applications (and operating systems) to optimize for cache usage.

Basics of NUMA - Example of a NUMA architecture
VMware vSphere OS Use numactl (Linux) / coreinfo (Windows) command to see vNUMA on GOS. Guest Operating System (GOS) c c Wide VM with 8 vCPU’s with 2 NUMA Clients Physical NUMA node Each NUMA client is a Virtual NUMA Node Physical Server 4 socket 4 cores each

Basics of NUMA vSphere NUMA management - Determines where best to place a virtual machine’s memory given how busy each NUMA node. If a VM is on a busy NUMA node, the ESXi kernel will automatically migrate the virtual machine to a less busy NUMA node, to get better performance. There are 2 main decision points to migrate a VM from one NUMA node to another. “balance migration” – To attain balance ESXi might migrate to reduce CPU contention within a NUMA node. “locality migration”. Locality migrations occur when ESXi detects that most of the VM’s memory is in a remote node, in such a case it is generally better to move/schedule the virtual machine to run from the NUMA node where most of the virtual machine’s memory is, as long as it doesn’t cause CPU contention to occur in that NUMA node. ESXi keeps a constant eye on your virtual machines and their NUMA placement and performs NUMA migrations either for balance or locality reasons in order to optimize the overall system performance. ESXi generally does a great job at this, but there can be hiccups. If you notice performance problems caused by either unbalanced NUMA nodes causing high ready time or poor virtual machine NUMA locality causing high memory latency there are a few things you can try. First and most importantly, try and size your VMs as a multiple of your physical server’s NUMA node size. For instance, if your physical server has 6 cores per NUMA node, size your VMs as either 2, 3, or 6 way. This KB article explains the potential impact of mismatched VM sizes to NUMA node sizes in more detail You might also try and tweak the advance NUMA settings for the VM. For instance, you can manually set the NUMA node affinity for the VM on the resources tab of the virtual machine edit settings screen in vCenter. Administrators should be cautious about manually setting VM – NUMA affinity as it may become difficult to manage manually balancing NUMA resources as your environment grows. Another setting that might be beneficial to look at is the numa.vcpu.maxPerMachineNode. By default ESXi tries to place a single VM in as few NUMA nodes as possible, this generally provides the best performance because it provides the best memory locality and reduces memory latency. But some applications may be memory bandwidth sensitive instead of memory latency sensitive, these memory bandwidth sensitive applications may benefit from the increased memory bandwidth that comes from using more NUMA nodes and thus more paths to memory. For these applications you may want to modify the VM advanced parameter numa.vcpu.maxPerMachineNode. Setting that parameter to a lower value will split the VM’s vCPUs up across more NUMA nodes. New in vSphere 5.0 is the vNUMA feature that presents the physical NUMA typology to the guest operating system. vNUMA is enabled by default on VMs greater than 8 way, but if you have VMs that are not greater than 8 way but are still larger than your physical server’s NUMA node size, than you might want to enable vNUMA on those VMs. To enable vNUMA on 8 way or smaller VMs, modify the numa.vcpu.min setting. See “Advanced Virtual NUMA Attributes” in the vSphere 5.0 Resource Management guide for more details. Lastly, If you are noticing any significant and unexplained performance problem, it might be best to call in the experts and notify VMware Support. They will be able to take a detailed look at the NUMA client stats for your VM. The NUMA client stats can show detailed VM NUMA counters like the number of Balance Migrations and Locality Migrations a VM has had, if these values are roughly the same then it might indicate NUMA migration thrashing.

ESXTOP and NUMA Metric Explanation NHN
Current Home Node for virtual machine NMIG Number of NUMA migrations between two snapshots. It includes balance migration, inter-mode VM swaps performed for locality balancing and load balancing NRMEM (MB) Current amount of remote memory being accessed by VM NLMEM (MB) Current amount of local memory being accessed by VM N%L Current percentage memory being accessed by VM that is local GST_NDx (MB) The guest memory being allocated for VM on NUMA node x. “x” is the node number OVD_NDx (MB) The VMM overhead memory being allocated for VM on NUMA node x

Design Considerations for Sizing and Best Practices ESXi and NUMA

Basics of vNUMA - Example of a NUMA architecture
VMware vSphere OS Use numactl (Linux) / coreinfo (Windows) command to see vNUMA on GOS Guest Operating System (GOS) c c Wide VM with 8 vCPU’s with 2 NUMA Clients , each NUMA client is a Virtual NUMA Node Physical NUMA node Virtual NUMA node Physical Server 4 socket 4 cores each

ESXi NUMA & vNUMA Concepts
ESXi exposes vNUMA topology to GOS Facilitate GOS and Application NUMA optimizations Underlying Physical NUMA architecture exposed to GOS enabled vCPU >8 adjust numa.vcpu.min to enable vNUMA for vCPU < 8 ESXi NUMA scheduling enabled when a server has at least 4 cores at least 2 cores per physical NUMA node VM 8vCPU’s Oracle sees = 8 CPU_COUNT = 8 Hot Add CPU’s by 4 Oracle will report 12cpu’s Hot-Add CPU disables vNUMA at the VM level !!! (KB )

Wide Virtual Machine & vNUMA Best Practices
# vCPUs <= # cores in each physical NUMA node local memory access lowest memory access latencies & better performance # vCPUs > # cores in each physical NUMA node Wide “VM” - vCPUs > number of cores / sockets May experience higher average memory access latencies as remote memory access You can affect the virtual NUMA topology with two settings in the vSphere Client: -number of virtual sockets -number of cores per socket for a virtual machine. -If the number of cores per socket is less than or equal to one, virtual NUMA nodes are created to match the topology of the first physical host where the virtual machine is powered on. -If the number of cores per socket (cpuid.coresPerSocket) is greater than one, and the number of virtual cores in the virtual machine is greater than 8, the virtual NUMA node size matches the virtual socket size. Pinning its virtual CPUs to fixed processors - Edit Settings - Advanced CPU - In the Scheduling Affinity panel, set the CPU affinity to the preferred processors. Memory Allocations with Specific NUMA Nodes Using Memory Affinity (after CPU affinity) Edit Settings - NUMA Memory Affinity panel, set memory affinity. Associate Virtual Machines with Specified NUMA Nodes Eg : numa.nodeAffinity = 0,1 NUMA Parameters numa.vcpu.maxPerMachineNode : Max vCPUs that belong to same VM machine that can be scheduled on a NUMA node at the same time numa.vcpu.maxPerClient (default = numa.vcpu.maxPerMachineNode) From “Performance Best Practices for VMware vSphere 5.5 / Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs”

NUMA - Memory Bandwidth v/s Memory Latency Sensitive Apps
Some applications may be Memory Bandwidth driven instead of Memory Latency Sensitive. Memory Bandwidth Sensitive Applications May benefit from the increased memory bandwidth that comes from using more NUMA nodes and thus more paths to memory You may want to modify the VM advanced parameter numa.vcpu.maxPerMachineNode Setting that parameter to a lower value will split the VM’s vCPUs up across more NUMA nodes Avoid creating a single VM with more vCPUs than physical cores. Having multiple VMs that, when combined, exceed the number of physical cores is typically OK – this is called overcommitment.

CPU & Memory Best Practices
Size VM for only minimal/appropriate number of CPUs and memory If exact workload is not known, start with fewer vCPUs and increase as needed Use small medium and large vCPU and Memory settings model for easier provisioning Capacity Planner can analyze current environment and provide resource utilization metrics for sizing Maintain 1:1 ratio of physical cores to vCPUs for Tier 1 production databases For non-prod/lite workloads, reasonable over commitment can improve consolidation ratios, increase aggregate throughput and maximize license savings Recommendation is to avoid over commitment of processor resources ()

Summary - CPU & Memory Best Practices
Do not over-commit CPU for RAC databases Do not over-commit memory for production databases If you choose to overcommit memory, ensure there’s sufficient swap space on your ESXi system. Set the memory reservation equal to the size of the Oracle SGA Ensure VM is sized to NUMA boundaries NUMA should be enabled in server hardware BIOS and at the guest OS Disable unnecessary devices; eg, on-board audio, modem, serial ports, etc. At the time a virtual machine is first powered on, ESXi creates a swap file for that virtual machine equal in size to the difference between the virtual machine's configured memory size and its memory reservation DirectPath I/O - leverages Intel VT-d and AMD-Vi hardware support to allow guest operating systems to directly access hardware devices. On hyper-threaded systems, virtual machines with a number of vCPUs greater than the number of cores in a NUMA node but lower than the number of logical processors in each physical NUMA node might benefit from using logical processors with local memory instead of full cores with remote memory. This behavior can be configured for a specific virtual machine with the numa.vcpu.preferHT flag.

vNUMA Best Practices Enable Hyperthreading benefit around 20-25%
< for CPU intensive batch jobs (based on OLTP workload tests ) > vCPUs to get HT benefit (numa.vcpu.preferHT) Recommendation , size for cores , not Hyper threads Align VMs with physical NUMA boundaries ESXTOP to monitor NUMA performance at vSphere Avoid CPU affinity with Hyper-threading (performance degrades) pinning vCPUs from different / single SMP VMs to both threads on 1 core poor performance threads share processor resources will affect ability of the NUMA scheduler to rebalance VMs across NUMA nodes for fairness NUMA VM categories VMs with # of vCPUs <= than # of cores in each physical NUMA node Memory accesses e local to the NUMA node, resulting in the lowest memory access latencies and better performance VMs with # vCPUs > than # of cores in each physical NUMA node (Wide VMs) May experience higher average memory access latencies than above VMs as they may need to access memory outside their own NUMA node (potential increase in average memory access latencies can be mitigated by appropriately configuring vNUMA) Some Memory bottleneck workloads can benefit from the increased aggregate memory bandwidth when a non-Wide VM is split across multiple NUMA nodes using the numa.vpu.maxPerMachineNode flag VM’s with # of vCPUs > than # of cores per NUMA node but lower than the # of logical processors in each physical NUMA node might benefit from using logical processors with local memory instead of full cores with remote memory. Use the numa.vcpu.preferHT flag. Wide VM : Because vCPUs in these wide virtual machines might sometimes need to access memory outside their own NUMA node, they might experience higher average memory access latencies than virtual machines that fit entirely within a NUMA node -Negative NUMA impact after vMotion if ESXi hosts have different NUMA topology -ESXi NUMA Scheduling and Optimizations enabled for Systems with a total of at least 4 cores , at least 2 cores per NUMA node (default). -Virtual NUMA at GOS level called vNUMA enabled only for VMs with > than 8 vCPUs (default) (numa.vcpu.min flag can be used to lower this bar for VMs # vCPUs < 8) Exposing vNUMA topology : performance gains for memory intensive Applications

Recommendations : vNUMA and Cores Per Socket
Setting for Virtual NUMA topology #virtual sockets #cores / socket (cpuid.coresPerSocket) cpuid.coresPerSocket <=1 , vNUMA = pNUMA of server where VM is powered on first > 1 & vCPU > 8, vNUMA = virtual socket size KB some O/S SKUs hard-limited to run on fixed number of CPU e.g. Win Server 2K3 SE limited to 4 CPU advantage of multi-core socket You can affect the virtual NUMA topology with two settings in the vSphere Client: -number of virtual sockets -number of cores per socket for a virtual machine. -If the number of cores per socket is less than or equal to one, virtual NUMA nodes are created to match the topology of the first physical host where the virtual machine is powered on. -If the number of cores per socket (cpuid.coresPerSocket) is greater than one, and the number of virtual cores in the virtual machine is greater than 8, the virtual NUMA node size matches the virtual socket size. Pinning its virtual CPUs to fixed processors - Edit Settings - Advanced CPU - In the Scheduling Affinity panel, set the CPU affinity to the preferred processors. Memory Allocations with Specific NUMA Nodes Using Memory Affinity (after CPU affinity) Edit Settings - NUMA Memory Affinity panel, set memory affinity. Associate Virtual Machines with Specified NUMA Nodes Eg : numa.nodeAffinity = 0,1

Hot-Add CPU - Oracle CPU Count increase on the fly..
VM 8vCPU’s Oracle sees = 8 CPU_COUNT = 8 Hot Add CPU’s by 4 Oracle will report 12cpu’s Hot-Add CPU disables vNUMA at the VM level !!! (KB ) Database

Basics of vNUMA - Example of a NUMA architecture
VMware vSphere OS Use numactl (Linux) / coreinfo (Windows) command to see vNUMA on GOS Guest Operating System (GOS) Oracle recommendation to turn it off (disabled by default in 111g) c c Wide VM with 8 vCPU’s with 2 NUMA Clients , each NUMA client is a Virtual NUMA Node Physical NUMA node Virtual NUMA node Physical Server 4 socket 4 cores each

Virtual Machine NUMA Design Considerations
What is the size of the building block VM for this new data platform, assuming you choose a NUMA optimized configuration? Host= 4 sockets, 40 cores The most optimal configuration on this host would be a 4VMs setup In above formula, nVMs = 4, as indicated earlier by the vendor NumberOfSockets = 4 and Total RAM on Host = 1024 GB in vSphere6 overhead is higher, I would approximate overhead by just saying 15%. SO here I would say (1024*0.85)/4 10vCPUs, 230 GB VM 4 VMs 4 sockets, 10 cores per socket, 1024 GB RAM NUMA Local Memory = [1024 – {(1024*4*0.01)+1GB}]/4=> GB amount of memory is the absolute ceiling that guarantees memory NUMA locality. NUMA Local Memory with Safety Buffer = 0.95 * GB => GB We can have 4 VMs per host, each VM is of 10vCPUs, 230 GB RAM RAM overhead 2GB (ESXi 5.1) & 4GB (ESXi 5.5) , 15% overhead vSphere 6.0

Latency Sensitive Setting
Recommendation - Default(Normal) Setting to High 1:1 vCPU to pCPU No hyper threading Full VM memory reservation to avoid swapping/ballooning Disabling Interrupt coalescing & LRO for VMXNET3 vNics Monitor_control.halt_desched=FALSE - ensures that VMkernel does not deschedule VM when vCPU is idle Optimized interrupt delivery path for VM DirectPath I/O and SR-IOV passthrough device -Latency Sensitive apps : Examples could be VOIP , media player apps , apps that require frequent access to the mouse or keyboard devices New in vSphere 5.5 is a VM option called Latency Sensitivity, which defaults to Normal. Setting this to High can yield significantly lower latencies and jitter, as a result of the following mechanisms that take effect in ESXi: • Halting in the VM Monitor when the vCPU is idle, leading to faster vCPU wake-up from halt, and bypassing the VMkernel scheduler for yielding the pCPU. This also conserves power as halting makes the pCPU enter a low power mode, compared to spinning in the VM Monitor with the monitor_control.halt_desched=FALSE option. • Full memory reservation eliminates ballooning or hypervisor swapping leading to more predictable performance with no latency overheads due to such mechanisms. • Exclusive access to physical resources, including pCPUs dedicated to vCPUs with no contending threads for executing on these pCPUs. • Optimized interrupt delivery path for VM DirectPath I/O and SR-IOV passthrough devices, using heuristics to derive hints from the guest OS about optimal placement of physical interrupt vectors on physical CPUs. • Disabling interrupt coalescing and LRO automatically for VMXNET 3 virtual NICs. -If you want to ensure that the VMkernel does not deschedule your VM when the vCPU is idle (most systems generally have brief periods of idle time, unless you’re running an application which has a tight loop executing CPU instructions without taking a break or yielding the CPU), you can add the following configuration option. Go to VM Settings  Options tab  Advanced General  Configuration Parameters and add monitor_control.halt_desched with the value of false. Note that this option should be considered carefully, because this option will effectively force the vCPU to consume all of its allocated pCPU time, such that when that vCPU in the VM idles, the VM Monitor will spin on the CPU without yielding the CPU to the VMkernel scheduler, until the vCPU needs to run in the VM again. However, for extremely latency-sensitive VMs which cannot tolerate the latency of being descheduled and scheduled, this option has been seen to help DirectPath I/O vs SR-IOV SR-IOV is a specification that allows a single Peripheral Component Interconnect Express (PCIe) physical device under a single root port to appear to be multiple separate physical devices to the hypervisor or the guest operating system. SR-IOV uses physical functions (PFs) and virtual functions (VFs) to manage global functions for the SR-IOV devices. PFs are full PCIe functions that include the SR-IOV Extended Capability which is used to configure and manage the SR-IOV functionality. It is possible to configure or control PCIe devices using PFs, and the PF has full ability to move data in and out of the device. VFs are lightweight PCIe functions that contain all the resources necessary for data movement but have a carefully minimized set of configuration resources. SR-IOV-enabled PCIe devices present multiple instances of themselves to the guest OS instance and hypervisor. The number of virtual functions presented depends on the device. For SR-IOV-enabled PCIe devices to function, you must have the appropriate BIOS and hardware support, as well as SR-IOV support in the guest driver or hypervisor instance. SR-IOV offers performance benefits and tradeoffs similar to those of DirectPath I/O. DirectPath I/O and SR-IOV have similar functionalty but you use them to accomplish different things. SR-IOV is beneficial in workloads with very high packet rates or very low latency requirements. Like DirectPath I/O, SR-IOV is not compatible with certain core virtualization features, such as vMotion. SR-IOV does, however, allow for a single physical device to be shared amongst multiple guests. This functionality allows you to virtualize low-latency (less than 50 microsec) and high PPS (greater than 50,000 such as network appliances or purpose built solutions) workloads on a VMWorkstation. With DirectPath I/O you can map only one physical funtion to one virtual machine. SR-IOV lets you share a single physical device, allowing multiple virtual machines to connect directly to the physical funtion. vMotion The following features are not available for virtual machines configured with SR-IOV: Storage vMotion vShield High Availability Virtual Wire Netflow DRS Fault Tolerance Snapshots Suspend and resume DPM MAC-based VLAN for passthrough virtual functions Participation in a cluster environment Hot addition and removal of virtual devices, memory, and vCPU •vsish-e set /config/Mem/intOpts/SamplePeriod0 Disable Memory Sampling •sched.mem.prealloc= "TRUE" Memory Preallocation

Design Considerations for Sizing and Best Practices ESXi and Guest OS - Memory

Memory Reclamation Techniques in ESXi
Transparent Page sharing VM’s with identical memory content maintain just a single copy while reclaiming the redundant ones Ballooning GOS aware of the host's low memory status (balloon driver part of VMware Tools) hypervisor sets balloon size based on the amount of physical pages it needs to reclaim balloon driver then pins the pages in the guest free list hypervisor reclaims the host physical pages which are backing them Memory compression memory pages are compressed and stored in a cache on the main memory itself can be accessed again just by a decompression rather than through disk I/O Hypervisor swapping last resort to reclaim memory creates a separate swap file for each VM when it is powered on swaps out guest physical memory thus freeing up host memory Transparent Page Sharing Optimizes use of memory on the host by “sharing” memory pages that are identical between VMs More effective with similar VMs (OS, Application, configuration) Very low overhead Ballooning Allows the ESXi host to “borrow” memory from one VM to satisfy requests from other VMs on that host exerts artificial memory pressure to the VM via the “balloon driver” and returns to the pool usable by other VMs host’s last option before being forced to swap only effective if VMs have “idle” memory DON’T TURN THESE OFF

Swapping is Bad! Swapping on the host happens when
The host is trying to service more memory than it has physically AND TPS and Ballooning are insufficient to provide relief Slows down I/O performance of disks for other VM’s Swapping Occurs in Two Places Guest VM ESXi Host Two ways to keep swapping from affecting your workload: At the VM Set memory reservation = allocated memory (avoid ballooning/swapping) Use active memory counter with caution At the Host Do not overcommit memory until vCenter reports that steady state usage is < the amount of RAM on the server Transparent Page Sharing Optimizes use of memory on the host by “sharing” memory pages that are identical between VMs More effective with similar VMs (OS, Application, configuration) Very low overhead Ballooning Allows the ESXi host to “borrow” memory from one VM to satisfy requests from other VMs on that host exerts artificial memory pressure to the VM via the “balloon driver” and returns to the pool usable by other VMs host’s last option before being forced to swap only effective if VMs have “idle” memory DON’T TURN THESE OFF

Memory Reservations Guarantees memory for a VM – even where there is contention The VM is only allowed to power on if the CPU and memory reservation is available (strict admission) The amount of memory can be guaranteed even under heavy loads In many cases, the configured size and reservation size could be the same If Allocated RAM = Reserved RAM, You Avoid Swapping Memory Reservation = [ SGA + PGA + Background processes(BG) + Active OS Memory ] This does impact vMtion and other mobility features

Oracle Memory Best Practices
Memory Reservation = [ SGA + PGA + Background processes(BG) + Active OS Memory ] Memory Reservation = VM allocated memory if VM is right sized Do not disable Balloon driver , Transparent Page Sharing (TPS) and Memory Compression Use ESXi Large Pages (2MB) Improves performance by significantly reducing TLB misses (applications with large active memory working sets) Does not share large pages unless memory pressure (KB and ) Slightly reduces the per-virtual-machine memory space overhead For systems with Hardware-assisted Virtualization Recommend use guest-level large memory pages ESXi will use large pages to back the GOS memory pages even if the GOS does not make use of large memory pages(full benefit of huge pages is when GOS use them as well as ESXi does) Lose AMM  No more memory target and memory_max_target Lose PGA_AGGREGATE_TARGET Talk about the customer who had excessive swapping … 4k kernel page size HUGE PAGES does not swap PGA(Client sessions & context) 2 GB Oracle SGA (DB buffer cache, and others) 10 GB OS Memory Reservation VM Configured Memory Oracle Background Processes (PMON, SMON, DBWR, LGWR, CKPT, others) 1 GB Operating System

Summary - Best Practices for Memory design
Do not over-commit memory for production databases If you choose to overcommit memory: ensure there’s sufficient swap space on your ESXi system For GOS – Use Linux HugePages - Significantly reduces cache misses (TLB misses) and reduces VM exits Set vm.swappiness = 0 ( ? 1 ) - Avoids memory swapping of GOS pages In the cases where host memory is overcommitted, ESX may have to swap out pages. Since ESX will not swap out large pages, during host swapping, a large page will be broken into small pages. ESX tries to share those small pages using the pre-generated hashes before they are swapped out. The motivation of doing this is that the overhead of breaking a shared page is much smaller than the overhead of swapping in a page if the page is accessed again in the future. Lose AMM  No more memory target and memory_max_target Lose PGA_AGGREGATE_TARGET Talk about the customer who had excessive swapping … 4k kernel page size HUGE PAGES does not swap At the time a virtual machine is first powered on, ESXi creates a swap file for that virtual machine equal in size to the difference between the virtual machine's configured memory size and its memory reservation DirectPath I/O - leverages Intel VT-d and AMD-Vi hardware support to allow guest operating systems to directly access hardware devices. On hyper-threaded systems, virtual machines with a number of vCPUs greater than the number of cores in a NUMA node but lower than the number of logical processors in each physical NUMA node might benefit from using logical processors with local memory instead of full cores with remote memory. This behavior can be configured for a specific virtual machine with the numa.vcpu.preferHT flag.

Linux Huge Pages Linux kernel 2.6 Feature
o/s support memory pages > default (4KB) huge page size 2MB - 256MB depend on kernel version hardware architecture x86 : 4K and 2M ia64 : 4K, 8K, 64K, 256K, 1M, 4M, 16M, 256M ppc64 : 4K & 16M improves performance increases Translation Lookaside Buffer (TLB) hit ratio Pages locked in memory & never swapped out guarantees shared memory e.g. SGA remains in RAM pre allocated contiguous pages only for System V shared memory e.g. SGA less book keeping work for o/s Transparent Page Sharing Optimizes use of memory on the host by “sharing” memory pages that are identical between VMs More effective with similar VMs (OS, Application, configuration) Very low overhead Ballooning Allows the ESXi host to “borrow” memory from one VM to satisfy requests from other VMs on that host exerts artificial memory pressure to the VM via the “balloon driver” and returns to the pool usable by other VMs host’s last option before being forced to swap only effective if VMs have “idle” memory DON’T TURN THESE OFF

Best Practices for Memory design
For systems with Hardware-assisted Virtualization Recommend use GOS Huge Memory Pages ESXi large pages backs up GOS memory pages (even if GOS does not use huge pages) Full benefit of huge pages is when GOS use them as well as ESXi does Set vm.swappiness = 0 ( ? 1 ) (avoids memory swapping of GOS pages) Disable Linux Transparent Huge Pages (THP) [RHEL 6 feature] THP performance issues with Oracle databases (causes node reboots in RAC) [ ] /etc/default/grub GRUB_CMDLINE_LINUX=“….. transparent_hugepage=never….” In the cases where host memory is overcommitted, ESX may have to swap out pages. Since ESX will not swap out large pages, during host swapping, a large page will be broken into small pages. ESX tries to share those small pages using the pre-generated hashes before they are swapped out. The motivation of doing this is that the overhead of breaking a shared page is much smaller than the overhead of swapping in a page if the page is accessed again in the future. Lose AMM  No more memory target and memory_max_target Lose PGA_AGGREGATE_TARGET Talk about the customer who had excessive swapping … 4k kernel page size HUGE PAGES does not swap

Design Considerations for Sizing and Best Practices ESXi and Guest OS - Storage

ESXi - Storage Use Paravirtualized SCSI adapters (PVSCSI)
Enable UUID (Unique Device Identifier) for the ASM disks Set the disk.EnableUUID parameter to true Allows scsi_id to retrieve the unique SCSI identifier For IO distribution (multipathing policy) across vHBAs use Round-Robin. This is especially important when using RDM against All Flash Arrays Consider your storage support for RR ESXi also supports third-party path selection plugins (PSPs), eg, Windows MPIO Cover ASM instance architecture Cover ASM-Communication via ASMB The database communicates with ASM instance using the ASMB (umblicus process) process. Once the database obtains the necessary extents from extent map, all database IO going forward is processed through by the database processes, bypassing ASM. Thus we say ASM is not really in the IO path. So, the question how do we make ASM go faster…..you don’t have to.

ESXi - Para Virtualized SCSI (PVSCSI) Adapters
Latest vSphere SCSI controller drivers Recommended for workloads with a high performance requirement Less CPU overhead Requires VMware Tools Drivers not native to OEL Cannot be used for OS partition without some workaround Cover ASM instance architecture Cover ASM-Communication via ASMB The database communicates with ASM instance using the ASMB (umblicus process) process. Once the database obtains the necessary extents from extent map, all database IO going forward is processed through by the database processes, bypassing ASM. Thus we say ASM is not really in the IO path. So, the question how do we make ASM go faster…..you don’t have to.

How Go-eth my Disk IO Usage of SR-IOV disables the following feature/functionality vSphere vMotion. Storage vMotion. NetFlow. VXLAN Virtual Wire. vSphere High Availability. vSphere Fault Tolerance. vSphere DRS. vSphere DPM. Virtual machine suspend and resume. Virtual machine snapshots. MAC-based VLAN for passthrough virtual functions. Hot addition and removal of virtual devices, memory, and vCPU. Participation in a cluster environment. Network statistics for a virtual machine NIC using SR-IOV passthrough.

ESXi – Storage Data Stores
Need careful consideration for datastore setup Create separate datastores for DATA and REDO Multiple databases can share the same datastore for same type; ie, all DATA diskgroups on one datastore, but this does cause issues for snapshotting at storage layer Be cognizant of workload profiles, isolate datastores where necessary Align VMFS partitions Use VMware vCenter to create VMFS partitions because it automatically aligns the partitions Use Oracle ASM over VMDK Use eager-thick zeroed disks Cover ASM instance architecture Cover ASM-Communication via ASMB The database communicates with ASM instance using the ASMB (umblicus process) process. Once the database obtains the necessary extents from extent map, all database IO going forward is processed through by the database processes, bypassing ASM. Thus we say ASM is not really in the IO path. So, the question how do we make ASM go faster…..you don’t have to.

Multiple Queue Model – Improves Concurrency
Multiple vSCSI Controller Storage Subsystem Additional vSCSI Controllers Improves Concurrency vSCSI Adapter per device / adapter queue depth maximums (KB 1267) LSI Logic SAS = 32 PVSCSI = 64/254 Default queue depths NOT ENOUGH Larger queue depth per-device (256, actual 254) and per-adapter(1024) KB Partition mis-alignment can add significant latency to high-end workloads due to a single IO having to cross physical boundaries. Partition alignment reduces the IOs sent to disk by the controller thus reducing latency To ensure proper alignment VMFS partitions should always be created from within the vCenter client. Guest OS partitions are also susceptible to mis-alignment, however Windows 2008 and higher will attempt to align the partition disks at a 1MB offset Pre-Windows 2008 must use the command line tool diskpart.exe Enterprise level arrays will always be deployed with multiple host-bus adapters as well as multiple storage processors. To ensure the highest level of redundancy vSphere hosts must follow in this pra vCenter5 aligns VMFS3/VMFS5 partitions along the 1MB boundary , vCenter4 aligns VMFS3 partitions along the 64K boundaryctice by deploying a minimum of two HBA ports per ESX host VMware Infrastructure 3 Recommendations for Aligning VMFS Partitions To manually align your VMware VMFS partitions, first check your storage vendor’s recommendations for the partition starting block. For example, in the EMC CLARiiON Best Practices for Fibre Channel Storage guide available at EMC Powerlink, EMC recommends a starting block of 128 to align the partition to the 64KB boundary. If your storage vendor makes no specific recommendation, use a starting block that is a multiple of 8KB Increase queue depth in GOS /etc/grub.conf RHEL 6.x /etc/default/grub RHEL 7.x vmw_pvscsi.cmd_per_lun=254 vmw_pvscsi.ring_pages=32

ESXi - Optimizing Performance – Increase the Queues
Smaller or Larger Datastores? Datastores have queue depths Determined by the LUN queue depth VMKernel Admittance VMKernel admittance policy affecting shared datastore [Disk.SchedNumReqOutstanding KB 1268] VMKernel admittance changes dynamically when SIOC is enabled Physical HBAs Follow vendor recommendation on max queue depth per LUN ( [KB ] Follow vendor recommendation on HBA execution throttling Settings are global if host is connected to multiple storage arrays Consult vendor for the right multi-pathing policy

64-256 HBA0 HBA1 I/O QUEUES (Deep Dive) ESXi 5.5 OS PVSCSI adapter
4 x devices per adapter kb.vmware.com/kb/1267 kb.vmware.com/kb/ kb.vmware.com/kb/1268 OS VM PVSCSI Adapter & device Q-depth See KB Best Practice vmw_pvscsi.cmd_per_lun= 254 vmw_pvscsi.ring_pages = 32 Oracle & GOS queues (Oracle v$ wait events with O/S IOSTAT & SAR commands ) If ESXi host generates more commands to a LUN than the LUN queue depth, the excess commands are queued in the ESX kernel, and this increases the latency LUN Q-depth Tuning (KB 1267 / KB ) Q depth is per LUN,. The HBA supports many commands (typically 2,000 to 4,000 commands per port) VMkernel ESXi 5.5 Setting the Maximum Outstanding Disk Requests for virtual machines (1268) To check the current value for a device,run the command: esxcli storage core device list -d naa.xxx Note: The value appears under No of outstanding IOs with competing worlds: To modify the current value for a device,run the command esxcli storage core device set -d naa.xxx -O Value Where Value is between 1 and 256 Example: esxcli storage core device list naa bab93b6826fe411 Has Settable Display Name: true Display Name: DGC Fibre Channel Disk (naa bab93b6826fe411) Size: Device Type: Direct-Access Multipath Plugin: NMP Vendor: DGC Devfs Path: /vmfs/devices/disks/naa bab93b6826fe411 Model: VRAID Revision: 0533 SCSI Level: 4 Status: on Is Pseudo: false Is RDM Capable: true Is Local: false Is Removable: false Is VVOL PE: false Is SSD: false Is Offline: false Is Perennially Reserved: false Queue Full Sample Size: 0 Thin Provisioning Status: unknown Queue Full Threshold: 0 Attached Filters: VAAI_FILTER VAAI Status: supported Other UIDs: vml bab93b6826fe Is Local SAS Device: false Is Shared Clusterwide: true Is SAS: false Is USB: false Is Boot USB Device: false Device Max Queue Depth: 64 Is Boot Device: false No of outstanding IOs with competing worlds: 32 Drive Type: unknown RAID Level: unknown Protection Enabled: false Number of Physical Drives: unknown PI Activated: false PI Type: 0 PI Protection Mask: NO PROTECTION Supported Guard Types: NO GUARD SUPPORT DIX Enabled: false DIX Guard Type: NO GUARD SUPPORT Emulated DIX/DIF Enabled: false HBA0 HBA1 Emulex Documentation for ESXi 5.5 lpfc_lun_queue_depth max= 512 lpfc_hba_queue-depth max=8192 HBA Card Disk.SchedNumReqOutstanding (KB 1268) 1 VM active on a LUN, effective queue depth = lun queue depth > 1 VM active on a LUN, effective queue depth = min (lun queue depth, DSNRO) DSNRO . Anything above DSNRO is queued in VMkernel Best Practice : Set DSNRO = LUN q-depth DSNRO set per LUN from 5.5 Default 32 (ESXi 3.5 onwards) Ingress/Egress Queues SAN Switch 1 SAN Switch 2 PORT0 PORT1 LUN1 mapped to 2 Storage Ports with X IOPS Array LUN Q depth Storage Array

Multipathing – NMP , PSA, SATP
Pluggable Storage Architecture (PSA) open, modular framework that coordinates the simultaneous operation of multiple multipathing plug-ins (MPPs) load balancing techniques failover mechanisms Native Multipathing Plug-In(NMP) generic VMware multipathing module manages Sub Plug-ins Sub Plug-ins Path Selection Plug-In(PSP) aka called Path Selection Policy handles path selection for a given device SATP Storage Array Type Plug-In aka Storage Array Type Policy handles path failover for storage array VMW_PSP_MRU The host selects the path that it used most recently. When the path becomes unavailable, the host selects an alternative path. The host does not revert back to the original path when that path becomes available again. There is no preferred path setting with the MRU policy. MRU is the default policy for most active-passive storage devices. The VMW_PSP_MRU ranking capability allows you to assign ranks to individual paths. To set ranks to individual paths, use the esxcli storage nmp psp generic pathconfig set command. For details, see the VMware knowledge base article at The policy is displayed in the client as the Most Recently Used (VMware) path selection policy. VMW_PSP_FIXED The host uses the designated preferred path, if it has been configured. Otherwise, it selects the first working path discovered at system boot time. If you want the host to use a particular preferred path, specify it manually. Fixed is the default policy for most active-active storage devices. NOTE If the host uses a default preferred path and the path's status turns to Dead, a new path is selected as preferred. However, if you explicitly designate the preferred path, it will remain preferred even when it becomes inaccessible. Displayed in the client as the Fixed (VMware) path selection policy. VMW_PSP_RR The host uses an automatic path selection algorithm rotating through all active paths when connecting to active-passive arrays, or through all available paths when connecting to active-active arrays. RR is the default for a number of arrays and can be used with both active-active and activepassive arrays to implement load balancing across paths for different LUNs. Displayed in the client as the Round Robin (VMware) path selection policy. VMW_PSP_MRU VMW_PSP_FIXED VMW_PSP_RR (default)

More on Multipathing For IO distribution (multipathing policy) across vHBAs use Round-Robin. This especially important when using RDM against All Flash Arrays Consider your storage support for RR ESXi also supports third-party path selection plugins (PSPs), e.g., Windows MPIO e.g RDM with All Flash Arrays Round Robin policy should be used noop IO scheduler should be used IO Operation Limit value of 1 following command creates a rule that achieves both of these for only Pure Storage Flash Array devices esxcli storage nmp satp rule add -s “VMW_SATP_ALUA” -V “PURE” -M “FlashArray” -P”VMW_PSP_RR” -O “iops=1” Not all storage arrays support the Round Robin path policy. Switching to an unsupported or undesirable path policy can result in connectivity issues to the LUNs or even a storage outage. Check your array documentation or with your storage vendor to see if Round Robin is supported and/or recommended for your array and configuration.

SDRS & SIOC with Array Based Auto Tiering
Pre 6.0 , SDRS generates migration recommendations that prevent out-of-space situations or resolve imbalances in performance SIOC workload injector utility which profiles the capabilities of the datastores for Storage DRS might not understand the capabilities of tiered storage e.g if the injector hits SSD tier, conclusion is a very high performance datastore & with SATA tier, conclusion is a lower performance datastore Recommendation SDRS be used for initial placement of VMs load balancing of VMs based on space usage only Disable I/O metrics feature vSphere 6.0 Storage DRS capable of understanding Array-based auto tiering - vSphere APIs for Storage Awareness (VASA) default Storage DRS invoked every 8 hours requires performance data > 16 hours to generate I/O load balancing decisions

GOS - IO schedulers Recommendation
Deadline CFQ (Completely Fair Queueing) NOOP Default except for SATA devices Attempts to provide guaranteed latency for requests from the point at which requests reach the I/O scheduler Suitable for most use cases especially where read > write operations Default scheduler for SATA disks Divides processes into three separate classes: real time, best effort, and idle Real time class performed before best effort which is performed before idle class Real time class can starve both best effort and idle processes of processor time Default is best effort class Implements a simple FIFO (first-in first-out) scheduling algorithm Requests merged at the generic block layer through a simple last-hit cache Best scheduler for CPU-bound systems using fast storage The deadline IO scheduler can be enabled for disk devices using one of the following steps. Red Hat Enterprise Linux 7 uses deadline as the default I/O scheduler for all SCSI devices, except SATA drives, so there are no additional steps required to select deadline IO scheduler for SAN devices in RHEL 7. For additional information see tuned-adm on changing performance profiles. The Completely Fair Queuing (CFQ) scheduler is the default algorithm in Red Hat Enterprise Linux 4, 5, 6 which is suitable for a wide variety of applications and provides a good compromise between throughput and latency. But for the database systems it is generally recommended to use the Deadline IO scheduler. Recommendation Set GOS disk scheduling algorithm to ‘noop’ ‘noop’ (No Op) scheduler lets ESXi perform I/O optimization & queueing

Virtual Disk Provisioning Policies
Disk Type Characteristics Thick Provision Lazy Zeroed Creates a virtual disk in a default thick format Space allocated when the disk is created Remaining data not erased during creation, but zeroed out on demand first write from VM Thick Provision Eager Zeroed Same as above except remaining data zeroed out when the virtual disk is created Might take longer to create virtual disks in this format Thin Provision Saves storage space Starts small and uses only as much space as the disk needs It can grow to its maximum capacity If needed Partition mis-alignment can add significant latency to high-end workloads due to a single IO having to cross physical boundaries. Partition alignment reduces the IOs sent to disk by the controller thus reducing latency To ensure proper alignment VMFS partitions should always be created from within the vCenter client. Guest OS partitions are also susceptible to mis-alignment, however Windows 2008 and higher will attempt to align the partition disks at a 1MB offset Pre-Windows 2008 must use the command line tool diskpart.exe Enterprise level arrays will always be deployed with multiple host-bus adapters as well as multiple storage processors. To ensure the highest level of redundancy vSphere hosts must follow in this pra vCenter5 aligns VMFS3/VMFS5 partitions along the 1MB boundary , vCenter4 aligns VMFS3 partitions along the 64K boundaryctice by deploying a minimum of two HBA ports per ESX host VMware Infrastructure 3 Recommendations for Aligning VMFS Partitions To manually align your VMware VMFS partitions, first check your storage vendor’s recommendations for the partition starting block. For example, in the EMC CLARiiON Best Practices for Fibre Channel Storage guide available at EMC Powerlink, EMC recommends a starting block of 128 to align the partition to the 64KB boundary. If your storage vendor makes no specific recommendation, use a starting block that is a multiple of 8KB Storage vMotion / cross-host Storage vMotion to transform virtual disks from one format to another NFS datastores with Hardware Acceleration & VMFS datastores support above disk provisioning NFS datastores without Hardware Acceleration supports thin format only

Partition Alignment – Guest Operating System
Configure storage presented to vSphere hosts using vCenter Ensure VMFS partition alignment Misalignment adds significant latency to high-end workloads Single IO cross physical boundaries Partition alignment Reduces the IOs sent to disk by the controller Reduces latency On GOS, align new disk using parted / fdisk (Linux) or diskpart.exe (windows) Align on a track boundary Unaligned partitions result in additional I/O Aligned partitions reduce I/O Partition mis-alignment can add significant latency to high-end workloads due to a single IO having to cross physical boundaries. Partition alignment reduces the IOs sent to disk by the controller thus reducing latency To ensure proper alignment VMFS partitions should always be created from within the vCenter client. Guest OS partitions are also susceptible to mis-alignment, however Windows 2008 and higher will attempt to align the partition disks at a 1MB offset Pre-Windows 2008 must use the command line tool diskpart.exe Enterprise level arrays will always be deployed with multiple host-bus adapters as well as multiple storage processors. To ensure the highest level of redundancy vSphere hosts must follow in this pra vCenter5 aligns VMFS3/VMFS5 partitions along the 1MB boundary , vCenter4 aligns VMFS3 partitions along the 64K boundaryctice by deploying a minimum of two HBA ports per ESX host VMware Infrastructure 3 Recommendations for Aligning VMFS Partitions To manually align your VMware VMFS partitions, first check your storage vendor’s recommendations for the partition starting block. For example, in the EMC CLARiiON Best Practices for Fibre Channel Storage guide available at EMC Powerlink, EMC recommends a starting block of 128 to align the partition to the 64KB boundary. If your storage vendor makes no specific recommendation, use a starting block that is a multiple of 8KB

Partition Alignment Concepts – EMC VMAX Storage (Example)
Hyper can have 32/64/128K track size depending on VMAX h/w and code (VMAX3/VMAX AF track size 128K) VMFS-3 uses MBR ESXi adjust MBR to occupy first 64K VMFS-5 uses GPT occupies first 1MB and no adjustment required Block size multiple of track size File allocations are in even multiples of track size By default, the stripe size of a striped meta is blocks or 960 Kk The decision to place Oracle database server data on VMFS versus RDM is no longer tied to the performance requirements. VMFS has been proven to provide and in certain cases exceed native performance. Considerations include: Operational procedures or policies of the end user. Support for hardware-based VSS. The requirement to be able to swing LUNs between virtual and physical servers. The requirement to use VMware features such as Clone, Storage vMotion, VMware Data Recovery, or Snapshots. -For random workloads, VMFS and RDM produce similar I/O throughput. -For sequential workloads with small I/O block sizes, RDM provides a small increase in throughput compared to VMFS. -However, the performance gap decreases as the I/O block size increases. -For all workloads, RDM has slightly better CPU cost. The test results described in this study show that VMFS and RDM provide similar I/O throughput for most of the workloads we tested. The small differences in I/O performance we observed were with the virtual machine running CPU‐saturated. The differences seen in these studies would therefore be minimized in real life workloads because most applications do not usually drive virtual machines to their full capacity. Most enterprise applications can, therefore, use either VMFS or RDM for configuring virtual disks when run in a virtual machine. However, there are a few cases that require use of raw disks. Backup applications that use such inherent SAN features as snapshots or clustering applications (for both data and quorum disks) require raw disks. RDM is recommended for these cases. We recommend use of RDM for these cases not for performance reasons but because these applications require lower‐level disk control. EMC VMAX track mapping of a Hypervolume with VMFS Virtual disks created by ESXi 5/6 always aligned

Linux fdisk partition alignment - Example
~]# fdisk -lu Disk /dev/sda: 53.7 GB, bytes, sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk label type: dos Disk identifier: 0x000f131c Device Boot Start End Blocks Id System /dev/sda1 * Linux /dev/sda e Linux LVM /dev/sda e Linux LVM Disk /dev/sdb: 53.7 GB, bytes, sectors Disk identifier: 0x1e60ac7b /dev/sdb Linux O/S root disk 2048 Blocks * 512 Bytes/Block = 1MB partition offset size Partition mis-alignment can add significant latency to high-end workloads due to a single IO having to cross physical boundaries. Partition alignment reduces the IOs sent to disk by the controller thus reducing latency To ensure proper alignment VMFS partitions should always be created from within the vCenter client. Guest OS partitions are also susceptible to mis-alignment, however Windows 2008 and higher will attempt to align the partition disks at a 1MB offset Pre-Windows 2008 must use the command line tool diskpart.exe Enterprise level arrays will always be deployed with multiple host-bus adapters as well as multiple storage processors. To ensure the highest level of redundancy vSphere hosts must follow in this pra vCenter5 aligns VMFS3/VMFS5 partitions along the 1MB boundary , vCenter4 aligns VMFS3 partitions along the 64K boundaryctice by deploying a minimum of two HBA ports per ESX host VMware Infrastructure 3 Recommendations for Aligning VMFS Partitions To manually align your VMware VMFS partitions, first check your storage vendor’s recommendations for the partition starting block. For example, in the EMC CLARiiON Best Practices for Fibre Channel Storage guide available at EMC Powerlink, EMC recommends a starting block of 128 to align the partition to the 64KB boundary. If your storage vendor makes no specific recommendation, use a starting block that is a multiple of 8KB Oracle ASM Disk

Storage – ASM LUNs on XtremIO (thx Kevin Closson)
Usage of SR-IOV disables the following feature/functionality vSphere vMotion. Storage vMotion. NetFlow. VXLAN Virtual Wire. vSphere High Availability. vSphere Fault Tolerance. vSphere DRS. vSphere DPM. Virtual machine suspend and resume. Virtual machine snapshots. MAC-based VLAN for passthrough virtual functions. Hot addition and removal of virtual devices, memory, and vCPU. Participation in a cluster environment. Network statistics for a virtual machine NIC using SR-IOV passthrough.

Storage – ASM ASM LUNs on XtremIO
Usage of SR-IOV disables the following feature/functionality vSphere vMotion. Storage vMotion. NetFlow. VXLAN Virtual Wire. vSphere High Availability. vSphere Fault Tolerance. vSphere DRS. vSphere DPM. Virtual machine suspend and resume. Virtual machine snapshots. MAC-based VLAN for passthrough virtual functions. Hot addition and removal of virtual devices, memory, and vCPU. Participation in a cluster environment. Network statistics for a virtual machine NIC using SR-IOV passthrough.

Design Considerations for Sizing and Best Practices ESXi and Guest OS - Network

ESXi – Network Best Practices
For best throughput use VMXNET3 network drivers Do not use NETIO limiter Use NICs that include TOE -Offload checksum, TCP segmentation offload, hi-mem DMA Use Jumbo frames for RAC private interconnect network and network storage (iSCSI or NFS) If low latency is required, implement VMDirectPath I/O Do enable network interrupt coalescing, unless you need extremely very low latency In these cases you shouldn’t virtualizing this application anyways Consider SRIOV ** Standard vs vDswitch??? Be aware that with software-initiated iSCSI and NFS the network protocol processing takes place on the host system, and thus these might require more CPU resources than other storage options.

vSphere Distributed Switch (vDS)
Network Design vSphere Distributed Switch (vDS) Port Groups Use 10 GB Ethernet, NIC Teaming, Enable Jumbo Frames for Oracle Interconnect

Best Practices for Network design
Allocate separate NICs for Motion FT logging traffic, ESXi console access management Oracle RAC interconnect Alternatively use VLAN trunking support to separate Production users Management VM network iSCSI storage traffic vSphere 5.0 supports the use of more than 1 NIC for vMotion allowing more simultaneous vMotions; added specifically for memory intensive applications like Databases Use NIC load-based teaming (route based on physical NIC load) for availability, load balancing, and improved vMotion speeds Have minimum 4 NICs per host to ensure performance and redundancy of network

Use VMXNET3 Paravirtualized adapter drivers to increase performance Reduces overhead versus vlance / E1000 emulation Must have VMware Tools to enable VMXNET3 Recommend the use of NICs that support: Checksum offload , TCP segmentation offload (TSO) Jumbo frames (JF), Large receive offload (LRO) Ability to handle high-memory DMA (i.e. 64-bit DMA addresses) Ability to handle multiple Scatter Gather elements per Tx frame NICs should support offload of encapsulated packets (with VXLAN) vSphere 5.0 and newer support multi-NIC, concurrent vMotion operations Multiple-NIC vMotion in vSphere ( ) Use Distributed Virtual Switches for cross-ESX network convenience Optimize IP-based storage (iSCSI and NFS) Enable Jumbo Frames Use dedicated VLAN for ESXi host's vmknic & iSCSI/NFS server to minimize network interference from other packet sources Exclude iSCSI NICs from Windows Failover Cluster use Be mindful of converged networks; storage load can affect network and vice versa as they use the same physical hardware; ensure no bottlenecks in the network between the source and destination

Network - Jumbo Frames Confirm there is no MTU mismatch – jumbo frames

Do not use NETIO limiter Tune Guest OS network buffers, maximum ports For Oracle GOS virtual machines, use an external NTP source Disable Time synchronization (KB 1189) time synchronization checkbox unselected Use Distributed Virtual Switches for cross-ESX network convenience Optimize IP-based storage (iSCSI and NFS) Enable Jumbo Frames Use dedicated VLAN for ESXi host's vmknic & iSCSI/NFS server to minimize network interference from other packet sources Exclude iSCSI NICs from Windows Failover Cluster use Be mindful of converged networks; storage load can affect network and vice versa as they use the same physical hardware; ensure no bottlenecks in the network between the source and destination

For Oracle RAC Interconnect Disable Virtual Interrupt Coalescing for VMXNET3 Private NIC [KB ] ethernetX.coalescingScheme=disabled (for a particular VM) Net.CoalesceDefaultOn=0 [For all vNICS on the ESXi Host] Disable Interrupt Throttling at the ESXi pNIC level ? helps reducing latency for latency-sensitive VMs will disable Large Receive Offloads (LRO) automatically at pNIC level esxcli system module parameters set -m <> -p "InterruptThrottleRate=0“ Disable LRO for VMXNET3 Private NIC(for TCP dependent apps) LRO aggregates multiple received TCP segments into a larger TCP segment before delivering it up to the guest TCP stack ethtool -K ethX lro off : set in /etc/rc.d/rc.local vSphere 5.0 supports the use of more than 1 NIC for vMotion allowing more simultaneous vMotions; added specifically for memory intensive applications like Databases Use NIC load-based teaming (route based on physical NIC load) for availability, load balancing, and improved vMotion speeds Have minimum 4 NICs per host to ensure performance and redundancy of network Note: 0 = disabled, 1 = enabled. tools.syncTime = "0" time.synchronize.continue = "0" time.synchronize.restore = "0" time.synchronize.resume.disk = "0" time.synchronize.shrink = "0" time.synchronize.tools.startup = "0" time.synchronize.tools.enable = "0" time.synchronize.resume.host = "0"

Best Practices for Network design - SRIOV
If low latency is required, implement VMDirectPath I/O or SRIOV ? SRIOV for apps with very high packet rates/very low latency requirements Allows for single physical device to be shared amongst multiple guests As opposed to VMDirectPath I/O Need to use certified [NIC] adapters and drivers Intel ixgbe and Emulex adapters Enable SR-IOV on a Physical Adapter Define number of virtual functions (VF) SRIOV disables a lot of VMware feature functionality You’re trading performance for functionality Usage of SR-IOV disables the following feature/functionality vSphere vMotion. Storage vMotion. NetFlow. VXLAN Virtual Wire. vSphere High Availability. vSphere Fault Tolerance. vSphere DRS. vSphere DPM. Virtual machine suspend and resume. Virtual machine snapshots. MAC-based VLAN for passthrough virtual functions. Hot addition and removal of virtual devices, memory, and vCPU. Participation in a cluster environment. Network statistics for a virtual machine NIC using SR-IOV passthrough.

Use NIC load-based teaming (route based on physical NIC load) vSphere 5.0 supports the use of more than 1 NIC for vMotion allowing more simultaneous vMotions; added specifically for memory intensive applications like Databases Use NIC load-based teaming (route based on physical NIC load) for availability, load balancing, and improved vMotion speeds Have minimum 4 NICs per host to ensure performance and redundancy of network Note: 0 = disabled, 1 = enabled. tools.syncTime = "0" time.synchronize.continue = "0" time.synchronize.restore = "0" time.synchronize.resume.disk = "0" time.synchronize.shrink = "0" time.synchronize.tools.startup = "0" time.synchronize.tools.enable = "0" time.synchronize.resume.host = "0"

Best Practices – Oracle RAC Networking
Each host must provide a minimum of 2 Ethernet NICs public and private (interconnect) communication Oracle RAC configuration, a separation of networks for the Public network Private network (cluster interconnect) Storage network (if applicable) Run NTPD to an external ntpd server Use Live Migration carefully Set css timeout conservatively to accommodate migration Live Migration basically consists of iteratively pre-copying the contents of a source VM’s memory from one physical host to another target VM on a remote host with minimal interruption to running services. This iterative pre-copy of the memory continues until it is determined that the dirty page rate on the source machine is consistently higher than the memory transfer rate to the target machine. At this point, the VM is stopped and all remaining dirty pages are copied from the source VM to the target VM. The stop-and-copy operation to complete the Live Migration is a suspend time blackout of the VM. For most applications, the dirty page rate will be low and the suspend time will measure in milliseconds to seconds. But for large highly active applications with potential dirty page sets of greater than 2.7GB, the suspend time of the VM could be considerably longer, potentially violating clusterware heartbeat thresholds which would trigger node fencing at the Oracle Clusterware level. To avoid these potential failures during a Live Migration of VMs the following best practices should be observed: 1. Prior to initiating a Live Migration in an Oracle RAC production environment redefine the Oracle Clusterware misscount from the default of 30 seconds to 60. Issue the following command as root: `crsctl set css misscount 60` This will allow a Live Migration for an application with a potential dirty page set of greater than 2.7GB. Assuming a conservative maximum throughput for a single 1GbE shared link provides 90MB/sec., 60 seconds misscount would allow a suspend period of 5.4GB (90MB/sec * MC) where MC= misscount set to 60 seconds). an Oracle RAC production environment redefine the Oracle Clusterware misscount from the default of 30 seconds to 60. Issue the following command as root: `crsctl set css misscount 60` This will allow a Live Migration for an application with a potential dirty page set of greater than 2.7GB. Assuming a conservative maximum throughput for a single 1GbE shared link provides 90MB/sec., 60 seconds misscount would allow a suspend period of 5.4GB (90MB/sec * MC) where MC= misscount set to 60 seconds). 2. When the Live Migration is complete, set the Oracle Clusterware misscount back to the default. Issue the following command as root: `crsctl unset css misscount`

Network – A Quick Word on SRIOV
Use SRIOV for applications with very high packet rates/very low latency requirements Allows for a single physical device to be shared amongst multiple guests As opposed to VMDirectPath I/O Need to use certified [NIC] adapters and drivers Intel ixgbe and Emulex adapters Enable SR-IOV on a Physical Adapter Define number of virtual functions (VF) SRIOV disables a lot of Vmware feature functionality You’re trading performance for functionality Usage of SR-IOV disables the following feature/functionality vSphere vMotion. Storage vMotion. NetFlow. VXLAN Virtual Wire. vSphere High Availability. vSphere Fault Tolerance. vSphere DRS. vSphere DPM. Virtual machine suspend and resume. Virtual machine snapshots. MAC-based VLAN for passthrough virtual functions. Hot addition and removal of virtual devices, memory, and vCPU. Participation in a cluster environment. Network statistics for a virtual machine NIC using SR-IOV passthrough.

RAC – VMWare Best Practices
RAC itself is latency sensitive, esp network. Disable Network Coalescing Applications may also be latency sensitive For these configurations, there needs to be balance between lowering overall latency and higher CPU overhead latency and CPU move in opposite directions

VMware – Things to look for
Run load tests before virtualizing….so you know what to expect Run RAC where it makes sense Use and get comfortable with esxtop Most importantly, make sure organizational boundaries are well defined Who has VM Linux root access Who can clone and create templates Cover ASM instance architecture Cover ASM-Communication via ASMB The database communicates with ASM instance using the ASMB (umblicus process) process. Once the database obtains the necessary extents from extent map, all database IO going forward is processed through by the database processes, bypassing ASM. Thus we say ASM is not really in the IO path. So, the question how do we make ASM go faster…..you don’t have to.

A Word on Oracle UEK It’s your ace in the hole. If you run UEK on Vmware, you get much better support form Oracle  UEK has loads of benefits for database workloads even on virtualized environments Improved memory management Better IO scheduling Cover ASM instance architecture Cover ASM-Communication via ASMB The database communicates with ASM instance using the ASMB (umblicus process) process. Once the database obtains the necessary extents from extent map, all database IO going forward is processed through by the database processes, bypassing ASM. Thus we say ASM is not really in the IO path. So, the question how do we make ASM go faster…..you don’t have to.

Performance Best Practices – At a glance

Summary Performance is not an issue anymore
Get immediate ROI by deploying Oracle databases on vSphere DBAs should not fear virtualization; their skills are fully transferable Both production DBAs and database architects include VMware virtualization Run load tests before virtualizing...so you know what to expect Run RAC where it makes sense Use and get comfortable with esxtop Most importantly, make sure organizational boundaries are well defined Who has VM Linux root access Who can clone and create templates

Oracle Database Virtualization and Linux Best Practices

Similar presentations

Presentation on theme: "Oracle Database Virtualization and Linux Best Practices"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Oracle Database Virtualization and Linux Best Practices

Similar presentations

Presentation on theme: "Oracle Database Virtualization and Linux Best Practices"— Presentation transcript:

Similar presentations

About project

Feedback