We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byEmely Stansberry
Modified over 2 years ago
AIX CPU & Virtual Processors Steve Nasypany
© 2012 IBM Corporation 2 Utilization, Simultaneous Multi-threading & Virtual Processors
© 2012 IBM Corporation 3 Review: POWER6 vs POWER7 SMT Utilization Simulating a single threaded process on 1 core, 1 Virtual Processor, utilization values change. In each of these cases, physical consumption can be reported as 1.0. Real world production workloads will involve dozens to thousands of threads, so many users may not notice any difference in the macro scale See Simultaneous Multi-Threading on POWER7 Processors by Mark Funk Processor Utilization in AIX by Saravanan Devendran https://www.ibm.com/developerworks/mydeveloperworks/wikis/home?lang=en#/wiki/Power %20Systems/page/Understanding%20CPU%20utilization%20on%20AIX busy idle POWER6 SMT2 Htc0 Htc1 100% busy busy idle POWER7 SMT4 ~65% busy idle busy idle POWER7 SMT2 ~70% busy busy 100% busy busy 100% busy Htc0 Htc1 Htc0 Htc1 Htc0 Htc1 Htc0 Htc1 Htc2 Htc3 busy = user% + system% POWER5/6 utilization does not account for SMT, POWER7 is calibrated in hardware
© 2012 IBM Corporation 4 Review: POWER6 vs POWER7 Dispatch There is a difference between how workloads are distributed across cores in POWER7 and earlier architectures –In POWER5 & POWER6, the primary and secondary SMT threads are loaded to ~80% utilization before another Virtual Processor is unfolded –In POWER7, all of the primary threads (defined by how many VPs are available) are loaded to at least ~50% utilization before the secondary threads are used. Once the secondary threads are loaded, only then will the tertiary threads be dispatched. This is referred to as Raw Throughput mode. –Why? Raw Throughput provides the highest per-thread throughput and best response times at the expense of activating more physical cores POWER6 SMT2 busy idle POWER7 SMT4 ~50% busy idle busy ~80% busy Htc0 Htc1 Htc0 Htc1 Htc2 Htc3 Virtual Processor Activate
© 2012 IBM Corporation 5 Review: POWER6 vs POWER7 Dispatch proc0proc1proc2proc3 Once a Virtual Processor is dispatched, the Physical Consumption metric will typically increase to the next whole number Put another way, the more Virtual Processors you assign, the higher your Physical Consumption is likely to be in POWER7 proc0proc1proc2proc3 Primary Secondary Tertiaries POWER7 POWER6 Primary Secondary
© 2012 IBM Corporation 6 POWER7 will activate more cores at lower utilization levels than earlier architectures when excess VPs are present –Customers may complain that the physical consumption metric (reported as physc or pc) is equal to or possibly even higher after migrations to POWER7 from earlier architecture –Every POWER7 customer with this complaint to also have significantly higher idle% percentages over earlier architectures –Consolidation of workloads and may result in many more VPs assigned to the POWER7 partition Customers may also note that CPU capacity planning is more difficult in POWER7. If they will not reduce VPs, they may need subtract %idle from the physical consumption metrics for more accurate planning. –POWER5 & POWER6, 80% utilization was closer to 1.0 physical core –In POWER7 with excess VPs, in theory, all of the VPs could be dispatched and the system could be 40-50% idle –Thus, you cannot get to higher utilization of larger systems if you have lots of VPs which are only exercising the primary SMT thread – you will have high Physical Consumption with lots of idle capacity POWER7 Consumption: A Problem?
© 2012 IBM Corporation 7 Apply APARs in backup section, these can be causal for many of the high consumption complaints Beware allocating many more Virtual Processors than sized Reduce Virtual Processor counts to activate secondary and tertiary SMT threads –Utilization percentages will go up, physical consumption will remain equal or drop –Use nmon, topas, sar or mpstat to look at logical CPUs. If only primary SMT threads are in use with a multi-threaded workload, then excess VPs are present. A new alternative is Scaled Throughput, allowing increases in per-core utilization by a Virtual Processor POWER7 Consumption: Solutions
© 2012 IBM Corporation 8 Scaled Throughput
© 2012 IBM Corporation 9 Scaled Throughput is an alternative to the default Raw AIX scheduling mechanism –It is an alternative for some customers at the cost of partition performance –It is not an alternative to addressing AIX and pHyp defects, partition placement issues, realistic entitlement settings and excessive Virtual Processor assignments –It will dispatch more SMT threads to a VP/core before unfolding additional VPs –It can be considered to be more like the POWER6 folding mechanism, but this is a generalization, not a technical statement –Supported on POWER7/POWER7+, AIX 6.1 TL08 & AIX 7.1 TL02 Raw vs Scaled Performance –Raw provides the highest per-thread throughput and best response times at the expense of activating more physical cores –Scaled provides the highest core throughput at the expense of per-thread response times and throughput. It also provides the highest system-wide throughput per VP because tertiary thread capacity is not left on the table. What is Scaled Throughput?
© 2012 IBM Corporation 10 POWER7 Raw vs Scaled Throughput lcpu 0-3 proc0 lcpu 4-7 proc1 lcpu 8-11 proc2 lcpu proc3 Once a Virtual Processor is dispatched, physical consumption will typically increase to the next whole number 63% 77% 88% 100% proc0proc1proc2proc3 Primary Secondary Tertiaries Raw default Scaled Mode 2 proc0proc1proc2proc3 Scaled Mode 4
© 2012 IBM Corporation 11 Tunings are not restricted, but you can be sure that anyone experimenting with this without understanding the mechanism may suffer significant performance impacts –Dynamic schedo tunable –Actual thresholds used by these modes are not documented and may change at any time schedo –p –o vpm_throughput_mode= 0 Legacy Raw mode (default) 1 Scaled or Enhanced Raw mode with a higher threshold than legacy 2 Scaled mode, use primary and secondary SMT threads 4 Scaled mode, use all four SMT threads Tunable schedo vpm_throughput_core_threshold sets a core count at which to switch from Raw to Scaled Mode –Allows fine-tuning for workloads depending on utilization level –VPs will ramp up quicker to a desired number of cores, and then be more conservative under chosen Scaled mode Scaled Throughput: Tuning
© 2012 IBM Corporation 12 Workloads –Workloads with many light-weight threads with short dispatch cycles and low IO (the same types of workloads that benefit well from SMT) –Customers who are easily meeting network and I/O SLAs may find the tradeoff between higher latencies and lower core consumption attractive –Customers who will not reduce over-allocated VPs and prefer to see behavior similar to POWER6 Performance –It depends, we cant guarantee what a particular workload will do –Mode 1 may see little or no impact but higher per-core utilization with lower physical consumed –Workloads that do not benefit from SMT and use Mode 2 or Mode 4 will see double-digit per-thread performance degradation (higher latency, slower completion times) Scaled Throughput: Workloads
© 2012 IBM Corporation 13 Raw Throughput: Default and Mode 1 AIX will typically allocate 2 extra Virtual Processors as the workload scales up and is more instantaneous in nature VPs are activated and deactivated one second at a time Mode 1 is more of a modification to the Raw (Mode 0) throughput mode, using a higher utilization threshold and moving average to prevent less VP oscillation It is less aggressive about VP activations. Many workloads may see little or no performance impact
© 2012 IBM Corporation 14 Scaled Throughput: Modes 2 & 4 Mode 2 utilizes both the primary and secondary SMT threads Somewhat like POWER6 SMT2, eight threads are collapsed onto four cores Physical Busy or utilization percentage reaches ~80% of Physical Consumption Mode 4 utilizes both the primary, secondary and tertiary SMT threads Eight threads are collapsed onto two cores Physical Busy or utilization percentage reaches % of Physical Consumption
© 2012 IBM Corporation 15 Never adjust the legacy vpm_fold_threshold without L3 Support guidance Remember that Virtual Processors activate and deactivate on 1 second boundaries. The legacy schedo tunable vpm_xvcpus allows enablement of more VPs than required by the workload. This is rarely needed, and is over-ridden when Scaled Mode is active. If you use RSET or bindprocessor function and bind a workload –To a secondary thread, that VP will always stay in at least SMT2 mode –If you bind to a tertiary thread, that VP cannot leave SMT4 mode –These functions should only be used to bind to primary threads unless you know what you are doing or are an application developer familiar with the RSET API –Use bindprocessor –s to list primary, secondary and tertiary threads A recurring question is How do I know how many Virtual Processors are active? –There is no tool or metric that shows active Virtual Processor count –There are ways to guess this, and looking a physical consumption (if folding is activated), physc count should roughly equal active VPs –nmon Analyser makes a somewhat accurate representation, but over long intervals (with a default of 5 minutes), it does not provide much resolution –For an idea at a given instant with a consistent workload, you can use: echo vpm | kdb Tuning (other)
© 2012 IBM Corporation 16 Virtual Processors > echo vpm | kdb VSD Thread State CPU CPPR VP_STATE FLAGS SLEEP_STATE PROD_TIME: SECS NSECS CEDE_LAT 0 0 ACTIVE 1 AWAKE ACTIVE 0 AWAKE C6DE 25AA4BBD ACTIVE 0 AWAKE C6DE 25AA636E ACTIVE 0 AWAKE C6DE 25AA4BFE ACTIVE 0 AWAKE DD 0D0CC64B ACTIVE 0 AWAKE DD 0D0D6EE ACTIVE 0 AWAKE DD 0D0E4F1E ACTIVE 0 AWAKE DD 0D0F7BE DISABLED 1 SLEEPING C DISABLED 1 SLEEPING C325A DISABLED 1 SLEEPING C319F DISABLED 1 SLEEPING E2AFE DISABLED 1 SLEEPING C327A DISABLED 1 SLEEPING C DISABLED 1 SLEEPING C3B DISABLED 1 SLEEPING C3ABD 02 VP With SMT4, each core will have 4 Logical CPUs, which equals 1 Virtual Processor This method is only useful for steady-state workloads
© 2012 IBM Corporation 17 Variable Capacity Weighting
© 2012 IBM Corporation 18 Variable Capacity Weighting - Reality Do I use a partitions Variable Capacity Weight to get uncapped capacity? –NO: PowerVM has no mechanism to distribute uncapped capacity when the pool is not constrained –YES: Its mechanism to arbitrate shared-pool contention Uncapped weight is only used where there are more virtual processors ready to consume unused resources than there are physical processors in the shared processor pool. If no contention exists for processor resources, the virtual processors are immediately distributed across the logical partitions independent of their uncapped weights. This can result in situations where the uncapped weights of the logical partitions do not exactly reflect the amount of unused capacity. This behavior is only supported for the default (global) shared pool –Unused capacity is spread by the hypervisor across all active pools, and not managed on a per-pool basis –Cycles will be distributed evenly in virtual shared (sub) pools, and not based on Variable Capacity Weight ratios
© 2012 IBM Corporation 19 Constraining Workloads Variable Capacity Weighting is not a partition or job management utility –Environments that are out of pool resources cannot be actively controlled with high-fidelity by fighting over scraps of unused cycles –It is only a failsafe when an environment is running out of capacity, typically to provide a last option for the VIOS –If you are out of CPU resources, you need to use other methods to manage workloads
© 2012 IBM Corporation 20 Constraining Workloads How do I protect critical workloads? By constraining other partitions access to CPU resources –PowerVM methods Capping Entitlement or setting Variable Capacity Weight to 0 Dynamically reducing Virtual Processors. Entitlement can also be reduced dynamically, and can reduce guaranteed resources, but adjusting VPs has a more direct impact. You must establish a practical range of minimum and maximum entitlement ranges to allow flexibility in dynamic changes Virtual Shared Pools (or sub-pools) - can constrain a workload by setting Maximum Processing Units of a sub-pool dynamically, but this is effectively the same as reducing VPs Use Capacity-on-Demand feature(s) to grow pool –Operating System methods Process priorities Execute flexible workloads at different times Workload Manager Classes or Workload Partitions Rebalance workloads between systems
© 2012 IBM Corporation 21 Constraining Workloads If youre out of capacity, you need to dial back VPs, cap partitions, move workloads (in time or placement) or use Capacity-on-Demand
© 2012 IBM Corporation 22 Redbook & APAR Updates
© 2012 IBM Corporation 23 Power Systems Performance Guide This is an outstanding Redbook for new and experienced users
© 2012 IBM Corporation 24 POWER7 Optimization & Tuning Guide A single first stop definitive source for a wide variety of general information and guidance, referencing other more detailed sources on particular topics Exploitable by IBM, ISV and customer software developers Hypervisor, OS (AIX & Linux), Java, compilers and memory details Guidance/Links for DB2, WAS, Oracle, Sybase, SaS, SAP Business Objects
© 2012 IBM Corporation 25 The most problematic performance issues with AIX were resolve in early Surprisingly, many customers are still running with these defects –Memory Affinity Domain Balancing –Scheduler/Dispatch defects –Wait process defect –TCP Retransmit –Shared Ethernet defects Do not run with a firmware level below 720_101. A hypervisor dispatch defect exists below that level. The next slide provides the APARs to resolve the major issues –We strongly recommend updating to these levels if you encounter performance issues. AIX Support will likely push you to these levels before wanting to do detailed research on performance PMRs. –All customer Proof-of-Concept or tests should use these as minimum recommended levels to start with Performance APARs
© 2012 IBM Corporation 26 IssueReleaseAPARSP/PTF WAITPROC IDLE LOOPING CONSUMES CPU 7.1 TL1 6.1 TL7 6.1 TL6 6.1 TL5 IV10484 IV10172 IV06197 IV01111 SP2 (IV09868) SP2 (IV09929) U bos.mp or SP7 U bos.mp or SP8 SRAD load balancing issues on shared LPARs 7.1 TL1 6.1 TL7 6.1 TL6 6.1 TL5 IV10802 IV10173 IV06196 IV06194 SP2 (IV09868) SP2 (IV09929) U bos.mp or SP7 U bos.mp or SP8 Miscellaneous dispatcher/scheduling performance fixes 7.1 TL1 6.1 TL7 6.1 TL6 6.1 TL5 IV10803 IV10292 IV10259 IV11068 SP2 (IV09868) SP2 (IV09929) U bos.mp or SP7 U bos.mp or SP8 address space lock contention issue 7.1 TL1 6.1 TL7 6.1 TL6 6.1 TL5 IV10791 IV10606 IV03903 n/a SP2 (IV09868) SP2 (IV09929) U bos.mp or SP7 TCP Retransmit Processing is slow (HIPER) 7.1 TL1 6.1 TL7 6.1 TL6 IV13121 IV14297 IV18483 SP4 U bos.net.tcp.client or SP8 SEA lock contention and driver issues FP25 SP02 Performance APARs – MUST HAVE
© 2012 IBM Corporation 27 New global_numperm tunable has been enabled with AIX 6.1 TL7 SP4 / 7.1 TL1 SP4. Customers may experience early paging due to failed pincheck on 64K pages What –Fails to steal from 4K pages when 64K pages near maximum pin percentage (maxpin) and 4K pages are available –Scenario not properly checked for all memory pools when global numperm is enabled –vmstat –v shows that the number of 64K pages pinned is close to maxpin% –svmon shows that 64K pinned pages are approaching the maxpin value Action –Apply APAR –Alternatively if the APAR cannot be applied immediately, disable numperm_global : # vmo -p -o numperm_global=0 –Tunable is dynamic, but workloads paged out will have to be paged in and performance may suffer until that completes or a reboot is performed APARs IV26272 AIX 6.1 TL7 IV26735 AIX 6.1 TL8 IV26581 AIX 7.1 TL0 IV27014 AIX 7.1 TL1 IV26731 AIX 7.1 TL2 Early 2013 Paging Defect
© Copyright IBM Corporation 2015 Technical University/Symposia materials may not be reproduced in whole or in part without the prior written permission.
Chapter 5: CPU Scheduling Adapted by Donghui Zhang from the original version by Silberschatz et al.
Executional Architecture Lecture Conceptual vs execution Conceptual Architecture Execution Architecture Component Connector Domain-level responsibilities.
© 2013 IBM Corporation POWER7 Performance & Dynamic System Optimizer Steve Nasypany, Advanced Technical Sales Support October 2013.
An Intro to AIX Virtualization Philadelphia CMG September 14, 2007 Mark Vitale.
© 2009 IBM Corporation Active Memory Sharing Overview Carol Hernandez, Power Firmware Architect.
5: CPU-Scheduling1 Jerry Breecher OPERATING SYSTEMS SCHEDULING.
CPU Scheduling. Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Thread Scheduling Operating.
S-Curves & the Zero Bug Bounce: Plotting Your Way to More Effective Test Management Presented By: Shaun Bradshaw Director of Quality Solutions Questcon.
© COPYRIGHT 2008 THE INFORMATION SYSTEMS MANAGER, INC. (ISM) Understanding the AIX Performance Data in a PowerVM Partition Pete Weilnau
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Lecture 5 Operating Systems.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
1 Memory Management Chapter 4 Basic memory management Swapping Virtual memory Page replacement algorithms.
Cache and Virtual Memory Replacement Algorithms Presented by Michael Smaili CS 147 Spring
Scheduling Algorithems First Come First Serve Scheduling Shortest Job First Scheduling Priority Scheduling Round-Robin Scheduling Multilevel Queue Scheduling.
CPU scheduling. Single Process one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting time is wasted 2.
Lecture 4 CPU scheduling. Basic Concepts Single Process one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.
Operating Systems Chapter 6. Main functions of an operating system 1. User/computer interface: Provides an interface between the user and the computer.
Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
Computer Organization and Architecture Operating System Support.
Pricing for Utility-driven Resource Management and Allocation in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS)
Chapter 5 Operating System Support. Outline Operating system - Objective and function - types of OS Scheduling - Long term scheduling - Medium term scheduling.
Taming User-Generated Content in Mobile Networks via Drop Zones Ionut Trestian Supranamaya Ranjan Aleksandar Kuzmanovic Antonio Nucci Northwestern University.
© 2004, D. J. Foreman 1 Scheduling & Dispatching.
So far Binary numbers Logic gates Digital circuits process data using gates – Half and full adder Data storage – Electronic memory – Magnetic memory –
25 seconds left….. 24 seconds left….. 23 seconds left…..
Silberschatz, Galvin and Gagne Operating System Concepts Chapter 10: Virtual Memory Background Demand Paging Process Creation Page Replacement.
1 PART 1 ILLUSTRATION OF DOCUMENTS Brief introduction to the documents contained in the envelope Detailed clarification of the documents content.
POM - J. Galván 1 PRODUCTION AND OPERATIONS MANAGEMENT Ch. 8: Capacity Planning.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. 5 Capacity Planning For Products and Services.
Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.
Chapter 5 CPU Scheduling. CPU Scheduling Topics: Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling.
Silberschatz, Galvin, and Gagne Applied Operating System Concepts Module 10: Virtual Memory Background Demand Paging Performance of Demand Paging.
Håkan Sundell, Chalmers University of Technology 1 Evaluating the performance of wait-free snapshots in real-time systems Björn Allvin.
Basic Concepts Maximum CPU utilization obtained with multiprogramming CPU–I/O Burst Cycle – Process execution consists of a cycle of CPU execution and.
6 Memory Management and Processor Management Management of Resources Measure of Effectiveness – On most modern computers, the operating system serves.
1 Processes and Threads Chapter Processes 2.2 Threads 2.3 Interprocess communication 2.4 Classical IPC problems 2.5 Scheduling.
Layers and Views of a Computer System Operating System Services Program creation Program execution Access to I/O devices Controlled access to files System.
Chapter 6: CPU Scheduling. 5.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Feb 2, 2005 Chapter 6: CPU Scheduling Basic.
Real Time Versions of Linux Operating System Present by Tr n Duy Th nh Quách Phát Tài 1.
Chapter 5: CPU Scheduling. 5.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Feb 2, 2005 Chapter 5: CPU Scheduling Basic.
Chapter 5 : Memory Management 1By : Jigar M. Pandya.
PowerVM and VMware. What this presentation is Basic Terms that can be used to discuss multiple forms of virtualization Concepts common to virtualization.
1 Chapter 10 Software Testing This chapter is extracted from Sommerville’s slides. Text book chapter 23 1.
Using TOSCA Requirements /Capabilities to Specify Capacity Resources (Primer Considerations) Proposal by CA Technologies, IBM, SAP, Vnomic.
What is an Operating System? Various systems and their pros and cons –E.g. multi-tasking vs. Batch OS definitions –Resource allocator –Control program.
© 2016 SlidePlayer.com Inc. All rights reserved.