Learn strategies for managing the performance of your Oracle Real Application Clusters database. We will review the common problems that customers have.

Learn strategies for managing the performance of your Oracle Real Application Clusters database. We will review the common problems that customers have faced and how to resolve them. This session will provide hints and tips for performance problem solving when using a cluster. Join the experts to uncover the best practices to get the best performance from your application S Practical Performance Management for Oracle Real Application Clusters Michael Zoll, Consulting Member of Technical Staff Barb Lundhild, Product Manager, Oracle Real Application Clusters

The following is intended to outline our general product direction
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

<Insert Picture Here>
Agenda <Insert Picture Here> Oracle RAC Infrastructure and Technical Fundamentals Application and Database Design Common Problems and Symptoms Diagnostics and Problem Determination Appendix

Convey a few simple and fundamental concepts of Oracle RAC performance
Objective Convey a few simple and fundamental concepts of Oracle RAC performance Summarize application level performance and scalability information Provide some simple sizing hints Give exemplary overview of common problems and solutions Builds on similar presentation from OOW 2008

Oracle RAC Infrastructure: Technical Fundamentals, Sizing and Configuration

Oracle RAC Architecture
/…/ public network VIP2 VIPn VIP1 Service Service Service Node 2 Node n Node1 Listener Listener Listener SCAN_Listener SCAN_Listener SCAN_Listener instance 1 instance 2 instance n ASM ASM ASM Oracle Clusterware Oracle Clusterware Oracle Clusterware Operating System Operating System Operating System shared storage This graphic focuses on the interconnect fabric. It is a network, like any other, for a single dedicated specific (private) purpose – cluster communication. People can get creative with their network design, with use of VLANS, various switches, and different topologies. Latency is the key factor to minimize, along with reliability of the switches for cluster communications. Redo / Archive logs all instances Database / Control files Managed by ASM OCR and Voting Disks

Global Cache and Global Enqueue Service Processes and Functions
SGA Runs in Real Time Priority Library Cache Buffer Cache Dictionary Cache Global Resource Directory Log buffer Global Enqueue Service Global Cache Service Oracle Process Oracle Process LMD0 LMON LMSx DBW0 LGWR From the point of view of process architecture, one or more block server processes , called LMS, are handling the bulk of the message traffic. The LMS processes are Oracle background processes. When a shadow process makes a request for data, it sends a message directly to an LMS process on another node, which in turn returns either the data or a grant ( permission to read from disk, or write to data block ) directly to the requester. The state objects used for globally cached data are maintained in the SGA and are accessed by all processes in an instance which need to maintain and manipulate global data consistently. LMSn. Runs in RT by default since 10gR2. Need predictable scheduling for predictable performance in: Runtime cache fusion performance. Broadcast on commit performance VKTM is a new fatal background process. VKTM keeps updating a timer variable on the SGA VKTM reduces the CPU overhead for getting timing information considerably VKTM needs to be in RT for correctness Cluster Private High Speed Network

Cache Hierarchy: Local Cache, Global Cache and Disk
Local buffer cache scan In local buffer cache ? Global buffer cache Lookup In global buffer cache ? Transfer buffer Access grant and Disk read Yes No The Global Cache Service manages data cached in the buffer caches of all instances which are part of the database cluster. In conjunction with the an IPC transport layer, it initiates and handles the memory transfers for write access ( CURRENT ) or read access ( CR ) transfers for all types (e.g data, index, undo, headers ) globally managed access permissions to cached data Global state of the block The GCS can determine if and where a data block is cached and forwards data requests to the appropriate instances. It minimizes the access time to data, as the response time on a private network is faster than a read from disk. The message protocol scales and will at most involve 3 hops in a cluster comprised of more than 2 nodes. In fact, the total number of messages will be determined by the probabilities of a finding the global state information for a data block on the local node or a remote node, and whether data is cached in the instance which also masters the global state for the data. Oracle RAC attempts to colocate buffered data and their global state as much as possible to miminize the impact of the message cost. Cache Fusion and the GCS constitute the infrastructure which allows the scale-out of a database tier by adding commodity servers.

Global Cache Access Post 2 3 5 Post 6 Send 1 4 Receive Flush redo
: Buffer Cache Immediate direct send: > 96% Log write and send : < 4% Legend: Post 3 LMS 2 LGWR Post 5 6 Send Flush redo 4 1 Shadow process: Receive In its simplest case, a data request involves a message to the instance where the data block is cached. The request message is usually small, approx. 200 bytes in size. The requesting shadow process initiates the send and then waits until the response arrives. The message is sent to an LMS process on a remote instance. The LMS process receives the message, executes a handler and processes the message, and eventually send either the data block or a grant message. The minimum roundtrip time involving an 8K data block is about 400 microseconds. It is obvious that the pure wire time consumes only an insignificant portion of the total time. It should also be clear that the key factors for performance are the time it takes to send, receive and process the data, which makes the responsiveness of LMS under load a critical factor. Actual Cost determined by Message Propagation Delay IPC CPU Operating system scheduling Block server process load Interconnect stability

Basic Performance Facts
Global Cache access is usecs ( roundtrip ) Data immediately served from remote instances via private, high speed interconnect Redo may have to be written to log file before send if data was changed and has not been committed yet Performance varies with network infrastructure and network protocol Maximum network hops is 3 messages For clusters with more than 2 nodes, independent of total cluster size CPU cost per OLTP transaction Dependent on locality of access , I.E. messages per tx Can be between 3-20% ( empirically ) Jumbo frames need to be supported by drivers, NICs and switches. They usually require a certain amount of additional configuration.

Basic Performance Facts: Latency (UDP/GbE and RDS/IB )
Block size RT (ms) 2K 4K 8K 16K UDP/GE 0.30 0.31 0.36 0.46 RDS/IB 0.12 0.13 0.16 0.20 Lower CPU cost relative to protocols and network infrastructure It should be stressed that these are the minimum roundtrip latencies measures at low to medium load ( 50% CPU utilization ) It should be clear that the processing cost is affected by several factors, as just explained . Hot database blocks may incur and extra processing cost in user space. Most average values represented in AWR reports are from a large distribution with some amount of variance, I.e. higher values can skew the avg value , and often one sees 1 or 2 ms avg latency although the the majority of accesses complete in less than 1 ms The main purpose of this table is to serve as a reference for expected values. In 11g, latency probes for small and large messages would allow you to correlate system LOAD and average access time at run time in the system. The results of the latency probes are stored in the AWR repository and can therefore be accesses and regressed easiliy. Actual interconnect latency is generally not the problem unless you have exceeded capacity or you are experiencing errors

CPU cycles for protocol and process scheduling is 80% of latency
Causes of Latency CPU cycles for protocol and process scheduling is 80% of latency LMS is critical resource process concurrency and context switching increaseCPU path length Other influences on latency and CPU cost and latency Load factors ( CPU utilization ) Total bandwidth Network frame size NIC Offload capabilities

Large ( Jumbo ) Frames for GbE recommended
Private Interconnect Network between the nodes of an Oracle RAC cluster MUST be private/dedicated to traffic between Oracle RAC nodes Large ( Jumbo ) Frames for GbE recommended Avoids fragmentation and reassembly ( 8K / 1500 MTU = 6 fragments ) Interconnect bandwidth should be tested with non-Oracle utilities ( e.g. iPerf ) No packet loss at 75% - 80% of bandwidth utilization Private network is important for for performance and stability Need to maintain bandwidth exclusive to keep variation low Dual-ported or multiple NICs are good to have for failover, but rarely needed for performance in OLTP systems as the utilized bandwidth is usually lower than the total capacity of a GbE link. For DSS and DW environments , it is very likely that the bandwidth of a single GbE NIC is not sufficient, so that other options such as NIC bonding, IB or 10GbE should be considered. It is difficult to predict the actual interconnect requirements without historical data, so planning should include a large tolerance. For data shipping in OLTP and DSS, larger MTUs are more efficient, because they reduce interrupt load, save CPU, avoid fragmentation and therefore the probability of “losing blocks” if a fragment is dropped due to congestion control, buffer overflows in switches or similar incidents related to the functioning of IPC and networks. Jumbo frames need to be supported by drivers, NICs and switches. They usually require a certain amount of additional configuration.

Interconnect Bandwidth
Generally, 1Gb/sec sufficient for performance and scalability in OLTP. DSS/DW systems should be designed with > 1Gb/sec capacity categorically Prediction of interconnect traffic is difficult Depends on transaction instruction length per message Empirical rule of thumb: 1Gb/sec per 32 CPU Cores Infiniband and 10GbE are supported for scale-out In most known OLTP configurations to date, the bandwidth of 1 GbE is sufficient. The actual utilization depends on the size of the cluster nodes in terms of CPU power, the number of nodes accessing the same data, the size of the working set for an application. Most applications have good cache locality, and there are no increasing interconnect requirements when scaling the application out by adding cluster nodes and distributing the work over more instances or adding additional load. For small working sets which could fit into a small percentage of the available global buffer cache, the interconnect traffic may increase when the set remains constant. The actual utilization is difficult to predict but in most cases is no reason for concern in the OLTP world when it comes to providing adequate bandwidth. Typical utilizations for OLTP are usually much lower than the total available network capacity of 1 GbE. As a rule of thumb, a total disk IO rate of Ios /sec in a cluster with 4 Nodes will require about 7.5 MB/sec of network bandwidth , given that the Ios read data into the buffer cache and are not direct reads ( for a read-mostly workload, it will be only a small fraction of that as long as the read-mostly state is active ). Direct Reads or read-mostly ( 11g) do not require any messages for global cache synchronization . For DSS queries which use inter-instance communication between slaves, the size of the data sets and the distribution of work between query slaves suggests using multiple GbE NICs, 10GbE or IB . The rule of thumb here is that it is a good design practice to provide for a higher bandwidth than 1GbE . For OLTP, a general rule is that if the number of CPUs in a cluster node exceeds 16 – 20 CPUs, multiple NICs may be required to provide sufficient bandwidth

Performance and Scalability of Applications and Database Design with RAC

Scaling OLTP workloads, DML intensive
General Scalability Scaling OLTP workloads, DML intensive Scale well, if contention is little and database/working set size scales ( I.E. add node when demand grows) Read intensive workloads scale predictably and linearly Bigger cache when adding more nodes Faster read access to global cache than to disk, less disk IO If cluster-size and database size growth are balanced, system will perform and scale well Private network is important for for performance and stability Need to maintain bandwidth exclusive to keep variation low Dual-ported or multiple NICs are good to have for failover, but rarely needed for performance in OLTP systems as the utilized bandwidth is usually lower than the total capacity of a GbE link. For DSS and DW environments , it is very likely that the bandwidth of a single GbE NIC is not sufficient, so that other options such as NIC bonding, IB or 10GbE should be considered. It is difficult to predict the actual interconnect requirements without historical data, so planning should include a large tolerance. For data shipping in OLTP and DSS, larger MTUs are more efficient, because they reduce interrupt load, save CPU, avoid fragmentation and therefore the probability of “losing blocks” if a fragment is dropped due to congestion control, buffer overflows in switches or similar incidents related to the functioning of IPC and networks. Jumbo frames need to be supported by drivers, NICs and switches. They usually require a certain amount of additional configuration.

Performance and Scaling in Application and Database Design
Response Time Impact Index contention on INSERTS when index is right-growing system generated “artificial” keys such as consecutive order numbers or “natural” keys such as dates UPDATES or DELETES to rows in a small working set Session logging and tracking First-in first-out queues State of messages in queues Bulk INSERTS of large amounts of data LOBS

DML Contention and Serialization
Modification intensive operations on small set of ( cached) blocks “busy blocks” “busy blocks” Table T Table T’ Index I Index I …… …… INSERT INTO I WHERE Key = sequence UPDATE T SET … WHERE row in blocks[1..n] and n is a small number

Performance and Scaling in Application and Database Design
CPU Cost due to Inter-Instance Messaging and non-linear scaling In-memory databases Working set spans multiple buffer caches Frequent modifications and reads of recent modifications Working set fits into memory of one instance Locality of access worsens when node are added and users are load balanced Scale as long as sufficient CPU power is available Logarithmic scaling curve as number of nodes increases

DML on Small Working Set
Frequent modification of a non-scaling data set , blocks move around often …… …… Working Set could be cache in 1 instance but is modified on all instances CPU intensive, when nodes are added, the rate of block transfers may increase Because the working set does not scale with it RAC Best Practices accumulated a wealth of knowledge learned from real life environments. Following those practices most of time eliminates any tuning effort

Read-intensive Buffer Cache 32GB Buffer Cache 32GB Cache Transfer
Disk Transfer Read Read Working Set on Disk 64GB …… …… Eventually all blocks cached, Larger read cache No messages in 11g

Performance and Scalability
Good linear or near-linear scaling out of box IO and CPU intensive applications with large working sets and low proximity of access Self-Service Web Applications ( Shopping Carts etc. ) CRM Document storage and retrieval Business Analytics and Data Warehousing Private network is important for for performance and stability Need to maintain bandwidth exclusive to keep variation low Dual-ported or multiple NICs are good to have for failover, but rarely needed for performance in OLTP systems as the utilized bandwidth is usually lower than the total capacity of a GbE link. For DSS and DW environments , it is very likely that the bandwidth of a single GbE NIC is not sufficient, so that other options such as NIC bonding, IB or 10GbE should be considered. It is difficult to predict the actual interconnect requirements without historical data, so planning should include a large tolerance. For data shipping in OLTP and DSS, larger MTUs are more efficient, because they reduce interrupt load, save CPU, avoid fragmentation and therefore the probability of “losing blocks” if a fragment is dropped due to congestion control, buffer overflows in switches or similar incidents related to the functioning of IPC and networks. Jumbo frames need to be supported by drivers, NICs and switches. They usually require a certain amount of additional configuration.

Performance and Scalability
Partitioning or load direction may optimize performance High proximity of access , e.g. adding and removing from message queues Advanced Queuing and Workflow Batch and bulk processes Order processing and Inventory Payroll processing Private network is important for for performance and stability Need to maintain bandwidth exclusive to keep variation low Dual-ported or multiple NICs are good to have for failover, but rarely needed for performance in OLTP systems as the utilized bandwidth is usually lower than the total capacity of a GbE link. For DSS and DW environments , it is very likely that the bandwidth of a single GbE NIC is not sufficient, so that other options such as NIC bonding, IB or 10GbE should be considered. It is difficult to predict the actual interconnect requirements without historical data, so planning should include a large tolerance. For data shipping in OLTP and DSS, larger MTUs are more efficient, because they reduce interrupt load, save CPU, avoid fragmentation and therefore the probability of “losing blocks” if a fragment is dropped due to congestion control, buffer overflows in switches or similar incidents related to the functioning of IPC and networks. Jumbo frames need to be supported by drivers, NICs and switches. They usually require a certain amount of additional configuration.

Identifying Performance and Scaling Bottlenecks in Database Design
The Golden Rules: #1: For first approximation, disregard read-mostly objects and focus on the INSERT, UPDATE and DELETE intensive indexes and tablespace #2: If DML access to data is random, no worries if CPU is not an issue #3: Standard SQL and schema tuning solves > 80% of performance problems. There is usually only a few problem SQL and Tables. #4: Almost everything can be scaled out quickly with load-direction and load balancing Corollary 1: DML and data modification is more restrictive than reads and takes longer to process Corollary 2: Concurrent reads on modified data requires consistent read generation and undo segment lookups from other nodes

Identifying Performance and Scaling Bottlenecks in Database Design
Look for indexes with right-growing characteristics Keys comprising DATE columns or keys generated by sequence numbers Find frequent updates of “small” and compact tables “small”=fits into a single buffer cache Identify frequently and concurrently modified LOBs RAC Best Practices accumulated a wealth of knowledge learned from real life environments. Following those practices most of time eliminates any tuning effort

HOW ? Look at segment and SQL statistics in the Automatic Workload Repository Use Oracle Enterprise Manager Access Advisories and Automatic Database Diagnostics Monitor (ADDM) Instrumentation with MODULE and ACTION helps identify and quantify components of the workload

Quick Fixes Without modifying Application
Indexes with right-growing characteristics Cache sequence numbers per instance Hash or range partition table with LOCAL indexes Frequent updates of “small” and compact tables Reduce block size ( 2K ) and row density of blocks (PCTFREE 99 ) Frequently modified LOBS Hash partitions ( 128 – 256 ) FREE POOLS Well-known best practices in many notes, papers and presentations

Quick Fixes Application Modules which may not scale or cannot be quickly reorganized can be directed to particular nodes via cluster managed services For Administrator Managed and older releases create service with 1 preferred node and the rest available For Policy Managed databases use a singleton service Some large scale and high performance applications may be optimized by Data Partitioning ( range, hash, or composites) and routing per partitioning key in application server tier E.g. hash by CLIENT_ID, REGION etc. RAC Best Practices accumulated a wealth of knowledge learned from real life environments. Following those practices most of time eliminates any tuning effort

Leverage Connection Pools UCP: Load Balancing and Affinity
Application RAC Database Instance1 Instance2 Instance3 Pool 30% Work 60% Work 10% Work I’m busy I’m very busy I’m idle Configure LBA via DBMS service No client side knobs to enable RCLB Pool gravitation is gradual, depends on user requests and connection distribution Prevents oscillation Integrated with FCF

Performance and Scalability Enhancements in 11.1 and 11.2
Read Mostly Automatic policy detects read and disk IO intensive tables No interconnect messages when policy kicks in -> CPU savings Direct reads for large ( serial and parallel ) scans No locks , no buffer cache contention Good when IO subsystem is fast or IO processing is offloaded to storage caches or servers ( e.g. Exadata ) Fusion Compression Reduces message sizes and therefore CPU cost Dynamic policies to make trade-off between disk IO and global cache transfers Private network is important for for performance and stability Need to maintain bandwidth exclusive to keep variation low Dual-ported or multiple NICs are good to have for failover, but rarely needed for performance in OLTP systems as the utilized bandwidth is usually lower than the total capacity of a GbE link. For DSS and DW environments , it is very likely that the bandwidth of a single GbE NIC is not sufficient, so that other options such as NIC bonding, IB or 10GbE should be considered. It is difficult to predict the actual interconnect requirements without historical data, so planning should include a large tolerance. For data shipping in OLTP and DSS, larger MTUs are more efficient, because they reduce interrupt load, save CPU, avoid fragmentation and therefore the probability of “losing blocks” if a fragment is dropped due to congestion control, buffer overflows in switches or similar incidents related to the functioning of IPC and networks. Jumbo frames need to be supported by drivers, NICs and switches. They usually require a certain amount of additional configuration.

Performance Diagnostics and Checks: Metrics and Method

Normal Behaviour <Insert Picture Here> It is normal to see time consumed in CPU Db file sequential/scattered read Direct read Gc cr/current block 2-way/3-way ( Transfer from remote cache ) Gc cr/current grant 2-way ( Correlates with buffered disk IOs ) Average latencies should be within baseline parameters Most problems boil down to CPU, IO, network capacity or applications issues In the following slides , we present the most common issues which you are likely to encounter with RAC and the global cache. We present the symptoms and possible solutions and a guideline on how to diagnose different problems A highly visible issue in 10g is the loss of messages due to network errors or congestion. These problems are usually visible as “lost blocks”. The disk subsystem may impact performance in RAC significantly, certain loads such as queries scanning large amount of data, backups and other Concurrent load which may affect the same disks or disk groups and cause bottlenecks. When these extra loads are run on a particular node, other nodes may be affected although those nodes may not show any particular symtoms except for higher average log writes and disk reads times. A high CPU utilization or context switching load can affect the performance of the global cache by adding run queue wait time to the access latencies. It is important to ensure that the LMS processes can run predictably and that interconnect messages and clusterware heartbeats can be processed predictably. Avoiding negative feedback when the servers slow down under load and existing connections are busy is and important best practice. Unconstrained dumping of new connections onto the database instance can aggravate a performance issue and render a system unstable. Application contention such as frequent access to the same blocks can cause serialization on latches, in the buffer cache of an instance, and in the global cache. If the serialization is on globally accessed data, then the response time impact can be significant . When these symptoms becomes dominant, regular application and schema tuning will take care of most of these bottlenecks Unexpectedly high latencies for data access should be rare , but can occur in some cases of network configuration problems, high system load, process spins or other extreme events .

Normality, Baselines and Significance
Most significant response time component AVG Waits Time(s) (ms) %Time db file sequential read 2,627, , % CPU time , 20.8% gc current block 3-way 3,289, , gc buffer busy acquire , , gc current block 2-way 3,982, , gc current block busy , , GC waits are influenced by interconnect or remote effects which are not always obvious In its simplest case, a data request involves a message to the instance where the data block is cached. The request message is usually small, approx. 200 bytes in size. The requesting shadow process initiates the send and then waits until the response arrives. The message is sent to an LMS process on a remote instance. The LMS process receives the message, executes a handler and processes the message, and eventually send either the data block or a grant message. The minimum roundtrip time involving an 8K data block is about 400 microseconds. It is obvious that the pure wire time consumes only an insignificant portion of the total time. It should also be clear that the key factors for performance are the time it takes to send, receive and process the data, which makes the responsiveness of LMS under load a critical factor. Actual Cost determined by Message Propagation Delay IPC CPU Operating system scheduling Block server process load Interconnect stability Avg < 1 ms Contention

Distributed Cause and Effect
Example: Cluster-wide Impact of a Log File IO Problem Node 2 Node 2 ROOT CAUSE Node 1 Node 1 Disk Capacity Disk or Controller Bottleneck

Global Metrics View Local Symptom Cause Remote instance table Instance
WORKLOAD REPOSITORY report for Instance OOW8 Host oowdb8 Local Symptom Event: gc current block busy ms WORKLOAD REPOSITORY report for Instance OOW4 Host oowdb4 Log file paralel write ms Cause Global Cache Transfer Stats Avg global cache current block flush time (ms): ms Inst # Busy % data block , Remote instance table Root Cause often not on the node where the symptom is observed As seen earlier, a lot of cycles for block access are actually spent in the OS on process wakeup and scheduling as well as network stack processing. The LMSs or block server processes are a crucial component. They should always be scheduled immediately when they need to run. On a very busy system with many concurrent processes, the system load may have an impact on how predictably LMS can be scheduled. The default number of LMS processes is based on the number of available CPUs and the goal is to minimize their number to keep individual LMS processes busy. Fewer LMS process have an additional advantage of allowing for better message aggregation and therefore more CPU efficient processing. It is default to max (2, 1/4 * number of CPU) if you have only 1 cpu on the system, it is 1. It is computed as MIN(MAX((1/4 * cpu_count),2),10) So it 1/4 of the number of CPUs but can not be more than 10 or less than 2. So if you have less than 8CPU (or cores) per node, you still get minimum of 2 LMS processes. You can use gcs_server_processes parameter to change the number of LMS processes. In 10gR2, if you see significant wait on event like 'gc... congested', such as gc cr block congested gc current block congested it likely to mean that LMS processes were starved for CPU resource. Depending on the size of the buffer cache, multiple LMS processes can speed up instance reconfiguration and recovery and startup. This should be born in mind when configuring machines with large SGAs If large buffer cache , want more than one lms, especially if want fast failover On most platforms, the block server processes are running in a high priority by default in order to minimize delays due to scheduling. The priority for LMS is set at startup time.

Investigate Serialization
gc buffer busy ms Waits for gc current block busy ms Not OK! Global Cache Transfer Stats 4 data block 114, 7 data bloc 162, Inst Block Blocks % % No Class Received Immed Busy Avg global cache current block flush time (ms): Log file IO In its simplest case, a data request involves a message to the instance where the data block is cached. The request message is usually small, approx. 200 bytes in size. The requesting shadow process initiates the send and then waits until the response arrives. The message is sent to an LMS process on a remote instance. The LMS process receives the message, executes a handler and processes the message, and eventually send either the data block or a grant message. The minimum roundtrip time involving an 8K data block is about 400 microseconds. It is obvious that the pure wire time consumes only an insignificant portion of the total time. It should also be clear that the key factors for performance are the time it takes to send, receive and process the data, which makes the responsiveness of LMS under load a critical factor. Actual Cost determined by Message Propagation Delay IPC CPU Operating system scheduling Block server process load Interconnect stability

Example: Segment Statistics
Segments by Global Cache Buffer Busy ES_BILLING TABLE % Segments by Current Blocks Received ES_BILLING TABLE % ANALYSIS: TABLE ES_BILLING is frequently read and modified on all nodes. The majority of global cache accesses and serialization can be attributed to this . In its simplest case, a data request involves a message to the instance where the data block is cached. The request message is usually small, approx. 200 bytes in size. The requesting shadow process initiates the send and then waits until the response arrives. The message is sent to an LMS process on a remote instance. The LMS process receives the message, executes a handler and processes the message, and eventually send either the data block or a grant message. The minimum roundtrip time involving an 8K data block is about 400 microseconds. It is obvious that the pure wire time consumes only an insignificant portion of the total time. It should also be clear that the key factors for performance are the time it takes to send, receive and process the data, which makes the responsiveness of LMS under load a critical factor. Actual Cost determined by Message Propagation Delay IPC CPU Operating system scheduling Block server process load Interconnect stability

Comprehensive Cluster-wide Analysis via Global ADDM
Courtesy of Cecilia Gervasio, Oracle Server Technologies, Diagnostics and Manageability

Common Problems and Symptoms

Common Problems and Symptoms
<Insert Picture Here> Interconnect or Switch Problems Slow or bottlenecked disks High Log file Sync Latency System load and scheduling In the following slides , we present the most common issues which you are likely to encounter with RAC and the global cache. We present the symptoms and possible solutions and a guideline on how to diagnose different problems A highly visible issue in 10g is the loss of messages due to network errors or congestion. These problems are usually visible as “lost blocks”. The disk subsystem may impact performance in RAC significantly, certain loads such as queries scanning large amount of data, backups and other Concurrent load which may affect the same disks or disk groups and cause bottlenecks. When these extra loads are run on a particular node, other nodes may be affected although those nodes may not show any particular symtoms except for higher average log writes and disk reads times. A high CPU utilization or context switching load can affect the performance of the global cache by adding run queue wait time to the access latencies. It is important to ensure that the LMS processes can run predictably and that interconnect messages and clusterware heartbeats can be processed predictably. Avoiding negative feedback when the servers slow down under load and existing connections are busy is and important best practice. Unconstrained dumping of new connections onto the database instance can aggravate a performance issue and render a system unstable. Application contention such as frequent access to the same blocks can cause serialization on latches, in the buffer cache of an instance, and in the global cache. If the serialization is on globally accessed data, then the response time impact can be significant . When these symptoms becomes dominant, regular application and schema tuning will take care of most of these bottlenecks Unexpectedly high latencies for data access should be rare , but can occur in some cases of network configuration problems, high system load, process spins or other extreme events .

Symptoms of Interconnect Problems
Serialization High latencies Capacity Limit Congestion Dropped Packets ROOT CAUSE

Symptoms of an Interconnect Problem
Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s)(ms) Time Wait Class log file sync 286, , Commit gc buffer busy 177, , Cluster gc cr block busy 110, , Cluster gc cr block lost 4, , Cluster cr request retry 6, , Other Misconfigured, Faulty, or Saturated Interconnect or IPC Resources: Symptoms Should never be here Always a severe performance problem

Interconnect or IPC problems
Applications: Oracle gc blocks lost UDP: Packet receive errors Socket buffer overflows Protocol processing:IP,UDP netstat –s IP: 1201 Fragments dropped after timeout Reassembly failure Incoming packets discarded Device Drivers TX errors:135 dropped: overruns: RX errors: 0 dropped:27 overruns: Ifconfig -a NIC1 NIC2 Ports Queues Switch In most known OLTP configurations to date, the bandwidth of 1 GbE is sufficient. The actual utilization depends on the size of the cluster nodes in terms of CPU power, the number of nodes accessing the same data, the size of the working set for an application. Most applications have good cache locality, and there are no increasing interconnect requirements when scaling the application out by adding cluster nodes and distributing the work over more instances or adding additional load. For small working sets which could fit into a small percentage of the available global buffer cache, the interconnect traffic may increase when the set remains constant. The actual utilization is difficult to predict but in most cases is no reason for concern in the OLTP world when it comes to providing adequate bandwidth. Typical utilizations for OLTP are usually much lower than the total available network capacity of 1 GbE. As a rule of thumb, a total disk IO rate of Ios /sec in a cluster with 4 Nodes will require about 7.5 MB/sec of network bandwidth , given that the Ios read data into the buffer cache and are not direct reads ( for a read-mostly workload, it will be only a small fraction of that as long as the read-mostly state is active ). Direct Reads or read-mostly ( 11g) do not require any messages for global cache synchronization . For DSS queries which use inter-instance communication between slaves, the size of the data sets and the distribution of work between query slaves suggests using multiple GbE NICs, 10GbE or IB . The rule of thumb here is that it is a good design practice to provide for a higher bandwidth than 1GbE . For OLTP, a general rule is that if the number of CPUs in a cluster node exceeds 16 – 20 CPUs, multiple NICs may be required to provide sufficient bandwidth

Causes and Diagnostics
ifconfig –a: eth0 Link encap:Ethernet HWaddr 00:0B:DB:4B:A2:04 inet addr: Bcast: Mask: UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets: errors:135 dropped:0 overruns:0 frame:95 TX packets: errors:0 dropped:27 overruns:0 carrier:0 … $netstat –s Ip: total packets received 1201 fragments dropped after timeout 3384 packet reassembles failed The no of LMS processes depends on the CPU. - Default value is max( 2, #CPU/2) but no more than 10

Cluster-wide Impact of a Database File IO Problem
ROOT CAUSE Node 2 Node 1 Disk Capacity Disk or Controller Bottleneck IO intensive Queries

Cluster-Wide Disk I/O Impact
Node 1 Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s)(ms) Time log file sync , , gc buffer busy , , gc cr block busy , , `` CAUSE: Expensive Query in Node 2 Causes IO bottleneck Node 2 1. IO on disk group containing redo logs is slow Load Profile ~~~~~~~~~~~~ Per Second Redo size: ,982.21 Logical reads: ,652.41 Physical reads: ,193.37 2. Block shipping for frequently modified blocks is delayed by log flush IO Busy events mean searlization Shipping blocks between instances can be impacted by waiting for Log Flush Here Log Flush is waiting for the I/O and is stalled because of heavy query on node 2. NOTE: Must look at the entire cluster, this is easy with ADDM in 11g 3. Serialization builds up

Log File Sync Latency: Causes and Symptoms
Courtesy of Vinay Srihari, Oracle Server Technologies, Recovery

Causes of High Commit Latency
Symptom of Slow Log Writes I/O service time spike may last only seconds or minutes Threshold-based warning message in LGWR trace file “Warning: log write elapsed time xx ms, size xxKB” Dumped when write latency >= 500ms Large log_buffer makes a bad situation worse. Fixes Smooth out log file IO on primary system and standby redo apply I/O pattern Primary and Standby storage subsystem should be configured for peaks Apply bug fixes in appendix Courtesy of Vinay Srihari, Oracle Server Technologies, Recovery

Block Server Process Busy or Starved
Node 2 Node 1 ROOT CAUSE Too few LMSs LMS not in High Prio Memory Problems ( Swapping)

Block Server Process Busy or Starved
Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time Wait Class gc cr grant congested , , Cluster gc current block congested , , Cluster gc cr grant 2-way , , Cluster gc current block 2-way , , Cluster gc buffer busy , , Cluster On remote note : Avg message sent queue time (ms): “Congested” : LMS could not dequeue messages fast enough

Block Server Processes Busy
Increase # LMS based on Occurrence of “congested” wait events Heuristics: 75 – 80 % busy is ok Avg send q time > 1ms Caveat: # of CPUs should always be >= # of LMS to avoid starvation On NUMA architectures and CMT Bind LMS to NUMA board or cores in processor set Fence off Hardware interrupts from the processor sets As seen earlier, a lot of cycles for block access are actually spent in the OS on process wakeup and scheduling as well as network stack processing. The LMSs or block server processes are a crucial component. They should always be scheduled immediately when they need to run. On a very busy system with many concurrent processes, the system load may have an impact on how predictably LMS can be scheduled. The default number of LMS processes is based on the number of available CPUs and the goal is to minimize their number to keep individual LMS processes busy. Fewer LMS process have an additional advantage of allowing for better message aggregation and therefore more CPU efficient processing. It is default to max (2, 1/4 * number of CPU) if you have only 1 cpu on the system, it is 1. It is computed as MIN(MAX((1/4 * cpu_count),2),10) So it 1/4 of the number of CPUs but can not be more than 10 or less than 2. So if you have less than 8CPU (or cores) per node, you still get minimum of 2 LMS processes. You can use gcs_server_processes parameter to change the number of LMS processes. In 10gR2, if you see significant wait on event like 'gc... congested', such as gc cr block congested gc current block congested it likely to mean that LMS processes were starved for CPU resource. Depending on the size of the buffer cache, multiple LMS processes can speed up instance reconfiguration and recovery and startup. This should be born in mind when configuring machines with large SGAs If large buffer cache , want more than one lms, especially if want fast failover On most platforms, the block server processes are running in a high priority by default in order to minimize delays due to scheduling. The priority for LMS is set at startup time.

High Latencies in Global Cache
ROOT CAUSE

Unexpected: To see > 1 ms (AVG ms should be around 1 ms)
High Latencies Event Waits Time (s) AVG (ms) % Call Time gc cr block 2-way , , gc current block 2-way , , Unexpected: To see > 1 ms (AVG ms should be around 1 ms) Additional Diagnostics; V$SESSION_WAIT_HISTOGRAM for events Check network configuration ( private ? bandwidth ? ) Check for high CPU consumption Runaway or spinning processes It is not always the events that are busy that are most important. Remember we said that a transfer should be less than a millisecond. Here we see high latency o the transfer of blocks. This is not really a RAC problem. RAC is the victim of either network problem or high system load.

Transient Problems and Hangs

Temporary Slowness and Hang
Can affect one or more instances in cluster Can be related IO issues at log switch time ( checkpoint or archiver slow) Process stuck waiting for IO Connection storm Hard to establish causality with AWR statistics Use Oracle Enterprise Manager and Active Session History

Temporary Cluster Wait Spike
Spike in Global Cache Reponse Time SQL with High Global Cache Wait Time Courtesy of Cecilia Gervasio, Oracle Server Technologies, Diagnostics and Manageability

Active Session History
Every 1 hour or out-of-space AWR Circular buffer in SGA (2MB per CPU) DBA_HIST_ACTIVE_SESS_HISTORY V$ACTIVE_SESSION_HISTORY Session state objects MMON Lite (MMNL) V$SESSION V$SESSION_WAIT Write 1 out of 10 samples Direct-path INSERTS Variable length rows In 10.2 , can identify local blocker for a hang In 11g , can identify global blocker Courtesy of Graham Wood, Oracle Server Technologies, Architect

Temporary Slowness or Hang
Slowdown from 5:00-5:30 $ORACLE_HOME/rdbms/admin/ashrpt.sql

Additional Diagnostics
For all slowdown with high averages in gc wait time Active Session History report ( all nodes ) Set event on selected processes: Event trace name context forever, level 7 Collect trace files Set event system-wide Threshold based , I.E. no cost Continuous OS Statistics Cluster Health Monitor (IPD/OS) LMS, LMD, LGWR trace files DIA0 trace files Hang Analysis

Conclusions

Golden Rules For Performance and Scalability in Oracle RAC
Thorough configuration and testing of infrastructure is basis for stable performance Anticipation of application and database bottleneck and their possible magnified impact in Oracle RAC is relatively simple Enterprise Manager provides monitoring and quick diagnosis of cluster-wide issues Basic intuitive and empirical guidelines to approach performance problems suffice for all practical purposes

Q & Q U E S T I O N S A N S W E R S A

Visit us in the Moscone West Demogrounds Booth W-037
Recommended Sessions DATE and TIME SESSION Tuesday, October 13 1:00 PM Next Generation Database Grid - Moscone Sourt 104 Tuesday, October :30 PM Single Instance Oracle Real Application Clusters - Better Virtualization for Databases – Moscone South 300 Wednesday, October :45 AM Understanding Oracle Real Application Clusters Intenals - Moscone South 104 Thursday, October 15 9:00 AM Oracle ACFS: The Awaited Missing Feature Moscone South 305 Visit us in the Moscone West Demogrounds Booth W-037

Appendix

References

Missed post from LGWR to foreground
Log File Sync Issues Missed post from LGWR to foreground commit is dependent on log file sync timeout value log file sync timeout was 1s, 100ms in 11.2 and backports One-offs/bundles available for 10.2.X, Broadcast On Commit(BOC) ack delays Missed post of LGWR by BOC ack receiver ( LMS ) Incorrect bookkeeping of multiple outstanding acks , , SYNC Standby log write latency bug , bug

Learn strategies for managing the performance of your Oracle Real Application Clusters database. We will review the common problems that customers have.

Similar presentations

Presentation on theme: "Learn strategies for managing the performance of your Oracle Real Application Clusters database. We will review the common problems that customers have."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learn strategies for managing the performance of your Oracle Real Application Clusters database. We will review the common problems that customers have.

Similar presentations

Presentation on theme: "Learn strategies for managing the performance of your Oracle Real Application Clusters database. We will review the common problems that customers have."— Presentation transcript:

Similar presentations

About project

Feedback