Diagnosing the “Bottlenecks” in your Streams Environment

Diagnosing the “Bottlenecks” in your Streams Environment
Brian Keating Chris Lawson May 17, 2007

What’s our Goal for Today?
“We will divulge & show you how to deal with some of the complications of Streams.”

Our Agenda Overview of Streams Replication
What are the parts of Streams? How does it work? Troublesome Parts of Streams Troubleshooting techniques A fun trivia question Types of bottlenecks & how to resolve Some helpful SQL scripts Questions?

A Few Caveats Unless otherwise stated, our examples assume Oracle 10g, release 2. Streams set-up is not the real problem; thus our we focus on monitoring and troubleshooting. Set-up documented in Oracle 10g Release 2 Concepts & Admin Guide. Also—the subject of “conflict resolution” is a complex subject of its own, and will not be covered.

Is Streams Difficult or Easy?
How difficult is it to figure out all the pieces, & keep Streams working well? Here’s one opinion . . .

Gosh, It’s Really Easy! Your Life will be so Peaceful
“You don’t need rocket scientists on your staff!” * * Marketing executive, on the simplicity of Streams/CDC.

Here’s another point of view . . .

Streams is like a Whirlpool
“Is Streams really a proven product?” * DBA * Programmer, noting the numerous bug fixes.

Streams Replication: A View from 10,000 Feet
Streams propagates both DML & DDL changes to another database (or elsewhere in same database.) Streams is based on Log Miner, which has been around for years. You decide, via “rules,” which changes are replicated. You can optionally “transform” data before you apply it at the destination.

Overview of Streams : Three Main Processes
Capture Propagate Apply c0, c1 J00, j01 a001 background processes Note: “p” processes also used in capture & apply

What’s in the Redo Log? LogMiner Processing Redo logs LCR’s LogMiner continuously reads DML & DDL changes from redo logs. It converts those changes into logical change records (LCRs). There is at least 1 LCR per row changed. The LCR contains the actual change, as well as the original data.

What’s in the Redo Log? Recall that changed (dirty) blocks in the redo log buffer are written after anyone’s commit—not just your own. This means that changes will often be captured, propagated (but not applied) even though you haven’t committed!

The Capture Process: Capture Change
Capture Change  Generate LCR  Enqueue Message The capture change step reads changes from redo logs. All changes (DML and DDL) in redo logs are captured – regardless of whether Streams is configured to propagate any given change. Observe “capture_message_number” value (in v$streams_capture). It regularly increments – even if there are no application transactions.

The Capture Process: Generate LCR
Capture Change  Generate LCR  Enqueue Message First, examine change & determine if Streams is configured to handle that change or not. If so, convert into one or more “logical change records”, or “LCRs”. An LCR only contains changes for one row. DML statements that affect multiple rows will cause multiple LCRs to be generated. So, if one statement updates 1,000 rows, at least 1,000 LCRs will be created!

The Capture Process: Enqueue Message
Capture Change  Generate LCR  Enqueue Message This step places previously-created LCRs onto the capture process’s queue. Oracle uses “q00n” background workers, & the QMNC (Queue Monitor Coordinator) q001 q002 q003 QMNC

The Capture Process: Enqueue Message
Example of the queue background processes: Select Program, Sid From V$session Where Upper(program) Like '%(Q%' And Osuser = 'Oracle‘ Order By 1; PROGRAM SID (QMNC) (q000) (q001) (q002) (q003) (q004)

The Propagate Process Source Queue  Propagate Process  Destination Queue Streams copies LCRs from the source queue to destination queue. These transfers are done by the J00n background processes. How do we know it’s the “j” processes? Let’s look at session statistics and see who’s doing all the db link communicating

The Propagate Process Who’s doing the Sending?
Select Sid, Name, Value From V$sesstat One, V$statname Two Where One.Statistic# = Two.Statistic# And Name Like '%Link%‘ And Value > 1000 SID NAME VALUE 467 bytes sent via SQL*Net to dblink E+10 947 bytes sent via SQL*Net to dblink E+10 467 bytes received via SQL*Net from dblink 947 bytes received via SQL*Net from dblink 467 SQL*Net roundtrips to/from dblink 947 SQL*Net roundtrips to/from dblink SIDs 467, 947 are indeed the “j00” processes

The Propagate Process What’s in the Queue?
Sample of the queue content: Select Queue_Name, Num_Msgs From V$Buffered_Queues; QUEUE_NAME NUM_MSGS APP_NY_PROD_Q CAP_LA_PROD_Q

The Apply Process Queue and Dequeue
LCRs usually wait in destination queue until their transaction is committed or rolled back. There are some exceptions, due to the queue “spilling” (covered later) lcr lcr lcr lcr lcr lcr lcr lcr The LCRs also remain on the capture queue until they are applied at the destination.

And now to the troubleshooting . . .
Commits & Rollbacks When transaction commits, the Streams sends over a single Commit LCR. After destination applies LCRs, destination sends back confirming LCR that it’s okay for capture queue to remove those LCRs. For rollback, source sends 1 rollback LCR for each LCR in that transaction. And now to the troubleshooting . . .

Trivia Question Question: how does Streams avoid “infinite loop” replication? For example, after a change is applied at a destination database, why doesn’t Streams “re-capture” that change, and then propagate it back to the original source database? The answer to this question will be provided later in the presentation.

LCR Concepts An LCR is one type of Streams message – the type that is formatted by the Capture process. Two types of LCRs: DDL LCRs and row LCRs. A DDL LCR contains information about a single DDL change. A row LCR contains information about a change to a single row. As a result, transactions that affect multiple rows will cause multiple LCRs to be generated.

LCR Concepts LOB Issues
Tables that contain LOB datatypes can cause multiple LCRs to be generated per row affected! Inserts into tables with LOBs will always cause multiple LCRs to be generated per row. Updates into those tables might cause multiple LCRs, if any of the LOB columns are updated. Deletes will never generate multiple LCRs – deletes always generate 1 LCR per row.

LCR Concepts Other Items to Note
All of the LCRs that are part of a single transaction must be applied by one apply server. If a transaction generates 10,000 LCRs, then all 10,000 of them must be applied by just one apply server – no matter how many servers are running. For each transaction, there is one additional LCR generated, at the end of the transaction. If a single transaction deletes 2,000 rows, then 2,001 LCRs will be generated. This item is important to note for the “spill threshold” value.

“Spill” Concepts Normally, outstanding messages are held in
buffered queues, which reside in memory. In some cases, messages can be “spilled”; that is, they can be moved into tables on disk. 3 Reasons: The total number of outstanding messages is too large to fit in the buffered queue; A given message has been in memory too long; The number of messages in a single transaction is larger than the “LCR spill threshold” parameter.

“Spill” Concepts Types of Spill Tables
There are two types of tables that can hold “spilled” messages: “Queue” tables: each buffered queue has a “queue table” associated with it. Queue tables are created at the same time as buffered queues. Name format for queue tables: AQ$_<name that you specified>_P Example: AQ$_CAPTURE_QUEUE_TABLE_P

“Spill” Concepts Types of Spill Tables (cont.)
The “Spill” table: there is one (and only one) “spill” table, in any database that uses Streams. The name of the spill table is: SYS.STREAMS$_APPLY_SPILL_MSGS_PART As the name implies, the spill table can only be used by apply processes – not by capture processes.

“Spill” Concepts Where are the spilled messages?
Tables where spilled messages are placed: If a message is spilled because the total number of outstanding messages is too large, then that message is placed in the associated queue table. If a message has been held in memory too long, then that message is placed in the queue table. If the number of LCRs in a single transaction exceeds the “LCR spill threshold”, then all of those LCRs are placed in the spill table – SYS.STREAMS$_APPLY_SPILL_MSGS_PART.

Troubleshooting Two basic ways to troubleshoot Streams issues:
Query the internal Streams tables and views; Search the alert log for messages. (This is particularly useful for capture process issues).

Capture Process Tables and Views
streams$_capture_process: lists all defined capture processes dba_capture: basic status, error info v$streams_capture: detailed status info dba_capture_parameters: configuration info

Propagate Process Tables and Views
streams$_propagation_process: lists all defined propagate procs dba_propagation: basic status, error info v$propagation_sender: detailed status v$propagation_receiver: detailed status

Apply Process Tables and Views
streams$_apply_process: lists all defined apply processes dba_apply: basic status, error info v$streams_apply_reader: status of the apply reader v$streams_apply_server: status of apply server(s) v$streams_apply_coordinator: overall status, latency info dba_apply_parameters: configuration info

“Miscellaneous” Tables and Views
v$buffered_queues: view that displays the current and cumulative number of messages, for each buffered queue. sys.streams$_apply_spill_msgs_part: table that the apply process uses, to “spill” messages from large transactions to disk. system.logmnr_restart_ckpt$: table that holds capture process “checkpoint” information.

Types of Bottlenecks Two main types of “bottlenecks”:
Type 1: Replication is completely stopped; i.e., no changes are being replicated at all. Type 2: Replication is running, but it is slower than the rate of DML; i.e., it is “falling behind”.

Capture Process Bottlenecks
Capture process bottlenecks typically have to do with a capture process being unable to read necessary online or (especially) archive logs. This will result in a “Type 1” bottleneck – that is, no changes will be replicated at all (because no changes can be captured). Almost all of the “Type 1” bottlenecks that I have ever encountered have been due to archive log issues with capture processes.

Capture Process Bottlenecks Capture “Checkpoints”
The capture process writes its own “checkpoint” information to its data dictionary tables. This checkpoint info keeps track of the SCN values that the capture process has scanned. That information is used to calculate the “required_checkpoint_scn” value. Capture process checkpoint information is primarily stored in the following table: SYSTEM.LOGMNR_RESTART_CKPT$

Capture Process Bottlenecks Capture “Checkpoints” (cont.)
By default, the capture process writes checkpoints very frequently, and stores their data for a long time. On a very write-intensive system, this can cause the LOGMNR_RESTART_CKPT$ table to become extremely large, very quickly. The amount of data stored in that table can be modified with these capture process parameters: _checkpoint_frequency: number of megabytes captured which will trigger a checkpoint checkpoint_retention_time: number of days to retain checkpoint information

Capture Process Bottlenecks SCN checks
When a capture process starts up, it calculates its “required_checkpoint_scn” value. This value determines which redo logs must be scanned, before any new transactions can be captured. The capture process needs to do this check, in order to ensure that it does not “miss” any transactions that occurred while it was down. As a result, when a capture process starts up, the redo log that contains that SCN value – and every subsequent log – must be present in the log_archive_dest directory (or in online logs).

Capture Process Bottlenecks SCN checks (cont.)
If any required redo logs are missing during a capture process restart, then the capture process will permanently “hang” during its startup. This issue will completely prevent the capture process from capturing any new changes. If the required redo log(s) cannot be restored, then the only way to resolve this situation is to completely rebuild the Streams environment. As a result, it is extremely important to “keep track” of redo logs that the capture process needs, before deleting any logs.

Capture Process Bottlenecks Which archive logs are needed?
The following query can be used to determine the oldest archive log that will need to be read, during the next restart of a capture process. select a.sequence#, b.name from v$log_history a, v$archived_log b where a.first_change# <= (select required_checkpoint_scn from dba_capture where capture_name = ‘<capture process name>’) and a.next_change# > and a.sequence# = b.sequence#(+); If no rows are returned from that query, then the SCN in question resides in an online log.

Capture Process Bottlenecks Flow Control
In 10g, the capture process is configured with “automatic flow control”. This feature prevents the capture process from spilling many messages. If a large number of messages build up in the capture process’s buffered queue, then it will “pause” capturing new messages, until some messages are removed from the queue. A message cannot be removed from the capture process’s queue until one of these items occurs: The message is applied at the destination The message is spilled at the destination

Propagate Process Bottlenecks
I have heard of a few “Type 2” propagate bottlenecks in some “extreme” environments. Parameter changes to consider, to try to avoid propagate-related bottlenecks: Set the propagate parameter LATENCY to 0 Set the propagate parameter QUEUE_TO_QUEUE to TRUE (10.2 only) Set the init.ora parameter _job_queue_interval to 1 Set the init.ora parameter job_queue_processes to 4 or more

Apply Process Bottlenecks
Apply process bottlenecks usually deal with the apply process not “keeping up” with the rate of DML being executed on the source tables. I have generally only seen apply bottlenecks with “batch” DML (as opposed to “OLTP” DML). The three main areas to be concerned with regarding apply process bottlenecks are: Commit frequency; Number of apply servers; Other Streams parameters.

Apply Process Bottlenecks Commit Frequency
The number of rows affected per transaction has an enormous impact on apply performance: The number of rows affected per transaction should not be too large – due to spill, and to the “one apply server per transaction” restriction. should not be too small, either – due to DML degradation, and to apply transaction overhead. From my experience, the optimal commit frequency is around 500 rows.

Apply Process Bottlenecks Number of Apply Servers
The number of parallel apply servers also has a large impact on the apply process’s performance. The number of parallel apply servers is set by the apply “PARALLELISM” parameter. From my experience, for best throughput, the PARALLELISM parameter should be three times the number of CPUs on the host machine. Note: setting PARALLELISM that high can potentially “eat up” all of the CPU cycles on the host, during intensive DML jobs.

Apply Process Bottlenecks Other Streams Parameters
Several other Streams-related parameters can also have an effect on apply performance: Apply process parameter _HASH_TABLE_SIZE: set this relatively high (such as ), to minimize “wait dependency” bottlenecks. Apply process parameter TXN_LCR_SPILL_THRESHOLD: set to be a little bit higher than the maximum number of LCRs generated per transaction, to prevent spill. Init.ora parameter aq_tm_processes: set to 1.

Apply Process Bottlenecks “Turning Off” Streams
As mentioned previously, apply bottlenecks generally occur with “batch”-style DML. It is possible to configure a session so that Streams will not capture any DML from that session – and therefore, that session’s DML will not be propagated or applied. This technique allows batch-style DML to be run at each Streams database individually – rather than running the batch DML in one database, and then having that DML get replicated to all of the other databases.

Apply Process Bottlenecks “Turning Off” Streams (cont.)
“Turning off” Streams in a session is done by setting the session’s “Streams tag” to a non-NULL value. Here is an example of turning Streams off before a batch statement, and then back on: exec dbms_streams.set_tag (tag => hextoraw(’01’)); <batch DML statement> commit; exec dbms_streams.set_tag (tag => NULL); Note: this technique will only work if the “include_tagged_lcr” parameter, in the Streams capture rules, is set to FALSE. (FALSE is the default value for that parameter in Streams rules.)

Apply Process Bottlenecks Apply Process Aborts!
If there are any triggers on any replicated tables, then the apply process can sometimes abort – and not restart – during periods of intense DML activity. These aborts happen most frequently when one particular apply server becomes “overloaded”, and has to run too much recursive SQL (triggers cause recursive SQL). This issue is caused by Oracle bug # This bug is apparently fixed in ; and there are “backport” patches available for and

Streams “Spot Check” Script
The script called “ssc.sql” performs a “spot check” on a variety of Streams objects. Here is some sample output from it: ********** Streams report for dbprod on 03/28/ :33:54 ********** Capture process : STREAMS_CAPTURE_PROCESS CREATE LCR Propagate process: STREAMS_PROPAGATE_PROCESS ENABLED Apply reader : STREAMS_APPLY_PROCESS DEQUEUE MESSAGES Apply server : STREAMS_APPLY_PROCESS EXECUTE TRANSACTION Apply server : STREAMS_APPLY_PROCESS IDLE Apply coordinator: STREAMS_APPLY_PROCESS APPLYING Latency: 2 Buffered queue : STREAMS_CAPTURE_QUEUE Messages: Spill msgs: 21 Buffered queue : STREAMS_APPLY_QUEUE Messages: Spill msgs: 0 Spill table rows : Total streams pool memory : Free streams pool memory : *************** End of report ***************

Contact Information Brian Keating: Chris Lawson: This presentation, and the script mentioned in it, are available for download on the Oracle Magician website:

Diagnosing the “Bottlenecks” in your Streams Environment

Similar presentations

Presentation on theme: "Diagnosing the “Bottlenecks” in your Streams Environment"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Diagnosing the “Bottlenecks” in your Streams Environment

Similar presentations

Presentation on theme: "Diagnosing the “Bottlenecks” in your Streams Environment"— Presentation transcript:

Similar presentations

About project

Feedback