Presentation on theme: "Diagnosing the “Bottlenecks” in your Streams Environment"— Presentation transcript:
1Diagnosing the “Bottlenecks” in your Streams Environment Brian KeatingChris LawsonMay 17, 2007
2What’s our Goal for Today? “We will divulge & show you how to deal with some of the complications of Streams.”
3Our Agenda Overview of Streams Replication What are the parts of Streams?How does it work?Troublesome Parts of StreamsTroubleshooting techniquesA fun trivia questionTypes of bottlenecks & how to resolveSome helpful SQL scriptsQuestions?
4A Few CaveatsUnless otherwise stated, our examples assume Oracle 10g, release 2.Streams set-up is not the real problem; thus our we focus on monitoring and troubleshooting.Set-up documented in Oracle 10g Release 2 Concepts & Admin Guide.Also—the subject of “conflict resolution” is a complex subject of its own, and will not be covered.
5Is Streams Difficult or Easy? How difficult is it to figure out all the pieces, & keep Streams working well?Here’s one opinion . . .
6Gosh, It’s Really Easy! Your Life will be so Peaceful “You don’t need rocket scientists on your staff!” ** Marketing executive, on the simplicity of Streams/CDC.
8Streams is like a Whirlpool “Is Streams really a proven product?” *DBA* Programmer, noting the numerous bug fixes.
9Streams Replication: A View from 10,000 Feet Streams propagates both DML & DDL changes to another database (or elsewhere in same database.)Streams is based on Log Miner, which has been around for years.You decide, via “rules,” which changes are replicated.You can optionally “transform” data before you apply it at the destination.
10Overview of Streams : Three Main Processes Capture Propagate Applyc0, c1J00, j01a001background processesNote: “p” processes also used in capture & apply
11What’s in the Redo Log?LogMiner ProcessingRedo logsLCR’sLogMiner continuously reads DML & DDL changes from redo logs.It converts those changes into logical change records (LCRs).There is at least 1 LCR per row changed.The LCR contains the actual change, as well as the original data.
12What’s in the Redo Log?Recall that changed (dirty) blocks in the redo log buffer are written after anyone’s commit—not just your own.This means that changes will often be captured, propagated (but not applied) even though you haven’t committed!
13The Capture Process: Capture Change Capture Change Generate LCR Enqueue MessageThe capture change step reads changes from redo logs.All changes (DML and DDL) in redo logs are captured – regardless of whether Streams is configured to propagate any given change.Observe “capture_message_number” value (in v$streams_capture). It regularly increments – even if there are no application transactions.
14The Capture Process: Generate LCR Capture Change Generate LCR Enqueue MessageFirst, examine change & determine if Streams is configured to handle that change or not.If so, convert into one or more “logical change records”, or “LCRs”.An LCR only contains changes for one row. DML statements that affect multiple rows will cause multiple LCRs to be generated.So, if one statement updates 1,000 rows, at least 1,000 LCRs will be created!
16The Capture Process: Enqueue Message Example of the queue background processes:Select Program, Sid From V$sessionWhere Upper(program) Like '%(Q%'And Osuser = 'Oracle‘ Order By 1;PROGRAM SID(QMNC)(q000)(q001)(q002)(q003)(q004)
17The Propagate ProcessSource Queue Propagate Process Destination QueueStreams copies LCRs from the source queue to destination queue.These transfers are done by the J00n background processes.How do we know it’s the “j” processes?Let’s look at session statistics and see who’s doing all the db link communicating
18The Propagate Process Who’s doing the Sending? Select Sid, Name, Value From V$sesstat One, V$statname TwoWhere One.Statistic# = Two.Statistic#And Name Like '%Link%‘ And Value > 1000SID NAME VALUE467 bytes sent via SQL*Net to dblink E+10947 bytes sent via SQL*Net to dblink E+10467 bytes received via SQL*Net from dblink947 bytes received via SQL*Net from dblink467 SQL*Net roundtrips to/from dblink947 SQL*Net roundtrips to/from dblinkSIDs 467, 947 are indeed the “j00” processes
19The Propagate Process What’s in the Queue? Sample of the queue content:Select Queue_Name, Num_Msgs From V$Buffered_Queues;QUEUE_NAME NUM_MSGSAPP_NY_PROD_QCAP_LA_PROD_Q
20The Apply Process Queue and Dequeue LCRs usually wait in destination queue until their transaction is committed or rolled back.There are some exceptions, due to the queue “spilling” (covered later)lcrlcrlcrlcrlcrlcrlcrlcrThe LCRs also remain on the capture queue until they are applied at the destination.
21And now to the troubleshooting . . . Commits & RollbacksWhen transaction commits, the Streams sends over a single Commit LCR.After destination applies LCRs, destination sends back confirming LCR that it’s okay for capture queue to remove those LCRs.For rollback, source sends 1 rollback LCR for each LCR in that transaction.And now to the troubleshooting . . .
22Trivia QuestionQuestion: how does Streams avoid “infinite loop” replication?For example, after a change is applied at a destination database, why doesn’t Streams “re-capture” that change, and then propagate it back to the original source database?The answer to this question will beprovided later in the presentation.
23LCR ConceptsAn LCR is one type of Streams message – the type that is formatted by the Capture process.Two types of LCRs: DDL LCRs and row LCRs.A DDL LCR contains information about a single DDL change.A row LCR contains information about achange to a single row.As a result, transactions that affect multiplerows will cause multiple LCRs to be generated.
24LCR Concepts LOB Issues Tables that contain LOB datatypes can cause multiple LCRs to be generated per row affected!Inserts into tables with LOBs will always cause multiple LCRs to be generated per row.Updates into those tables might cause multiple LCRs, if any of the LOB columns are updated.Deletes will never generate multiple LCRs – deletes always generate 1 LCR per row.
25LCR Concepts Other Items to Note All of the LCRs that are part of a single transaction must be applied by one apply server.If a transaction generates 10,000 LCRs, then all 10,000 of them must be applied by just one apply server – no matter how many servers are running.For each transaction, there is one additional LCR generated, at the end of the transaction.If a single transaction deletes 2,000 rows, then 2,001 LCRs will be generated. This item is important to note for the “spill threshold” value.
26“Spill” Concepts Normally, outstanding messages are held in buffered queues, which reside in memory.In some cases, messages can be “spilled”; that is,they can be moved into tables on disk. 3 Reasons:The total number of outstanding messagesis too large to fit in the buffered queue;A given message has been in memory too long;The number of messages in a single transaction islarger than the “LCR spill threshold” parameter.
27“Spill” Concepts Types of Spill Tables There are two types of tables that can hold “spilled” messages:“Queue” tables: each buffered queue has a “queue table” associated with it. Queue tables are created at the same time as buffered queues.Name format for queue tables:AQ$_<name that you specified>_PExample:AQ$_CAPTURE_QUEUE_TABLE_P
28“Spill” Concepts Types of Spill Tables (cont.) The “Spill” table: there is one (and only one) “spill” table, in any database that uses Streams.The name of the spill table is:SYS.STREAMS$_APPLY_SPILL_MSGS_PARTAs the name implies, the spill table can only be used by apply processes – not by capture processes.
29“Spill” Concepts Where are the spilled messages? Tables where spilled messages are placed:If a message is spilled because the total number of outstanding messages is too large, then that message is placed in the associated queue table.If a message has been held in memory too long, then that message is placed in the queue table.If the number of LCRs in a single transaction exceeds the “LCR spill threshold”, then all of those LCRs are placed in the spill table – SYS.STREAMS$_APPLY_SPILL_MSGS_PART.
30Troubleshooting Two basic ways to troubleshoot Streams issues: Query the internal Streams tables and views;Search the alert log for messages. (This is particularly useful for capture process issues).
31Capture Process Tables and Views streams$_capture_process: lists all definedcapture processesdba_capture: basic status, error infov$streams_capture: detailed status infodba_capture_parameters: configuration info
32Propagate Process Tables and Views streams$_propagation_process: lists all definedpropagate procsdba_propagation: basic status, error infov$propagation_sender: detailed statusv$propagation_receiver: detailed status
33Apply Process Tables and Views streams$_apply_process: lists all defined applyprocessesdba_apply: basic status, error infov$streams_apply_reader: status of the apply readerv$streams_apply_server: status of apply server(s)v$streams_apply_coordinator: overall status,latency infodba_apply_parameters: configuration info
34“Miscellaneous” Tables and Views v$buffered_queues: view that displays the current and cumulative number of messages, for each buffered queue.sys.streams$_apply_spill_msgs_part: table that the apply process uses, to “spill” messages from large transactions to disk.system.logmnr_restart_ckpt$: table that holds capture process “checkpoint” information.
35Types of Bottlenecks Two main types of “bottlenecks”: Type 1: Replication is completely stopped; i.e., no changes are being replicated at all.Type 2: Replication is running, but it is slower than the rate of DML; i.e., it is “falling behind”.
36Capture Process Bottlenecks Capture process bottlenecks typically have to do with a capture process being unable to read necessary online or (especially) archive logs.This will result in a “Type 1” bottleneck – that is, no changes will be replicated at all (because no changes can be captured).Almost all of the “Type 1” bottlenecks that I have ever encountered have been due to archive log issues with capture processes.
37Capture Process Bottlenecks Capture “Checkpoints” The capture process writes its own “checkpoint” information to its data dictionary tables.This checkpoint info keeps track of the SCN values that the capture process has scanned. That information is used to calculate the “required_checkpoint_scn” value.Capture process checkpoint information is primarily stored in the following table:SYSTEM.LOGMNR_RESTART_CKPT$
38Capture Process Bottlenecks Capture “Checkpoints” (cont.) By default, the capture process writes checkpoints very frequently, and stores their data for a long time. On a very write-intensive system, this can cause the LOGMNR_RESTART_CKPT$ table to become extremely large, very quickly.The amount of data stored in that table can be modified with these capture process parameters:_checkpoint_frequency: number of megabytes captured which will trigger a checkpointcheckpoint_retention_time: number of days to retain checkpoint information
39Capture Process Bottlenecks SCN checks When a capture process starts up, it calculates its “required_checkpoint_scn” value. This value determines which redo logs must be scanned, before any new transactions can be captured.The capture process needs to do this check, in order to ensure that it does not “miss” any transactions that occurred while it was down.As a result, when a capture process starts up, the redo log that contains that SCN value –and every subsequent log – must be present in the log_archive_dest directory (or in online logs).
40Capture Process Bottlenecks SCN checks (cont.) If any required redo logs are missing during a capture process restart, then the capture process will permanently “hang” during its startup.This issue will completely prevent the capture process from capturing any new changes.If the required redo log(s) cannot be restored, then the only way to resolve this situation is to completely rebuild the Streams environment.As a result, it is extremely important to “keep track” of redo logs that the capture process needs, before deleting any logs.
41Capture Process Bottlenecks Which archive logs are needed? The following query can be used to determine the oldest archive log that will need to be read, duringthe next restart of a capture process.select a.sequence#, b.name from v$log_history a, v$archived_log bwhere a.first_change# <=(select required_checkpoint_scn from dba_capturewhere capture_name = ‘<capture process name>’)and a.next_change# >and a.sequence# = b.sequence#(+);If no rows are returned from that query, then the SCN in question resides in an online log.
42Capture Process Bottlenecks Flow Control In 10g, the capture process is configured with“automatic flow control”. This feature prevents thecapture process from spilling many messages.If a large number of messages build up in thecapture process’s buffered queue, then it will“pause” capturing new messages, until somemessages are removed from the queue.A message cannot be removed from the captureprocess’s queue until one of these items occurs:The message is applied at the destinationThe message is spilled at the destination
43Propagate Process Bottlenecks I have heard of a few “Type 2” propagatebottlenecks in some “extreme” environments.Parameter changes to consider, to try to avoidpropagate-related bottlenecks:Set the propagate parameter LATENCY to 0Set the propagate parameterQUEUE_TO_QUEUE to TRUE (10.2 only)Set the init.ora parameter _job_queue_interval to 1Set the init.ora parameter job_queue_processes to 4 or more
44Apply Process Bottlenecks Apply process bottlenecks usually deal with the apply process not “keeping up” with the rate of DML being executed on the source tables.I have generally only seen apply bottlenecks with “batch” DML (as opposed to “OLTP” DML).The three main areas to be concerned with regarding apply process bottlenecks are:Commit frequency;Number of apply servers;Other Streams parameters.
45Apply Process Bottlenecks Commit Frequency The number of rows affected per transaction hasan enormous impact on apply performance:The number of rows affected per transactionshould not be too large – due to spill, and to the“one apply server per transaction” restriction.should not be too small, either – due to DMLdegradation, and to apply transaction overhead.From my experience, the optimal commitfrequency is around 500 rows.
46Apply Process Bottlenecks Number of Apply Servers The number of parallel apply servers also has a large impact on the apply process’s performance.The number of parallel apply servers is set by the apply “PARALLELISM” parameter.From my experience, for best throughput, the PARALLELISM parameter should be three times the number of CPUs on the host machine.Note: setting PARALLELISM that high can potentially “eat up” all of the CPU cycles on the host, during intensive DML jobs.
47Apply Process Bottlenecks Other Streams Parameters Several other Streams-related parameters canalso have an effect on apply performance:Apply process parameter _HASH_TABLE_SIZE:set this relatively high (such as ), tominimize “wait dependency” bottlenecks.Apply process parameterTXN_LCR_SPILL_THRESHOLD: set to be a littlebit higher than the maximum number of LCRsgenerated per transaction, to prevent spill.Init.ora parameter aq_tm_processes: set to 1.
48Apply Process Bottlenecks “Turning Off” Streams As mentioned previously, apply bottlenecks generally occur with “batch”-style DML.It is possible to configure a session so that Streams will not capture any DML from that session – and therefore, that session’s DML will not be propagated or applied.This technique allows batch-style DML to be run at each Streams database individually – rather than running the batch DML in one database, and then having that DML get replicated to all of the other databases.
49Apply Process Bottlenecks “Turning Off” Streams (cont.) “Turning off” Streams in a session is done by setting the session’s “Streams tag” to a non-NULL value. Here is an example of turning Streams off before a batch statement, and then back on:exec dbms_streams.set_tag (tag => hextoraw(’01’));<batch DML statement>commit;exec dbms_streams.set_tag (tag => NULL);Note: this technique will only work if the “include_tagged_lcr” parameter, in the Streams capture rules, is set to FALSE. (FALSE is the default value for that parameter in Streams rules.)
50Apply Process Bottlenecks Apply Process Aborts! If there are any triggers on any replicated tables,then the apply process can sometimes abort – andnot restart – during periods of intense DML activity.These aborts happen most frequently whenone particular apply server becomes“overloaded”, and has to run too muchrecursive SQL (triggers cause recursive SQL).This issue is caused by Oracle bug #This bug is apparently fixed in ; andthere are “backport” patches available forand
51Streams “Spot Check” Script The script called “ssc.sql” performs a “spot check” on a varietyof Streams objects. Here is some sample output from it:********** Streams report for dbprod on 03/28/ :33:54 **********Capture process : STREAMS_CAPTURE_PROCESS CREATE LCRPropagate process: STREAMS_PROPAGATE_PROCESS ENABLEDApply reader : STREAMS_APPLY_PROCESS DEQUEUE MESSAGESApply server : STREAMS_APPLY_PROCESS EXECUTE TRANSACTIONApply server : STREAMS_APPLY_PROCESS IDLEApply coordinator: STREAMS_APPLY_PROCESS APPLYING Latency: 2Buffered queue : STREAMS_CAPTURE_QUEUE Messages: Spill msgs: 21Buffered queue : STREAMS_APPLY_QUEUE Messages: Spill msgs: 0Spill table rows :Total streams pool memory :Free streams pool memory :*************** End of report ***************
52Contact InformationBrian Keating:Chris Lawson:This presentation, and the script mentioned in it, are available for download on the Oracle Magician website: