Practical Performance Tuning

Name: Practical Performance Tuning
Uploaded: 2017-08-25T05:06:11+00:00
Duration: PTM34S9
Channel: Payton Birdsell
Description: Practical Performance Tuning

Practical Performance Tuning
For Your Progress OpenEdge Database

A Few Words About The Speaker
Tom Bascom, Roaming DBA & Progress User since 1987 President, DBAppraise, LLC Remote Database Management Service. Simplifying the job of Managing and Monitoring the world’s best business applications. VP, White Star Software, LLC Expert Consulting Services related to all aspects of Progress and OpenEdge.

Tom’s Top Twenty Tuning Tips
In No Particular Order.

{inc/disclaimer.i} Your kilometerage will vary.
These tips are in no special order. But sometimes order does matter. In an ideal world you will change one thing at a time, test and remove changes that do not have the desired effect. The real world isn’t very ideal. “Occasional” is subjective.

#20 Dump & Load You shouldn’t need to routinely dump & load.
If you’re on OE10, Using type 2 areas, That have been well defined, And you have a well-behaved application. The rest of us benefit from an occasional D&L. “Occasional” is subjective.

#20 Dump & Load DB-20 Highly Parallel Dump & Load 4GB, live during 45 minute demo. Using old, crappy hardware…

#19 Stay Current Up to date releases of Progress, your OS and your application are essential components of a well tuned system. You cannot take advantage of the best techniques, improved algorithms or new hardware without staying up to date. Failure to stay up to date may mean poor performance, increased costs and uncompetitiveness. Example – 9.1E = major performance improvements in 4GL execution speed. Not just Progress -- Hardware is almost always cheaper than people’s time. You may be able to tweak things or rewrite some code to get results. But it may be cheaper and faster to “throw hardware” at a lot of problems – “throwing hardware” is only bad when you aren’t aiming it…

#19 Stay Current TRX throughput in v8 -> v9 (CCLP)
Major 4GL execution speed improvements in 9.1E. 64 bit platform support and large memory. 10.1A Automatic Defragmentation. 10.1B Workgroup/Multicore bug addressed. Significant improvements in DB read performance in 10.1C (-spin enhancements). These improvements require no effort on your part – no dump & load, no migration to type 2 storage areas…

#19 The Impact of an Upgrade
conv89 These improvements require no effort on your part – no dump & load, no migration to type 2 storage areas…

#18 Parallelize Step outside of the box and consider what portions of your system could benefit from being parallelized: MRP Runs Nightly Processing Reports Data Extracts Data Imports

#18 Parallelize $ mbpro dbname –p exp.p –param “01|0,3000”
*/ define variable startCust as integer no-undo. define variable endCust as integer no-undo. startCust = integer( entry( 1, entry( 2, session:parameter, “|” ))). endCust = integer( entry( 2, entry( 2, session:parameter, “|” ))). output to value( “export.” + entry( 1, session:parameter, “|” ). for each customer no-lock where custNum >= startCust and custNum < endCust: export customer. end. output close. quit.

#17 Update Statistics SQL-92 uses a cost based optimizer…
But it cannot optimize without knowledge of the cost! (data distribution). Weekly or monthly “update statistics” is appropriate for most people. Or when 20% of your data has changed. This is a data intense process: Run it during off hours if you can. You might want to only do a few tables/indexes at a time. How do I know when 20% of my data has changed? Table growth rates Experience WAG

#17 Update Statistics COMMIT WORK; $ cat customer.sql UPDATE
TABLE STATISTICS AND INDEX STATISTICS AND ALL COLUMN STATISTICS FOR PUB.Customer; COMMIT WORK; $DLC/bin/sqlexp \ -db dbname \ -S portnum \ -infile customer.sql \ -outfile customer.out \ -user username \ -password passwd >> customer.err 2>&1 For More Information: OpenEdge Data Management: SQL Development Chapter 10, Optimizing Query Performance Test before doing this in production! The negative performance impact while it is running could be very large. And it might take a very long time to complete. 4gl people aren’t used to having to explicitly commit a transaction…

#16 Progress AppServer Used to reduce network traffic and latency.
When properly implemented it will minimize the path length between business logic and data persistence layers. IOW, for best performance, the AppServer should live on the same server as the database and use a self-service connection. Exception: An AppServer which is dedicated to making Windows API calls.

#16 Progress AppServer run asproc on … procedure asproc … for each …
end. How many network layers do we want to go through? procedure asproc … for each … end.

#15 fork() & exec() Very expensive system calls.
Multiple context switches, process creation & tear-down, IO to load executables etc. Complex shell scripts (the old UNIX “lp” subsystem). Long PATH variables. 4GL usage of UNIX, DOS, OS-COMMAND and INPUT THROUGH.

#15 fork() & exec() define variable i as integer no-undo.
define variable fSize as integer no-undo. etime( yes ). do i = 1 to 1000: input through value( "ls -ld .." ). import ^ ^ ^ ^ fSize. input close. end. display etime fSize. define variable i as integer no-undo. define variable fSize as integer no-undo. etime( yes ). do i = 1 to 1000: file-info:file-name = "..". fSize = file-info:file-size. end. display etime fSize. Not that stat() is an especially good thing – but it is way better than fork() & exec(). And it does happen to be the right tool for this particular job. 3140ms, at least 1000 calls to each of open(), close(), fork(), exec(), read() complete with multiple context switches per invocation. 16ms, 1,000 stat() calls.

#14 -spin Almost all new machines, even desktops & laptops are now multi-core. Do NOT use the old X * # of CPUs rule to set –spin. It is bogus. Bigger is not always better with –spin! Modest values (5,000 to 10,000) generally provide the best and most consistent results for the vast majority of people. Use readprobe.p to explore. Check out Rich Banville’s Superb Exchange 2008 Presentation!

#14 -spin OPS-28 A New Spin on Some Old Latches

#13 bi cluster size The idea is to reduce the number and frequency of checkpoints giving APWs plenty of time to work. Larger bi clusters permit spikes in the workload to take place without ambushing the APWs. Easy to overlook when building new db via prostrct create… 512 is the default OE 10 bi cluster size. 8192 is good for small systems. 16384 is “a good start” for larger systems. Longer REDO phase on startup so don’t get crazy. NOT a good idea for “Workgroup” database licenses. For WG small values (512 or 1024) are better.

#13 bi cluster size $ grep ‘(4250)’ dbname.lg
Before-Image Cluster Size: $ proutil dbname -C truncate bi -bi 16384 … (1620) Before-image cluster size set to kb. Before-Image Cluster Size: $ proutil dbname -C -bigrow 8

#12 APW, AIW, BIW & WDOG Always start a BIW Always start an AIW
Start WDOG Two APWs are usually enough: Too many is just a (small) waste of CPU cycles. If you are consistently flushing buffers at checkpoints increase bi cluster size and add an APW (one at a time until buffer flushes stop). (You are running after-imaging? Right?)

#12 APW, AIW, BIW & WDOG APW Data BIW & AIW Data

#11 Larger db Blocks Larger blocks result in much more efficient IO.
Fewer IO ops mean less contention for disk. Moving from 1k to 4k is huge. 4k to 8k is relatively less huge but still very valuable. 8k works best in most cases. Especially read-heavy workloads. Better disk space efficiency (tighter packing, less overhead). Don’t forget to adjust –B and Rows Per Block!

#11 Larger db Blocks Large Blocks reduce IO, fewer operations are needed to move the same amount of data. More data can be packed into the same space because there is proportionally less overhead. Because a large block can contain more data it has improved odds of being a cache “hit”. Large blocks enable HW features to be leveraged. Especially SAN HW. Use the Largest DB Block Size Available There are those who would say “except for windows”. Larger block sizes improve performance by reducing the number of IO operations. This occurs because more records can be packed into one block than will fit into a single, smaller, block. Furthermore more records can be can be packed into one larger block than two smaller blocks because the combined overhead of two smaller blocks is greater than that of the larger block and the excess space is combined more effectively. Likewise larger blocks fill less frequently than either one or two smaller blocks thereby needing to be written less often. Finally, larger blocks have a slight advantage in the buffer pool replacement process since they are somewhat more likely to contain data needed by future data references. Additionally slightly better compression of index values is obtained by having larger blocks available. In older releases of Progress there were several reasons why people sometimes preferred a smaller block: Too reduce record fragmentation (especially prior to OpenEdge 10.1) Progress suggested leaving very large amounts of free space in data blocks. Using smaller blocks could, sometimes, decrease the amount of “wasted” space. Since 10.1A Progress automatically defragments records during forward processing whenever possible. This greatly reduces the degree of potential fragmentation from tightly packing record blocks. Matching the database block size to the OS and disk subsystem block size avoided the possibility of “torn pages”. That is situation where a system crash might result in one half of a data block being written to disk and the other half being lost. This “torn page” would then not be detected until much later when a user attempted to reconstruct the block. (Actual instances of this problem occurring have been very rare and, so far as we know, isolated to Linux systems.) Starting with OpenEdge 10.1C Progress has introduced several integrity checks into the database that detect these, and many similar, conditions and prevent the corruption from propagating. In summary – choosing an 8k block over a 4k block can reduce IO operations by 50% (or more), can have a significant impact on the time required to dump and load or rebuild indexes and is beneficial to OLTP performance as well.

#11 Larger db Blocks 8k Blocks 8k Blocks New Server

#10 Type 2 Storage Areas Data Clusters – contiguous blocks of data that are homogenous (just one table). 64 bit ROWID. Variable (by area) rows per block. All data should be in type 2 areas – until you prove otherwise. Storage Optimization Strategies!

#10 Type 2 Storage Areas Storage areas

#9 Transactions Distinguish between a “business transaction” and a “database transaction”. Do not try to abuse a database transaction to enforce a business rule: You may need to create “reversing (business) transactions”. Or restartable transactions. For large database operations “chunk” your transactions.

#9 “Chunking” Transactions
define variable i as integer no-undo. outer: do for customer transaction while true: inner: do while true: i = i + 1. find next customer exclusive-lock no-error. if not available customer then leave outer. discount = 0. if i modulo 100 = 0 then next outer. end.

#8 Minimize Network Traffic
Use FIELD-LIST in queries. Use –cache and –pls. NO-LOCK queries pack multiple records into a request and eliminate lock downgrade requests. Watch out for client-side sorting and selection on queries. Remember that CAN-DO is evaluated on the CLIENT (yet another reason not to use it). Use -noautoresultlist/FORWARD-ONLY. -Mm = message buffer size -pls = prolib swap; When accessing r‑code files stored in a standard library, the AVM generally reads the files directly from the library instead of swapping the files into temporary sort files. However, over a network, direct reads might actually take longer than reading r‑code files once and storing them locally in temporary sort files. If you specify -pls with memory‑mapped libraries, the AVM ignores it.

#8 Minimize Network Traffic
Use a secondary broker to isolate high activity clients (such as reports). Consider setting –Mm to 8192 or larger. Use –Mn to keep the number of clients per server low (3 or less). Use –Mi 1 to spread connections across servers.

#7 Runaways, Orphans, Traps & Kills
Consume entire cores doing nothing useful. These are sometimes caused by bugs. But that is rare. More likely is a poorly conceived policy of restricting user logins. The UNIX “trap” command is often at the bottom of these problems.

#7 Runaways, Orphans, Traps & Kills

#6 The Buffer Cache The cure for disk IO is RAM.
Use RAM to buffer and cache IO ops. Efficiency of –B: Is loosely measured by hit ratio. Changes follow an inverse square law. So to make a noticeable change in hit ratio you must make a large change to –B. 100,000 is “a good start” 8k blocks).

#6 The Buffer Cache In Big B You Should Trust! Layer Time # of Recs
# of Ops Cost per Op Relative Progress to –B 0.96 100,000 203,473 1 -B to FS Cache 10.24 26,711 75 FS Cache to SAN 5.93 45 -B to SAN Cache* 11.17 120 SAN Cache to Disk 200.35 1500 -B to Disk 211.52 1585 Multiple runs of the same query under carefully controlled conditions with various db restarts, system reboots and SAN restarts in order to clear the various caches. Some very expensive hardware instrumentation was employed to ensure that data really did flow between the layers as expected. The ultimate outcome of the testing was to show that poor performance was due to poor SAN configuration rather than some exotic Progress misbehavior. * Used concurrent IO to eliminate FS cache

#5 Rapid Readers Similar to a runaway – consumes a whole CPU
But is actually doing db IO Usually caused by: Table scans Poor index selection. Unmonitored batch processes and app-servers. Really bad algorithm choices.

#5 Rapid Readers High Read Rate Suspicious user IO Code being run!

#4 Balance & Isolate IO Use more than one disk:
A fast disk can do 150 or so random IO Ops/sec. Kbytes/sec is a measure of sequential IO. OLTP is mostly random. Don’t waste time trying to “manually stripe”. Instead, use “hardware” striping and mirroring. Isolate AI extents for safety, not performance. Isolate temp-file, application, OS and “other” IO.

#4 Balance & Isolate IO fillTime = cacheSize / (requestRate – serviceRate) Typical Production DB Example (4k db blocks): 4GB / ( 200 io/sec – 800 io/sec ) = cache doesn’t fill! Heavy Update Production DB Example: 4GB / ( 1200 io/sec – 800 io/sec ) = 2621 seconds (RAID10) 4GB / ( 1200 io/sec – 200 io/sec ) = 1049 seconds (RAID5) Maintenance Example: 4GB / ( 5000 io/sec – 3200 io/sec ) = seconds (RAID10) 4GB / ( 5000 io/sec – io/sec ) = seconds (RAID5) Time until saturated is interesting and good to know – but the impact post saturation is the real killer! Post saturation RAID10 write performance is significantly faster than RAID5. It takes longer to saturate and, once saturated RAID10 performs better… Has anyone ever been told “don’t worry, the array will handle it” with regards to concerns about IO on a SAN? Did you (or your boss) believe it? Requests = “os reads” + “os writes” 200 represents a well configured and reasonably tuned system humming along doing mostly reads 1,200 represents a system that is probably in its batch window doing big reports and updates 5,000 is a dump & load, an index rebuild or a restore Service is based on a 4 disk stripe 800 is for random reads = 200 x 4 3,200 assumes that significant sequential behavior is being recognized by the disk subsystem 200 is RAID5, when writing it acts like a single disk with no sequential behavior advantages 4GB = 4 * 1024 * 1024 KB = 1,048,576 4K blocks

#3 Manage Temp File IO Temp-file IO can exceed db IO.
Sometimes by 2:1, 3:1 or more! -T isolates temp file IO. -t helps you to crudely diagnose the source of IO. -y provides some detail regarding r-code swapping. -mmax buffers r-code, 4096 is a good start for ChUI, for GUI. Memory mapped procedure libraries cache r-code. Use –Bt & -tmpbsize to tune 4GL temp-tables.

#3 Manage Temp File IO -rw-r--r-- 1 VEILLELA users Oct 19 15:16 srtrAyhEb -rw-r--r-- 1 wrightb users Oct 19 15:16 srtH6miqb -rw-r--r-- 1 STEELEJL users Oct 19 15:16 srtz37kyb -rw-r--r-- 1 THERRIKS users Oct 19 07:12 srt--Elab -rw-r--r-- 1 root users Oct 19 15:16 lbiV6Qp7a -rw-r--r-- 1 root users Oct 19 15:16 lbi-TymMa -rw-r--r-- 1 wrightb users Oct 19 15:16 DBIHDmiqc -rw-r--r-- 1 BECKERLM users Oct 19 11:06 DBI--Abac -rw-r--r-- 1 CALUBACJ users Oct 19 09:16 DBI--Abyc CLIENT.MON (-y) Program access statistics: Times Bytes Reads from temp file: Writes to temp file: Loads of .r programs: Saves of compilation .r's: Compilations of .p's: Checks of files with stat: Srt file = 0; good! Srt file <> 0; bad  DBI file = 8192; good! DBI file <> minimum for your system; bad  Lbi file – not much that can be done. Worse yet root owns it so tying it back to real user is difficult. Client.mon writes to temp file = -mmax too low. Checks of files with stat = lack of –q.

#2 Index Compact Compacts Indexes.
Removes deleted record placeholders. Improves “utilization” = fewer levels & blocks and more index entries per read. Runs online or offline. Available since version 9.

#2 Index Compact Do NOT set target % for 100!
proutil dbname –C idxcompact table.index target% Do NOT set target % for 100! Consider compacting when utilization < 70% … and blocks > 1,000. INDEX BLOCK SUMMARY FOR AREA "APP_FLAGS_Idx" : 96 Table Index Fields Levels Blocks Size %Util Factor PUB.APP_FLAGS AppNo M FaxDateTime K FaxUserNotified K INDEX BLOCK SUMMARY FOR AREA "Cust_Index" : 10 Table Index Fields Levels Blocks Size % Util Factor PUB.Customer Comments B CountryPost K CustNum K Name K SalesRep K Default compact % is 80% It is a minimum target – actual results are usually better.

#1 Stupid 4GL Tricks Bad code will defeat any amount of heroic tuning and excellent hardware. Luckily bad code is often advertised by the perpetrator as having been developed to “improve performance”. Just because a feature of the 4GL can do something doesn’t mean that it should be used to do it.

#1 Stupid 4GL Tricks /* SR#1234 – enhanced lookup to improve performance! */ update cName. find first customer where cName matches customer.name use-index custNum no-error. Or – find first customer where can-do( cName, name ) “Pattern” is on the wrong side for MATCHES example. This will prevent wild-cards from working (if that was the intent). USE-INDEX is obviously a stupid 4gl trick. FIRST is being used to obscure the performance problems being caused by MATCHES, USE-INDEX and CAN-DO. Note the lack of a LOCK specification too… CAN-DO is a security function. It is always evaluated by the client. It is also sensitive to the data in both arguments. Comma, splat and bang can all change your results. Being “root” will too – when you are “root” tests that should fail will succeed unless root is specifically excluded (i.e. !root) in the idlist argument. (idlist is the 1st arg) CAN-FIND is also easily abused…

Questions?

Thank You!

Practical Performance Tuning

Similar presentations

Presentation on theme: "Practical Performance Tuning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Practical Performance Tuning

Similar presentations

Presentation on theme: "Practical Performance Tuning"— Presentation transcript:

Similar presentations

About project

Feedback