Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology

Presentation on theme: "Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology"— Presentation transcript:

Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

There are Two Kinds of Optimizer Statistics >Table/Index level- describes a table and its index(es) >Page/row counts, cluster ratios, deleted and forwarded rows >Some are updated dynamically as DML occurs page/ row counts, deleted rows, forwarded rows, cluster ratios >Stored in systabstats >Column level - describes the data to the optimizer >Histogram (distribution), density values, default selectivity values >Static, need to be updated or written directly >Stored in sysstatistics >This presentation deals with the column level statistics

What Are These Column Level Statistics Used For? >The Histogram Values >Describes the distribution of values in the column >Belongs to a column, not an index >Used in costing SARGs >A step is the point in the column where a value is read to obtain a boundary value >A cell represents the rows that fall between two steps >Each cell has a weight which is the fraction of rows in the column it represents - read as a percentage of all rows >There are approximately the same number of rows in each cell - except Frequency count cells

Some Quick Definitions Range cell density: 0.0037264745412389 Total density: 0.3208892191740000 Range selectivity: default used (0.33) In between selectivity: default used (0.25) Histogram for column: A" Column datatype: integer Requested step count: 20 Actual step count: 10 Step Weight Value 1 0.00000000 <= 154143023 2 0.05263700 <= 154220089 3 0.05245500 <= 171522361 4 0.00000000 < 800000000 5 0.34489399 = 800000000 6 0.04968300 <= 859388217 7 0.00000000 < 860000000

What Are These Column Level Statistics Used For? cont. >A cell can represent either a single value or multiple values of the column >Range cell - more than one value 2 0.05263982 <= 31423 3 0.05263316 <= 63045 All values between 31424 and 63045 are in cell 3 >Frequency count cell - only one value (very accurate) 5 0.00000000 < 170016 6 0.15908001 = 170016 7 0.05263815 <= 201861 8 0.10263316 = 201862 9 0.05264576 <= 317462 Cells 6 and 8 represent only one value >Cell (step) 1 represents the NULL values in the column

Statistics On Inner Columns of Composite Indexes Stats on inner columns of composite indexes Think of a composite index as a 3D object, columns with statistics are transparent, those without statistics are opaque >Columns with statistics give the optimizer a clearer picture of an index – sometimes good, sometimes not >This is a fairly common practice >Does add maintenance >update index statistics most commonly used to do this update index statistics tab_name [ind_name]

Statistics On Inner Columns of Composite Indexes cont. Index on columns E and B – No statistics on column b select * from TW4 where E = "yes" and b >= 959789065 and id >= 600000 and F > "May 14, 2002 and A_A = 959000000 Beginning selection of qualifying indexes for table TW4', varno = 0, objectid 464004684. The table (Allpages) has 1000000 rows, 24098 pages, Estimated selectivity for E, selectivity = 0.527436, upper limit = 0.527436. No statistics available for B, using the default range selectivity to estimate selectivity. Estimated selectivity for B, selectivity = 0.330000.

Statistics On Inner Columns of Composite Indexes cont. The best qualifying index is E_B' (indid 7) costing 49264 pages, with an estimate of 191 rows to be returned per scan of the table FINAL PLAN (total cost = 481960): varno=0 (TW4) indexid=0 () path=0xfbccc120 pathtype=sclause method=NESTED ITERATION Table: TW4 scan count 1, logical reads:(regular=24098 apf=0 total=24098) physical reads: (regular=16468 apf=0 total=16468), apf IOs used=0

Statistics On Inner Columns of Composite Indexes cont. Statistics are now on column B Estimated selectivity for E, selectivity = 0.527436, upper limit = 0.527436. Estimated selectivity for B, selectivity = 0.022199, upper limit = 0.074835. The best qualifying index is E_B' (indid 7) costing 3317 pages,with an estimate of 13 rows to be returned per scan of the table FINAL PLAN (total cost = 55108): varno=0 (TW4) indexid=7 (E_B) path=0xfbd1da08 pathtype=sclause method=NESTED ITERATION Table: TW4 scan count 1, logical reads:(regular=4070 apf=0 total=4070), physical reads: (regular=820 apf=0 total=820),

Statistics On Non-Indexed Columns and Joins Stats on non-indexed columns Cant help with index selection but can affect join ordering >Columns with statistics give the optimizer a clearer picture of the column – no hard coded assumptions have to be used >When costing joins of non-indexed columns having statistics may result in better plans than using the default values >Without statistics there will be no Total density or histogram that the optimizer can use to cost the column in the join >Yes, in some circumstances histograms can be used in costing joins – if there is a SARG on the joining column and that column is also in the join table then the SARG from the joining table can be used to filter the join table >If there is no SARG on the join column or on the joining column the Total density value (with stats) or the default value (w/o stats) will be used

Statistics On Non-Indexed Columns and Joins cont. SARG example select....from TW1, TW4 where TW1.A = TW4.A and TW1.A = 10 Selecting best index for the JOIN CLAUSE: TW4.A = TW1.A TW4.A = 10 Estimated selectivity for a, selectivity = 0.003726,upper limit = 0.049683. Histogram values used select....from TW1, TW4 where TW1.A = TW4.A and TW1.B = 10 Selecting best index for the JOIN CLAUSE: TW4.A = TW1.A Estimated selectivity for a, selectivity = 0.320889. Total density value used

Statistics On Non-Indexed Columns and Joins - Example select * from TW1,TW2 where TW1.A=TW2.A and TW1.A =805975090 A simple join with a SARG on the join column of one table Table TW2 column A has no statistics, TW1 column A does Selecting best index for the JOIN CLAUSE: (for TW2.A) TW2.A = TW1.A TW2.A = 805975090 Inherited from SARG on TW1 But, cant help…no stats Estimated selectivity for A, selectivity = 0.100000. The best qualifying access is a table scan, costing 13384 pages, with an estimate of 50000 rows to be returned per scan of the table, using no data prefetch (size 2K I/O), in data cache 'default data cache' (cacheid 0) with MRU replacement Join selectivity is 0.100000. Inherited SARG from other table doesnt help in this case

Statistics On Non-Indexed Columns and Joins – Example cont. Without statistics on TW2.A the plan includes a reformat with TW1 as the outer table FINAL PLAN (total cost = 2855774): varno=0 (TW1) indexid=2 (A_E_F) path=0xfbd46800 pathtype=sclause method=NESTED ITERATION varno=1 (TW2) indexid=0 () path=0xfbd0bb10 pathtype=join method=REFORMATTING >Not the best plan – but the optimizer had little to go on

Statistics On Non-Indexed Columns and Joins – Example cont. Table TW2 column A now has statistics. The inherited SARG on TW1.A can now be used to help filter the join on TW2.A Selecting best index for the JOIN CLAUSE: TW2.A = TW1.A TW2.A = 805975090 Estimated selectivity for A, selectivity = 0.001447, upper limit = 0.052948. The best qualifying access is a table scan, costing 13384 pages, with an estimate of 724 rows to be returned per scan of the table, using no data prefetch (size 2K I/O), in data cache 'default data cache' (cacheid 0) with MRU replacement Join selectivity is 0.001447.

Statistics On Non-Indexed Columns and Joins – Example cont. With statistics on TW2.A reformatting is not used and the join order has changed FINAL PLAN (total cost = 1252148): varno=1 (TW2) indexid=0 () path=0xfbd0b800 pathtype=sclause method=NESTED ITERATION varno=0 (TW1) indexid=2 (A_E_F) path=0xfbd46800 pathtype=sclause method=NESTED ITERATION

The Effects of Changing the Number of Steps (Cells) The Number of Cells (steps) Affects SARG Costing – As the Number Of Steps Changes Costing Does Too Cell weights and Range cell density are used in costing SARGs >Cell weight is used as columns upper limit Range cell density is used as selectivity for Equi-SARGs – as seen in 302 output >Result(s) of interpolation is used as column selectivity for Range SARGs >Increasing the number of steps narrows the average cell width, thus the weight of Range cells decreases >Can also result in more Frequency count cells and thus change the Range cell density value >More cells means more granular cells

The Effects of Changing the Number of Steps (Cells) cont. Average cell width = # of rows/(# of requested steps –1) >Table has 1 million rows, requested 20 steps - >1,000,000/19 = 52,632 rows per cell >1,000,000/199 = 5,025 rows per cell >What does this mean? >As you increase the number of steps (cells) they become narrower – representing fewer values >Well see that this has an effect on how the optimizer estimates the cost of a SARG >update statistics ……. using X values create index ….. using X values

The Effects of Changing the Number of Steps (Cells) cont. Changing the number of steps – effects on Equi-SARGs select A from TW2 where B = 842000000 With 20 cells (steps) in the histogram Range cell density: 0.0012829768785739 9 0.05263200 <= 825569337 10 0.05264200 <= 842084405 SARG value falls into cell 10 Estimated selectivity for B, selectivity = 0.001283, upper limit = 0.052642. Range cell weight of density qualifying cell

The Effects of Changing the Number of Steps (Cells) cont. With 200 cells (steps) in the histogram Range cell density: 0.0002303825911991 77 0.00507200 <= 839463989 78 0.00506000 <= 842019895 >SARG value falls into cell 78 Estimated selectivity for B, selectivity = 0.000230, upper limit = 0.005060. In this case more cells result in a lower estimated selectivity >Increasing the number of steps has decreased the average width and lowered the Range cell density and the average cell weight. >Range cell density decreased because Frequency count cells appeared in the histogram

The Effects of Changing the Number of Steps (Cells) cont. Changing the number of steps – effects on Range SARGs - select * from TW2 where B between 825570000 and 830000000 With 20 cells (steps) in the histogram Range cell density: 0.0012829768785739 9 0.05263200 <= 825569337 10 0.05264200 <= 842084405 >SARG values fall into cell 10 Estimated selectivity for B, selectivity = 0.014121, upper limit = 0.052642. >Here selectivity is the product of interpolation, upper limit is the weight of the qualifying cell. >Interpolation estimates how much of cell will qualify for the range SARG

The Effects of Changing the Number of Steps (Cells) cont. select * from TW2 where B between 825570000 and 830000000 With 200 cells (steps) in the histogram Range cell density: 0.0002303825911991 67 0.00505200 <= 825505843 68 0.00503000 <= 825570611 69 0.00508000 <= 825635378 70 0.00504000 <= 825690418 71 0.00506400 <= 825702450 72 0.00503200 <= 825767218 73 0.00510200 <= 825831945 74 0.00425800 <= 825833785 75 0.00598400 <= 839462921 Estimated selectivity for B, selectivity = 0.029624, upper limit = 0.034606. >The SARG values now span multiple cells >Interpolation estimates amount of cells 68 and 75 to use since not all of those two cells qualify

Some Statistics Related FAQs cont. How many steps should I request? >It will depend on your data and your queries >Increase requested steps to get Frequency count cells when there are highly duplicated values >FC only represents one value - very accurate weight >Range SARGs will estimate what portion of a cell qualifies for the SARG >More cells means narrower cells (represent fewer values) >Narrower cells mean more accurate estimates >Can have an affect on equi-SARGs - lower selectivity

Removing Statistics Can Effect Query Plans Sometimes no statistics are better then having them This will usually be an issue when very dense columns are involved Histogram for column: E" Step Weight Value 1 0.00000000 < "no" 2 0.47256401 = "no" 3 0.00000000 < "yes" 4 0.52743602 = "yes This can also show up when you have spikes (Frequency count cells) in the distribution

Removing Statistics Can Effect Query Plans cont. select count(*) from TW4 where E = yes and C = 825765940 The table…has 1000000 rows, 24098 pages, Estimated selectivity for E, selectivity = 0.527436, upper limit = 0.527436. Estimating selectivity of index E_AA_B', indid 6 scan selectivity 0.52743602,filter selectivity 0.527436 527436 rows, 174107 pages The best qualifying index is E_AA_B' (indid 6) costing 174107 pages, with an estimate of 526 rows FROM TABLE TW4 Nested iteration. Table Scan.

Removing Statistics Can Effect Query Plans cont. delete statistics TW4(E) Estimated selectivity for E, selectivity = 0.100000. Estimating selectivity of index E_AA_B', indid 6 scan selectivity 0.100000,filter selectivity 0.100000 100000 rows, 20584 pages The best qualifying index is E_AA_B (indid 6) costing 20584 pages, with an estimate of 92 rows FROM TABLE TW4 Nested iteration. Index : E_AA_B Forward scan. Positioning by key.

Maintaining Tuned Statistics Tuned statistics will add to your maintenance Any statistical value you write to sysstatistics either via optdiag or sp_modifystats will be overwritten by update statistics >Keep optdiag input files for reuse >If needed get an optdiag output file, edit it and read it in >Keep scripts that run sp_modifystats >Rewrite tuned statistics after running update statistics that affects the column with the modified statistics

Monitoring Table/Index Level Fragmentation Using The Statistics Can Be Both An Optimizer and Space Management Concern The more fragmentation the less efficient page reads are >Deleted rows – fewer rows per page, affects costing >Forwarded rows – 2 I/O each, optimizer adds to costing >Empty data/leaf pages – more reads may be necessary >Clustering can get worse >Watch the DPCR of the table or APL clustered index >In general the Cluster Ratios are not a good indicator of fragmentation since they are often normally low >Use optdiag outputs to monitor these values

Monitoring Table/Index Level Fragmentation Using The Statistics cont. >ASE 12.0 and above check the Space utilization value >Large I/O efficiency is another value to watch Empty data page count: 0 Forwarded row count: 0.0000000000000000 Deleted row count: 0.0000000000000000 Derived statistics: Data page cluster ratio: 0.9994653835872761 Space utilization: 0.9403543288085808 Large I/O efficiency: 1.0000000000000000 >Space utilization and Large I/O efficiency are not used by the optimizer >The further from 1 the more fragmentation there is

Maintaining the Statistics When data changes the statistics become out of date In general up to date statistics are needed to get the best query plans >Statistics are usually updated using update statistics commands >The more statistics you have the more maintenance >Its a trade off between the gain in query performance and the increased statistics maintenance >Theres no point in updating statistics if the table is static

Update Statistics >Update statistics has been extended to allow for placement of statistics on columns >update statistics table_name (col_name) >update index statistics table_name [ind_name] >update all statistics table_name >Specify the requested number of steps (cells) to use when building the columns histogram >update statistics table_name (col_name) using 200 values

How Update Statistics Works Column and table/index values have to be read in order to gather the statistics >What does it do? >Reads the column to gather information for density and histogram, writes the column level statistics >While reading the column it gathers index/table level statistics – row & page count, forwarded rows, deleted rows, the cluster ratios, etc. >Takes a sample value every X rows for a histogram boundary value - (based on the number of rows and requested steps) If same value for multiple steps save it to make an FC

How Update Statistics Works cont. >Values have to be in sorted order for statistics gathering >If its the leading column of an index no sort is necessary >Just scan index leaf for statistics >If not the leading column of an index - create a worktable, read values in, sort and scan for statistics update statistics tab_name (col_name)- a table scan will be done to read the column update index statistics (ind_name)- then only an index scan (with a sort of the inner columns) >The sort is done in a worktable in tempdb. update index and update all statistics will use a lot of tempdb space unless sampling is used

Some Statistics Related Myths & Legends Update statistics will result in improved performance >Only guarantees up to date statistics >Due to distribution statistics may not give a pretty picture of the column Always use update all statistics >Rarely need statistics on all columns of a table >Can take a VERY long time to run, makes maintenance difficult at best >Should consider adding stats to composite index columns

Statistics Tools >Some useful tools for working with the statistics >Some are by Sybase some are by users >Optdiag - read, write and simulate the statistics >Well known and documented >sp_modifystats - make modifications to density values (more functionality coming soon - 11.9.2.4, 12.0.0.4, 12.5) >sp__optdiag (thats a double underscore) - >by Kevin Sherlock >Displays the statistics ala optdiag output - very handy >http://www.sypron.nl/download.html

Sampling for Update Statistics A new feature in 12.5.0.3 Designed to dramatically reduce the time it takes to update statistics – can dramatically speed up the running of update statistics >Opens your maintenance window >Decreases the cost of using ASE Randomly selected pages are read instead of reading all pages to gather the column level statistics – less I/O >The percentage of pages to be sampled can be specified update statistics tab_name with sampling = X percent >X is the percentage of pages you want to sample Can be between 1 and 100

Definitions >Column Level Statistics – those statistics that describe the values in the column to the optimizer – an attribute of a colum (i.e.; the histogram and density values) >Sampling – randomly reading rows from a specified percentage (subset) of pages rather than all pages of the table in order to gather column level statistics >Sampling Rate – the specified percentage of pages to read >Full Scan – to gather statistics by reading all pages of the object (table or index) >Major Attribute of an Index – the leading column of an index as listed in the create index command

Sampling for Update Statistics cont. >Unofficial tests show that a sampling rate of 10% on a 1 million row numeric column reduces the time for update statistics to run from 9 minutes to 30 seconds

Sampling for Update Statistics cont. >The Resulting histogram will be based on the values that are sampled It will differ from a histogram obtained from a full scan update statistics The lower the specified percentage of sampling the more the histogram will differ from a full scan histogram Test your queries against sampled statistics. In most cases you wont see any major changes Density values not updated by sampling >In most cases this wont be an issue.

Why Sampling for Update Statistics? >As datasets have grown the time it takes to run update statistics has also grown – Dramatically!! >This became more of an issue with update index statistics introduced in 11.9.x due to extra sort in worktable >TCO and auto-tuning/admin require a faster way to run update statistics >Without a faster update statistics neither efforts would succeed >Speeding up update statistics is a long standing Customer feature request >Random page sampling is the most I/O efficient method >Dramatically decreased the time to run update statistics

Why Sampling for Update Statistics? cont. >Some time test results – Not official, not for general release >your mileage may vary >Timings are from tests run by Sybase QA >1 million row int colum – timings based on elapsed time 20% sampling rate – Full scan time :2465850 Sampling time : 398783 Percentage of savings time(elapsed time):83% 10% sampling rate – Full scan time :2139013 Sampling time : 153130 Percentage of savings time(elapsed time):92% >Variations in full scan time are taken into account

How Does It Work? >Specify the percentage of pages to read via update statistics >with sampling = X percent Percent value can be between 1 and 100 >with extensions must follow using – with sampling = x percent and/or with consumers = x must follow using X values update statistics authors(auth_id) using 40 values with percent = 10 >Sampling reads all rows from each page read >Row values are moved to the worktable to be sorted and the statistics gathered >This saves tempdb space since the sampled sets of values are smaller than if the whole column was read into the worktable

How Does It Work? cont. Specific update statistics syntax and their affects update statistics table_name [index_name] with sampling = X percent >Will full scan index pages to update/create statistics on the major attribute(s) of the specified index or all indexes on the table ignoring the specified sampling rate – sampling will not be done

How Does It Work? cont. update index statistics tab_name [ind_name] with sampling = X percent >Will full scan index pages to update/create statistics for the major attribute(s) of the indexes or specified index on the table, ignoring sampling. >For minor index attributes will use sampling to scan the requested percentage of pages, read those values into a worktable, sort and gather statistics from there. >The space used in tempdb will decrease as the sampling rate decreases update statistics tab_name (col_name) with sampling = X percent >Will use sampling to update/create statistics for the specified column using the specified sampling rate. This applies to all columns whether major attributes of an index or not >Will not affect multi-column density values

How Does It Work? cont. update all statistics table_name with sampling = X percent >Will full scan index pages to gather statistics for the major attribute of all indexes – will not use sampling on these columns >Will use sampling to gather statistics for all columns that are not the major attribute of an index >The space used in tempdb will decrease as the sampling rate decreases

How Does It Work? cont. >Sampling is not used for create index >Since a full scan is required to build an index there is no additional cost for building the statistics

Trade Offs >A sampled set of anything is not as accurate as examining the most effective sampling rate for a given dataset >A histogram created with sampling is not likely to match a histogram created via a full scan >Histogram boundary values will vary >Cell weights will vary >Minimum and maximum histogram boundary values will vary >Since cell weight(s) and Range cell density are used to cost all SARGs a histogram from a sampled set will have an affect on SARG costing >Variations in the upper and lower histogram values may result in out of bounds costing by the optimizer >The smaller the sampling rate the greater the variance is likely to be

Trade Offs cont. >If there are existing density values they will not be overwritten. If there are no density values a default value of 0.100000 will be used for both Range cell and Total density values >There is currently no information saved about the use of sampling (whether or not it was used and the sampling rate) >Different cell types may appear >As the sampling rate decreases it is possible that Frequency count and/or Range cells may appear where they didnt exist prior to sampling >The same pages will be resampled if the dataset is static and the same sampling rate used

Examples of Variations in the Histogram Full scan histogram - Step Weight Value 1 0.00000000 <= 154218543 2 0.05315000 <= 805909305 3 0.05305000 <= 808793353 4 0.05311000 <= 822687028 5 0.05304000 <= 825700873 6 0.05314000 <= 839464505 7 0.05292000 <= 842544649 8 0.05305000 <= 858863369 20 0.04621000 <= 960051465 >Note boundary values, cell weights and the upper and lower boundary values >Variations within the histogram are the main issue that needs to be tested

Examples of Variations in the Histogram cont. 10% sampled histogram - Step Weight Value 1 0.00000000 <= 154218799 2 0.05253968 <= 805909300 3 0.05253968 <= 808728585 4 0.05253968 <= 822686772 5 0.05269841 <= 825636617 6 0.05349206 <= 839464498 7 0.05253968 <= 842543113 8 0.05253968 <= 858797321 20 0.04888889 <= 960050979 >Note variations in the boundary values, cell weights and the upper and lower boundary values

Tuning and Troubleshooting >Trial-and-error testing/tuning will need to be done to determine the most optimal sampling rate for a given dataset >In most cases variations in the statistics will have no affect In other cases small variations may change query plans >There is no rule of thumb on what sampling rate to use >In some cases the same sampling rate may be fine across all or most tables/columns. >In some cases sampling may not result in efficient plans >Use showplan and traceon 302/310 outputs to track changes to the query plan as the sampling rate changes >Using sample queries get above outputs from statistics gathered by a full scan. Update statistics with the sampling rate, rerun query and compare outputs

Tuning and Troubleshooting cont. >Use optdiag to monitor changes to the histogram >Check optdiag of full scan histogram for upper and lower boundary values these can be inserted into the histogram if needed >Keep a copy of optdiag output file as a backup of statistics in case old values need to be reloaded

Future Enhancements >This first implementation of sampling will require some enhancements >Scale density values gathered by sampling so that they are more accurate >Track the min/max values in the column in order to maintain the upper and lower boundary values of the histogram >Sampling index pages >Will help decrease the time of running update statistics even further >Add a mechanism to record if sampling was used and what sampling rate was last used >Add this information to optdiag and traceon 302 (and future optimizer diagnostics)

Where To Get More Information >The Sybase Customer newsgroups >http://support.sybase.com/newsgroups >The Sybase list server >SYBASE-L@LISTSERV.UCSB.EDU >The external Sybase FAQ >http://www.isug.com/Sybase_FAQ/ >Join the ISUG, ISUG Technical Journal, feature requests >http://www.isug.com

Where To Get More Information >The latest Performance and Tuning Guide >Dont be put off by the ASE 12.0 in the title, it covers the 11.9.2 features/functionality too >http://sybooks.sybase.com/onlinebooks/group-as/asg1200e >Any Whats New docs for a new ASE release >Tech Docs at Sybase Support >http://techinfo.sybase.com/css/techinfo.nsf/Home >Upgrade/Migration help page >http://www.sybase.com/support/techdocs/migration

Sybase Developer Network (SDN) Additional Resources for Developers/DBAs >Single point of access to developer software, services, and up-to-date technical information: >White papers and documentation >Collaboration with other developers and Sybase engineers >Code samples and beta programs >Technical recordings >Free software

Similar presentations