Unlocking the Mysteries Behind Update Statistics

Unlocking the Mysteries Behind Update Statistics
1/25/2020 Unlocking the Mysteries Behind Update Statistics John F. Miller III

The Dice Problem Throw dice, how many will be 1?

Questions about the Dice
How many dice are you throwing? How many sides does each dice have? Are all the dice the same? The better the information, the more accurate the estimate.

What does Update Statistics do?
Collects information for the optimizer Statistics LOW Distributions MEDIUM & HIGH Drop Distributions Compile stored procedures

Statistics Collected systables syscolumns sysindexes Number of Rows
Number of pages to store the data Second largest value for a column Second smallest value for a column # of unique values for the lead key How highly clustered the values for the lead key

Update Statistics Low Basic Algorithm
Walk the leaf pages in each index Submit btree cleaner requests when deleted items are found causing re-balancing of indexes Collects the following information Number of unique items Number of leave pages How clustered the data is Second highest and lowest value

How to Read Distributions
# of rows represented in this bin --- DISTRIBUTION --- ( 1: ( , , ) 2: ( , , ) 3: ( , , ) 4: ( , , ) 5: ( , , ) 6: ( , , ) --- OVERFLOW --- 1: ( , ) 2: ( , ) # of unique values Highest Value in this bin To get the range of values look at the highest value in the previous bin. The value # of rows for this value

Example - Approximating a Value
There are rows containing a value between -1 and 75 There are 70 unique values in this range The optimizer will deduce / 70 = 12,404 records for each value between -1 and 75 --- DISTRIBUTION --- ( 1: ( , , ) 2: ( , , ) 3: ( , , ) 4: ( , , ) 5: ( , , ) 6: ( , , ) --- OVERFLOW --- 1: ( , ) 2: ( , )

Example - Dealing with Data Skew
--- DISTRIBUTION --- ( 1: ( , , ) 2: ( , , ) 3: ( , , ) 4: ( , , ) 5: ( , , ) 6: ( , , ) --- OVERFLOW --- 1: ( , ) 2: ( , ) Data skew For the value 43 how many records will the optimizer estimate will exist? Answer values Any value that exceeds 25% of the bin size will be placed in an overflow bin

Basic Algorithm for Distributions
Develop scan plan based on available resources Scan table High = All rows Medium = Sample of rows Sort each column Build distributions Begin transaction Delete old columns distributions Insert new columns distributions Commit transaction

Scan The table is scanned in its entirety for update stats high, while it is only sampled for update stats medium (see Sample Size) The reading of rows is done in dirty read isolation, regardless of what the user has set for their transaction level.

Scan This scan of the table may occur several times depending on the amount of sort memory available and the number of columns to collect statistics about. The approximate number of table scans is defined by the (size of the data to sort) / (amount of sort memory)

Sort The rows processed by the scan phase are passed directly to the sort package. Each column in the row for which statistics are being generated is passed to a unique invocation of a sort.

Build After the sort is completed we read the sorted column data finding out the number of duplicates and unique values creating approximately 200 range bins by default. Any count of a duplicates value that exceeds 25% the size of a bin will be placed in an overflow bin.

Insert Now we have to delete the old distributions and insert the new distributions. As long as the user was not in a transaction this will be done as its own transaction. This transaction will last for less than 1 second and will hold NO locks on the tables, but locks on the system catalogs while the update occurs.

Sample Size HIGH Medium All rows in the table
Misconception about the number of rows sampled is based on the number of rows in the table, this is incorrect. The number of samples depends on the Confidence and Resolution. If the sample size is greater than the number of row in the table Medium turns into High mode

Update Statistics Medium Sample Size

Update Statistics Medium Memory Requirements

Update Statistics High Memory Requirements
In memory sort Approximate Memory = number of rows * sum(column widths + 2 * sizeof(pointer) )

Memory Rules Estimated Update Stats memory is below 100MB
Hard coded limit of 4MB Attempts to minimize the scans by fitting as many columns into 4MB Estimated Update Stats memory is above 100MB Memory is requested from MGM Attempt to minimize the scans by fitting as many columns in the MGM memory

Examples Customer Table Number of Rows 500,000 Cust_id integer
Fname char(50) Lname char(50) Address1 char(200) Address2 char(200) State char(2) zipcode integer Number of Rows 500,000

Examples Memory for Incore Sort

Examples Number of Table Scans

Confidence A factor in the number of samples used by update statistics medium

Resolution Percentage of data that is represented in a distribution bin Example 100,000 rows in the table Resolution of 2% Each bin will represent 2,000 rows

Improvements in Update Statsitics
Update statistics can not allocated memory between 4MB and 100MB of sort memory The default has been raised from 4MB to 15MB User can now configure the amount of memory Use DBUPSPACE has been augmented to include memory Format of DBUPSPACE {max disk space}:{default memory} To increase the memory to 35 MB, set DBUPSPACE=0:35. Allow update statistics to use light scans when scanning a a table Implemented light scans Set oriented reads

Improvements in update statistics
Information about building data distributions is not viewable by the DBA Set explain will now print the scan path and resource usage when building data distributions Update statistics low on fragmented tables does not run in parallel With PDQ turned on each index fragment will be scanned in parallel PDQ at 1 means 10% of the index fragments scanned in parallel, while PDQ at 10 means all the index fragments will be scanned in parallel

Improvements in Update Statistics
Various errors (126, 312, 100,…) when executing update statistics Errors when trying to insert the distributions because set lock mode to wait was not handled properly inside update statistics Range scanning a fragmented index is slow Replace the next loop merge with a binary search merge when ordering items from index fragments Most noticeable when the number of fragments in an index is large

Example Following Example Table size 215,000 rows Row size 445 bytes
Uniprocessor

Example of the current update statistics
Table: jmiller.t9 Mode: HIGH Number of Bins: Bin size 1082 Sort data MB Sort memory granted 4.0 MB Estimated number of table scans 10 PASS #1 c9 PASS #2 c5 PASS #3 c7 PASS #4 c6 ….. PASS #10 c4 Completed pass 1 in 0 minutes 24 seconds Completed pass 2 in 0 minutes 20 seconds Completed pass 3 in 0 minutes 17 seconds Completed pass 4 in 0 minutes 17 seconds Completed pass 5 in 0 minutes 17 seconds Completed pass 6 in 0 minutes 15 seconds Completed pass 7 in 0 minutes 14 seconds Completed pass 8 in 0 minutes 15 seconds Completed pass 9 in 0 minutes 16 seconds Completed pass 10 in 0 minutes 14 seconds Total Time 146 seconds

The New Defaults Total Time 98 seconds New Memory Default
Table: jmiller.t9 Mode: HIGH Number of Bins: Bin size Sort data MB Sort memory granted MB Estimated number of table scans 7 PASS #1 c9,c8,c10,c5,c7 PASS #2 c6,c1 PASS #3 c3 PASS #4 c2 PASS #5 c4 Completed pass 1 in 0 minutes 34 seconds Completed pass 2 in 0 minutes 19 seconds Completed pass 3 in 0 minutes 16 seconds Completed pass 4 in 0 minutes 14 seconds Completed pass 5 in 0 minutes 15 seconds Total Time 98 seconds New Memory Default

Enabling PDQ with Update Statistics
Table: jmiller.t9 Mode: HIGH Number of Bins: Bin size Sort data MB PDQ memory granted MB Estimated number of table scans 1 PASS #1 c1,c2,c3,c4,c5,c6,c7,c8,c9,c10 Index scans disabled Light scans enabled Completed pass 1 in 0 minutes 29 seconds PDQ Memory Features Enabled Total Time 29 seconds

Tuning with the New Statistics
Turn on PDQ when running update statistics, but only for tables Avoid PDQ when updating statistics for procedures When running high or medium increase the memory update statistics has to work with Enable parallel sorting (i.e. PSORT_NPROCS)

Considerations Change the RESOLUTION to 1.5
Increasing the number of bins for the distributions Increasing the sample size for update statistics medium

Old Recommendations Start one update statistics for each column of a table Fname Lname Address Three sequential scans of the table

New Recommendations Start one update statistics for ALL columns giving it more resources (memory) Requires only one scan of the table to produce distributions on several columns. Fname Lname Address One scans of the table

Unlocking the Mysteries Behind Update Statistics

Similar presentations

Presentation on theme: "Unlocking the Mysteries Behind Update Statistics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unlocking the Mysteries Behind Update Statistics

Similar presentations

Presentation on theme: "Unlocking the Mysteries Behind Update Statistics"— Presentation transcript:

Similar presentations

About project

Feedback