DAT353 Analysis Service: Server Internals Tom Conlon Program Manager SQL Server Business Intelligence Unit Microsoft Corporation.

DAT353 Analysis Service: Server Internals Tom Conlon Program Manager SQL Server Business Intelligence Unit Microsoft Corporation

Purpose of this Session Remove some of the mystery Remove some of the mystery Explain how it is that we do some things so much better than our competitors Explain how it is that we do some things so much better than our competitors Things are easier to understand when the internals are understood Things are easier to understand when the internals are understood Requirements: Requirements: – You already know the basics – this is for the experienced

Agenda Architecture Review Architecture Review Aggregations Aggregations Data and Dimension Storage Data and Dimension Storage Processing Processing Querying Querying Server Memory Management Server Memory Management Distinct Count Distinct Count

Architecture – Single Server OLAPStore Application ADO MD PivotTable Service OLEDB for OLAP AnalysisServerProcessing Querying AnalysisManager DSO SQL Server DataWarehouse Other OLE DB Providers OLEDB

Component Architecture - Query MMSMDSRV.EXE CACHE Server Storage Engine METADATA MANAGER MSOLAP80.DLL CACHE FORMULA ENGINE METADATA MANAGER AGENT FORMULA ENGINE MDX

Component Architecture - Management MMSMDSRV.EXE CACHE Server Storage Engine METADATA MANAGER MSOLAP80.DLL CACHE FORMULA ENGINE METADATA MANAGER AGENT FORMULA ENGINE METADATA MANAGER MSMDGD80.DLL DCube PARSER MSMDCB80.DLL DCube Storage Engine MDXDDL

Component Architecture - Distributed MMSMDSRV.EXE CACHE Server Storage Engine METADATA MANAGER MSOLAP80.DLL CACHE FORMULA ENGINE METADATA MANAGER AGENT FORMULA ENGINE METADATA MANAGER MSMDGD80.DLL MMSMDSRV.EXE CACHE Server Storage Engine METADATA MANAGER DCube PARSER MSMDCB80.DLL DCube Storage Engine MDX

Why Aggregations? Aggregations can result in orders of magnitude improvement in performance Aggregations can result in orders of magnitude improvement in performance – Don’t have to access every fact table record to determine query result – Further savings with data compression – Biggest savings: reduce disk scan

Aggregations - Overview Customers All Customers CountryStateCityNameProduct All Products CategoryBrandNameSKU Facts custIDSKU Units Sold Sales 345-231351232$45.67 563-0145123634$67.32 … Highest Level Aggregation CustomerProduct Units Sold SalesAllAll347814123$345,212,301.3 Intermediate Aggregation countryCodeproductID Units Sold SalesCansd4529456$23,914.30 USyu6784623$57,931.45 …

Partial Aggregation Don’t want to create all possible aggregations Don’t want to create all possible aggregations – Data explosion! What if a query is made to a combination of levels where no aggregation exists? What if a query is made to a combination of levels where no aggregation exists? – Can compute from lower level aggregations – Don’t need to compute every possible aggregation Customers All Customers CountryStateCityNameProduct All Products CategoryBrandNameSKU Queries including a combination of Country and Brand can be answered if aggregation Country by Name exists.

Fact Table Highest level of aggregation (1,1,1,1,…) Most detailed Aggregations (m,m,m,…) Show me all sales for all products for all... Aggregations

Fact Table Show me all sales for all products for all... Most detailed Aggregations Highest level of aggregation Partial Aggregation

Aggregation Design Fact Table MonthProducts Quarter Pro. Family QuarterProduct Month

Fact Table Aggregation Design Results Result: aggregations designed in waves from the top of the pyramidResult: aggregations designed in waves from the top of the pyramid At 100% aggregations, ‘waves’ all touch: overkillAt 100% aggregations, ‘waves’ all touch: overkill 20-30% Generally adequate (0% for the smaller cubes)20-30% Generally adequate (0% for the smaller cubes)

Aggregation Design Which aggregations are more important than others? Which aggregations are more important than others? – All are equal – From design perspective, select the ones that result in overall improved query performance Usage Based Optimization: Weightings on each aggregation based on usage frequency Usage Based Optimization: Weightings on each aggregation based on usage frequency

Flexible and Rigid Aggregations ‘Flexible’ aggregations deleted when a changing dimension is incrementally processed. ‘Flexible’ aggregations deleted when a changing dimension is incrementally processed. ‘Rigid’ aggregations remain valid ‘Rigid’ aggregations remain valid Changing dimensions allow members to be moved, added and deleted. Changing dimensions allow members to be moved, added and deleted. After members move, only incremental process of dimension is required After members move, only incremental process of dimension is required A B C When member X is moved from a child of A to a child of C, all aggregations involving A or C are invalided X X

Aggregation Data Storage No impact on fact data or rigid aggregation data when changing dimension incrementally processed No impact on fact data or rigid aggregation data when changing dimension incrementally processed Flexible aggregations are invalidated when changing dimension incrementally processed Flexible aggregations are invalidated when changing dimension incrementally processed Data is in three files: Data is in three files: – partitionName.fact.data – partitionName.agg.rigid.data – [partitionName.agg.flex.data] Aggregations including (All) level are rigid (if all other levels in the agg are rigid) Aggregations with this level are always flexible Aggregations with this level are rigid (if all other levels in the agg are rigid) A B C X X

Incremental Dimension Processing (Chg Dimension) Query and process dimension data: Keys, member names, member properties For each cube using this dimension Delete flexible Aggs and indexes Start lazy indexing aggregating Potential Resource Competition during lazy processing after changing dimension incrementally processedPotential Resource Competition during lazy processing after changing dimension incrementally processed Fewer Aggregations!Fewer Aggregations! Result: Query performance degradationResult: Query performance degradation

Flexible Aggregation Demo demo demo

Data Storage No data stored for empty member combinations No data stored for empty member combinations With compression – data storage approx 1/3 of space required in RDBMS source With compression – data storage approx 1/3 of space required in RDBMS source Data is stored by record in pages Data is stored by record in pages Each record contains all measures at an intersection of dimension members Each record contains all measures at an intersection of dimension members Record 1: mbr d1, mbr d2,…mbr dn m 1, m 2,…m n Record 2: mbr d1, mbr d2,…mbr dn m 1, m 2,…m n … Record 256: mbr d1, mbr d2,…mbr dn m 1, m 2,…m n Page

Data Structures Partition data stored in a file divided into Segments Partition data stored in a file divided into Segments Each Segment contains 256 pages (each with 256 records) = 64K records Each Segment contains 256 pages (each with 256 records) = 64K records Segment 1 … Page 1 Page 2 Page 3 Page 256 … Segment 2 Segment n … Data File Only last segment has fewer than 256 pages

Clustering Physical order of the records in each page and segment is organized to improve performance Physical order of the records in each page and segment is organized to improve performance – Keeps records with same or close members together – Similar in concept to SQL clustered index where data sorted by key values Try to minimize distribution of records with the same member across segments and pages Try to minimize distribution of records with the same member across segments and pages – Optimized, but no algorithm can keep records for the same member (unless the cube contains a single dimension) – Similarly – SQL can only have a single clustered index Records with identical dimension members can be in multiple segments Records with identical dimension members can be in multiple segments – Data is read and processed in chunks (more on this later…)

Indexing How is the data retrieved? How is the data retrieved? – Cubes can be in the terabyte range – Scanning data files not an option Need an index by dimension member Need an index by dimension member – Answers question “Where is the data associated with this combination of dimension members?” Map files provide this Map files provide this

Map Files There is a map for each dimension which indicates the page where the member is included in a data record MemberMap …… 1213 …… … Page 1 Page 2 Page 3 Page 256 Page 4 Page 5 Page 6 Segment 1 Dimension 1 Map MemberMap…… 1324 …… Dimension 2 Map To resolve a query containing a member from each dimension, get list of pages containing all members

Other Approaches Array Based Array Based – Normally allocates a cell for every combination. – Result: Data explosion - much more disk space and longer processing times Mix of Record and Array Mix of Record and Array – ‘Dense’ dimensions are record like – Sparse are array like Bit used per empty cell – sparsity explodes db sizes Bit used per empty cell – sparsity explodes db sizes – User chooses decides whether a dimension is dense or sparse

Processing Buffer Memory Settings ‘Read-Ahead Buffer Size’ is the buffer containing data read from source db ‘Read-Ahead Buffer Size’ is the buffer containing data read from source db – Defined in Server Property Dialog. Default: 4Meg – Rarely important – little effect when changed Data is processed in chunks of ‘Process Buffer Size’ Data is processed in chunks of ‘Process Buffer Size’ – Defined in Server Property Dialog – Data is clustered within Process Buffer Size Bigger Process Buffer Size the better – make as big as possible Bigger Process Buffer Size the better – make as big as possible – Data for dimension members is clustered to keep data for ‘close’ members close together – The larger these memory settings are, the more effective clustering

Incremental Processing … … … … Original partition Incremental Partition + Two Step Process First, a partition is created with the incremental data First, a partition is created with the incremental data Second, the partition is merged with the original Second, the partition is merged with the original Complete Segments of both partitions left intact – incomplete ones are merged Complete Segments of both partitions left intact – incomplete ones are merged After many incremental processes, data distributed: degraded performance After many incremental processes, data distributed: degraded performance Reprocess (if you have a large Process Buffer size) can provide improved performance Reprocess (if you have a large Process Buffer size) can provide improved performance

Querying a Cube CLIENT CLIENT – Select {[North America],[USA],[Canada]} on rows, Measures.members on columns from myCube Need two things: Need two things: – getting the dimension members – the axes – getting the data

Resolve Axis ‘Christmas trees’ Dimension members cached on client in ‘Client Member Cache’ Dimension members cached on client in ‘Client Member Cache’ – Levels with #members < Large Level Threshold sent in group – Levels with #members > Large Level Threshold retrieved as needed – Large Level Threshold default value:1000, can be changed in server property and in connection string Where members not cached, members and descendents retrieved to client until needed member retrieved Where members not cached, members and descendents retrieved to client until needed member retrieved Levels with members with 1000s of siblings result in degraded performance Levels with members with 1000s of siblings result in degraded performance Member cache not cleaned except for disconnect or when cube structure changes. Member cache not cleaned except for disconnect or when cube structure changes. Cached members Non-cached members Requested member

Client Data Cache Client retains data of previous queries in client data cache Client retains data of previous queries in client data cache Client Cache Size property controls how much data is in the client cache Client Cache Size property controls how much data is in the client cache – When 0: unlimited – 1-99 (inclusive), percent of physical memory – >99 use up to the value in KB Default value: 25 Default value: 25 When exceeded, client cache is cleaned at cube granularity When exceeded, client cache is cleaned at cube granularity

How Cubes Are Queried Segment 6 Segment 5 Segment 4 Segment 3 Segment 2 Segment 1 Partition: Canada Segment 6 Segment 5 Segment 4 Segment 3 Segment 2 Segment 1 Partition: Mexico Segment 6 Segment 5 Segment 4 Segment 3 Segment 2 Segment 1 Partition: USA DimensionMemory CacheMemory Client Data Cache Data on disk Service Client QueryProcessor

Service Start Up Minimum Allocated Memory defines the amount of memory completely dedicated to the server Minimum Allocated Memory defines the amount of memory completely dedicated to the server All dimensions in the database are retained in memory All dimensions in the database are retained in memory – Tip: invalidate a dimension if not used in a cube Dimension requirements: ~125 bytes per member plus member properties Dimension requirements: ~125 bytes per member plus member properties – 1M members: 125M – With 25 char member property (eg, Address): 175M Large dimensions can migrate to separate process space Large dimensions can migrate to separate process space DimensionMemory Minimum allocated memory

During Processing Shadow dimensions Shadow dimensions – 2 copies of dimensions stored in memory while processing Processing Buffers Processing Buffers – Read Ahead Buffer size – Process Buffer Size If dimension and processing buffers memory requirements exceed Memory Conservation Threshold - no room for data cache If dimension and processing buffers memory requirements exceed Memory Conservation Threshold - no room for data cache DimensionMemory ShadowDimensions AvailableCache ProcessingBuffers Minimum allocated memory Memory conservation threshold

During Querying Data cache stores query data for reuse Data cache stores query data for reuse – Faster than retrieving from storage If Dimension Memory requirements > Memory Conservation Threshold, no Data Cache If Dimension Memory requirements > Memory Conservation Threshold, no Data Cache ‘Cleaner’ wakes up periodically to reclaim memory from data cache ‘Cleaner’ wakes up periodically to reclaim memory from data cache – BackgroundInterval registry setting. Default value: 30 seconds DimensionMemory Minimum allocated memory Memory conservation threshold AvailableCache <= 0.5 * (Minimum Allocated Memory+ Memory Conservation Threshold): No cleaning  0.5 * (Minimum Allocated Memory + Memory Conservation Threshold) and < Memory Conservation Threshold: mild cleaning  Memory Conservation Threshold: aggressive cleaning

Setting Server Properties demo demo

Distinct Count Business Problem: Sales Manager wants to know: Business Problem: Sales Manager wants to know: – “How many customers are buying Computers?” – “How many active customers do I have?” 500 Printers 700 Games 2500 Business 1500 Home 4700 Software 800 Monitors 2000 Computers 3300 Hardware 8000All products Sales 30 80 100 150 60 70 80 200 Number of Customers

Distinct Count: Changes to Data Structure DC Measure stored with each fact and aggregation record DC Measure stored with each fact and aggregation record – Just like a new dimension Data ordered by DC measure Data ordered by DC measure – “Order by” included in SQL statement during processing Number of records can be increased by orders of magnitude Number of records can be increased by orders of magnitude – Dependant on number of distinct values per record Sample aggregate record without Distinct Count… countryCodeproductID Units Sold Sales Cansd4529456$23,914.30 …# records increases with distinct count on customers CustIDcountryCodeproductID Units Sold Sales132-45Cansd45213054,453.01 432-39Cansd45223257,212.23 639-53Cansd45214064,890.01 430-30Cansd45231108,476.54 964-90Cansd45213104,490.23

Distinct Count: Changes to Query Single thread per partition instead of per segment Single thread per partition instead of per segment – Unlike regular cubes, cannot do a single aggregation of results from each segment as a single value of the DC measure can cross segments – Consequently – performance impact Dimension slice requires much more disk scan than before Dimension slice requires much more disk scan than before – Segments clustered by DC measure – Expensive

Distinct Count Tips Keep DC measures in their own cube Keep DC measures in their own cube – All measures are retrieved on query – even if some are not asked for – Create virtual cube to merge DC with other measures Incremental processing DC cubes is very expensive Incremental processing DC cubes is very expensive – Segments restructured and reordered to keep records ordered by DC measure – Time and memory intensive

Distinct Count Tips Unlike regular cubes, best to distribute DC values evenly across each partition Unlike regular cubes, best to distribute DC values evenly across each partition – Most effective use of multiple threads for query processing If you have a dimenion that corresponds to DistinctCount Measure If you have a dimenion that corresponds to DistinctCount Measure – Aggregations recommended only on lowest level – (Example, Customer dimension in cube, Customer as distinct count measure)

Summary Architecture Review Architecture Review Aggregations Aggregations Data and Dimension Storage Data and Dimension Storage Processing Processing Querying Querying Server Memory Management Server Memory Management Distinct Count Distinct Count

Don’t forget to complete the on-line Session Feedback form on the Attendee Web site https://web.mseventseurope.com/teched/ https://web.mseventseurope.com/teched/

DAT353 Analysis Service: Server Internals Tom Conlon Program Manager SQL Server Business Intelligence Unit Microsoft Corporation.

Similar presentations

Presentation on theme: "DAT353 Analysis Service: Server Internals Tom Conlon Program Manager SQL Server Business Intelligence Unit Microsoft Corporation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DAT353 Analysis Service: Server Internals Tom Conlon Program Manager SQL Server Business Intelligence Unit Microsoft Corporation.

Similar presentations

Presentation on theme: "DAT353 Analysis Service: Server Internals Tom Conlon Program Manager SQL Server Business Intelligence Unit Microsoft Corporation."— Presentation transcript:

Similar presentations

About project

Feedback