Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Columnstore indexes in Azure DevOps Services. Lessons learned.

Similar presentations


Presentation on theme: "Using Columnstore indexes in Azure DevOps Services. Lessons learned."— Presentation transcript:

1 Using Columnstore indexes in Azure DevOps Services. Lessons learned.
Konstantin Kosinsky

2 Thanks To Our Sponsors

3 About Principal Software Engineer in Azure DevOps Analytics service
SQL Server /Data Platform MVP before joining Microsoft in 2012 @kkosinsky

4 Agenda What am I working on and how Columnstore indexes help?
Columnstore indexes internals Lessons learned 1..9

5 Azure DevOps Analytics Service
Reporting platform for Azure DevOps Includes data from: Pipelines (CI/CD) Work Item (stories, bugs, etc) Tracking Automated and Manual Tests Code (Azure Repos and GitHub) – coming soon

6 Analytics Service: In Product

7 Analytics Service: Power BI

8 Analytics Service: OData
OData v4.0 with Aggregation Extensions

9 Analytics Service: OData
OData v4.0 with Aggregation Extensions

10 Query Engine Requirements
Must Support: Huge amount of data Queries that aggregate data across millions of records Arbitrary filters, aggregations and groupings Both updates and deletes (due to late arriving data and re-transformations) Near real-time ingestion and availability of new data On-premises installations Subsecond query response times Online DB maintenance operations ...all within a reasonable cost structure when deployed in large multi-tenant environments.

11 Columnstore Indexes 10x+ data compression
Good performance for database warehouse queries Don’t need to create and maintain indexes for each report Still support updates and trickle inserts

12 Columnstore Internals
c4 min = 1 SELECT sum(c1), sum(c2) FROM MyTable SELECT sum(c1), sum(c2) FROM MyTable WHERE c4 > 22 c4 max = 10 c4 min = 11 c4 max = 20 c4 min = 21 c4 max = 30

13 Why Columnstore Indexes Are Performant
Segment elimination Predicate pushdown Local aggregation (aka aggregation pushdown) Compression Batch mode

14 Lesson #1: Data Types Are Important
Not all data types support aggregate pushdown SELECT SUM(DurationSeconds) FROM AnalyticsModel.tbl_TestResultDaily

15 Lesson #2: Cardinality Matters
Number of distinct values affects segment size CompleteDate DATETIMEOFFSET for 1B -> 5.8 Gb CompleteDate DATETIMEOFFSET(0) for 1B -> 2.2 Gb CompleteDateSK INT (YYYYMMDD) or DATE for 1B -> 900Kb select sum(on_disk_size) from sys.column_store_segments s join sys.partitions p on s.partition_id=p.partition_id join sys.columns c on p.object_id=c.object_id and c.column_id=s.column_id where p.object_id = OBJECT_ID('AnalyticsModel.tbl_TestResult') and c.name='CompletedDateSK'

16 Lesson #2: Cardinality Matters
Number of distinct values per row group affects aggregate pushdown for GROUP BY # rows # of unique for the column 1M 16K 17K Number of distinct values isn’t only one criteria:

17 Lesson #3: Predicate Pushdown
Avoid predicates that touch multiple columns ColumnA = 1 OR ColumnB = 2 -- No pushdown ColumnA > ColumnB -- No pushdown ColumnA = 1 AND ColumnB = 2 -- Allow pushdown Consider creating custom columns that combine logic Pushdown for string predicates has limitation No pushdown before SQL Server 2016 Consider replacing with numeric codes or surrogate keys

18 Lesson #3.1: Strings + Segment Elimination
Segment elimination works only for numeric and datetime types Segment elimination doesn’t work for string and GUID types In most cases it isn’t a problem When it is, consider replacing with numeric codes or surrogate keys

19 Lesson #4: Segment Elimination
Row groups should be aligned with filters Try to insert data in a way that helps segment elimination

20 Lesson #4: Segment Elimination
c1 c2 c3 c4 c5 c6 c4 min = 1 Update is DELETE + INSERT Range for old segment and new segment will overlap Need to read both c4 max = 10 c4 min = 11 c4 max = 20 c4 min = 21 UPDATE MyTable SET c6 +=1 WHERE C4 < 9 OR C4 >25 c4 max = 30 c4 min = 1 SELECT sum(c1), sum(c2) FROM MyTable WHERE c4 > 22 c4 max = 30

21 Lesson #4: Segment Elimination (cont.)
Avoid for updates if you can Consider splitting table Current – small, changes could happen History – records graduated when they done Do periodic maintenance of the index

22 Lesson #5: Watch the Delta Store
Delta store is HEAP without compression To start compression, delta store needs to reach 1M records Each query reads entire delta store Delta store could be larger than you expect REORGANIZE WITH(COMPRESS_ALL_ROW_GROUPS =  ON) 

23 Lesson #5: Watch the Delta Store
Could we avoid delta store? Yes for CCI. Insert at least rows Keep statistics up to date. Low row number estimation lead to skipping of that optimization No for NCCIs  Could delta store be more that 1M rows? Many parallel writes could lead to multiple delta stores When Tuple Mover is busy

24 Lesson #6: Physical Partitioning
In a multi-tenant environment, you may have a mix of small and large tenants – partitioning strategy matters. All tenants in one large physical partition Small tenants must scan all records from big tenants Locks are on row group level Column cardinality could be high, which means less segment elimination One physical partition per tenant SQL Server limitation of 15K partitions per table Small tenants may never reach 1M and stay forever in the delta store. Group small tenants in one partition to help them compress and huge tenants in dedicated partitions

25 Lesson #7: Physical Partition Maintenance
ALTER INDEX REORGANIZE – clean ups deleted rows and merges row groups Will not touch RG if trim reason is DICTIONARY_SIZE Could merge old and new RGs and mess with segment elimination ALTER INDEX REBUILD Remove all deleted rows Does not guarantee insert order May affect segment elimination SELECT i.type_desc, CSRowGroups.state_desc, total_rows, deleted_rows, size_in_bytes, trim_reason_desc, transition_to_compressed_state_desc, 100*(ISNULL(deleted_rows,0))/total_rows AS 'Fragmentation' FROM sys.indexes AS i JOIN sys.dm_db_column_store_row_group_physical_stats AS CSRowGroups ON i.object_id = CSRowGroups.object_id AND i.index_id = CSRowGroups.index_id WHERE total_rows>0 ORDER BY object_name(i.object_id), i.name, row_group_id;

26 Lesson #7: Physical Partition Maintenance
Built custom solution that will periodically restore proper order Stored procedure that sorts partitions that are in bad shape Clones table structure Copies data from affected partition in desired order (1M batches) Stops ETL for affected partition Applies modifications that happened since process start Switches partitions Restarts ETL SPLITs and MERGEs uses the same approach

27 Lesson #8: Schema Updates
Index and column modifications are mostly offline operations Adding a NULL column or NOT NULL column with DEFAULT is online and fast Adding a column then issuing an UPDATE to set the new value will lead to fragmentation delta store insert + Columnstore delete Analytics Service solution: Create new table, copy data with required modifications, switch tables Use the same approach as maintenance to minimize data latency

28 Lesson #9: Paging Power BI needs raw data with arbitrary filters and projections Power BI could request a lot of data and we need to force server-side paging OFFSET – FETCH approach needs Sorting Read all data and throws away most of it N + 1 page is more expensive than N page Use skip token (identity column) as pointer to next page Decreases amount of data that need be thrown away Still need sorting N + 1 page is cheaper that N page

29 Lesson #9: Paging (cont.)
To make sorting cheap we could use B-Tree index But we need arbitrary filters  Sorting of wide SELECT requires a lot of memory Analytics solution: Two queries, leveraging Columnstore behaviors First query gets page boundaries: eliminating most of columns, using aggregation Second query gets page: eliminating most of segments, not sorting

30 Questions? Contact me: @kkosinsky,


Download ppt "Using Columnstore indexes in Azure DevOps Services. Lessons learned."

Similar presentations


Ads by Google