Using Columnstore indexes in Azure DevOps Services. Lessons learned

Using Columnstore indexes in Azure DevOps Services. Lessons learned
Konstantin Kosinsky Using Columnstore indexes in Azure DevOps Services. Lessons learned

Please Thank our Sponsors:

About Me Principal Software Engineer in Azure DevOps Analytics service SQL Server /Data Platform MVP before joining Microsoft in

Azure DevOps Analytics Service
Reporting platform for Azure DevOps (formerly Visual Studio Team Services) Includes data from: Work Item (stories, bugs, etc) Tracking Builds Automated Tests

Analytics Service: In Product

Analytics Service: Power BI

Analytics Service: OData
OData v4.0 with Aggregation Extensions

Query Engine Requirements
Must Support: Huge amount of data Queries that aggregate data across millions of records Arbitrary filters, aggregations and groupings Both updates and deletes due to late arriving data and re-transformations Near real-time ingestion and availability of new data On-premises installations Subsecond query response times Online DB maintenance operations ...all within a reasonable cost structure when deployed in large multi-tenant environments.

Columnstore Indexes 10x+ data compression Good performance for database warehouse queries Don’t need to create and maintain indexes for each report Still support updates and trickle inserts

Columnstore Internals
c4 min = 1 SELECT sum(c1), sum(c2) FROM MyTable SELECT sum(c1), sum(c2) FROM MyTable WHERE c4 > 22 c4 max = 10 c4 min = 11 c4 max = 20 c4 min = 21 c4 max = 30

Why Columnstore Indexes Are Performant
Segment elimination Predicate pushdown Local aggregation (aka aggregation pushdown) Compression Batch mode

Lesson #1: Data Types Are Important
Not all data types support aggregate pushdown SELECT SUM(DurationSeconds) FROM AnalyticsModel.tbl_TestResultDaily

Lesson #2: Cardinality Matters
Number of distinct values affects segment size CompleteDate DATETIMEOFFSET for 1B -> 5.8 Gb CompleteDate DATETIMEOFFSET(0) for 1B -> 2.2 Gb CompleteDateSK INT (YYYYMMDD) for 1B -> 900Kb select sum(on_disk_size) from sys.column_store_segments s join sys.partitions p on s.partition_id=p.partition_id join sys.columns c on p.object_id=c.object_id and c.column_id=s.column_id where p.object_id = OBJECT_ID('AnalyticsModel.tbl_TestResult') and c.name='CompletedDateSK'

Lesson #2: Cardinality Matters
Number of distinct values per row group affects aggregate pushdown for GROUP BY # rows # of unique for the column 1M 16K 17K Number of distinct values isn’t only one criteria:

Lesson #3. Predicate Pushdown
Avoid predicates that touch multiple columns ColumnA = 1 OR ColumnB = 2 -- No pushdown ColumnA > ColumnB -- No pushdown ColumnA = 1 AND ColumnB = 2 -- Allow pushdown Consider creating custom columns that combine logic Pushdown for string predicates has limitation No pushdown before SQL Server 2016 Consider replacing with numeric codes or surrogate keys

Lesson #3.1: Strings + Segment Elimination
Segment elimination works only for numeric and datetime types Segment elimination doesn’t work for string and GUID types In most cases it isn’t a problem When it is, consider replacing with numeric codes or surrogate keys

Lesson #4: Segment Elimination
c1 c2 c3 c4 c5 c6 Row groups should be aligned with filters Update is DELETE + INSERT Range for old segment and new segment will overlap Need to read both c4 min = 1 c4 max = 10 c4 min = 11 c4 max = 20 c4 min = 21 UPDATE MyTable SET c6 +=1 WHERE C4 < 9 OR C4 >25 c4 max = 30 c4 min = 1 SELECT sum(c1), sum(c2) FROM MyTable WHERE c4 > 22 c4 max = 30

Lesson #4: Segment Elimination (cont.)
Avoid for updates if you can Consider splitting table Current – small, changes could happen History – records graduated when they done Custom “rebuild” to maintain the order sys.column_store_segments

Lesson #5: Watch the Delta Store
Delta store is HEAP without compression Each query reads entire delta store Delta store could be larger than you expect To start compression, delta store needs to reach 1M records

Lesson #6: Physical Partitioning
In a multi-tenant environment, you may have a mix of small and large tenants – partitioning strategy matters. All tenants in one large physical partition Small tenants must scan all records from big tenants Locks are on row group level Column cardinality could be high, which means less segment elimination One physical partition per tenant SQL Server limitation of 15K partitions per table Small tenants may never reach 1M and stay forever in the delta store. Group small tenants in one partition to help them compress and huge tenants in dedicated partitions

Lesson #6.1: Physical Partition Maintenance
INDEX REBUILD – does not guarantee order Built custom solution that will periodically restore proper order Stored procedure that sorts partitions that are in bad shape Clones table structure Copies data from affected partition in desired order (100K+ batches) Stops ETL for affected partition Applies modifications that happened since process start Switches partitions Restarts ETL SPLITs and MERGEs uses the same approach

Lesson #7: Schema Updates
All servicing in Azure DevOps must be online operations Index and column modifications are mostly offline operations Adding a NULL column or NOT NULL column with DEFAULT is online and fast Adding a column then issuing an UPDATE to set the new value will lead to fragmentation delta store insert + Columnstore delete Analytics Service solution: Create new table, copy data with required modifications, switch tables Use the same approach as maintenance to minimize data latency

Lesson #8: Paging Power BI needs raw data with arbitrary filters and projections Power BI could request a lot of data and we need to force server-side paging OFFSET – FETCH approach needs Sorting Read all data and throws away most of it N + 1 page is more expensive than N page Use skip token (identity column) as pointer to next page Decreases amount of data that need be thrown away Still need sorting N + 1 page is cheaper that N page

Lesson #8: Paging (cont.)
To make sorting cheap we could use B-Tree index But we need arbitrary filters  Sorting of wide SELECT requires a lot of memory Analytics solution: Two queries, leveraging Columnstore behaviors First query gets page boundaries: eliminating most of columns, using aggregation Second query gets page: eliminating most of segments, not sorting

Questions? Contact info: @kkosinsky, Feedback about the Analytics Service:

Using Columnstore indexes in Azure DevOps Services. Lessons learned

Similar presentations

Presentation on theme: "Using Columnstore indexes in Azure DevOps Services. Lessons learned"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using Columnstore indexes in Azure DevOps Services. Lessons learned

Similar presentations

Presentation on theme: "Using Columnstore indexes in Azure DevOps Services. Lessons learned"— Presentation transcript:

Similar presentations

About project

Feedback