=tg= Thomas Grohser Column Store.

=tg= Thomas Grohser Column Store

select * from =tg= where topic =
=tg= Thomas Grohser, NTT Data Senior Director Technical Solutions Architecture Focus on SQL Server Security, Performance Engineering, Infrastructure and Architecture Wrote some of Close Relationship with SQLCAT (SQL Server Customer Advisory Team) SCAN (SQL Server Customer Advisory Network) TAP (Technology Adoption Program) Product Teams in Redmond Active PASS member and PASS Summit Speaker Remark SQL 4.21 First SQL Server ever used (1994) SQL 6.0 First Log Shipping with failover SQL 6.5 First SQL Server Cluster (NT4.0 + Wolfpack) SQL 7.0 2+ billion rows / month in a single Table SQL 2000 938 days with 100% availability SQL 2000 IA64 First SQL Server on Itanium IA64 SQL 2005 IA64 First OLTP long distance database mirroring SQL 2008 IA64 First Replication into mirrored databases SQL 2008R2 IA64 SQL 2008R2 x64 First 256 CPUs & > STMT/sec First Scale out > STMT/sec First time 1.2+ trillion rows in a table SQL 2012 > Transactions per second > 1.3 Trillion Rows in a table SQL 2014 > Transactions per second Fully automated deploy and management SQL 2016 AlwaysOn Automatic HA and DR, crossed the PB in storage SQL 2017 In production 3 minutes after RTM, HA for Replication 3+ Trillion Rows and > 1 PB of data in single DB SQL 2019 Can’t wait to push the limits even further 24 Years with SQL Server

All slides with a green background are “borrowed” from the presentation
The full original slide deck will be provided as well.

How Row Store Stores Data
Row Header Col A Col B Col C Col D …. Row Header Col A Col B Col C Col D …. Page 8KB Page 8KB Page 8KB … Row 1 Row 4 Row 2 Row 3 …

How Column Store Stores Data
Col A Col B Col C …. Row 1 Value Row 1 Value Row 1 Value Row 2 Value Row 2 Value Row 2 Value Row Group 1 Row 3 Value Row 3 Value Row 3 Value … … … Row n Value Row n Value Row n Value Row n+1 Value Row n+1 Value Row n+1 Value Row n+2 Value Row n+2 Value Row n+2 Value Row Group 2 Row n+3 Value Row n+3 Value Row n+3 Value … … … Row n+m Value Row n+m Value Row n+m Value

Column Store History SQL 2012 One on top of regular table, read only
SQL One clustered column store that becomes the table Insert/Update/Delete are possible on this one SQL Changes possible also on secondary version of the index Filtered Secondary Index Non Clustered Indexes on top of Clustered Column Store SQL 2017 Computed Column SQL 2019 To early to tell

The goal Audit all activities on our fleet of SQL Servers and let developers query the result to find bad queries and unused database components. Simple, except its not if there are more than 6 billion queries per day.

Getting Started Enable SQL Audit
Augment the missing information (like client IP) via login triggers Collect the audit files and store them on a fast file share So far so good One little problem about more than 5 TB audit files every 24 hours!

Note The order in which I present what we “fixed” to make it work is not in chronological order. I decided to group them by topic to not jump back and forth all the time.

First of many not working solutions
Create the Clustered Column Store table and fill it with data using foreach filename in GetListOfFiles INSERT INTO Audit.Data (columns…) SELECT columns … FROM fn_read_audit(‘filename’) Slow to insert Good but not great compression Slow to query Performance and compression got worse when running multiple insert jobs at the same time. With a single insert job we could load less than 10% of the data we needed.

What happened We basically ignored all best practice for column store
Insert data in sweet spot number of rows at a time Avoid data types that use more than 8 bytes Partition by your most selective search criteria Don’t rely on column store to do normalization via compression for you

How Column Store Stores Data
Col A Col B Col C …. Row 1 Value Row 1 Value Row 1 Value Row 2 Value Row 2 Value Row 2 Value Row Group 1 Row 3 Value Row 3 Value Row 3 Value … … … Row n Value Row n Value Row n Value Row n+1 Value Row n+1 Value Row n+1 Value Row n+2 Value Row n+2 Value Row n+2 Value Row Group 2 Row n+3 Value Row n+3 Value Row n+3 Value … … … Row n+m Value Row n+m Value Row n+m Value

Insert rows at a time For all of you not good at math the = 1024 or with other words the sweet spot is to insert 1024 x 1024 = rows at a time no more no less This is because SQL Server stores data in a column store in row groups Every time you insert data a new row group is created (simplified) The maximum size of a row group is rows Therefore inserting this many rows in one go is the best.

If you insert less rows …
You either create smaller row groups which slows down processing and reduces compression Or if the number of rows is very small SQL stores it in row store format and later might convert it during an index rebuild

If you insert more rows If its an multiple of you are fine SQL will split into row groups Your DBA will hate you because you create very large transactions that will impact transaction log maintenance and HA/DR If not, SQL will create as many as possible row groups and then threat the remaining rows the same as any smaller number of rows

The Tuple Mover and Delta Stores
Delta RG Delta RG Delta RG OPEN CLOSED TOMBSTONE < rows = rows Tuple Mover every 5 mins. COMPRESSION_DELAY configurable at index level. Force it = ALTER INDEX Compressed RG COMPRESSED = rows

Rowgroup loading magic
Read the merge policies Rowgroup loading magic >=102400 rows < rows Compressed RG Delta RG . . LOB . . btree >= rows < rows Compressed RG Delta RG LOB btree LOB LOB Compressed RGs Compressed RGs ALTER INDEX ALTER INDEX rows rows compaction

The Delete Bitmap Deletes are logical just track don’t actually delete
Persisted in a “hidden” clustered index (compressed pages) Cached in Columnstore Object Pool Flushed by Tuple Mover Cost = More rows slower queries Cleaned by reorganizing, rebuild, or table truncate TODO: What about memory pressure?

Use the audit.dll in C# instead of fn_get_audit
The parsing of the audit file can be done in a multi threaded C# component on one or more application server (no license fees !!!)

Agent job waits for 1048576 rows and batch inserts
The architecture Column store Agent Processes audit files Agent Processes audit files Bulk Insert API In Memory OLTP table Agent job waits for rows and batch inserts Agent Processes audit files … App Server SQL Server

Partition your column store tables
There is only one access method for column stores: Full table scan! To be precise: Full table scan with partition elimination if the table is partitioned Always partition by a common search argument

Partition by Day – Almost worked …
We created a partition schema with one partition per day and we planned to have up to 2 years of data in the system (so we always have at least one year of special end of year processing data captured) This gives us about 750 active partitions which is way below the number of partitions where according to the internet issues start (~4000 and more) While reducing the data for each query to a single day it still took about 40 minutes to process the 6 billion rows in every query.

Keep just some of the data
As much as we would have loved to keep all data we could not do this with a reasonable budget. We analyzed the queries against our database (yes we audited our self (and no we filtered inserting the data before we captured it to avoid having an infinite loop – one of the few mistakes we never made). The result: more than 95% of all searches have been for the last three months of data with spikes on month end and end of quarter time periods  Let’s keep 100 days of data on the primary servers and move the rest to an archive server.

Keep an daily aggregate of the data
As a kind of an index into the old data we kept an aggregated version of the data where we grouped by day, access (login + client) and query and stored the aggregated execution count and total resource consumption. This aggregate is in an indexed row store table and produces about 30 million rows per day (instead of the 6 billion)

Partition Per Day and Instance Group
Way later in the game we changed the partition key to day and server group. An instance group is the list of instances/servers that hold the replicas for availability groups. In general we want to get all the queries for a database regardless of the replica they are executed on We have about 40 of these groups This was possible after we split the data and moved all data older then 100 days to an archive server 40 x 100 = 4000 partitions. On the archive server we only partition by day and a group of groups (4) 4 x 750 = 3000 partitions

Partition Elimination Gains
Since almost every query is limited by a specific day and the instance group (indirectly by the database which is a member of an AG which runs on a instance group) We cut down the data touched to about 1/10 of before (depending on the group) this results in about 600 million rows processed for each query. But wait it gets much better.

Normalize Reference all data that does not fit in 8 bytes with a key <= 8 bytes wide Server/Instance Name Client Workstation The Query Itself The Session The Object …

Normalizing the Query Step 1 Remove all literals
SELECT A, B, C FROM Table WHERE DT=‘2018/12/24’ AND ID=123 and City = ‘DC’; INSERT INTO TableB (ColA, ColB, ColC) VALUES (‘2017/1/3’, 456, ‘abc’); INSERT INTO TableB (ColA, ColB, ColC) VALUES (‘2016/3/5’, 789, ‘abc’); becomes SELECT A, B, B FROM Table WHERE DT=‘#’ AND ID=# and City=‘#’; INSERT INTO TableB (ColA, ColB, ColC) VALUES (‘#’, #, ‘#’);

Use the audit.dll in C# PART 2
C# is much faster removing literals (state machine parsing the string) and normalizing small strings with low number of variation (linq and plinq queries on in memory cached tables.

Normalizing the Query Step 2 create a hash of the query
Step 3 use COMPRESS() function to store the text We have many queries where the query text is several MB so doing a zip like operation on the text makes sense This resulted in about 3 million entries in the query table with a few hundred new ones every day (out of 3 trillion entries). We experimented with converting all text to lower case and stripping remarks did not make a substantial difference

The client table Hold client server/workstation name, IP address, account used for login and application name. This table is a row store with a few thousand entries and only a few ones every week (application names contain a version number so we get new ones after each rollout.

Object Contains one row for each database object ever touched by any query (database, schema, table | view, column)

Partition Dimension Is a standard time dimension for days with the addition of the instance group. The key is (days since 2000/1/1) x (instance group id) Which allows us to store over 500 years of data with 9999 instance groups (where we have about 40 today) Every month we start a new filegroup. This allows easier transfer to archive system

Sessions and Events This is the two tables that are implemented as clustered column store . Session contains all the info like who logged in when and from where and how many resources they consumed in this session. Events has an entry for each query and object touched by the query Every column in the tables uses <= 8 bytes !!!!

New Schema Query Event Object Client Session Partition Dimension

Normalizing enabled a lot of Magic
Aggregate push down Filter push down Row Group elimination

Aggregate Push Down SELECT SUM(x), AVG(x) FROM TABLE MyWideTable
Row Store Load all data pages from disk to buffer pool (even all data from columns not used in the query) Calculate Aggregate Column Store if x uses more than 8 bytes Load column x from disk to memory (ignore all other columns) Column Store if x uses <= 8 bytes Storage Engine returns aggregate for each row group (no data transferred to memory) Query just has to process a few values for the overall result

Meet your new best friends
bigint int smallint tinyint datetime datetime2() time() float binary(1…8) char(1…8) nchar(1..4) smallmoney Money date

Filter Push Down Aggregates are great but what if I use filters and group by If all the involved columns in the query use <= 8 bytes and the where/group by is simple (=, <, >, … not OR IIF CASE WHEN OR IIF…) The filter is pushed down as well and the storage engine returns the aggregates for each group that match the filter for each row group All the query has to do is combine the results and return the result

Row Group Elimination This one is even bigger than the aggregate and filter pushdown. Imagine all rows in a row group have a date from 2017 and we filter for date >= ‘2018/1/1’ with filter push down the engine would return no data but would it be even better if it would not look at the data at all. Guess what if your column uses <= 8 bytes columns store keeps a min and max value for each row group and if the filter or group by criteria don’t overlap the row group is ignored (we just processes over a million rows with a single if statement …

Helping Row Group Elimination
Column store does sort the rows in any form size or shape it just ingests them as they come but we can use this to our advantage and sort and group the data before we insert. The trick is to sort the data by the most selective filter criteria that are not in the partition key (which should have the most selective one) WARNING: This is the current implementation and can change REMEMBER SQL before 7.0 all results where always sorted … When you rebuild the clustered column store it tries to optimize but unfortunately does not as good of a job as you can. So avoid index rebuilds

Batch Mode Instead of processing row by row the engine works on about 900 rows at a time (up to 936 current observed implementation – don’t ask me why that number it makes no sense) Not all operations can run in batch mode but new ones in every version. This speeds things up a lot.

The magic of batch mode processing
Contiguous column data allows for vectorized execution that leverages advanced CPU features (SIMD) Same operation applied on a vector of values: 128-bit and 256 bit operations.

Problems not solved Unknown sessions in case logion happened a long time ago or missing audit files Log file rollover on not busy servers Normally log files roll over every few minutes on the busiest servers even multiple times per minute On idle servers it can take days This needs reprocessing of several days of partitions

=tg= Thomas Grohser Column Store.

Similar presentations

Presentation on theme: "=tg= Thomas Grohser Column Store."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

=tg= Thomas Grohser Column Store.

Similar presentations

Presentation on theme: "=tg= Thomas Grohser Column Store."— Presentation transcript:

Similar presentations

About project

Feedback