1 Designing Highly Scalable OLTP Systems Thomas Kejser:Principal Program Manager Ewan Fairweather: Program Manager Microsoft.

1 Designing Highly Scalable OLTP Systems Thomas Kejser:Principal Program Manager Ewan Fairweather: Program Manager Microsoft

2 Agenda  Windows Server 2008R2 and SQL Server 2008R2 improvements  Scale architecture  Customer Requirements  Hardware setup  Transaction log essentials  Getting the code right  Application Server Essentials  Database Design  Tuning Data Modification  UPDATE statements  INSERT statements  Management of LOB data  The problem with NUMA and what to do about it  Final results and Thoughts

3 Top statistics CategoryMetric Largest single database80 TB Largest table20 TB Biggest total data 1 customer2.5 PB Highest transactions per second 1 db 36,000 Fastest I/O subsystem in production 18 GB/sec Fastest “real time” cube15 sec latency data load for 1TB20 minutes Largest cube4.2 TB

4 Upping the Limits  Previous (before 2008R2) windows was limited to 64 cores  Kernel tuned for this config  With Windows Server 2008R2 this limit is now upped to 1024 Cores  New concept: Kernel Groups  A bit like NUMA, but an extra layer in the hierarchy  SQL Server generally follows suit – but for now, 256 Cores is limit on R2  Currently, largest x64 machine is 128 Cores  And largest IA-64 is 256 Hyperthread (at 128 Cores)

5 The Path to the Sockets Windows OS Kernel Group 0 Kernel Group 0 NUMA 0 NUMA 1 NUMA 2 NUMA 3 NUMA 4 NUMA 5 NUMA 6 NUMA 7 Kernel Group 1 Kernel Group 1 NUMA 8 NUMA 9 NUMA 10 NUMA 11 NUMA 12 NUMA 13 NUMA 14 NUMA 15 Kernel Group 2 Kernel Group 2 NUMA 16 NUMA 17 NUMA 18 NUMA 19 NUMA 20 NUMA 21 NUMA 22 NUMA 23 Kernel Group 3 Kernel Group 3 NUMA 24 NUMA 25 NUMA 26 NUMA 27 NUMA 28 NUMA 29 NUMA 30 NUMA 31 Hardware NUMA 6 CPU Socket CPU Socket CPU Core HT CPU Core HT CPU Socket CPU Socket CPU Core HT CPU Core HT NUMA 7 CPU Socket CPU Socket CPU Core HT CPU Core HT CPU Socket CPU Socket CPU Core HT CPU Core HT

6 And we measure it like this  Sysinternals CoreInfo http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx  Nehalem-EX  Every socket is a NUMA node  How fast is your interconnect….

7 And it Looks Like This...

8 Customer Scenarios Core BankingHealthcare SystemPOS Workload Credit Card transactions from ATM and Branches Sharing patient information across multiple healthcare trusts World record deployment of ISV POS application across 8,000 US stores Scale Requirements 10.000 Business Transactions / sec 37,500 concurrent users Handle peak holiday load of 228 checks/sec Technology App Tier.NET 3.5/WCF SQL 2008R2 Windows 2008R2 App Tier:.NET SQL 2008R2 Windows 2008R2 Virtualized App Tier: Com+, Windows 2003 SQL 2008, Windows 2008 Server HP Superdome HP DL785G6 IBM 3950 and HP DL 980 DL785

9 Hardware Setup – Database files  Database Files  # should be at least 25% of CPU cores  This alleviates PFS contention – PAGELATCH_UP  There is no signficant point of diminishing returns up to 100% of CPU cores  But manageability, is an issue...  Though Windows 2008R2 is much easier  TempDb  PFS contention is a larger problem here as it’s an instance wide resource  Deallocations and Allocations, RCSI – version store, triggers, temp tables  # files shoudl be exactly 100% of CPU Threads  Presize at 2 x Physical Memory  Data files and TempDb on same LUNs  It’s all random anyway – don’t sub-optimize  IOPS is a global resource for the machine. Goal is to avoid PAGEIOLATCH on any data file  Example: Dedicated XP24K SAN  ~500 spindles in 64 LUN (RAID5 7+1)  No more than 4 HBA per LUN via MPIO  Key Takeaway: Script it! At this scale, manual work WILL drive you insane

10 Special Consideration: Transaction Log  Transaction log is a set of 127 linked buffers with max 32 outstanding IOPS  Each buffer is 60KB  Multiple transactions can fit in one buffer  BUT: Buffer must flush before log manager can signal a commit OK  Pre-allocate log file  Use dbcc loginfo for existing systems  Transaction log throughput was ~80MB/sec  But we consistently got <1ms latency, no spikes!  Initial Setup: 2 x HBA on dedicated storage port on RAID10 with 4+4  When tuning for peak: SSD on internal PCI bus (latency: a few µs)  Key Takeway: For Transaction Log, dedicate storage components and optimize for low latency

11 Network Cards – Rule of Thumb  At scale, network traffic will generate a LOT of interrupts for the CPU  These must be handled by CPU Cores  Must distribute packets to cores for processing  Rule of thumb (OTLP): 1 NIC / 16 Cores  Watch the DPC activity in Taskmanager  In Windows 20003 remove SQL Server (with affinity mask) from the NIC cores

12 Lab: Network Tuning Approaches 1. Tuning configuration options of a single NIC card to provide the maximum throughput. 2. Improve the application code to compress LOB data before sending it to the SQL Server 3. Team a pair of 1 Gb/s NICs to provide more bandwidth (transparent to the app). 4. Add multiple NICS (better for scale )

13 Tuning a Single NIC Card – POS system  Enable RSS to enable multiple CPUs to process receive indications: http://www.microsoft.com/whdc/device/network/NDIS_RSS.mspx  The next step was to disable the Base Filtering Service in Windows and explicitly enable TCP Chimney offload.  Careful with Chimney Offload as per KB 942861KB 942861 

14 Before and After Tuning Single NIC 1. Before any network changes the workload was CPU bound on CPU0 2. After tuning RSS, disabling Base Filtering Service and explicitly enabling TCP Chimney Offload CPU time on CPU0 was reduced. The base CPU for RSS successfully moved from CPU0 to another CPU.

15 Teaming NICS  Workload bound by network throughput  Teaming of 2 network adapters realized no more aggregate throughput  Application blade servers shared a 1 Gb/s network connection  Left until next episode  Consider 10 Gbps NICS for this throughput

16 SQL Server Memory Setup  For large CPU/Memory box, Lock Pages in Memory really matters  We saw more than double performance  Use gpedit.msc to grant it to SQL Service account  Consider TF834 (Large page Allocations)  On Windows 2008R2 previous issues with this TF are fixed  Around 5-10% throughput increase  Increases startup time  Beware of NUMA node memory distribution  Set max memory close to box max if dedicated box available

17 SQL Server Configuration Changes  As we increased number of connections to around 6000 (users had think time) we started seeing waits on THREADPOOL  Solution: increase sp_configure ‘max worker threads’  Probably don’t want to go higher than 4096  Gradually increase it, default max is 980  Avoid killing yourself in thread management – bottleneck is likely somewhere else  Use affinity mask to get rid of SQL Server for cores running NIC traffic  Well tuned, pure play OLTP  No need to consider parallel plans  Sp_configure ‘max degree of parallelism’, 1

18 Getting the Code Right Designing Highly Scalable OLTP Systems

19 Lessons from ISV Applications  Parameterize or pay the CPU cost and potentially hit the gateway limits for compilations (RESOURCE_SEMAPHORE_QUERY_COMPILA TIONS)  Watch out for cursors  They tie up worker threads and if they consume workspace memory you could see blocking (RESOURCE_SEMAPHORE)  Consume those results as quickly as possible (watch for ASYNC_NETOWORK_IO)  Schema design  For insert heavy workload RI can be very expensive.  If performance is key, work out the RI outside the DB and “trust” you app

20 To DTC or not to DTC: POS System  Com+ transactional applications are still prevalent today  This results in all database calls enlisting in a DTC transaction  45% performance overhead  Scenario in the lab involved two Resource Managers MSMQ and SQL:  Tuning approaches 1. Optimize DTC TM configuration (transparent to app) 2. Remove DTC transactions (requires app changes)  Utilize System.Transactions which will only promote to DTC if more than one RM is involved  See Lightweight transactions: http://msdn.microsoft.com/en-us/magazine/cc163847.aspx#S5 wait_typetotal_wait_time_mstotal_waiting_tasks_countaverage_wait_ms DTC_STATE5,477,997,9344,523,0191,211 PREEMPTIVE_TRANSIMPORT2,852,073,2823,672,147776 PREEMPTIVE_DTC_ENLIST2,718,413,4583,670,307740

21 Optimizing DTC Configuration  Default application servers use local TM (MSDTC Coordinator)  Introduces RPC communication between SQL TM and App Server TM  App virtualization layer incurs ‘some’ delay  Configuring application servers to use remote coordinator removes RPC communication  See Mike Ruthruff’s paper on SQLCAT.COM:  http://sqlcat.com/msdnmirror/archive/2010/05/11/resolving-dtc-related-waits-and-tuning- scalability-of-dtc.aspx http://sqlcat.com/msdnmirror/archive/2010/05/11/resolving-dtc-related-waits-and-tuning- scalability-of-dtc.aspx

22 Things to Double Check  Connection pooling enabled?  How much connection memory are we using?  Monitor perfmon: MSSQL: Memory Manager  Obvious Memory or Handle leaks?  Check perfmon Process counters in perfmon for.NET app  Server side processes will keep memory unless under pressure  Can the application handle the load?  Call into dummy procedures that do nothing  Check measured application throughput  Typical case: Application breaks before SQL

23 Remote Calling from WCF  Original client code: Synchronous calls in WCF  Each thread must wait for network latency before proceeding  Around 1ms waiting  Very similar to disk I/O – thread will fall asleep  Lots of sleeping threads  Limited to around 50 client simulations per machine  Instead, use IAsyncInterface

24 Fully Qualified Calls To Stored Procedures  Developer uses Exec myproc for dbo.myproc  SQL acquires an exclusive lock LCK_M_X and prepares to compile the procedure; this includes calculating the object ID  dm_exec_requests revealed almost all the sessions were waiting on LCK_M_X to compile a stored procedure  SOS_CACHESTORE spins - GetOwnerBySID  Workaround: make app user DB_Owner

25 Tuning Data Modification Designing Highly Scalable OLTP Systems

26 Database Schema – Credit Cards Transaction ATM Account Transaction_ID Customer_ID ATM_ID Account_ID TransactionDate Amount … Account_ID LastUpdateDate Balance … ID_ATM ID_Branch LastTransactionDate LastTransaction_ID … INSERT.. VALUES (@amount) INSERT.. VALUES (-1 * @amount) UPDATE.. SET LastTransaction_ID = @ID + 1 LastTransactionDate = GETDATE() UPDATE … SET Balance 10**10 rows 10**5 rows 10**3 rows

27 Summary of Concerns  Transaction table is hot  Lots of INSERT  How to handle ID numbers?  Allocation structures in database  Account table must be transactionally consistent with Transaction  Do I trust the developers to do this?  Cannot release lock until BOTH are in sync  What about latency of round trips for this  Potentially hot rows in Account  Are some accounts touched more than others  ATM Table has hot rows.  Each row on average touched at least ten times per second  E.g. 10**3 rows with 10**4 transactions/sec Transaction ATM Account Transaction_ID Customer_ID ATM_ID Account_ID TransactionDate Amount … Account_ID LastUpdateDate Balance … ID_ATM ID_Branch LastTransactionDate LastTransaction_ID …

28 Generating a Unique ID  Why wont this work? CREATE PROCEDURE GetID @ID INT OUTPUT @ATM_ID INT AS DECLARE @LastTransaction_ID INT SELECT @LastTransaction_ID = LastTransaction_ID FROM ATM WHERE ATM_ID = @ATM_ID SET @ID = @LastTransaction_ID + 1 UPDATE ATM SET @LastTransaction_ID WHERE ATM_ID = @ATM_ID

29 Concurrency is Fun SELECT @LastTransaction_ID = LastTransaction_ID FROM ATM WHERE ATM_ID = 13 SET @ID = @LastTransaction_ID + 1 UPDATE ATM SET @LastTransaction_ID = @ID WHERE ATM_ID = 13 SELECT @LastTransaction_ID = LastTransaction_ID FROM ATM WHERE ATM_ID = 13 SET @ID = @LastTransaction_ID + 1 UPDATE ATM SET @LastTransaction_ID = @ID WHERE ATM_ID = 13 ATM ID_ATM = 13 LastTransaction_ID = 42 … (@LastTransaction_ID = 42)

30 Generating a Unique ID – The Right way CREATE PROCEDURE GetID @ID INT OUTPUT @ATM_ID INT AS UPDATE ATM SET LastTransaction_ID = @ID + 1, @ID = LastTransaction_ID WHERE ATM_ID = @ATM_ID  And it it is simple too...

31 Hot rows in ATM  Initial runs with a few hundred ATM shows excessive waits for LCK_M_U  Diagnosed in sys.dm_os_wait_stats  Drilling down to individual locks using sys.dm_tran_locks  Inventive readers may wish to use Xevents  Event objects: sqlserver.lock_acquired and sqlos.wait_info  Bucketize them  As concurrency increases, lock waits keep increasing  While throughput stays constant  Until...

32 Spinning around Diagnosed using sys.dm_os_spinlock_stats Pre SQL2008 this was DBCC SQLPERF(spinlockstats) Can dig deeper using Xevents with sqlos.spinlock_backoff event We are spinning for LOCK_HASH

33 LOCK_HASH – what is it? ROW Lock Manager Thread More Threads LOCK_HASH LCK_U - Why not go to sleep?

34 Locking at Scale  Ratio between ATM machines and transactions generated too low.  Can only sustain a limited number of locks/unlocks per second  Depends a LOT on NUMA hardware, memory speeds and CPU caches  Each ATM was generating 200 transactions / sec in test harness  Solution: Increase number of ATM machines  Key Takeway: If a locked resource is contended – create more of it  Notice: This is not SQL Server specific, any piece of code will be bound by memory speeds when access to a region must be serialized

35 Hot rows in Account  Three ways to update Account table  Let application servers invoke transaction to both insert in TRANSACTION and UPDATE account  Set a trigger on TRANSACTION  Create stored proc that handles the entire transaction  Option 1 has two issues:  App developers may forget in all code paths  Latency of roundtrip: around 1ms – i.e. no more than 1000 locks/sec possible on single row  Option 2 is better choice!  Option 3 must be used in all places in app to be better than option 2.

36 Hot Latches!  LCK waits are gone, but we are seeing very high waits for PAGELATCH_EX  High = more than 1ms  What are we contending on?  Latch – a light weight semaphore  Locks are logical (transactional consistency)  Latches are internal SQL Engine (memory consitency)  Because rows are small (many fit a page) multiple locks may compete for one PAGELATCH Page (8K) ROW LCK_U PAGELATCH_EX

37 Row Padding  In the case of the ATM table, our rows are small and few  We can ”waste” a bit of space to get more performance  Solution: Pad rows with CHAR column to make each row take a full page  1 LCK = 1 PAGELATCH Page (8K) ROW LCK_U PAGELATCH_EX CHAR(5000) ALTER TABLE ATM ADD COLUMN Padding CHAR(5000) NOT NULL DEFAULT (‘X’)

38 INSERT throughput  Transaction table is by far the most active table  Fortunately, only INSERT  No need to lock rows  But several rows must still fit a single page  Cannot pad pages – there are 10**10 rows in the table  A new page will eventually be allocated, but until it is, every insert goes to same page  Expect: PAGELATCH_EX waits  And this is the observation

39 Hot page at the end of B-tree with increasing index

40 Waits & Latches  Dig into details with:  sys.dm_os_wait_stats  sys.dm_os_latch_waits

41 How to Solve INSERT hotspot  Hash partition the table  Create multiple B-trees  Round robin between the B-trees create more resources and less contention  Do not use a sequential key  Distribute the inserts all over the B-tree 0 0 1 1 2 2 3 3 4 4 5 5 6 6 hash ID 7 7 0,8,16 1,9,17 2,10,18 3,11,19 4,12,20 5,13,21 6,14,22 7,15,23 0 -1000 0 -1000 1001 - 2000 1001 - 2000 2001 - 3000 2001 - 3000 3001 - 4000 3001 - 4000 INSERT

42 0 0 Design Pattern: Table “Hash” Partitioning  Create new filegroup or use existing to hold the partitions  Equally balance over LUN using optimal layout  Use CREATE PARTITION FUNCTION command  Partition the tables into #cores partitions  Use CREATE PARTITION SCHEME command  Bind partition function to filegroups  Add hash column to table (tinyint or smallint)  Calculate a good hash distribution  For example, use hashbytes with modulo or binary_checksum 1 1 2 2 3 3 4 4 5 5 6 6 253 254 255 hash

43 Table Partitioning Example --Create the partition scheme and function CREATE PARTITION FUNCTION [pf_hash16] (tinyint) AS RANGE LEFT FOR VALUES (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) CREATE PARTITION SCHEME [ps_hash16] AS PARTITION [pf_hash16] ALL TO ( [ALL_DATA] ) -- Add the computed column to the existing table (this is an OFFLINE operation of done the simply way) - Consider using bulk loading techniques to speed it up. ALTER TABLE [dbo].[Transaction] ADD [HashValue] AS (CONVERT([tinyint], abs(binary_checksum([uidMessageID ])%(16)),(0))) PERSISTED NOT NULL --Create the index on the new partitioning scheme CREATE UNIQUE CLUSTERED INDEX [IX_Transaction_ID] ON [dbo].[Transaction([Transaction_ID ], [HashValue]) ON ps_hash16(HashValue)  Note: Requires application changes  Ensure Select/Update/Delete have appropriate partition elimination

44 Query Engine Good Plan

45 Query Engine Bad Plan

46 Lab Example: Before Partitioning Latch waits of approximately 36 ms at baseline of 99 checks/sec.

47 Lab Example: After Partitioning* *Other optimizations were applied Latch waits of approximately 0.6 ms at highest throughput of 249 checks/sec.

48 Pick The Right Number of Buckets

49 B-Tree Root Split Next Prev Virtual Root SH LATCH (ACCESS_METHODS HBOT_VIRTUAL_ROOT) LCK PAGELATCH X X SH PAGELATCH EX SH EX SH EX

50 Management of LOB Data  Resolving latch contention required rebuilding indexes into a new filegroup  Resulted in PFS contention (PAGELATCH_UP):  Engine uses proportional fill algorithm  Moving indexes from one filegroup to another resulted in imbalance between underlying data files in PRIMARY filegroup  Resolve: move hot table to dedicated filegroup  Neither ALTER TABLE nor any method of index rebuild support the movement of LOB data. Technique used:  Create the new filegroup and files.  SELECT/INTO from the existing table into a new table.  Change the default filegroup as specifying a target filegroup is not supported  INSERT...WITH (TABLOCK) SELECT will have similar behaviour without the need to change default filegroup  Drop the original table and rename the newly created table to the original name.  As a general best practice we advised the partner/customer to use dedicated filegroups for LOB data  Don’t use PRIMARY filegroup  See Paul Randal post: http://www.sqlskills.com/BLOGS/PAUL/post/Importance-of- choosing-the-right-LOB-storage-technique.aspxhttp://www.sqlskills.com/BLOGS/PAUL/post/Importance-of- choosing-the-right-LOB-storage-technique.aspx

51 NUMA and What to do  Remember those PAGELATCH for UPDATE statements?  Our solution: add more pages  Improvemnet: Get out of the PAGELATCH fast so next one can work on it  On NUMA systems, going to a foreign memory node takes at least 4-10 times more expensive  Use SysInternals CoreInfo tool

52 How does NUMA work in SQL Server?  The first NUMA node to request a page will ”own” that page  Ownership continues until page is evicted from buffer pool  Every other NUMA node that need that page will have to do foreign memory access  Additional (SQL 2008) feature is SuperLatch  Useful when page is read a lot but written rarely  Only kicks in on 32 cores or more  The ”this page is latched” information is copied to all NUMA nodes  Acquiring a PAGELATCH_SH only requires local NUMA access  But: Acquiring PAGELATCH_EX must signal all NUMA nodes  Perfmon object: MSSQL:Latches  Number of SuperLatches  SuperLatch demotions / sec  SuperLatch promotions / sec  See CSS blog postCSS blog post

53 NUMA 3 NUMA 2 NUMA 1 NUMA 0 Effect of UPDATE on NUMA traffic 0 0 1 1 2 2 3 3 ATM_ID UPDATE ATM SET LastTransaction_ID UPDATE ATM SET LastTransaction_ID UPDATE ATM SET LastTransaction_ID UPDATE ATM SET LastTransaction_ID App Servers

54 NUMA 3 NUMA 2 NUMA 1 NUMA 0 Using NUMA affinity 0 0 1 1 2 2 3 3 ATM_ID UPDATE ATM SET LastTransaction_ID UPDATE ATM SET LastTransaction_ID UPDATE ATM SET LastTransaction_ID UPDATE ATM SET LastTransaction_ID Port: 8000 Port: 8001 Port: 8002 Port: 8003 How to: Map TCP/IP Ports to NUMA Nodes

55 Final Results and thoughts  120.000 Batch Requests / sec  100.000 SQL Transactions / sec  50.000 SQL Write Transactions / sec  12.500 Business Transactions / sec  CPU Load: 34 CPU cores busy  Given more time, we would get the CPU’s to 100%, Tune the NICs more, and work on balancing NUMA more.  And of NIC, we only had two and they were loading two CPU at 100%

56 Q A & Q A &

Coming up… P/X001 How to Get Full Access to a Database Backup in 3 Minutes or Less Idera P/L001 End-to-end database development has arrived Red Gate P/L002 Weird, Deformed, and Grotesque –Horrors Stories from the World of IT Quest P/L005 Expert Query Analysis with SQL Sentry SQLSentry P/T007 Attunity Change Data Capture for SSIS Attunity # SQLBITS

58 © 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

1 Designing Highly Scalable OLTP Systems Thomas Kejser:Principal Program Manager Ewan Fairweather: Program Manager Microsoft.

Similar presentations

Presentation on theme: "1 Designing Highly Scalable OLTP Systems Thomas Kejser:Principal Program Manager Ewan Fairweather: Program Manager Microsoft."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Designing Highly Scalable OLTP Systems Thomas Kejser:Principal Program Manager Ewan Fairweather: Program Manager Microsoft.

Similar presentations

Presentation on theme: "1 Designing Highly Scalable OLTP Systems Thomas Kejser:Principal Program Manager Ewan Fairweather: Program Manager Microsoft."— Presentation transcript:

Similar presentations

About project

Feedback