2About Joe Chang SQL Server Execution Plan Cost Model True cost structure by system architectureDecoding statblob (distribution statistics)SQL Clone – statistics-only databaseToolsExecStats – cross-reference index use by SQL-execution planPerformance Monitoring,Profiler/Trace aggregation
4Organization Structure In many large IT departmentsDB and Storage are in separate groupsStorage usually has own objectivesBring all storage into one big system under full management (read: control)Storage as a Service, in the CloudOne size fits all needsUsually have zero DB knowledgeOf course we do high bandwidth, 600MB/sec good enough for you?
5Data Warehouse Storage OLTP – Throughput with Fast ResponseDW – Flood the queues for maximum through-putDo not use shared storage for data warehouse!Storage system vendors like to give the impression the SAN is a magical, immensely powerful box that can meet all your needs.Just tell us how much capacity you need and don’t worry about anything else.My advice: stay away from shared storage, controlled by different team.
6Nominal and Net Bandwidth PCI-E Gen 2 – 5 Gbit/sec signalingx8 = 5GB/s, net BW 4GB/s, x4 = 2GB/s netSAS 6Gbit/s – 6 Gbit/sx4 port: 3GB/s nominal, 2.2GB/sec net?Fibre Channel 8 Gbit/s nominal780GB/s point-to-point,680MB/s from host to SAN to back-end loopSAS RAID Controller, x8 PCI-E G2, 2 x4 6G2.8GB/sDepends on the controller, will change!
7Storage – SAS Direct-Attach Many Fat PipesVery Many DisksSAS x4RAIDPCI-E x8.SAS x4RAIDPCI-E x8Option A:24-disks in one enclosure for each x4 SAS port. Two x4 SAS ports per controllerOption B: Split enclosure over 2 x4 SAS ports, 1 controller.SAS x4RAIDPCI-E x8.SAS x4RAIDPCI-E x8.RAIDBalance by pipe bandwidthPCI-E x4SAS x4PCI-E x42 x10GbEDon’t forget fat network pipesPCI-E x42 x10GbE
8Storage – FC/SAN PCI-E x8 Gen 2 Slot with quad-port 8Gb FC HBAPCI-E x8 Gen 2 Slot with quad-port 8Gb FCIf 8Gb quad-port is not supported, consider system with many x4 slots, or consider SAS!SAN systems typically offer 3.5in 15-disk enclosures. Difficult to get high spindle count with density.disk enclosures per 8Gb FC port, 20-30MB/s per disk?PCI-E x8.8Gb FCHBAPCI-E x8.8Gb FCPCI-E x4HBA.PCI-E x4HBA8Gb FCPCI-E x4HBA8Gb FCPCI-E x4HBA8Gb FC.PCI-E x4HBA8Gb FCPCI-E x42 x10GbEPCI-E x42 x10GbE
9Storage – SSD / HDD Hybrid No RAID w/SSD?SSDSAS x4SASPCI-E x8.SSDSAS x4SASPCI-E x8.SSDSAS x4SAS.PCI-E x8SSDSAS x4SASPCI-E x8.PCI-E x4RAIDStorage enclosures typically 12 disks per channel. Can only support bandwidth of a few SSD. Use remaining bays for extra storage with HDD. No point expending valuable SSD space for backups and flat filesSAS x4PCI-E x42 x10GbELog:Single DB – HDD, unless rollbacks or T-log backups disrupts log writes.Multi DB – SSD, otherwise to many RAID1 pairs to logsPCI-E x42 x10GbE
15SSD Current: mostly 3Gbps SAS/SATA SDD Some 6Gbps SATA SSDFusion IO – direct PCI-E Gen2 interface320GB-1.2TB capacity, 200K IOPS, 1.5GB/sNo RAID ?HDD is fundamentally a single point failureSDD could be built with redundant componentsHP report problems with SSD on RAID controllers, Fujitsu did not?
16Big DW Storage – iSCSI Are you nuts? Well, maybe if you like frequent long coffee-cigarette breaks
17Storage Configuration - Arrays Shown:two 12-disk Arrays per 24-disk enclosureOptions: between 6-16 disks per arraySAN systems may recommend R or R5 7+1Very Many SpindlesComment on Meta LUN
20Data Consumption Rate: Opteron TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875MProcessorsGHzTotalCoresMemGBSQLQ1secSFTotalMB/sMB/sper core4 Opt 82202.881285rtm309.7300868.7121.18 Opt 83602.5322568rtm91.43002,872.089.7Barcelona8 Opt 83842.7322568rtm72.53003,620.7113.2Shanghai8 Opt 84394849.08sp15,357.1111.62.8256300Istanbul8 Opt 843948166.98rtm5,242.7109.22.851210002 Opt 61762420.28r24,331.7180.52.3192100Magny-C4 Opt 61764831.88,254.7172.0512300-Expected Istanbul to have better performance per core than Shanghai due to HT Assist. Magny-Cours has much better performance per core! (at 2.3GHz versus 2.8 for Istanbul), or is this Win/SQL 2K8 R2?
22Storage Targets Processors 2 Xeon X5680 4 Opt 6176 4 Xeon X7560 2U disk enclosure24 x 73GB15K 2.5in disks$14K, $600 per diskProcessors2 Xeon X56804 Opt 61764 Xeon X75608 Xeon X7560Total Cores12483264BWCore350175250225Target MB/s42008400800014400PCI-Ex8-x45 - 16 - 49 - 5SASHBA24611†StorageUnits/Disks2 - 484 - 96StorageUnits/Disks4 - 96Actual Bandwidth5 GB/s10 GB/s15 GB/s26 GB/s† 8-way :9 controllers in x8 slots, 24 disks per x4 SAS port2 controllers in x4 slots, 12 disk24 15K disks per enclosure,12 disks per x4 SAS port requires 100MB/sec per disk,possible but not always practical24 disks per x4 SAS port requires 50MB/sec,more achievable in practiceThink: Shortest path to metal (iron-oxide)
23Your Storage and the Optimizer Sequential IOPS1,350350,000ModelOptimizerSAS 2x4Disks-2448BW (KB/s)10,8002,800,000“Random” IOPS3209,60019,200Sequential- Rand IO ratio4.2236.518.245,000FC 4G30360,00012,0003.75SSD8280,0001.25Assumptions2.8GB/sec per SAS 2 x4 Adapter, Could be 3.2GB/sec per PCI-E G2 x8HDD 400 IOPS per disk – Big query key lookup, loop join at high queue, and short-stroked, possible skip-seek. SSD 35,000 IOPSThe SQL Server Query Optimizer make key lookup versus table scan decisions based on a 4.22 sequential-to-random IO ratioA DW configured storage system has a ratio, 30 disks per 4G FC about matches the QO, SSD is in the other direction
26Fast Track Reference Architecture My ComplaintsSeveral Expensive SAN systems (11 disks)Each must be configured independently$1,500-2,000 amortized per diskToo many 2-disk Arrays2 LUN per Array, too many data filesBuild Indexes with MAXDOP 1Is this brain dead?Designed around 100MB/sec per diskNot all DW is single scan, or sequentialScripting?
27Fragmentation Weak Storage System TableWeak Storage System1) Fragmentation could degrade IO performance,2) Defragmenting very large table on a weak storage system could render the database marginally to completely non-functional for a very long time.Powerful Storage System3) Fragmentation has very little impact.4) Defragmenting has mild impact, and completes within night time window.What is the correct conclusion?FilePartitionLUNDisk
29Operating System Disk View Controller 1 Port 0Controller 1 Port 1Disk 2Basic396GBOnlineDisk 3Controller 2 Port 0Controller 2 Port 1Disk 4Disk 5Controller 3 Port 0Controller 3 Port 1Disk 6Disk 7Additional disks not shown, Disk 0 is boot drive, 1 – install source?
30File Layout Each File Group is distributed across all data disks Disk 2, Partition 0File Group for the big TableFile 1Partition 1File Group for all othersPartition 2TempdbPartition 4Backup and LoadDisk 3 Partition 0File 2Small File GroupDisk 4 Partition 0File 3Disk 5 Partition 0File 4Disk 6 Partition 0File 5Disk 7 Partition 0File 6Log disks not shown, tempdb share common pool with data
31File Groups and Files Dedicated File Group for largest table Never defragmentOne file group for all other regular tablesLoad file group?Rebuild indexes to different file group
32Partitioning - Pitfalls Disk 2File Group 1Disk 3File Group 2Disk 4File Group 3Disk 5File Group 4Disk 6File Group 5Disk 7File Group 6Table Partition 1Table Partition 2Table Partition 3Table Partition 4Table Partition 5Table Partition 6Common Partitioning StrategyPartition Scheme maps partitions to File GroupsWhat happens in a table scan?Read first from Part 1then 2, then 3, … ?SQL 2008 HF to read from each partition in parallel?What if partitions have disparate sizes?