Download presentation
Presentation is loading. Please wait.
Published byFerdinand Booth Modified over 9 years ago
1
System Architecture: Big Iron (NUMA) Joe Chang jchang6@yahoo.com www.qdpma.com
2
About Joe Chang SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools ExecStats – cross-reference index use by SQL- execution plan Performance Monitoring, Profiler/Trace aggregation
3
Scaling SQL on NUMA Topics OLTP – Thomas Kejser session “Designing High Scale OLTP Systems” Data Warehouse Ongoing Database Development Bulk Load – SQL CAT paper + TK session “The Data Loading Performance Guide” Other Sessions with common coverage: Monitoring and Tuning Parallel Query Execution II, R Meyyappan (SQLBits 6) Inside the SQL Server Query Optimizer, Conor Cunningham Notes from the field: High Performance Storage, John Langford SQL Server Storage – 1000GB Level, Brent Ozar
4
Server Systems and Architecture
5
Symmetric Multi-Processing CPU System Bus CPU MCH ICHPXH SMP, processors are not dedicated to specific tasks (ASMP), single OS image, each processor can acess all memory SMP makes no reference to memory architecture? Not to be confused to Simultaneous Multi-Threading (SMT) Intel calls SMT Hyper-Threading (HT), which is not to be confused with AMD Hyper-Transport (also HT)
6
Non-Uniform Memory Access Memory Controller CPU Shared Bus or X Bar Memory Controller CPU Memory Controller CPU Memory Controller CPU Node Controller NUMA Architecture - Path to memory is not uniform 1)Node: Processors, Memory, Separate or combined Memory + Node Controllers 2)Nodes connected by shared bus, cross-bar, ring Traditionally, 8-way+ systems Local memory latency ~150ns, remote node memory ~300-400ns, can cause erratic behavior if OS/code is not NUMA aware
7
AMD Opteron Opteron HT2100 Opteron HT1100HT2100 Local memory latency ~50ns, 1 hop ~100ns, two hop 150ns? Actual: more complicated because of snooping (cache coherency traffic) Technically, Opteron is NUMA, but remote node memory latency is low, no negative impact or erratic behavior! For practical purposes: behave like SMP system
8
8-way Opteron Sys Architecture Opteron processor (prior to Magny-Cours) has 3 Hyper-Transport links. Note 8-way top and bottom right processors use 2 HT to connect to other processors, 3 rd HT for IO, CPU 1 & 7 require 3 hops to each other CPU 0 CPU 2 CPU 4 CPU 6 CPU 1 CPU 3 CPU 5 CPU 7
9
http://www.techpowerup.com/img/09-08-26/17d.jpg
10
Nehalem System Architecture Intel Nehalem generation processors have Quick Path Interconnect (QPI) Xeon 5500/5600 series have 2, Xeon 7500 series have 4 QPI 8-way Glue-less is possible
11
NUMA Local and Remote Memory Local memory is closer than remote Physical access time is shorter What is actual access time? With cache coherency requirement!
12
HT Assist – Probe Filter part of L3 cache used as directory cache ZDNET
13
Source Snoop Coherency From HP PREMA Architecture whitepaper: All reads result in snoops to all other caches, … Memory controller cannot return the data until it has collected all the snoop responses and is sure that no cache provided a more recent copy of the memory line
14
DL980G7 From HP PREAM Architecture whitepaper: Each node controller stores information about* all data in the processor caches, minimizes inter-processor coherency communication, reduces latency to local memory (*only cache tags, not cache data)
15
HP ProLiant DL980 Architecture Node Controllers reduces effective memory latency
16
Superdome 2 – Itanium, sx3000 Agent – Remote Ownership Tag + L4 cache tags 64M eDRAM L4 cache data
17
IBM x3850 X5 (Glue-less) Connect two 4-socket Nodes to make 8-way system
18
Fujitsu R900 4 IOH 14 x8 PCI-E slots, 2 x4, 1x8 internal
20
OS Memory Models SUMA: Sufficiently Uniform Memory Access Memory interleaved across nodes NUMA: first interleaved within a node, then spanned across nodes Memory stripe is then spanned across nodes 1 2 1 2 Node 0 49 33 17 1 48 32 16 0 51 35 19 3 50 34 18 2 Node 0 53 37 21 5 52 36 20 4 55 39 23 7 54 38 22 6 Node 0 57 41 25 9 56 40 24 8 59 43 27 11 58 42 26 10 Node 0 61 45 29 13 60 44 28 12 63 47 31 15 62 46 30 14 Node 0 13 9 5 1 12 8 4 0 15 11 7 3 14 10 6 2 Node 0 29 25 21 17 28 24 20 16 31 27 23 19 30 26 22 18 Node 0 45 41 37 33 44 40 36 32 47 43 39 35 46 42 38 34 Node 0 61 57 53 49 60 56 52 48 63 59 55 51 62 58 54 50
21
OS Memory Models 25 17 9 1 Node 0 24 16 8 0 27 19 11 3 Node 1 26 18 10 2 29 21 13 5 Node 2 28 20 12 4 31 23 15 7 Node 3 30 22 14 6 7 5 3 1 Node 0 6 4 2 0 15 13 11 9 Node 1 14 12 10 8 23 21 19 17 Node 2 22 20 18 16 31 29 27 25 Node 3 30 28 26 24 SUMA: Sufficiently Uniform Memory Access Memory interleaved across nodes NUMA: first interleaved within a node, then spanned across nodes Memory stripe is then spanned across nodes 1 2 1 2
22
Windows OS NUMA Support Memory models SUMA – Sufficiently Uniform Memory Access NUMA – separate memory pools by Node Node 0 0 24 16 8 1 25 17 9 Node 1 2 26 18 10 3 27 19 11 Node 2 4 28 20 12 5 29 21 13 Node 3 6 30 22 14 7 31 23 15 Node 0 0 6 4 2 1 7 5 3 Node 1 8 14 12 10 9 15 13 11 Node 2 16 22 20 18 17 23 21 19 Node 3 24 30 28 26 25 31 29 27 Memory is striped across NUMA nodes
23
Memory Model Example: 4 Nodes SUMA Memory Model memory access uniformly distributed 25% of memory accesses local, 75% remote NUMA Memory Model Goal is better than 25% local node access True local access time also needs to be faster Cache Coherency may increase local access
24
Architecting for NUMA Web determines port for each user by group (but should not be by geography!) Affinitize port to NUMA node Each node access localized data (partition?) OS may allocate substantial chunk from Node 0? End to End Affinity North East Mid Atlantic South East Central Texas Mountain California Pacific NW 1440 1441 1442 1443 1444 1445 1446 1447 Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 0-0 0-1 1-0 1-1 2-0 2-1 3-0 3-1 4-0 4-1 5-0 5-1 6-0 6-1 7-0 7-1 NE MidA SE Cen Tex Mnt Cal PNW App ServerTCP PortCPUMemoryTable
25
Architecting for NUMA Web determines port for each user by group (but should not be by geography!) Affinitize port to NUMA node Each node access localized data (partition?) OS may allocate substantial chunk from Node 0? End to End Affinity North East Mid Atlantic South East Central Texas Mountain California Pacific NW 1440 1441 1442 1443 1444 1445 1446 1447 Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 0-0 0-1 1-0 1-1 2-0 2-1 3-0 3-1 4-0 4-1 5-0 5-1 6-0 6-1 7-0 7-1 NE MidA SE Cen Tex Mnt Cal PNW App ServerTCP PortCPUMemoryTable
26
HP-UX LORA HP-UX – Not Microsoft Windows Locality-Optimizer Resource Alignment 12.5% Interleaved Memory 87.5% NUMA node Local Memory
27
System Tech Specs 8GB $400 ea 18 x 8G = 144GB, $7200,64 x 8G = 512GB - $26K 16GB $1100 ea 12 x16G =192GB, $13K,64 x 16G = 1TB – $70K Processors 2 x Xeon X56x0 4 x Opteron 6100 4 x Xeon X7560 8 x Xeon X7560 CoresDIMMPCI-E G2 6185 x8+,1 x4 12325 x8, 1 x4 8644 x8, 6 x4 † 81289 x8, 5 x4 ‡ Max memory 192G* 512G 1TB 2TB Total Cores 12 48 32 64 Base $7K $14K $30K $100K Max memory for 2-way Xeon 5600 is 12 x 16 = 192GB, † Dell R910 and HP DL580G7 have different PCI-E ‡ ProLiant DL980G7 can have 3 IOH for additional PCI-E slots
29
Software Stack
30
Operating System Windows Server 2003 RTM, SP1 Network limitations (default) Scalable Networking Pack (912222) Windows Server 2008 Windows Server 2008 R2 (64-bit only) Breaks 64 logical processor limit NUMA IO enhancements? Do not bother trying to do DW on 32-bit OS or 32-bit SQL Server Don’t try to do DW on SQL Server 2000 Impacts OLTP Search: MSI-X
31
SQL Server version SQL Server 2000 Serious disk IO limitations (1GB/sec ?) Problematic parallel execution plans SQL Server 2005 (fixed most S2K problems) 64-bit on X64 (Opteron and Xeon) SP2 – performance improvement 10%(?) SQL Server 2008 & R2 Compression, Filtered Indexes, etc Star join, Parallel query to partitioned table
32
Configuration SQL Server Startup Parameter: E Trace Flags 834, 836, 2301 Auto_Date_Correlation Order date A Implied: Order date > A-C, Ship date < A+C Port Affinity – mostly OLTP Dedicated processor ? for log writer ?
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.