Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hardware Overview System P & Power5.

Similar presentations


Presentation on theme: "Hardware Overview System P & Power5."— Presentation transcript:

1 Hardware Overview System P & Power5

2 Agenda Overview of Systems Hardware POWER5 Processors System p5-575
The chip The processors The memory System p5-575

3 Hardware Overview Core: Processors: Nodes: Clusters:

4 IBM Product Naming New Name Old Names Market Processor System i
iSeries, AS400 Commercial RS64 POWER5 System p RS6000 SP pSeries Server, technical POWER3 POWER4 System x xSeries IA-32 Server, technical Intel AMD PowerPC System z zSeries ES9000 Mainframe

5 Power Processor Progression
Years Clock Rate Feature POWER2 20 – 60 MHz RISC P2SC 60 – 150 MHz Bandwidth POWER3 1998 – 2002 200 – 450 MHz Single Chip POWER4 2001 – 2005 1 – 1.9 GHz Dual Core POWER5 2004 - 1.5 – 1.9 GHz Multi-Thread POWER5+ 2006- 1.9 – 2.2 GHz Speed bump

6 POWER5 Systems POWER5 processors Modules Nodes Cluster
Single and Dual processor chips Modules Dual Chip Modules (DCM) Multi Chip Modules (MCM) Nodes Multiple modules p5-575 p5-595 SMP within a node Cluster Multiple nodes Connected with High Speed Switch (HPS)

7 System p5 “Nodes” – partial list
2000 1.65, 1.9 16-64 p5 595 32 1.5, 1.65* 1,2 p5 505 1.65, 1.9* p5 520 128 1.5* 4-16 p5 560Q 512 1.9, 2.2* 2-16 p5 570 256 8-16 p5 575 1000 1.65 8-32 p5 590 Max Memory (x 2^30 byte) Clock Rate (GHz) Processors Model * - POWER5+

8 POWER5 Processor Systems
MCM Processor Chip DCM p5-575 Cluster

9 Cluster 1600 Physical View Logical View Network, Disk System
Multi Processor Nodes Physical View Logical View

10 POWER5 Processors Multi-processor chip High clock rate: Multiple GHz
Three cache levels Bandwidth Latency hiding Shared Memory Large memory size

11 POWER5 Chip IBM CMOS 130nm Chip Copper and SOI 8 layers of metal
389 mm2 276M transistors I/Os: 2313 signal, 3057 power

12 POWER5 Multi-chip Module
95mm  95mm Four POWER5 chips Four cache chips 4,491 signal I/Os 89 layers of metal

13 Multi Chip and Dual Chip Modules
Dual Chip Module (DCM) p5-570 p5-575 Multi Chip Module (MCM) p5-590 p5-595 POWER5 Processor Chip

14 Dual Chip Module Each Module: Each Processor Chip 1 processor chip
1 L3 cache 1 Memory card Each Processor Chip 2 processors L1 caches Registers Functional units 1 L2 cache 1 path to memory 36 Mbyte L3 Memory

15 POWER5 Dual Chip Module One POWER5 chip One L3 cache chip
Single or Dual Core One L3 cache chip

16 POWER5 Features Private L1 cache Shared L2 cache Shared L3 cache
Interleaved memory Hardware Prefetch Multiple Page Size support

17 Processor Characteristics
High frequency clocks Deep pipelines High asymptotic rates Superscalar Speculative out-of-order instructions Up to 8 outstanding cache line misses Large number of instructions in flight Branch prediction Hardware Prefetching

18 POWER5 Chip Enhancements
Caches and translation resources Larger caches Enhance associativity Resource pools Rename registers: GPRs, FPRs increased to 120 each L2 cache coherency engines: increased by 100% Memory controller moved on chip Dynamic power management added IBM Confidential

19 New Instructions in POWER5
Enhanced data prefetch (eDCBT) Floating-point: Non-IEEE mode of execution for divide and square-root Reciprocal estimate, double-precision Reciprocal square-root estimate, single-precision Local data cache block flush Population count IBM Confidential

20 POWER5 Chip Layout Chip Line Width Area Transistors POWER4+ 130 nm
267 mm2 184 million POWER5 389 mm2 268 million

21 POWER5 Chip Component Percentage Reason Core 10 SMT L2 11
POWER5 chips size increase relative to POWER4+ Component Percentage Reason Core 10 SMT L2 11 1.5 Mbyte  1.9 Mbyte L3 Directory and control 7 Larger L3 capacity and smaller L3 cache line Memory Controller 3 New to POWER5 Fabric controller 5 Enhanced distributed switch I/Os Increased bus speeds Other Total 46

22 Simultaneous Multi-Threading in POWER5
Each chip appears as a 4-way SMP to software Processor resources optimized for enhanced SMT performance Software controlled thread priority Dynamic feedback of runtime behavior to adjust priority Dynamic switching between single and multithreaded mode Simultaneous Multi-Threading FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Thread 0 active Thread 1 active

23 Multi-threading Evolution
FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Single Thread FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Coarse Grain Threading FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Fine Grain Threading FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Simultaneous Multi-Threading Thread 0 Executing Thread 1 Executing No Thread Executing

24 Each processor: Each chip (shared): Chip Structure L1 caches Registers
Functional units Each chip (shared): L2 cache L3 cache Path to memory L2 Cache Processor Core 1 Core 2 Fabric Controller L3 Directory GX Control L3/Mem Control

25 Chip Structure Processor Core 1 Processor Core 2 L2 Cache L2 Cache L2
To Chip To Fabric Controller MCM To MCM To L3 Directory GX Control L3/Mem Control GX Bus L3/Mem Bus

26 Floating Point Paths FPU 0 r0 r1 … r31 FPU 1 Load - Store
Cache and Memory

27 Multiple Functional Units
Symmetric functional units Two Floating Point Units (FPU) Three Fixed Point Units (FXU) Two Integer One Control Two Load/Store Units (LSU) One Branch Processing Unit (BPU) CR FMA Fixed Pt. Load/Store FMA Fixed Pt. Load/Store Branch

28 Instruction-level Parallelism
Speculative superscalar organization Out-of-Order execution Large rename pools 8 instruction issue, 5 instruction complete Large instruction window for scheduling 8 Execution pipelines 2 load / store units 2 fixed point units 2 DP multiply-add execution units 1 branch resolution unit 1 CR execution unit Aggressive branch prediction Target address and outcome prediction Static prediction / branch hints used Fast, selective flush on branch mispredict Processor Core IFAR I-cache BR Scan Instr Q BR Predict Decode, Crack & Group Formation GCT FX1 Exec Unit FX2 FP1 FP2 CR BR BR/CR Issue Q FX/LD 1 FX/LD 2 FP D-cache StQ LD2 LD1

29 Block Diagram IFAR I-Cache BR Scan Eight-way Issue
Eight independent functional units Inst. Buffer BR Predict Decode, Crack & Group Formation GCT BR/CDR Issue Q FX/LD 1 Issue Q FX/LD 2 Issue Q FP1 Issue Q FP2 Issue Q BR CR FX1 LD1 LD2 FX2 FP1 FP2 D-Cache

30 Processor Features POWER4 POWER5 Clock 1.0 – 1.9 GHz 1.5 – 2.3 GHz
Caches Three levels L3 Speed 1/3 clock frequency ½ clock frequency Virtualization Up to 32 partitions Up to 254 partitions Partitions Unit processor Fractional Power Mang. Static Dynamic Thread Execution Single Thread Multi Threading Memory Store Single Buffer Double Buffer Renaming Registers GP: 72 FP: 80 GP: 120 FP: 120

31 Caches and Memory POWER4 POWER5 L1 Cache Data: 32 kbyte
Instruction: 64 kbyte 2-way Assoc., FIFO 4-way Assoc., LRU L2 Cache 1.5 Mbyte 8-way Assoc., FIFO 1.9 Mbyte 10-way Assoc., LRU L3 Cache 32 Mbyte 8-way Assoc., LRU 120 Cycles 36 Mbyte 12-way Assoc., LRU ~80 Cycles Memory Bandwidth 4 Gbyte/s/Chip* 16 Gbyte/s/Chip* * - if all memory DIMM slots occupied

32 POWER4 – POWER5 Comparison
Frequency (GHz) 1.7 1.9 L2 Latency (Cycles) 12 L3 Latency (Cycles) 120 80 Memory Latency (Cycles) 351 220 Copy Bandwidth 4 proc. (Gbyte/s) 8 18 Linpack Rate N=1000 (Gflop/s) 3.9 5.6 SPECint_base2000 1077 1398 SPECfp_base2000 1598 2576

33 Memory and Cache Three levels of cache Memory L1: 32 kbyte (data)
L2: 1.9 Mbyte, shared L3: 36 Mbyte, shared Memory Uniform access (approximate) On module Across modules Up to 2 Gbyte

34 L1 Cache Size: Bandwidth: Latency: D(data): 32 kbyte 4-way associative
LRU replacement policy Write-through "New" data goes directly to L2 I (Instruction): 64 kbyte Bandwidth: 2 words x 8 bytes x clock_rate 32 Gbyte/s Latency: 4-5 clock period D I

35 POWER5: Memory One or two memory cards per MCM
Best bandwidth with two memory cards 25 Gbyte/s bandwidth per chip Memory Memory

36 Memory Subsystem Memory controller on processor chip
Synchronous Memory Interface (SMI): DDR-I DDR-II Two or four SMI chips per DIMM set Up to two DIMMs behind each of two ports can be attached to each SMI Two SMI configuration: 8-byte read, 2-byte write Four SMI configuration: 4-byte read, 2-byte write

37 Memory Subsystem SMI SMI SMI SMI POWER5 Chip 8 16 Memory Controller
POWER5 SMI: 16 byte read + 8 byte write @ 2x DRAM Freq SMI DRAM: 4 or 8 64-bit channels @ DDR Freq. Memory Controller 8 16 Address Bus Write Bus Read Bus SMI DIMM SMI DIMM SMI DIMM SMI DIMM

38 Interleaved Memory Interleaved 4 way within MCM
(Only if 2 memory cards match in size) Pages interleaved within an MCM Consecutive pages can be on any MCM Memory Memory

39 Memory Page Placement Default is random page placement
Small pages Local page placement optional with "first touch" policy "Round robin" option available with AIX 5.2 Large pages are also available Loader option or "tag" binary Large pages are statically allocated Placed at allocation time, not first reference

40 Memory Allocation Pages are allocated by module
Approximately uniform distribution Approximately round robin

41 Memory Allocation Memory Affinity
Allocate pages on memory local to module

42 Memory Performance Bandwidth Latency
3 – 8 Gbyte/s per single processor 200 Gbyte/s per p5-595 Latency ~200 nanoseconds

43 POWER5 Design: Summary More gates Enhancements New features
170 million  260 million Enhancements Increased cache associativity Increased number of rename registers Reduced L3 and cache latency New features Simultaneous Multi Threading Dynamic power management

44 System Architecture: p5 575
Configuration: Single motherboard (planar) Eight POWER5 Dual Chip Modules (DCM) 2.2 GHz single core 1.9 GHz dual core Each DCM: Eight Synchronous Memory Interfaces (SMI) Eight DIMM slots Four or eight used Two address busses p5 575 is NOT address-fabric limited

45 p5 575: Inter Module Busses Eight bytes wide Half processor freq.
50 bit real address L3 Proc.

46 p5 575 Memory Interface POWER5+ Chip SMI SMI SMI SMI POWER5 SMI:
DIMM SMI DIMM SMI DIMM SMI DIMM POWER5 SMI: 16 byte read + 8 byte write @ 2x DRAM Freq SMI DRAM: 4 or 8 64-bit DDR Freq.

47 p5 575 POWER5 DDR1 Memory Performance
Construct Large Pages Small Pages DAXPY 7.1 4.8 DDOT 8.1 3.8 STREAM Triad 5.2 4.2 Results are in Gbyte/s Node bandwidth: ~ 56 Gbyte/s


Download ppt "Hardware Overview System P & Power5."

Similar presentations


Ads by Google