Download presentation
Presentation is loading. Please wait.
1
Cluster Hardware Overview (IA-32 Pentium) Kent Milfeld milfeld@tacc.utexas.edu 10/31/2002 The University of Texas at AustinTexas Advanced Computing Center
2
1 Cluster Hardware Overview (IA-32 Pentium 4) Outline: Cluster Systems Cluster Architecture –Nodes -- 2-way SMP (Dell Xeon Pentium 4) –Motherboard -- 2-way SMP (ServerWorks) –Interconnect -- Switch (Myrinet)
3
2 Cluster Hardware Overview (IA-32 Pentium 4) Cluster Architecture interne t Switch Server PC PC+ GigE, Myrinet Switch File Server PC+ ethernet Myrinet, … FCAL, SCSI,… …
4
3 Cluster Hardware Overview (IA-32 Pentium 4) Node Processors:Two 2.4GHz Intel® Xeon Processors (2U ) Chipset:ServerWorks Grand Champ LE chipset Memory:2GB 2:1 memory interleave (200MHz DDR SDRAM) FSB:400MHz (Front Side Bus) Cache:512KB L2 Advanced Transfer Cache Disk:Dual-channel integrated Ultra3 (Ultra160) SCSI Adaptec® AIC-7899 (160Mb/s) controller Dell PowerEdge 2650 2U
5
4 Cluster Hardware Overview (IA-32 Pentium 4) Motherboard GC-LE BCM5701 CIOB-X PCI64C CSB5 Legacy I/O DDR 200 32-bit PCI thin IMBus IMBus 8B Pentium 4 Xeon Processors PCI-X 3.2GB/s Memory Subsystem Interleaved Memory Memory:dual-channel, up to 16GB of DDR200 memory Bandwidth:3.2GB/s of memory bandwidth RAS:ECC, redundant spare memory support, memory scrubbing & chipkill Gigabit NIC Myrinet Adapter
6
5 Cluster Hardware Overview (IA-32 Pentium 4) Interconnect (Myrinet Bandwidth) http://www.myri.com/myrinet/performance/index.html
7
6 Cluster Hardware Overview (IA-32 Pentium 4) Interconnect (MPI Bandwidth) GigE (IBM) Myrinet
8
7 Cluster Hardware Overview (IA-32 Pentium 4) Outline: Pentium 4 Microarchitecture Features Block Diagrams (data flow / hardware) Out-of-Order (OO) execution Speeds & Feeds Floating Point & Memory Performance Registers / Caches SIMD Compiler Design Optimizations
9
8 Cluster Hardware Overview (IA-32 Pentium 4) Architecture Features NetBurst Microarchitecture Instruction Cache (Execution Trace Cache) Out-of-Order (OO) execution engine Double-pumped Arithmetic Logic Unit Memory Subsystem (L1 access in 2 CP) Floating Point/Multi-Media performance
10
9 Cluster Hardware Overview (IA-32 Pentium 4) Basic Features 42 million transistors (0.18u), 217 mm**2, 55watts @1.5GHz, 6 levels of aluminum interconnect) Up to 3.0GHz 400/533 MHz FSB 144 128/64-bit SIMD instructions –SSE2 (Streaming Extension 2)
11
10 Cluster Hardware Overview (IA-32 Pentium 4) Data and Instruction Flow Level 1 Data Cache Execution Units Registers Level 2 Cache Execution Out-of-Order Core Retire- ment Trace Cache code ROM Fetch/ Decode BTB Branch Prediction Branch History Update Memory Int & FP ExecutionMemory Subsystem Out-of-Order Engine Front End Bus 100MHz Bus Unit +
12
11 Cluster Hardware Overview (IA-32 Pentium 4) Block Diagram From Tom’s Hardware:
13
12 Cluster Hardware Overview (IA-32 Pentium 4) Out-of-Order Execution Non deterministic because Out-of-Order Execution Stalls overcome by parallel execution, buffering, and speculation. In Order Issue Out of Order Execution In Order Retirement
14
13 Cluster Hardware Overview (IA-32 Pentium 4) Out-of-Order Execution -- Pipeline Fetch 1 2 Decode 3 4 5 Rename 6 ROB Rd 7 Rdy/Sch 8 Dispatch 9 Exec 10 TC Fetch 1 Drive RenameQueSchDispFR Flgs Drive 2 TC Fetch 34567891011121314151617181920 Alloc Sch DispFR BrCk Ex Pentium III processor misprediction pipeline Pentium 4 processor misprediction pipeline
15
14 Cluster Hardware Overview (IA-32 Pentium 4) Pentium 4 Speeds & Feeds L1 Data Regs. W PF Word (64 bit) Int Integer (64 bit) CP Clock Period Memory 4 W CP 8KB L2 1 W (load) CP 1 W 6 CP @400MHz FSB 2.4GHz CPU PC800 RDRAM 2 CP ~4 CP (3uops/CP stream) Latencies Trace Cache Exec 1 W (store) CP 2-7 CP~90 CP Line size L1/L2 =8/16/ W 256/512KB 32B wide on die
16
15 Cluster Hardware Overview (IA-32 Pentium 4) Performance Comparison Scott Wasson “Intel’s Pentium 4 Processor, Radical Chic” www.tech-report.com/reviews/2001q3/pentium4-2ghz/
17
16 Cluster Hardware Overview (IA-32 Pentium 4) Processor Speed vs Memory Bandwidth Scott Wasson “Intel’s Pentium 4 Processor, Radical Chic” www.tech-report.com/reviews/2001q3/pentium4-2ghz/
18
17 Cluster Hardware Overview (IA-32 Pentium 4) Registers 1 GPR SEG MMX 8 64-bit 32-bit 16-bit XMM FPU 80-bit 128-bit 1 1 1 1 6 8 8 8 General Purpose Registers Segment Registers Floating Point Registers MMX/SSE Registers SSE2 Registers (FP/Int…) EFLAGS Register Control Register
19
18 Cluster Hardware Overview (IA-32 Pentium 4) Pentium 4 Cache LevelCapacityAssoci ativity Line Size (bytes) Latency int/float (clocks) Write Update Policy First8KB4642/9write through TC12K uops 8N/A Second256KB, 512KB 8128 read 64 write 7/7write back Third0, 512KB or 1MB 8128 read 64 write 14/14write back
20
19 Cluster Hardware Overview (IA-32 Pentium 4) SIMD Beginning with Pentium II SIMD Technology was integrated into the Hardware & Instruction Set. SSE2 was implemented in Pentium 4. InstructionsPacked Data Registers MXM 64-bit Registers XMM 128bit APPS MMXINT B,W,Q Yes---Imaging, MM, comm. SSESP FloatYes---3-D geo/rendering video en/decode SSE2INT, SP/DP Float Yes 4-D graphics Scientific Comp Intel Hyper-Threading Technology Use OpenMP pragmas & Directives with Intel Compiler. Higher Performance realized with Multi-Entry Threading (MET)
21
20 Cluster Hardware Overview (IA-32 Pentium 4) Compiler Design “all for one” with SIMD C++/C Front EndFortran 95 Front End Code Restructuring & IPO OpenMP/Automatic Parallelization & Vectorization HLO & Scalar Opt. Lower Level Code Gen. & Optimization IA-32IA-64 outlining Multi-Entry Threading Uses guide-based multi- threaded run-time library from Intel KAI Software Laboratory (KSL)
22
21 Cluster Hardware Overview (IA-32 Pentium 4) OPT: Avoid Unpredictable Branches Simple, often traversed, loops can be corrected by compiler: C = (A<B) ? C1: C2 or IF(A.LT.B) C=C1; ELSE C=C2; ENDIF Compare A>B C3 = C1-C2 Set register to 0 or 1 (according to compare) And C3 with register Add C2 to register A =B 0000000000 11111111111 C3 C3 C2 C2 C1 (result) AND ADD Example 1. Optimization Eliminates Branch cmp A,B jge L0 mov ebx, C1 jmp L1 L0: mov ebx, C2 L1: Assembly BranchAssembly No-Branch (pseudo-assembly)
23
22 Cluster Hardware Overview (IA-32 Pentium 4) OPT: Make code consistent with static prediction algorithm Predict backward conditional branches to be taken (loops). Predict forward conditional branches to be NOT taken. Predict indirect branches to be NOT taken. If { … } for { … } loop{ … } Forward Conditional branches not taken (fall through) Backward Conditional branches taken
24
23 Cluster Hardware Overview (IA-32 Pentium 4) OPT: Make code consistent with static prediction algorithm Inline functions with branch structure: A mispredicted branch can lead to larger performance penalties inside a small function than if that function is inlined Be careful not to increase “working set” beyond what will fit in the trace cache. Indirect branches degrade performance if they are non-predictable. (switches, computed GOTOs, call through pointers)
25
24 Cluster Hardware Overview (IA-32 Pentium 4) OPT: Unrolling If loop count is small, unroll Pentium 4 code so that only up to 16 iterations are performed. They will all be predicted. (Only 4 suggested for PII and PIII.) Other concerns: registers, working set size in trace cache, and prefetching may be more important. OPT: Memory Inappropriate Alignment and Forwarding are the sources of large delays.
26
25 Cluster Hardware Overview (IA-32 Pentium 4) OPT: General Optimization Concerns Instruction Decoding is less important (than with Pentium III) Some Latencies of simple arithmetic ops have decreased (2x faster local clock) Memory latency hiding is better. (Hardware Prefetching) New Cacheability Instructions (streamline stores and manage cache usage) Fewer prefetches required. (64-byte cache lines compared to 32-bytes (PII, PIII); but false sharing more important. L2 code misses should be less. (Trace Cache is used in lieu of L1 code cache.
27
26 Cluster Hardware Overview (IA-32 Pentium 4) OPT: X87/SSE2 Instructions Avoid changing between 3 (or more) floating- point modes. [FLDCW (mode change= precision & rounding control, etc. e.g. converting to int.] Must flush instruction pipe. Masked floating-point exceptions require “assistance” from slower microcode operations to handle masked exception. Avoid propagation of overflow, underflow and denormalized operands.
28
27 Cluster Hardware Overview (IA-32 Pentium 4) OPT: X87/SSE2 Instructions Set mode to convert underflows to zero (FTZ mode). Set mode to convert denormalized floats to zero (DAZ mode). -- Use FTZ and DAX when speed is important and a slight loss in precision is acceptable.
29
28 Cluster Hardware Overview (IA-32 Pentium 4) References http://www.tomshardware.com/ http://www.myri.com http://www.serverworks.com http://developer.intel.com/design/pentium4/ manuals/index2.htm http://www3.sk.sympatico.ca/jbayko/cpu.html
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.