Presentation is loading. Please wait.

Presentation is loading. Please wait.

® * Other brands and names may be claimed as the property of others. Architecture of: Intel® Pentium® 4, Intel® Xeon™, Intel® Xeon™ MP Architecture Rev.

Similar presentations


Presentation on theme: "® * Other brands and names may be claimed as the property of others. Architecture of: Intel® Pentium® 4, Intel® Xeon™, Intel® Xeon™ MP Architecture Rev."— Presentation transcript:

1 ® * Other brands and names may be claimed as the property of others. Architecture of: Intel® Pentium® 4, Intel® Xeon™, Intel® Xeon™ MP Architecture Rev. 1.0 HM 10/7/2010 Derived from Herbert G. Mayer’s 2003 Presentation for: Intel Software College

2 ® * Other brands and names may be claimed as the property of others. 2 Agenda  Assumptions  Speed Limitations  x86 Architecture Progression  Architecture Enhancements  Intel ® x86 Architectures

3 ® * Other brands and names may be claimed as the property of others. 3 Assumptions  Audience: Understands generic x86 architecture  Knows some assembly language –Flavor used here: gas, as used in ddd disassembly –Result on right-hand-side: –mov [temp], %eax;is a load into register a –add %eax, %ebx;new integer sum is in register b –Different from Microsoft * masm, and tasm  Understand some architectural concepts: –Caches, Multi-level caches, (some MESI) –Threading, multi-threaded code –Blocking (cache), blocking (aka tiling), blocking (thread synch.)  Causes of pipeline stalls –Control flow change –Data dependence, registers and data  NOT discussed: asm, VTune, CISC vs. RISC

4 ® * Other brands and names may be claimed as the property of others. 4 Speed Limitations

5 ® * Other brands and names may be claimed as the property of others. 5 Agenda  Performance Limiters  Register Starvation  Processor-Memory Gap  Processor Stalls  Store Forwarding  Misc Limitations: –Spin-Lock in Multi Thread –Misaligned Data –Denorm Floats

6 ® * Other brands and names may be claimed as the property of others. 6 Performance Limiters  Architectural limitations the programmer or compiler can overcome: –Indirect limitations: stall via branch, call, return –Incidental limits: resource constraint –Historical limits: register starved x86 –Technological: ALU speed vs. memory access speed –Logical limits: data- and resource dependence

7 ® * Other brands and names may be claimed as the property of others. 7 Register Starvation  How many regs needed (compiler or programmer)? –Infinite is perfect –Infinite is perfect –1024 is very good –64 acceptable –16 is crummy –4+4 is x86 –1 is saa (single-accumulator architecture)  Formally on x86: 16 regs. Quick test: – –ax, bc, cx, dx – –si, di – –bp, sp, ip – –cs, ds, ss, es, fs, gs, flags  Of which are GPRs, almost  Of which ax, bx, cx, dx are GPRs, almost  Rest can be used as better temps  & used for * and /, for loop  ax & dx used for * and /, cx for loop

8 ® * Other brands and names may be claimed as the property of others. 8 Register Starvation  Absence of regs causes –Spurious memory spills and load –False data dependences --not dependencies –False data dependences --not dependencies  Except single-accumulator arch: No other arch is more register starved than x86  Instruction Stream mov %eax, [mem1] use stuff, %eax mov [mem1], %eax Added ops Mem latency Instruction Stream mov %eax, [tmp] add %ebx, %eax imul %ecx mov %eax, [prod] mov [tmp], %eax False DD

9 ® * Other brands and names may be claimed as the property of others. 9 And the Programmer?  No solution in ISA, x86 had 4 GPRs since 8086  Improved via internal register renaming –Pentium ® Pro has hundreds of internal regs  Added registers in mmx –Visible to you, programmer and compiler –fp(0).. fp(7), 80-bits as FP, 64 bits as mmx, but note: context switch  Added registers in SSE –xmm(0).. xmm(7) 128 bits

10 ® * Other brands and names may be claimed as the property of others. 10 Processor-Memory Gap µProc 60%/yr. DRAM 7%/yr DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance Time “Moore’s Law” Source: David Patterson, UC Berkeley

11 ® * Other brands and names may be claimed as the property of others. Bridging the Gap: Trend DRAM CPU Caches Multilevel Caches Performance Time Instruction Level Thread Level Intel® Pentium II Processor: Out of Order Execution ~30% Intel® Xeon™ Processor: Hyperthreading Technology ~30% Hyperthreading Technology: Feeds two threads to exploit shared execution units

12 ® * Other brands and names may be claimed as the property of others. 12 Impact of Memory Latency  Memory speed has NOT kept up with advance in processor speed –Avg. integer add ~ 0.16 ns (Xeon), but memory accesses take ~10 ns or more  CPU hardware resource utilization is only 35% on average –Limited due to memory stalls and dependencies  Possible solutions to memory speed mismatch? Memory speed mismatch is a major source of CPU stalls

13 ® * Other brands and names may be claimed as the property of others. 13 And the Programmer?  Cache provided  Methods to manipulate cache  Tools provided to pre-fetch data –At risk of superfluous fetch, if control-flow change

14 ® * Other brands and names may be claimed as the property of others. 14 Processor Stalls  Stalled cycle is a cycle in which processor cannot receive or schedule new instructions –Total Cycles = Total Stall Cycles + Productive Cycles –Stalls waste processor cycles –Perfmon, Linux ps, tops, other system tools show Stalled cycles as busy CPU cycles –Intel® VTune Analyzer used to monitor stalls (HP* PFmon)

15 ® * Other brands and names may be claimed as the property of others. 15 Why Stalls Occur!  Stalls occur, because: –Instruction needs resource not available –Dependences [sic] (control- or data-) between instructions –Processor / instruction waits for some signal or event  Sample resource limitations: –Registers –Execution ports –Execution units –Load / store ports –Internal buffers (ROBs, WOBs, etc.)  Sample events: –Exceptions, Cache misses, TLB misses, e.t.c. –Common thing: they hold up compute progress

16 ® * Other brands and names may be claimed as the property of others. 16 Control Dependences (CD)  Change in flow of control causes stalls  Processors handle control dependences: –Via branch prediction hardware –Conditional move to avoid branch & pipeline stall Instruction Stream mov [%ebp+8], %eax cmp 1, %eax jg bigger mov 1, %eax... bigger: Barrier (Predict) Instruction Stream dec %ecx push %eax call rfact mov %ecx,[%ebp+8] mul %ecx Barrier (Predict)

17 ® * Other brands and names may be claimed as the property of others. 17 Data Dependences (DD)  Data dependence limits performance  Programmer / Compiler cannot solve –Xeon has register renaming to avoid false data dependencies –supports out of order execution to hide effects of dependencies Instruction Stream... mov eax, [ebp+8] cmp eax, 1 Mem latency Instruction Stream mov [temp], eax add eax, ebx mult ecx mov [prod], eax mov eax, [temp]... bigger: False DD

18 ® * Other brands and names may be claimed as the property of others. 18 Xeon Processor Stalls  D-side –DTLB Misses –Memory Hierarchy  L1, L2 and L3 misses  Core –Store Buffer Stalls –Load/Store splits –Store forwarding hazard –Loading partial/misaligned data –Branch Mispredicts  I-side –Streaming Buffer Misses –ITLB Misses –TC misses –64K Aliasing conflicts  Misc –Machine Clears

19 ® * Other brands and names may be claimed as the property of others. 19 And the Programmer?  Reduce processor stall by prefetching data  Reduces control flow change by conditional move  Reduce false dependences by using register temps, from mmx (fp) and xmm pool

20 ® * Other brands and names may be claimed as the property of others. 20 Partial Writes: WC buffers First Level Cache Fill/WC Buffer 8B - Incomplete WC buffer 3 - 8B “Partial” bus transactions 8B Complete WC buffer 1 bus transaction Second Level Cache Memory Detection (VTune) Event based sampling: Ext. Bus Partial Write Trans. Causes: L2 Cache Request Ext. Bus Burst Read Trans. Ext. Bus RFO Trans. Causes: 1) Too many WC streams 2) WB loads/stores contending for fill- buffers to access L2 cache or memory Partial writes reduce actual front-side bus Bandwidth –~3x lower for PIII –~7x lower for ~Pentium 4 processor due to longer cache line FSB

21 ® * Other brands and names may be claimed as the property of others. 21 Store Forwarding Guidelines A Will ForwardForwarding Penalty Store Load Load aligned with Store Load contained in Store 128-bit forwards must be 16-byte aligned Store Load Store Load Store Load Store Load Store Load 16-byte boundaries Load contained in single Store B Store Load Store Load Store Forward: Loading from an address recently stored can cause data to be fetched more quickly than via mem access. Large penalty for non-forwarding cases ( x) MSVC < 7.0 and you generate these. Intel Compiler doesn’t.

22 ® * Other brands and names may be claimed as the property of others. 22 And the Programmer?  Pick right compiler, for HLL programs  Use VTune to check, for asm code  In asm programs, ensure loads after stores are: –Contained in stored data, subset or proper subset –In single previous store, not in sum of multiple stores –Thus do store-combining: assemble together, then store –Both data start on same address

23 ® * Other brands and names may be claimed as the property of others. 23 Misc Limitations  Spin-Lock in Multi Thread –Don’t use busy wait, juts because you have (almost) a second processor for second thread  Misaligned data –Don't align data on arbitrary boundary, just because architecture can fetch from any address  Dumb errors  Dumb errors –Fail to use proper tool (library, compiler, performance analyzer) –Failure to use tiling (aka blocking) or SW pipelining  Denormalized Floats

24 ® * Other brands and names may be claimed as the property of others. 24 And the Programmer?  Use pause, when applicable! –New NetBurst instruction  Use compiler switches to align data on address divisible by greatest individual data object –Who cares about wasting 7 bytes to force 8-byte alignment?  Be smart, pick right tools  Be smart, pick right tools –Instruct compiler to SW pipeline –In asm, manually SW pipeline; note easier on EPIC than VLIW, lacking prologue, epilogue sometimes –Enable compiler to partition larger data structures into smaller suitable blocks, for improved locality –cache parameter dependent

25 ® * Other brands and names may be claimed as the property of others. 25 And the Programmer?  Executes for first of 2 labs, this one being a "two-minute" exercise:  Turn on your computer, verify Linux is alive  Verify you have available: –Editor to modify program –Intel C++ compiler, text command icc, with -g –Debugger ddd, with disassembly ability  Source program vscal.cpp  Linux commands: ls, vi, icc, mkdir, etc.

26 ® * Other brands and names may be claimed as the property of others. 26 Module Summary Covered: key causes that render execution slower than possible:  More registers at your disposal than seems  Van Neumann bottleneck can be softened via cache use and data pre-fetch  Stalls can be reduced by conditional move, avoiding false dependences  Use (time limited) capabilities, such as proper store forwarding  Note new Pause instruction

27 ® * Other brands and names may be claimed as the property of others. 27 x86 Architecture Progression

28 ® * Other brands and names may be claimed as the property of others. 28 Agenda: x86 Arch. Progression  Abstract & Objectives  x86 Nomenclature & Notation  Intel® Architecture Progress  Pentium 4 Abstract

29 ® * Other brands and names may be claimed as the property of others. 29 Abstract & Objectives: x86 Architecture Progression  Abstract: High-level introduction to history and evolution of increasingly powerful 16-bit and 32-bit x86 processors that are backwards compatible.  Objectives: understand processor generations and architectural features, by learning –Progressive architectural capabilities –Names of corresponding Intel processors –Explanation, description of capabilities –FP incompatibility, minor

30 ® * Other brands and names may be claimed as the property of others. 30 Non-Objectives  Objective is not introduction of: –x86 assembly language, assumed known –Itanium ® processor family now in 3 rd generation –Intel tools (C++, VTune) –Performance tools: MS Perfmon, Linux ps, emon, HP PFMon, etc. –Performance benchmarks, performance counters –Differentiation Intel vs. competitor products –CISC vs. RISC

31 ® * Other brands and names may be claimed as the property of others. 31 x86 Nomenclature & Notation Pentium ® II, 2H98, 450 MHz MMX, BX chipset Dynamic branch prediction enhanced Processor name, initial launch date, final clock speed Architecturally visible enhancement list, can be empty Architectural speedup technique, invisible exc. higher speed

32 ® * Other brands and names may be claimed as the property of others. 32 Intel® Architecture Progress Pentium ® Pro, 2H95, 100 MHz, Dynamic branch prediction 8086, 2H80, 4 MHz, , 2H2h85, 10 MHz, FP integrated Pentium ®, 1988, 40 MHz, D+I caches, static branch prediction Pentium ® 4, 2H00, 3.06 GHz SSE2, 144 WNI, NetBurst ® L3 on chip cache Pentium ® II, 2H98, 450 MHz MMX, BX chipset Dynamic branch prediction enhanced Pentium ® III, 2H99, 733 MHz SSE, XMM regs Large cache, l2 onchip

33 ® * Other brands and names may be claimed as the property of others. 33 Intel ® Pentium ® 4 Processors ProcessorFamilyDescription Northwood Pentium ® Willamette shrink. Consumer and business desktop processor. HT not enabled, though capable. NW E-Step Pentium HT errata corrected. Desktop processor PrescottPentium Consumer and business desktop processor. Replaces NW. Offers 6 PNI: Prescott New Instructions. First processor with Lagrande technology (trusted computing) Prestonia DP Xeon TM DP slated for workstations and entry-level servers. Based on NW core. HT enabled. 512 kB L2 cache. No L3. 3 GHz processor. Nocona DP Xeon DP based on Prescott core. Targeted for 3.06 GHz. 533 MHz (quad- pumped) bus, I.e. bus speed is 133 MHz. 1 MB L2 cache. HT enabled. About to be launched. Foster MP Xeon MP based on Willamette core. 1 MB L3 cache, 256 kB L2, HT enabled. For higher-end servers. Gallatin MP Xeon MP based on NW core. 1 or 2 MB L3 cache, 512 kB L2 cache. For high-end servers. See 8-way HP DL 760, and IBM x440. HT enabled. Potomac MP Xeon MP based on Prescott core. 533 MHz (quad-pumped) bus. 1 MB L2 cache, 8 MB L3 cache. HT enabled, yet to be launched. Note: lower clock rates for MP versions. Due to higher circuit complexity, bus load.

34 ® * Other brands and names may be claimed as the property of others. 34 Processor Generation Comparison Feature MHz Execution Type MMX™ Technology Streaming SIMD Extensions Yes Pentium® III Processor Yes Dynamic 600 MHz – 1.13GHz System Bus 1.5 GHz Intel® NetBurst™  Arch Yes 400MHz (4x100 MHz) 133MHz Streaming SIMD Extensions 2 No Yes Pentium® 4 Processor Yes Pentium® III Processor Yes Dynamic MHz 100MHz No L2 Cache 512k off-die 256k on-die 512k on-die 2+ GHz Northwood 400/533MHz (4x100/133 MHz) Yes Manufacturing Process Chipset ICH-1ICH-2ICH-2ICH-2.25 micron.18 micron.13 micron Intel® NetBurst™  Arch

35 ® * Other brands and names may be claimed as the property of others. 35 Intel® Architecture Progress  8087 co-processor of 8086: off-chip FP computation, extended 80-bit FP format for DP  MMX: multi-media extensions –Mmx regs aliased w. FP register stack –needs context switch –FP regs also called ST(I) regs  SSE: Streaming SIMD extension already since Pentium III  WNI: 144 new instructions, using additional data types for existing opcodes, using previously reserved opcodes

36 ® * Other brands and names may be claimed as the property of others. 36 Intel® Architecture Progress  XMM: 8 new 128-bit registers, in addition to MMX  SSE2: multiple integer ops and multiple DP FP ops: part of 144 WNI –Regs unchanged in Pentium ® 4 from P III –Ops added  NetBurst: generic term for: HyperThreading & quad-pumped bus & new Trace Cache & etc. Note: architectural feature ages with next generation, but survives, due to compatibility requirement. Hence is interesting not only for historical reasons: You need to know it!

37 ® * Other brands and names may be claimed as the property of others. 37 Xeon TM MP Abstract 20 HyperthreadingTechnology Xeon™ MP Processor “Gallatin” 64 GB (PAE-36) 8 Integer, 1 Multimedia 2FloatingPoint 2.0+ GHz Registers (126) HyperthreadingTechnology 3 Instructions / Cycle L3 – 1or 2 MB L KB L1 - 12K TC, 8K D 6 5 2xALU 3.2 GB/s (400) Physical Addressing (36-bit P Pro) Physical Addressing (36-bit P Pro) On-die Cache Pipeline Stages Registers Execution Units Core Frequency Issue Ports Logical CPU 2 X System Bus Bandwidth Instructions/clock-cycle External Cache

38 ® * Other brands and names may be claimed as the property of others. 38 Xeon TM Memory Hierarchy Xeon ™ Processor MP 12.8 GB/s L2 (unif'd) 512KB 8-way 128B lines 7+ CLKS L3 2MB 8-way 128B lines 21+ CLKS External Memory 64GB 3.2 GB/s L1(DL0) 8KB 64B lines 2 CLKS TC 12KB 64B lines 2 CLKS Note: Physical Address Extension, 36-bit PAE addresses, since Pentium ® Pro

39 ® * Other brands and names may be claimed as the property of others. 39 Architecture Enhancements

40 ® * Other brands and names may be claimed as the property of others. 40 Agenda: Architecture Enhancements  Abstract & Objectives  Faster Clock  Caches: Advantage, Cost, Limitation  Multi-Level Cache-Coherence in MP  Register Renaming  Speculative, Out of Order Execution  Branch Prediction, Code Straightening

41 ® * Other brands and names may be claimed as the property of others. 41 Abstract & Objectives: Architecture Enhancements  Abstract: Outline generic techniques that overcome performance limitations  Objectives: under stand cost of architectural techniques (tricks) in terms of resources (mil space) and of lost performance if incorrectly guessed –Caches: cost silicon, can slow down –Branch prediction: costs silicon, can be wrong –Prefetch: costs instruction, may be superfluous –Superscalar: may not find a second op

42 ® * Other brands and names may be claimed as the property of others. 42 Non-Objectives  Objective is not to explain detail of Intel processor architecture  Not to claim Intel invented techniques; academia invented many  Not to show all techniques; some apply mainly to EPIC or VLIW architectures  No hype, no judgment, just the facts please!

43 ® * Other brands and names may be claimed as the property of others. 43 Faster Clock  CISC: –Decompose circuitry into multiple simple, sequential modules  Resulting modules are smaller and thus can be fast: –high clock rate –Shorter speed-paths  That's what we call: pipelined architecture  More modules -> simpler modules -> faster clock -> super-pipelined  Super-pipelining NOT goodness per-se: –Saves no silicon –Execution time per instruction does not improve –May get worse, due to delay cycles  But: –Instructions retired per unit time improves –Especially in absence of (large number of) control-flow stalls

44 ® * Other brands and names may be claimed as the property of others. 44 Faster Clock  Xeon TM processor pipeline has 20 stages  Beautiful model breaks upon control transfer Intel ® NetBurst TM µarchitecture: 20 stage pipeline TC Nxt IP TC Fetch DriveAlloc Rename QueSch Sch Sch 1314 Disp Disp RF Ex Flgs Br Ck Drive Drive RF ALU op I-Fetch R Store Decode O1-Fetch O2-Fetch.. I-Fetch Decode O1-Fetch O2-Fetch ALU op R Store

45 ® * Other brands and names may be claimed as the property of others. 45 Intel ® x86 Architectures

46 ® * Other brands and names may be claimed as the property of others. 46 Agenda: Intel x86 Architectures  Abstract & Objectives  High Speed, Long Pipe  Multiprocessing  MMX Operations  SSE Operations  SSE2 Operations  Willamette New Instructions WNI  Cacheability Instructions  Pause Instruction  NetBurst, Hyperthreading  SW Tools

47 ® * Other brands and names may be claimed as the property of others. 47 Abstract & Objectives: Intel ® x86 Architectures  Abstract: Emphasizing Pentium ® 4 processors, show progressively more powerful architectural features introduced in Intel processors. Refer to speed problems solved from module 2 and general solutions explained in module 3.  Objective: you not only understand the various processor product names and supported features (Intel marketing names), but understand how they work, and what their limitations and costs are.

48 ® * Other brands and names may be claimed as the property of others. 48 Non-Objectives  Objective is not to show Intel's techniques are the only ones, or best possible. They are just good trade-off in light of conflicting constraints: –Clock speed vs. small # of pipes –Small transistor count vs. high performance –Large caches vs. small mil. Space –Grandiose architecture vs. backward compatibility –Need for large register file vs. register-starved x86 –Wish to have two full on-die processors vs. preserving silicon space

49 ® * Other brands and names may be claimed as the property of others. High Speed, Long NetBurst TM Pipe FetchFetchDecodeDecodeDecodeRename ROB Rd Rdy/SchDispatchExec Basic Pentium ® Pro Pipeline Hyper pipelined Technology enables industry leading performance and clock rate Hyper pipelined Technology enables industry leading performance and clock rate Basic Pipeline Basic NetBurst™ Micro-architecture Pipeline TC Nxt IP TC Fetch DriveAlloc Rename QueSch Sch Sch 1314 Disp Disp RF Ex Flgs Br Ck Drive Drive RF Intro at Intro at733MHz.18µ 1.4 GHz.18 µ 1.4 GHz.18 µ 2.2GHz.13µ

50 ® * Other brands and names may be claimed as the property of others. 50 Check Your Progress Execute: Execute the  ops on the correct port; 1 clk Flags: Compute flags (0, negative, etc.); 1 clk Trace Cache Fetch: Read decoded  op from TC; 2 clks Register File: Read the register file; 2 clks Drive: Drive  ops to the Allocator; 1 clk Trace Cache/Next IP: Read from Branch Target Buffer; 2 clks Dispatch: Send  ops to appropriate execution unit; 2 clks Rename: Rename logical regs to physical regs; 2 clks Drive: Drive the branch result to BTB at front; 1 clk Allocate: Allocate resources for execution; 1 clk Branch Check: Compare act. branch to predicted; 1 clk Queue: Write  op into  op queue to wait for scheduling; 1 clk Schedule: Write to schedulers; compute dependencies; 3 clks Match pipe functions to clocks/stages

51 ® * Other brands and names may be claimed as the property of others. 51 Multiprocessing, SMP  Def: Execution of 1 task by >= 2 processors  Floyd Model (1960s): –Single-Instruction, Single-Data Stream (SISD) Architecture (PDP-11) –Single-Instruction, Multiple-Data Stream (SIMD) Architecture (Array Processors, Solomon, Illiac IV, BSP, TMC) –Multiple-Instruction, Single-Data Stream (MISD) Architecture (possibly: pipelined, VLIW, EPIC) –Multiple-Instruction, Multiple-Data Stream Architecture (possibly: EPIC when SW-pipelined, true multiprocessor)

52 ® * Other brands and names may be claimed as the property of others. 52 MP Scalability Caveat Performance gain from doubling processors Number of processors Gain Gain Follows Law of Diminishing Returns

53 ® * Other brands and names may be claimed as the property of others. 53 Intel® Xeon™ Processor Scaling 1.39x Frequency Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other source of information to evaluate the performance of systems or components they are considering purchasing. Source: Intel Corporation Based on Intel internal projections. System configuration assumptions: 1) two Intel® Xeon™ processor 2.8GHz with 512KB L2 cache in an E7500 chipset-based server platform, 16GB memory, Hyperthreading enabled; 2) Four Intel® Xeon™ processor MP 1.6GHz with 1MB L3 cache in a GC-HE chipset-based server platform, 32GB memory, Hyperthreading enabled; 3) Four Intel® Xeon™ processor MP 2.0GHz with 2MB L3 cache in a GC-HE chipset-based server platform, 32GB memory, Hyperthreading enabled; 4) Four Intel® Xeon™ processor MP 2.8GHz with 2MB L3 cache in a GC-HE chipset-based server platform, 32GB memory, Hyperthreading enabled OLTPSPECint_rate_base2000 Frequency Scale more visible with large cache

54 ® * Other brands and names may be claimed as the property of others. 54 Intel® Xeon™ MP vs. Xeon™ Relative OLTP Performances Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other source of information to evaluate the performance of systems or components they are considering purchasing. Source: TPC.org Which processor is better? Xeon processor MP Targeted for OLTP

55 ® * Other brands and names may be claimed as the property of others. 55 MMX Integer Operations  Add (saturation) paddusw  mm0, mm3 packed add with unsigned saturation on words mm0 b1b03000hb2 a1a0F000ha2 a1+b1a0+b0FFFFha2+b2 mm0 mm  Add (wrap around) paddw  mm0, mm3 packed add on words mm0 b1b03000hb2 a1a0F000ha2 a1+b1a0+b02000ha2+b2 mm0 mm3 ++++

56 ® * Other brands and names may be claimed as the property of others. 56  Multiply-low pmullw  mm0, mm3 multiply low, words  Multiply-high pmulhw  mm1, mm4 multiply high, words MMX Arithmetic Operations mm1 b1b0b3b2 a1a0a3a2 c1c0c3c2 mm1 **** a3*b3a2*b2 mm4 a1*b1a0*b0 c1c0c3c2c1c0c3c2 mm0 b1b0b3b2 a1a0a3a2 mm0 mm3 **** a3*b3a2*b2a1*b1a0*b0

57 ® * Other brands and names may be claimed as the property of others. 57 MMX Arithmetic Operations  Multiply Add pmaddwd mm1, mm4 packed multiply and add 4 words to 2 doublewords b3b2b1b0 a3a2 a1a0 **** a3*b3+a2*b2a1*b1+a0*b0 mm1 mm4 a1*b1a0*b0a3*b3a2*b2  Note: This instruction does not have a saturation option.

58 ® * Other brands and names may be claimed as the property of others. 58 MMX Convert Operations j punpckhwdmm0, mm1 unpack high words into doublewords b0b1 a0 a1 b1b0b3b2 mm1 a1a0a3 a2 mm0 b2b3 a2 a3 a1a0a3a2 mm0 b1b0b3b2 mm1  Unpack, interleaved merge punpcklwdmm0, mm1 unpack low words into doublewords j Zero extend from small data elements to bigger data elements by using the unpack instruction, with zeros in one of the operands.

59 ® * Other brands and names may be claimed as the property of others. 59 MMX Convert Operations  Pack packusdwmm0, mm1 pack with unsigned saturation (signed) doublewords into words mm1 AB CD C’D’A’B’ mm0

60 ® * Other brands and names may be claimed as the property of others psllw MM0, 8 packed shift left logical words MM0 psllq MM0, 8 packed shift left logical quadword MM0 703F 0000 FFD9 4364h3F00 00FF D h MM0 81DBh007Fh703FhDF00h DB00h7F00h3F00h0000h MM0 8 MMX Shift Operations

61 ® * Other brands and names may be claimed as the property of others. 61 MMX Compare Operations  pcmpgtw; compare greater words (generate a mask) >>>>

62 ® * Other brands and names may be claimed as the property of others. 62 SSE Registers Eight 128 bit registers Eight 128 bit registers Single-precision / Double-precision / 128-bit integer Single-precision / Double-precision / 128-bit integer Direct access to registers Direct access to registers Referred to as XMM0-XMM7 Referred to as XMM0-XMM7 Use simultaneously with FP / MMX Technology Use simultaneously with FP / MMX™ Technology Data array only Data array only IA-INT Registers 32 EAX EDI... Fourteen 32-bit registers Fourteen 32-bit registers Direct register access Direct register access Scalar Data only Scalar Data only Streaming SIMD Extension Registers (128-bit integer) 128 XMM0 XMM3 XMM4 XMM Eight 64 bit registers Eight 64 bit registers Xor eight 80 bit FP regs Xor eight 80 bit FP regs Direct access to regs Direct access to regs FP data / data array FP data / data array x87 remains aliased with SIMD integer registers x87 remains aliased with SIMD integer registers Context-switch Context-switch MMX™ Technology / IA-FP Registers FP0 or MM0 FP7 or MM7...

63 ® * Other brands and names may be claimed as the property of others. 63 SSE Arithmetic Operations ADD, SUB, MUL, DIV, SQRT – Floating Point (Packed/Scalar) – Full 23 bit precision RCP - Reciprocal RSQRT - Reciprocal Square Root – Perspective correction / projection – Vector normalization – Very fast – Return at least 11 bits of precision Full Precision Approximate Precision

64 ® * Other brands and names may be claimed as the property of others. 64 SSE Arithmetic Operations MULPS: Multiply Packed Single-FP mulps xmm1, xmm2 xmm1 xmm2 xmm1 X4*Y4X3*Y3X2*Y2X1*Y1 * X4X3X2X1 Y4Y3Y2Y1

65 ® * Other brands and names may be claimed as the property of others. SSE Compare Operation CMPPS: Compare Packed Single-FP cmpps xmm0, xmm1, 1 xmm0 xmm1 xmm0 111…11000…00111… <

66 ® * Other brands and names may be claimed as the property of others. 66 SSE2 Registers, look like SSE cos they r Eight 128 bit registers Eight 128 bit registers Single-precision array / Double- precision array / 128-bit integer Single-precision array / Double- precision array / 128-bit integer Direct access to registers Direct access to registers Referred to as XMM0-XMM7 Referred to as XMM0-XMM7 Use simultaneously with FP / MMX Technology Use simultaneously with FP / MMX™ Technology Data array only Data array only IA-INT Registers 32 EAX EDI... Fourteen 32-bit registers Fourteen 32-bit registers Direct register access Direct register access Scalar Data only Scalar Data only Streaming SIMD Extension Registers (scalar / packed SIMD-SP, SIMD-DP, 128-bit integer) 128 XMM0 XMM3 XMM4 XMM Eight 64 bit registers Eight 64 bit registers Xor eight 80 bit FP regs Xor eight 80 bit FP regs Direct access to regs Direct access to regs FP data / data array FP data / data array x87 remains aliased with SIMD integer registers x87 remains aliased with SIMD integer registers Context-switch Context-switch MMX™ Technology / IA-FP Registers FP0 or MM0 FP7 or MM7...

67 ® * Other brands and names may be claimed as the property of others. 67 SSE2 Register Use Backward compatible with all existing MMX™ & SSE code Cache Management (Memory Streaming/Prefetch) -AND- -OR- -AND- Instruction Type 64-bit SIMD int. (4x16, 8x8) Single-precision SIMD FP (4x32) Double-precision SIMD FP (2x64) Pentium® III Processor 128-bit SIMD int. (8x16, 16x8) WillametteProcessor Standard x87 (SP, DP, EP)    New 64-bit double- precision floating point instructions  New / enhanced 128-bit wide SIMD integer –Superset of MMX™ technology instruction set  No forced context switching on SSE registers (unlike MMX™/x87 registers)

68 ® * Other brands and names may be claimed as the property of others. 68 Willamette New Instructions  New Instructions  Extended SIMD Integer Instructions  New SIMD Double-precision FP Instructions  New Cacheability Instructions  Fully Integrated into Intel Architecture –Use previously reserved opcodes –Same addressing modes as MMX™ / SSE ops –Several MMX™ / SSE mnemonics are repeated –New Extended SIMD functionality is obtained by specifying 128-bit registers (xmm0-xmm7) as src/dst.

69 ® * Other brands and names may be claimed as the property of others. 69 SIMD Double-Precision FP Ops  Same instruction categories as SIMD single- precision FP instructions  Operate on both elements of packed data, in parallel -> SIMD  Some instructions have scalar or packed versions  IEEE 754 Compliant FP Arithmetic –Not bit exact with x87: 80 bit internal vs 64 bit mem  Usable in all modes: real, virtual x86, SMM, and protected (16-bit & 32-bit) X2 X1 / Scalar SExponent Significand

70 ® * Other brands and names may be claimed as the property of others. 70 FP Instruction Syntax  Arithmetic FP Instructions can be: –Packed or Scalar –Single-Precision or Double-Precision ASMIntrinsics addps _mm_add_ps()Add Packed Single addpd _mm_add_pd()Add Packed Double addss _mm_add_ss()Add Scalar Single addsd _mm_add_sd()Add Scalar Double Packed or Scalar Single or Double

71 ® * Other brands and names may be claimed as the property of others. 71 New SSE2 Data Types  Packed & Scalar FP Instructions operate on packed single- or double-precision floating point elements –Packed instructions operate on 4 (sp) or 2 (dp) floats –Scalar instructions operate only on the right-most field addpd X2opY2 X1opY1 X2 X1 Y2 Y1 op addps X4opY4 X3opY3X2opY2 X1opY1 X4X3X2X1 Y4Y3Y2Y1 op addss Y4 Y3Y2 X1opY1 X4X3X2X1 Y4Y3Y2Y1 op addsd Y2 X1opY1 X2 X1 Y2 Y1 op

72 ® * Other brands and names may be claimed as the property of others. Extended SIMD Integer Ops  All MMX™/SSE integer instructions operate on 128-bit wide data in XMM registers  Additionally, some new functionality –MOVDQA, MOVDQU: 128-bit aligned/unaligned moves –PADDQ, PSUBQ: 64-bit Add/Subtract for mm & xmm regs –PMULUDQ: Packed 32 * 32 bit Multiply –PSLLDQ, PSRLDQ: 128-bit byte-wise Shifts –PSHUFD: Shuffle four double-words in xmm register –PSHUFL/HW: Shuffle four words in upper/lower half of xmm reg –PUNPCKL/HQDQ: Interleave upper/lower quadwords –Full 128-bit Conversions: 4 Ints vs. 4 SP Floats

73 ® * Other brands and names may be claimed as the property of others. 73  New 128-bit data-types for fixed-point integer data –16 Packed bytes –8 Packed words –4 Packed doublewords –2 Quadwords New SIMD Integer Data Formats

74 ® * Other brands and names may be claimed as the property of others. 74 New DP Instruction Categories ADD, SUB, MUL, DIV, SQRT MAX, MIN – Full 52-bit precision mantissa (Packed & Scalar) AND, ANDN, OR, XOR – Operate uniformly on entire 128-bit register – Must use DP instructions for double-precision data MOVAPD, MOVUPD – 128-bit DP moves (aligned/unaligned) MOVH/LPD, MOVSD – 64-bit DP moves SHUFPD – Shuffle packed doubles – Select data using 2-bit immediate operand Computation Data Formatting Logic

75 ® * Other brands and names may be claimed as the property of others. 75 DP Packed & Scalar Operations  The new Packed & Scalar FP Instructions operate on packed double precision floating point elements –Packed instructions operate on 2 numbers –Scalar instructions operate on least-significant number Y2X1opY1 op X2X1 Y2Y1 addsd X2opY2 X1opY1 op X2 X1 Y2Y1 addpd

76 ® * Other brands and names may be claimed as the property of others. 76 y2-y1 x2-x1 SHUFPD: Shuffle Packed Double-FP SHUFPD Instruction XMM1 XMM2 SHUFPD XMM1, XMM2, 3// binary 11 SHUFPD XMM1, XMM2, 2// binary XMM1 x2x1y2y1y2x2y2x1

77 ® * Other brands and names may be claimed as the property of others. 77 New DP instruction Categories, Cont'd CMPPD, CMPSD – Compare & mask (Packed/Scalar) COMISD – Scalar compare and set status flags MOVMSKPD – Store 2-bit mask of DP sign bits in a reg32 CVT – Convert DP to SP & 32- bit integer w/ rounding (Packed/Scalar) CVTT – Convert DP to 32-bit integer w/ truncation (Packed/Scalar) Branching Type Conversion

78 ® * Other brands and names may be claimed as the property of others. 78 Compare & Mask Operation CMPPD: Compare Packed Double-FP CMPPD XMM0, XMM1, 1// 1 = less than XMM0 XMM1 XMM0 < < … ….000

79 ® * Other brands and names may be claimed as the property of others. 79 Cache Enhancements  On-die trace cache for decoded uops (TC) –Holds 12K uops  8K on-die, 1 st level data cache (L1) –64-byte line size –Pentium Pro was 32 bytes –Ultrafast, multiple accesses per instruction  256K on-die, 2 nd level write-back, unified data and instruction cache (L2) –128-byte line size –operates at full processor clock frequency  PREFETCH instructions return 128 bytes to L2 Faster

80 ® * Other brands and names may be claimed as the property of others. 80 New Cacheability Instructions  MMX™/SSE cacheability instructions preserved  New Functionality: –CLFLUSH: Cache line flush –LFENCE / MFENCE: Load Fence / Memory Fence –PAUSE: Pause execution –MASKMOVDQU: Mask move 128-bit integer data –MOVNTPD: Streaming store with 2 64-bit DP FP data –MOVNTDQ: Streaming store with 128-bit integer data –MOVNTI: Streaming store with 32-bit integer data

81 ® * Other brands and names may be claimed as the property of others. 81 Streaming Stores  Willamette implementation supports: –Writing to uncacheable buffer (e.g. AGP) with full line-writes –Re-reading same buffer with full line-reads –New in WNI, compared to Katmai/CuMine  Integer streaming store –Operates on integer registers (ie, EAX, EBX) –Useful for OS, by avoiding need to save FP state, just move raw bits

82 ® * Other brands and names may be claimed as the property of others. 82 Detail: Cache Line Flush  CLFLUSH: Cache line containing m8 flushed and invalidated from all caches in the coherency domain  Linear address based; allowed by user code  Potential usage: –Allows incoherent (AGP) I/O data to be mapped as WB for high read performance and flushed when updated –Example: video encode stream –Precise control of dirty data eviction may increase performance by idle memory cycles

83 ® * Other brands and names may be claimed as the property of others. 83 Detail: Fences  Capabilities introduced over time to enable software managed coherence: –Write combining with the Pentium Pro processor –SFence and memory streaming with Streaming SIMD Extensions  New Willamette Fences completes the tool set to enable full software coherence management –LFence, strong load order –Blocks younger loads from passing a prior load instruction –All loads preceding an LFence will be completed before loads coming after the LFence –MFence –Achieves effect of LFence and SFence instructions executed at same time –Necessary, as issuing an SFence instruction followed by an LFence instruction does not prevent a load from passing a prior store

84 ® * Other brands and names may be claimed as the property of others. 84 Pause Instruction  PAUSE architecturally a NOP on IA-32 processor generations  Usable since Willamette!  Not necessary to check processor type.  PAUSE is hint to processor that code is a spin- wait or non- performance- critical code. A processor which uses the hint can: –Significantly improves performance of spin-wait loops without negative performance impact, by inserting a implementation- dependent delay that helps processors with dynamic execution (a. k. a. out- of- order execution) exit from the spin- loop faster  Significantly reduces power consumption during spin- wait loops

85 ® * Other brands and names may be claimed as the property of others. 85 NetBurst TM µArchitecture Overview System Bus 2 nd Level Cache 8-way 1 st Level Cache (Data) 4-way Bus Unit Fetch/ Decode Trace Cache Microcode ROM Frequently used paths Less frequently used paths Execution Out-of-Order Core Retirement BTBs/Branch Prediction Front End

86 L2 Cache and Control FP RF FMulFAddMMXSSE FP move FP store 3.2 GB/s System Interface L2 Cache and Control L1 D-Cache and D-TLB L1 D-Cache and D-TLB StoreAGU LoadAGU Schedulers Integer RF ALU ALU ALU ALU Trace Cache Rename/Alloc uop Queues BTB uCodeROM 3 3 Decoder BTB & I-TLB NetBurst TM µArchitecture

87 ® * Other brands and names may be claimed as the property of others. 87 NetBurst TM µArchitecture Summary  Quad Pumps bus to keep the Caches loaded  Stores most recent instructions as µops in TC to enhance instruction issue  Improves Program Execution –Issues up to 3 µops per Clock –Dispatches up to 6 µops to Execution Units per clock –Retires up to 3 µops per clock  Feeds back branch and data information to have required instructions and data available

88 ® * Other brands and names may be claimed as the property of others. 88 What is Hyperthreading?  Ability of processor to run multiple threads –Duplicate architecture state creates illusion to SW of Dual Processor (DP) –Execution unit shared between two threads, but dedicated if one stalls  Effect of Hyperthreading on Xeon Processor: –CPU utilization increases to 50% (from ~35%) –About 30% performance gain for some applications with the same processor frequency Hyperthreading Technology Results: 1. More performance with enabled applications 2. Better responsiveness with existing applications

89 ® * Other brands and names may be claimed as the property of others. 89 Hyperthreading Implementation  Almost two Logical Processors  Architecture state (registers) and APIC duplicated  Share execution units, caches, branch prediction, control logic and buses Processor Execution Resource Adv. Programmable Interrupt Control Architecture State Adv. Programmable Interrupt Control Architecture State On-Die Caches System Bus *APIC: Advanced Programmable Interrupt Controller. Handles interrupts sent to a specified logical processor

90 ® * Other brands and names may be claimed as the property of others. 90 Benefits to Xeon™ Processor Hyperthreading Technology Performance for Dual Processor Servers Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit or call (U.S.) or Source: Veritest (Sep, 2002). Comparisons based on Intel internal measurements w/pre- production hardware 1) HTT on and off configurations with the Intel® Xeon™ processor 2.80 GHz with 512KB L2 cache, Intel® Server Board SE7501WV2 with Intel® E7501 chipset, 2GB DDR, Microsoft Windows* 2000 Server SP2, Intel® PRO/1000 Gigabit Server adapter, AMI 438 MegaRAID* controller v MB EDO RAM- Dell PowerVault 210S disk array. 2) HTT on and off configurations with the Intel® Xeon™ processor 2.80 GHz with 512KB L2 cache, Intel® Server Board SE7501WV2 with Intel® E7501 chipset, 2GB DDR, Microsoft Windows* 2000 Server SP2, Intel® PRO/1000 Gigabit Server adapter, AMI 438 MegaRAID* controller v MB EDO RAM- Dell PowerVault 210S disk array.  Enhancements in bandwidth, throughput and thread-level parallelism with Hyperthreading Technology deliver an acceleration of performance Hyper Threading Technology Performance Gains Intel® Xeon™ processor 2.8GHz with 512KB cache, Microsoft Windows* 2000 Hyperthreading Technology increases performance by ~20% on Some Server Applications

91 ® * Other brands and names may be claimed as the property of others. 91 Hyperthreading for Workstation Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit or call (U.S.) or Source: Intel Corporation. With and without Hyperthreading Technology on the following system configuration: Intel Xeon Processor 2.80 GHz/533 MHz system bus with 512KB L2 cache, Intel® E7505 chipset-based Pre-Release platform, 1GB PC2100 DDR CL2 CAS2-2-2, (2) 18GB Seagate* Cheetah ST318452LW 15K Ultra160 SCSI hard drive using Adaptec SCSI adapter BIOS , nVidia* Quadro4 Pro 980XGL 128MB AGP 8x graphics card with driver version 40.52, Windows XP* Professional build Intel® Xeon™ processor 2.8GHz with 512KB cache Hyperthreading Technology performance gains Performance gains whether running: Performance gains whether running:  Multiple tasks within one application  Multiple applications running at once Multi- Threaded Applicatio n CHARMm*3DSM*5D2cluster*BLAST* Lightwave 3D*75 Multi- Tasking Applicatio n Patran* + Nastran * Multiple Compiles 3ds max* + Photoshop Compile + Regressio n Maya* multiple renderings + Animation Hyperthreading Technology increases performance by % on Workstation Applications

92 ® * Other brands and names may be claimed as the property of others. 92 Hyperthreading Resources TypeDescriptionExample Shared Each logical processor can use, evict or allocate any part of resource Cache, WC Buffers, VTune reg. MS-ROM Duplicated Each logical processor has it’s own set of resources APIC, registers, TSC, IP Split Resources are hard partitioned in half Load/Store buffers, ITLB, ROB, IAQ Tagged Resource entries are tagged with processor ID Trace Cache, DTLB

93 ® * Other brands and names may be claimed as the property of others. 93 Xeon Processor Pipeline Simplified  Buffering Queues separate major pipeline logic blocks  Buffering queues are either partitioned or duplicated to ensure independent forward progress through each logic block Buffering Queues duplicated Buffering Queues partitioned Queue Queue Queue Fetch Decode TC/MSROM Rename/Allocate OOO Execute Retirement

94 ® * Other brands and names may be claimed as the property of others. 94 HT in NetBurst  Front End –Execution Trace Cache –Microcode Store ROM (MSROM) –ITLB and Branch Prediction –IA-32 Instruction Decode –Micro-op Queue Bus unit 3 rd level cache Optional server product 2 nd level cache 1 st level cache 4 way Fetch/ Decode Trace Cache MS ROM OOO Execution Retirement BTBs/ Branch Prediction System Bus

95 ® * Other brands and names may be claimed as the property of others. 95 Front End  Responsible for delivering instruction to the later pipe stages  Trace Cache Hit –When the requested instruction trace is present in trace cache  Trace cache miss –Requested instruction is brought in the trace cache from L2 cache

96 ® * Other brands and names may be claimed as the property of others. 96 Trace Cache Hit Front End  Two separate instruction pointers  Two logical processors arbitrate for access to TC each cycle  If one logical processor stalls,other uses full bandwidth of TC IP Trace Cache Micro-Op Queue

97 ® * Other brands and names may be claimed as the property of others. 97 Programming Models  Two major types of parallel programming models –Domain decomposition –Functional decomposition  Domain Decomposition –Multiple threads working on subsets of the data  Functional Decomposition –Different computation on the same data –E.g. Motion estimation vs. color conversion, e.t.c. Both models can be implemented on HT processors

98 ® * Other brands and names may be claimed as the property of others. 98 Threading Implementation  O/S thread implementations may differ  Microsoft Win32 –NT threads (supports 1-1 O/S level threading) –Fibers (supports M-N user level threading)  Linux –Native Linux Thread (severely broken & inefficient) –IBM Next Generation Posix Threads (NGPT) – IBM’s attempt to fix Linux native thread –Redhat Native Posix Thread Model for Linux (NPTL) - supports 1-1 O/S level threading that is to be Posix compliant  Others –Pthreads (generic Posix compliant thread) –Sun Solaris Light Weight Processes (lwp), Sun Solaris user level threads Thread Model Issues Somewhat Orthogonal to HT

99 ® * Other brands and names may be claimed as the property of others. 99 OS Implications of HT ALL UP OS Legacy MP OS Backward Compatible, will not take the advantage of Enabled MP OS OS with Basic Hyperthreading Technology Functionality Optimized MP OS OS with optimized Hyperthreading Technology support Fully Compatible with ALL existing O/S… but only optimized O/S enables the most benefits

100 ® * Other brands and names may be claimed as the property of others. 100 HT Optimized OS  Windows XP –Windows XP –Windows XP Professional  Windows 2003 –Enterprise –Data Center  Enabled –RedHat Enterprise Server (version 7.3, 8.0) –RedHat Advanced Server 2.1 –Suse (8.0, 9.0)

101 ® * Other brands and names may be claimed as the property of others. 101 OS Scheduling  HT enabled O/S sees two processors for each HT physical processor –Enumerates first logical processor from all physical processors first  Schedules processors almost same as regular SMP –Thread priority determines schedule, but CPU dispatch matters –O/S independently submits code stream for thread to logical processors and can independently interrupt or halt each logical processor (no change) Logical Processor 1 Logical Processor 0 Logical Processor 1 Logical Processor Physical Processor 1Physical Processor 0 CPUID

102 ® * Other brands and names may be claimed as the property of others. 102 Thread Management  Avoid coding practices that disable hyperthreaded processors, e.g. –Avoid 64KB Aliasing –Avoid processor serializing events (e.g. FP denormals, self modifying codes, e.t.c.)  Avoid Spin Locks –Minimize lock contention to less than two threads per lock –Use “Pause” and “O/S synchronization” when Spin-Wait loops must be implemented  In addition, follow multi-threading best practices: –Use O/S services to block waiting threads –Spin as briefly as possible before yielding to O/S –Avoid false sharing –Avoid unintended synchronizations (C Runtime, C++ Template Library implementations)

103 ® * Other brands and names may be claimed as the property of others. 103 Threading Tools  Intel ThreadChecker Tool –Itemization of parallelization bugs and source –ThreadChecker class  OpenMP –Thread model in which programmer introduces parallelism or threading via directives or pragmas  Intel Vtune Analyzer –Provides analysis and drills down to source code –ThreadChecker Integration  GuideView –Parallel performance tuning

104 ® * Other brands and names may be claimed as the property of others. 104 Software Tools Intel C/C++ Compiler Intel C/C++ Compiler –Support for SSE and SSE2 using C++ classes, intrinsics, and assembly –Improved Vectorization and prefetch insertion –Profile-guided optimizations –G7 compiler switch for Pentium® 4 optimizations Register Viewing Tool (RVT) Register Viewing Tool (RVT) –Shows contents of XMM registers as they are updated –Plugs into Microsoft* Visual Studio* Microsoft* Visual Studio* 6.0 Processor Pack* Microsoft* Visual Studio* 6.0 Processor Pack* Support for SSE and SSE2 instructions, including intrinsics Support for SSE and SSE2 instructions, including intrinsics Available for free download from Microsoft* Available for free download from Microsoft* Microsoft* Visual Studio*.NET Microsoft* Visual Studio*.NET –Provides improved support for Intel® NetBurst™ micro-architecture –Recognizes XMM registers

105 ® * Other brands and names may be claimed as the property of others. 105 Hyperthreading is NOT:  Hyperthreading is not a full, dual-core processor  Hyper-threading does not deliver multi- processor scaling Dual Processor Dual CoreHyper-Threading Processor core APIC Arch. State APIC Arch. State On-Die Cache Processor core APIC Arch. State APIC Arch. State Cache Processor core APIC Arch. State APIC Arch. State On-Die Cache Processor core Processor core

106 ® * Other brands and names may be claimed as the property of others. 106 Backup

107 ® * Other brands and names may be claimed as the property of others. 107 TERMS  Branch: transfer of control to address different from next instruction. Unconditional or conditional.  Branch Prediction: Ability to guess target of conditional branch. Can be wrong, in which case we have mis-predict.  CISC: complex instruction set computer  Compiler: Tool translating high-level instructions into low-level machine instructions. Can be asm source (ASCII) or binary machine code.  EPIC (Explicitly Parallel Instruction Computing): New architecture jointly defined by Intel ® and HP.Is foundation of new 64-bit Instruction Set Architecture

108 ® * Other brands and names may be claimed as the property of others. 108 TERMS  Explicit parallelism: Intended ability of two tasks to be executed by design (explicitly) at the same time. Task can be as simple as an instruction, or as complex as a complete program.  Implicit parallelism: Incidental ability of two or more tasks to be executed at the same time. Example: sequence of integer add and FP convert instructions without common registers or memory addresses, executed on a target machine that happens to have respective HW modules available.

109 ® * Other brands and names may be claimed as the property of others. 109 TERMS  Instruction Set Architecture (ISA): Architecturally visible instructions that perform software functions and direct operations within the processor. HP and Intel ® jointly developed a new 64-bit ISA.This ISA integrates technical concepts from the EPIC technology.  Memory latency: Time to move data from memory to the processor, at request of processor.  Mispredict: A wrong guess, where new flow of control will continue as a result of a branch (or similar control flow instruction).


Download ppt "® * Other brands and names may be claimed as the property of others. Architecture of: Intel® Pentium® 4, Intel® Xeon™, Intel® Xeon™ MP Architecture Rev."

Similar presentations


Ads by Google