Chapter 6 Intel© x86 Microprocessor Architecture

Chapter 6 Intel© x86 Microprocessor Architecture
ECE 485/585 Microprocessors Chapter 6 Intel© x86 Microprocessor Architecture Derived from Dr. Herbert G. Mayer 2003 Presentation to Intel’s Software College Status 9/1/2016

Agenda Assumptions Speed Limitations x86 Architecture Progression
Architecture Enhancements Intel® x86 Architectures

Assumptions Audience understands x86 architecture, RISC, CISC
Knows some assembly language Flavor used here: Gnu assembler gas Result on right-hand-side: mov [temp], %eax; is a load into register a add %eax, %ebx; new integer sum is in register b Different from Microsoft * masm, and tasm Understand some architectural concepts: Caches, Multi-level caches, (some MESI) Threading, multi-threaded code Blocking (cache), blocking (aka tiling), blocking (thread synch.) Causes of pipeline stalls Control flow change Data dependence, registers and data NOT discussed: VTune

Speed Limitations

Agenda Performance Limiters Register Starvation Processor-Memory Gap
Processor Stalls Store Forwarding Misc Limitations: Spin-Lock in Multi Thread Misaligned Data Denorm Floats

Performance Limiters Architectural limitations the programmer or compiler can overcome: Indirect limitations: stall via branch, call, return Incidental limits: resource constraint Historical limits: register starved x86 Technological: ALU speed vs. memory access speed Logical limits: data- and resource dependence

Register Starvation How many regs needed (compiler or programmer)?
Infinite is perfect  1024 is very good 64 acceptable 16 is crummy 4+4 is x86 1 is saa (single-accumulator architecture) Formally on x86: 16 regs. Quick test: ax, bc, cx, dx si, di bp, sp, ip cs, ds, ss, es, fs, gs, flags Of which ax, bx, cx, dx are GPRs, almost Rest can be used as better temps ax & dx used for * and /, cx for loop Immaterial, whether use: extended size (32-bit data size) in which case nomenclature: eax, ebx, ecx, etc. or use of 16-bit data size, in which a nomenclature is: ax, bx, cx, dx, etc Note that register starvation made worse by the few GPRs being reserved for: integer multiply, divide, and loop counter

Register Starvation Absence of regs causes
Spurious memory spills and load False data dependences --not dependencies  Except single-accumulator arch: No other arch is more register starved than x86  Instruction Stream Instruction Stream mov %eax, [tmp] add %ebx, %eax imul %ecx mov %eax, [prod] mov [tmp], %eax Added ops Mem latency mov %eax, [mem1] use stuff, %eax mov [mem1], %eax False DD

And the Programmer? No solution in ISA, x86 had 4 GPRs since 8086
Improved via internal register renaming Pentium ® Pro has hundreds of internal regs Added registers in mmx Visible to you, programmer and compiler fp(0) .. fp(7), 80-bits as FP, 64 bits as mmx, but note: context switch Added registers in SSE xmm(0) .. xmm(7) 128 bits Immaterial, whether use: extended size (32-bit data size) in which case nomenclature: eax, ebx, ecx, etc. or use of 16-bit data size, in which a nomenclature is: ax, bx, cx, dx, etc Note that register starvation made worse by the few GPRs being reserved for: integer multiply, divide, and loop counter Note: mmx came with Pentium ® II SSE with Pentium III additional instructions for SSE, aka SSE1, with Pentium ® 4

Processor-Memory Gap 1000 Performance 100 10 1 Time µProc 60%/yr.
CPU “Moore’s Law” 100 Processor-Memory Performance Gap: (grows 50% / year) Performance 10 DRAM 7%/yr. DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 Time Source: David Patterson, UC Berkeley

Bridging the Gap: Trend
CPU Intel® Xeon™ Processor: Hyperthreading Technology ~30% Intel® Pentium II Processor: Out of Order Execution ~30% Performance Multilevel Caches DRAM Caches Time Memory Latency driving force for architecture change Two approaches to fix: out of order, hyperthreading technology HT is just the first step Instruction Level Thread Level Hyperthreading Technology: Feeds two threads to exploit shared execution units

Impact of Memory Latency
Memory speed has NOT kept up with advance in processor speed Avg. integer add ~ 0.16 ns (Xeon), but memory accesses take ~10 ns or more CPU hardware resource utilization is only 35% on average Limited due to memory stalls and dependencies Possible solutions to memory speed mismatch? Memory speed mismatch is a major source of CPU stalls

And the Programmer? Cache provided Methods to manipulate cache
Tools provided to pre-fetch data At risk of superfluous fetch, if control-flow change

Processor Stalls Stalled cycle is a cycle in which processor cannot receive or schedule new instructions Total Cycles = Total Stall Cycles + Productive Cycles Stalls waste processor cycles Perfmon, Linux ps, tops, other system tools show Stalled cycles as busy CPU cycles Intel® VTune Analyzer used to monitor stalls (HP* PFmon)

Why Stalls Occur! Stalls occur, because: Sample resource limitations:
Instruction needs resource not available Dependences [sic] (control- or data-) between instructions Processor / instruction waits for some signal or event Sample resource limitations: Registers Execution ports Execution units Load / store ports Internal buffers (ROBs, WOBs , etc.) Sample events: Exceptions, Cache misses, TLB misses, e.t.c. Common thing: they hold up compute progress

Control Dependences (CD)
Change in flow of control causes stalls Processors handle control dependences: Via branch prediction hardware Conditional move to avoid branch & pipeline stall Instruction Stream Instruction Stream Barrier (Predict) mov [%ebp+8], %eax cmp 1, %eax jg bigger mov 1, %eax . . . bigger: dec %ecx push %eax call rfact mov %ecx,[%ebp+8] mul %ecx Barrier (Predict)

Data Dependences (DD) Data dependence limits performance
Programmer / Compiler cannot solve Xeon has register renaming to avoid false data dependencies supports out of order execution to hide effects of dependencies Instruction Stream Instruction Stream . . . mov eax, [ebp+8] cmp eax, 1 Mem latency mov [temp], eax add eax, ebx mult ecx mov [prod], eax mov eax, [temp] . . . bigger: False DD

Xeon Processor Stalls D-side Core I-side Misc DTLB Misses
Memory Hierarchy  L1, L2 and L3 misses Core Store Buffer Stalls Load/Store splits Store forwarding hazard Loading partial/misaligned data Branch Mispredicts I-side Streaming Buffer Misses ITLB Misses TC misses 64K Aliasing conflicts Misc Machine Clears Back End Front End

And the Programmer? Reduce processor stall by prefetching data
Reduces control flow change by conditional move Reduce false dependences by using register temps, from mmx (fp) and xmm pool

Partial Writes: WC buffers
Causes: 1) Too many WC streams 2) WB loads/stores contending for fill-buffers to access L2 cache or memory First Level Cache 8B 8B 8B - Incomplete WC buffer 3 - 8B “Partial” bus transactions Fill/WC Buffer Fill/WC Buffer Detection (VTune) Event based sampling: Ext. Bus Partial Write Trans. Causes: L2 Cache Request Ext. Bus Burst Read Trans. Ext. Bus RFO Trans. Fill/WC Buffer Fill/WC Buffer 8B 8B 8B 8B Second Level Cache Complete WC buffer 1 bus transaction FSB Memory Partial writes reduce actual front-side bus Bandwidth ~3x lower for PIII ~7x lower for ~Pentium 4 processor due to longer cache line

Store Forwarding Guidelines
Store Forward: Loading from an address recently stored can cause data to be fetched more quickly than via mem access. Large penalty for non-forwarding cases ( x) Will Forward Forwarding Penalty Store Store Load aligned with Store Load Load Store Store Load contained in Store Load Load Load contained in Store Store A B single Store Load Load 128-bit forwards must be Store Store 16-byte aligned Load Load 16-byte boundaries MSVC < 7.0 and you generate these. Intel Compiler doesn’t.

And the Programmer? Pick right compiler, for HLL programs
Use VTune to check, for asm code In asm programs, ensure loads after stores are: Contained in stored data, subset or proper subset In single previous store, not in sum of multiple stores Thus do store-combining: assemble together, then store Both data start on same address

Misc Limitations Spin-Lock in Multi Thread Misaligned data
Don’t use busy wait, juts because you have (almost) a second processor for second thread Misaligned data Don't align data on arbitrary boundary, just because architecture can fetch from any address Dumb errors  Fail to use proper tool (library, compiler, performance analyzer) Failure to use tiling (aka blocking) or SW pipelining Denormalized Floats

And the Programmer? Use pause, when applicable!
New NetBurst instruction Use compiler switches to align data on address divisible by greatest individual data object Who cares about wasting 7 bytes to force 8-byte alignment? Be smart, pick right tools  Instruct compiler to SW pipeline In asm, manually SW pipeline; note easier on EPIC than VLIW, lacking prologue, epilogue sometimes Enable compiler to partition larger data structures into smaller suitable blocks, for improved locality cache parameter dependent

And the Programmer? Executes for first of 2 labs, this one being a "two-minute" exercise: Turn on your computer, verify Linux is alive Verify you have available: Editor to modify program Intel C++ compiler, text command icc, with -g Debugger ddd, with disassembly ability Source program vscal.cpp Linux commands: ls, vi, icc, mkdir, etc.

Module Summary Covered: key causes that render execution slower than possible: More registers at your disposal than seems Van Neumann bottleneck can be softened via cache use and data pre-fetch Stalls can be reduced by conditional move, avoiding false dependences Use (time limited) capabilities, such as proper store forwarding Note new Pause instruction

x86 Architecture Progression

Agenda: x86 Arch. Progression
Abstract & Objectives x86 Nomenclature & Notation Intel® Architecture Progress Pentium 4 Abstract

Abstract & Objectives: x86 Architecture Progression
Abstract: High-level introduction to history and evolution of increasingly powerful 16-bit and 32-bit x86 processors that are backwards compatible. Objectives: understand processor generations and architectural features, by learning Progressive architectural capabilities Names of corresponding Intel processors Explanation, description of capabilities FP incompatibility, minor

Non-Objectives Objective is not introduction of:
x86 assembly language, assumed known Itanium ® processor family now in 3rd generation Intel tools (C++, VTune) Performance tools: MS Perfmon, Linux ps, emon, HP PFMon, etc. Performance benchmarks, performance counters Differentiation Intel vs. competitor products CISC vs. RISC

x86 Nomenclature & Notation
Processor name, initial launch date, final clock speed Pentium ® II, 2H98, 450 MHz MMX, BX chipset Dynamic branch prediction enhanced Architecturally visible enhancement list, can be empty Architectural speedup technique, invisible exc. higher speed

Intel® Architecture Progress
8086, 2H80, 4 MHz , 8087 80485, 2H2h85, 10 MHz FP integrated , Pentium ®, 1988, 40 MHz D+I caches, static branch prediction , Pentium ® Pro, 2H95, 100 MHz Dynamic branch prediction , Pentium ® III, 2H99, 733 MHz Large cache, l2 onchip SSE, XMM regs Pentium ® 4, 2H00, 3.06 GHz SSE2, 144 WNI, NetBurst ® L3 on chip cache Pentium ® II, 2H98, 450 MHz MMX, BX chipset Dynamic branch prediction enhanced

Intel ® Pentium ® 4 Processors
Family Description Northwood Pentium ® Willamette shrink. Consumer and business desktop processor. HT not enabled, though capable. NW E-Step Pentium HT errata corrected. Desktop processor Prescott Consumer and business desktop processor. Replaces NW. Offers 6 PNI: Prescott New Instructions. First processor with Lagrande technology (trusted computing) Prestonia DP Xeon TM DP slated for workstations and entry-level servers. Based on NW core. HT enabled. 512 kB L2 cache. No L3. 3 GHz processor. Nocona DP Xeon DP based on Prescott core. Targeted for 3.06 GHz. 533 MHz (quad-pumped) bus, I.e. bus speed is 133 MHz. 1 MB L2 cache. HT enabled. About to be launched. Foster MP MP based on Willamette core. 1 MB L3 cache, 256 kB L2, HT enabled. For higher-end servers. Gallatin MP MP based on NW core. 1 or 2 MB L3 cache, 512 kB L2 cache. For high-end servers. See 8-way HP DL 760, and IBM x440. HT enabled. Potomac MP MP based on Prescott core. 533 MHz (quad-pumped) bus. 1 MB L2 cache, 8 MB L3 cache. HT enabled, yet to be launched. Note: lower clock rates for MP versions. Due to higher circuit complexity, bus load.

Processor Generation Comparison
Pentium® III Processor Pentium® III Processor Pentium® 4 Processor Feature Northwood MHz MHz 600 MHz – 1.13GHz 1.5 GHz 2+ GHz L2 Cache 512k off-die 256k on-die 256k on-die 512k on-die Execution Type Dynamic Dynamic Intel® NetBurst™mArch Intel® NetBurst™mArch System Bus 400MHz (4x100 MHz) 400/533MHz (4x100/133 MHz) 100MHz 133MHz MMX™ Technology Yes Yes Yes Yes Streaming SIMD Extensions Yes Yes Yes Yes Northwood Differences: Chipsets based around ICH2 USB 2.0 will initially be a discreet chip on the motherboard Northwood IS: Pentium® 4 Processor (Willamette) die shrink (.18 to .13) Minor u-Arch tuning Higher Frequency (2+ GHz) Larger L2 Cache (512K vs. 256K) 400/533 MHz FSB (533 available 2H 2002) MMX, SSE, SSE2 Support Streaming SIMD Extensions 2 No No Yes Yes Manufacturing Process .25 micron .18 micron .18 micron .13 micron Chipset ICH-1 ICH-2 ICH-2 ICH-2

8087 co-processor of 8086: off-chip FP computation, extended 80-bit FP format for DP MMX: multi-media extensions Mmx regs aliased w. FP register stack needs context switch FP regs also called ST(I) regs SSE: Streaming SIMD extension already since Pentium III WNI: 144 new instructions, using additional data types for existing opcodes, using previously reserved opcodes

XMM: 8 new 128-bit registers, in addition to MMX SSE2: multiple integer ops and multiple DP FP ops: part of 144 WNI Regs unchanged in Pentium ® 4 from P III Ops added NetBurst: generic term for: HyperThreading & quad-pumped bus & new Trace Cache & etc. Note: architectural feature ages with next generation, but survives, due to compatibility requirement. Hence is interesting not only for historical reasons: You need to know it!

XeonTM MP Abstract Xeon™ MP Processor 20 64 GB 2.0+ GHz 1 3 4
Hyperthreading Technology Xeon™ MP Processor “Gallatin” 64 GB (PAE-36) 8 Integer, 1 Multimedia 2 Floating Point 2.0+ GHz 1 3 4 24 Registers (126) 3 Instructions / Cycle L3 – 1or 2 MB L KB L1 - 12K TC, 8K D 6 5 2xALU 3.2 GB/s (400) Physical Addressing (36-bit P Pro) System Bus Bandwidth External Cache On-die Cache Logical CPU 2 X Pipeline Stages Issue Ports Registers Execution Units Core Frequency Instructions/clock-cycle

XeonTM Memory Hierarchy
Note: Physical Address Extension, 36-bit PAE addresses, since Pentium ® Pro Xeon™ Processor MP 12.8 GB/s L2 (unif'd) 512KB 8-way 128B lines 7+ CLKS L3 2MB 21+ CLKS External Memory 64GB 3.2 GB/s L1(DL0) 8KB 64B lines 2 CLKS TC 12KB

Architecture Enhancements

Agenda: Architecture Enhancements
Abstract & Objectives Faster Clock Caches: Advantage, Cost, Limitation Multi-Level Cache-Coherence in MP Register Renaming Speculative, Out of Order Execution Branch Prediction, Code Straightening

Abstract & Objectives: Architecture Enhancements
Abstract: Outline generic techniques that overcome performance limitations Objectives: under stand cost of architectural techniques (tricks) in terms of resources (mil space) and of lost performance if incorrectly guessed Caches: cost silicon, can slow down Branch prediction: costs silicon, can be wrong Prefetch: costs instruction, may be superfluous Superscalar: may not find a second op

Non-Objectives Objective is not to explain detail of Intel processor architecture Not to claim Intel invented techniques; academia invented many Not to show all techniques; some apply mainly to EPIC or VLIW architectures No hype, no judgment, just the facts please!

Faster Clock CISC: Resulting modules are smaller and thus can be fast:
Decompose circuitry into multiple simple, sequential modules Resulting modules are smaller and thus can be fast: high clock rate Shorter speed-paths That's what we call: pipelined architecture More modules -> simpler modules -> faster clock -> super-pipelined Super-pipelining NOT goodness per-se: Saves no silicon Execution time per instruction does not improve May get worse, due to delay cycles But: Instructions retired per unit time improves Especially in absence of (large number of) control-flow stalls

Intel® NetBurstTM µarchitecture: 20 stage pipeline
Faster Clock Xeon TM processor pipeline has 20 stages Beautiful model breaks upon control transfer ALU op I-Fetch R Store Decode O1-Fetch O2-Fetch . Intel® NetBurstTM µarchitecture: 20 stage pipeline 1 2 3 4 5 6 7 8 9 10 11 12 TC Nxt IP TC Fetch Drive Alloc Rename Que Sch 13 14 Disp 15 16 17 18 19 20 RF Ex Flgs Br Ck

Intel® x86 Architectures

Agenda: Intel x86 Architectures
Abstract & Objectives High Speed, Long Pipe Multiprocessing MMX Operations SSE Operations SSE2 Operations Willamette New Instructions WNI Cacheability Instructions Pause Instruction NetBurst, Hyperthreading SW Tools

Abstract & Objectives: Intel® x86 Architectures
Abstract: Emphasizing Pentium ® 4 processors, show progressively more powerful architectural features introduced in Intel processors. Refer to speed problems solved from module 2 and general solutions explained in module 3. Objective: you not only understand the various processor product names and supported features (Intel marketing names), but understand how they work, and what their limitations and costs are.

Non-Objectives Objective is not to show Intel's techniques are the only ones, or best possible. They are just good trade-off in light of conflicting constraints: Clock speed vs. small # of pipes Small transistor count vs. high performance Large caches vs. small mil. Space Grandiose architecture vs. backward compatibility Need for large register file vs. register-starved x86 Wish to have two full on-die processors vs. preserving silicon space

High Speed, Long NetBurst TM Pipe
Intro at 733MHz .18µ Basic Pentium ® Pro Pipeline 1 2 3 4 5 6 7 8 9 10 Fetch Decode Rename ROB Rd Rdy/Sch Dispatch Exec Basic NetBurst™ Micro-architecture Pipeline 1 2 3 4 5 6 7 8 9 10 11 12 TC Nxt IP TC Fetch Drive Alloc Rename Que Sch 13 14 Disp 15 16 17 18 19 20 RF Ex Flgs Br Ck Hyper pipelined Technology enables industry leading performance and clock rate 1.4 GHz .18 µ 2.2GHz .13µ This is the branch mispredict pipeline. This is just one of the key pipelines in the machine. One of the more interesting and important pipelines. Even higher frequencies on 0.18u. Rest of the presentation refers to 1.4GHz. Expect to see this even faster as the design matures on this process.

Check Your Progress Match pipe functions to clocks/stages
3 4 5 6 7 8 9 10 2 1 11 12 13 14 15 16 17 18 20 19 Execute: Execute the mops on the correct port; 1 clk Flags: Compute flags (0, negative, etc.); 1 clk Trace Cache Fetch: Read decoded mop from TC; 2 clks Register File: Read the register file; 2 clks Drive: Drive mops to the Allocator; 1 clk Trace Cache/Next IP: Read from Branch Target Buffer; 2 clks Dispatch: Send mops to appropriate execution unit; 2 clks Rename: Rename logical regs to physical regs; 2 clks Drive: Drive the branch result to BTB at front; 1 clk Allocate: Allocate resources for execution; 1 clk Branch Check: Compare act. branch to predicted; 1 clk Queue: Write mop into mop queue to wait for scheduling; 1 clk Schedule: Write to schedulers; compute dependencies; 3 clks

Multiprocessing, SMP Def: Execution of 1 task by >= 2 processors
Floyd Model (1960s): Single-Instruction, Single-Data Stream (SISD) Architecture (PDP-11) Single-Instruction, Multiple-Data Stream (SIMD) Architecture (Array Processors, Solomon, Illiac IV, BSP, TMC) Multiple-Instruction, Single-Data Stream (MISD) Architecture (possibly: pipelined, VLIW, EPIC) Multiple-Instruction, Multiple-Data Stream Architecture (possibly: EPIC when SW-pipelined, true multiprocessor)

MP Scalability Caveat Performance gain from doubling processors
Number of processors Gain Follows Law of Diminishing Returns

Intel® Xeon™ Processor Scaling 1.39x Frequency
Frequency Scale more visible with large cache SPECint_rate_base2000 OLTP Source: Intel Corporation Based on Intel internal projections. System configuration assumptions: 1) two Intel® Xeon™ processor 2.8GHz with 512KB L2 cache in an E7500 chipset-based server platform, 16GB memory, Hyperthreading enabled; 2) Four Intel® Xeon™ processor MP 1.6GHz with 1MB L3 cache in a GC-HE chipset-based server platform, 32GB memory, Hyperthreading enabled; 3) Four Intel® Xeon™ processor MP 2.0GHz with 2MB L3 cache in a GC-HE chipset-based server platform, 32GB memory, Hyperthreading enabled; 4) Four Intel® Xeon™ processor MP 2.8GHz with 2MB L3 cache in a GC-HE chipset-based server platform, 32GB memory, Hyperthreading enabled Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other source of information to evaluate the performance of systems or components they are considering purchasing.

Intel® Xeon™ MP vs. Xeon™ Relative OLTP Performances
Which processor is better? Source: TPC.org Xeon processor MP Targeted for OLTP Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other source of information to evaluate the performance of systems or components they are considering purchasing.

MMX Integer Operations
Add (wrap around) paddwmm0, mm3 packed add on words Add (saturation) padduswmm0, mm3 packed add with unsigned saturation on words a1 a0 F000h a2 mm0 F000h a2 a1 a0 mm0 + + + + + + + + mm3 3000h b2 b1 b0 mm3 3000h b2 b1 b0 Both of these instructions operate in the same way: add the 1st word (from the right) of the blue packed word to the 1st word of the yellow packed word. The 2nd word of the blue packed word to the 2nd word of the yellow packed word, etc. The difference between these two instructions is in the saturation: paddusw saturates; Therefore in result in the last addition is the upper limit, i.e. FFFFh for an unsigned word. paddw is wrap around. Therefore the result of the same addition: F000h+3000h wraps around to 2000h. mm0 2000h a2+b2 a1+b1 a0+b0 mm0 FFFFh a2+b2 a1+b1 a0+b0

MMX Arithmetic Operations
Multiply-low pmullwmm0, mm3 multiply low, words Multiply-high pmulhwmm1, mm4 multiply high, words * * * * mm3 b3 b2 b1 b0 a3*b3 a2*b2 a1*b1 a0*b0 mm0 c3 c3 c2 c2 c1 c1 c0 c0 mm1 a3 a2 a1 a0 * * * * b1 b0 b3 b2 You can use these multiplications for keeping the relevant low-order bits of your calculations. If you need the higher bits, use the multiply high and vice-versa. mm4 a3*b3 a2*b2 a1*b1 a0*b0 mm1 c3 c2 c1 c0

MMX Arithmetic Operations
Multiply Add pmaddwd mm1, mm4 packed multiply and add 4 words to 2 doublewords a3 a2 mm1 a1 a0 * * * * mm4 b3 b2 b1 b0 a3*b3 a2*b2 a1*b1 a0*b0 The multiply operations are available only for signed words. This instruction multiplies the four words of the blue packed word with the four words of the yellow packed word. Then it adds the two pairs of multiplications, to produce a packed doubleword. The result is one packed doubleword. As you remember, a packed doubleword contains two doublewords. Each doubleword is the addition of two multiplications. Note: The pmadd instruction wraps around only in one situation: when all four numbers of one calculation (e.g. a0, a1, b0, and b1) are the smallest negative number, In this case, the result will be 0x , rather than saturating to 0x7fffffff. That is, you’ve overflowed by exactly one. If this overflow possible in your program: check for this case. mm1 a3*b3+a2*b2 a1*b1+a0*b0  Note: This instruction does not have a saturation option.

MMX Convert Operations
Unpack, interleaved merge punpcklwd mm0, mm1 unpack low words into doublewords mm1 b3 b2 b1 b0 mm0 a3 a2 a1 a0 mm0 b1 a1 b0 a0 punpckhwd mm0, mm1 unpack high words into doublewords mm1 b3 b2 b1 b0 mm0 a3 a2 a1 a0 mm0 b3 a3 b2 a2 Zero extend from small data elements to bigger data elements by using the unpack instruction, with zeros in one of the operands.

MMX Convert Operations
Pack packusdw mm0, mm1 pack with unsigned saturation (signed) doublewords into words mm1 A B mm0 C D This example packs the doublewords in mm0 and mm1 into words. The source of a pack is always signed. The U in a pack instruction always refers to the destination. If the data doesn’t fit in one word, the word saturates. If A is larger than A’= 216-1 If A is smaller than 0 A’=0 mm0 A’ B’ C’ D’

MMX Shift Operations psllq MM0, packed shift left logical quadword psllw MM0, packed shift left logical words 703F 0000 FFD9 4364h 8 8 MM0 MM0 703Fh DF00h 81DBh 007Fh 3F00 00FF D h MM0 MM0 3F00h 0000h DB00h 7F00h

MMX Compare Operations
pcmpgtw ; compare greater words (generate a mask) 51 3 5 23 > > > > 73 2 5 6 This operation compares between the words in the blue packed word and the yellow packed word. The result is a packed word where all the bits of each word are either set or null. The bits in the word are set if the blue word was greater than the yellow word. The bits in the word are null if the blue word was not greater than the yellow word. You can use this operation to create a branch effect. This technique is discussed in the Coding Techniques section. Q: What about equal words A: As you can see the second words are equal (5) and the result is null. To set the bits in this case you could use: pcmpgew compare greater words or equal

SSE Registers 32 80 128 64 IA-INT Registers
Eight 128 bit registers Single-precision / Double-precision / 128-bit integer Direct access to registers Referred to as XMM0-XMM7 Use simultaneously with FP / MMX™ Technology Data array only IA-INT Registers 32 EAX EDI . Fourteen 32-bit registers Direct register access Scalar Data only Streaming SIMD Extension Registers (128-bit integer) 128 XMM0 XMM3 XMM4 XMM7 Eight 64 bit registers Xor eight 80 bit FP regs Direct access to regs FP data / data array x87 remains aliased with SIMD integer registers Context-switch MMX™ Technology / IA-FP Registers 64 80 FP0 or MM0 FP7 or MM7 No new physical registers will be added. SSE2 instructions use the 128-bit "xmm" registers that were introduced on the Pentium® III processor. MMX/x87 context switching penalty still exists (ISV's always seem to ask about this). If prompted, explain that double-wide MMX can be done using the new SIMD integer instructions, so the MMX/x87 conflict should not be a performance limiter anymore. +90% of MMX code can be ported to SSE2. It is also notable that there is no penalty (no context switch) for moving MMX integer data to the SSE/SSE2 registers, making it possible to avoid the MMX/x87 context switch penalty.

SSE Arithmetic Operations
ADD, SUB, MUL, DIV, SQRT Floating Point (Packed/Scalar) Full 23 bit precision Full Precision Approximate Precision RCP - Reciprocal RSQRT - Reciprocal Square Root Perspective correction / projection Vector normalization Very fast Return at least 11 bits of precision

SSE Arithmetic Operations
MULPS: Multiply Packed Single-FP mulps xmm1, xmm2 xmm1 xmm2 X4*Y4 X3*Y3 X2*Y2 X1*Y1 * X4 X3 X2 X1 Y4 Y3 Y2 Y1

SSE Compare Operation CMPPS: Compare Packed Single-FP
cmpps xmm0, xmm1, 1 xmm0 xmm1 111…11 000…00 < 1.1 7.3 2.3 5.6 8.6 3.5 1.2

SSE2 Registers, look like SSE  cos they r
Eight 128 bit registers Single-precision array / Double-precision array / 128-bit integer Direct access to registers Referred to as XMM0-XMM7 Use simultaneously with FP / MMX™ Technology Data array only IA-INT Registers 32 EAX EDI . Fourteen 32-bit registers Direct register access Scalar Data only Streaming SIMD Extension Registers (scalar / packed SIMD-SP, SIMD-DP, 128-bit integer) 128 XMM0 XMM3 XMM4 XMM7 Eight 64 bit registers Xor eight 80 bit FP regs Direct access to regs FP data / data array x87 remains aliased with SIMD integer registers Context-switch MMX™ Technology / IA-FP Registers 64 80 FP0 or MM0 FP7 or MM7 No new physical registers will be added. SSE2 instructions use the 128-bit "xmm" registers that were introduced on the Pentium® III processor. MMX/x87 context switching penalty still exists (ISV's always seem to ask about this). If prompted, explain that double-wide MMX can be done using the new SIMD integer instructions, so the MMX/x87 conflict should not be a performance limiter anymore. +90% of MMX code can be ported to SSE2. It is also notable that there is no penalty (no context switch) for moving MMX integer data to the SSE/SSE2 registers, making it possible to avoid the MMX/x87 context switch penalty.

Backward compatible with all existing MMX™ & SSE code
SSE2 Register Use Backward compatible with all existing MMX™ & SSE code Cache Management (Memory Streaming/Prefetch) -AND- -OR- Instruction Type 64-bit SIMD int. (4x16, 8x8) Single-precision SIMD FP (4x32) Double-precision SIMD FP (2x64) Pentium® III Processor 128-bit SIMD int. (8x16, 16x8) Willamette Standard x87 (SP, DP, EP) ü û New 64-bit double-precision floating point instructions New / enhanced 128-bit wide SIMD integer Superset of MMX™ technology instruction set No forced context switching on SSE registers (unlike MMX™/x87 registers) This shows what instruction types can be mixed on Wmt. MMX/x87 should not be mixed due to forced context switching, but XMM registers are like a scratch pad and can handle all data types simultaneously without penalty. Example: xmm0 - Extended SIMD integer data xmm1 - Single-precision FP xmm2 - Double-precision FP usage does not cost added latencies, if all operations on a register are consistent with the data type. Instructions ANDPS, ANDPD, PAND all appear to do the same thing, but they are needed to maintain consistent data type. 128 bit wide SIMD integer 8 x 16 bit SIMD integer (vs. 4 for MMX™ tech/SSE), OR 16 x 8 bit SIMD integer (vs. 8 for MMX tech/SSE), OR scalar operations (8 or 16 bit) true 128 bit implementation (vs. 64 bit for Pentium® III processor) 64 bit double-precision floating point added 2 x 64 bit double-precision SIMD floating point, OR 4 x 32 bit single-precision SIMD floating point (SSE), OR scalar operations (SP or DP) 3 instructions added for encryption / authentication Easy port from MMX™ Technology and Pentium® III technologies

Willamette New Instructions
Extended SIMD Integer Instructions New SIMD Double-precision FP Instructions New Cacheability Instructions Fully Integrated into Intel Architecture Use previously reserved opcodes Same addressing modes as MMX™ / SSE ops Several MMX™ / SSE mnemonics are repeated New Extended SIMD functionality is obtained by specifying 128-bit registers (xmm0-xmm7) as src/dst. Three categories of new instructions. The programming model is the same as MMX/SSE (assembly, intrinsics, or C++ classes). Some MMX mnemonics are repeated at the assembly level, but intrinsics are all unique between WNI and MMX.

SIMD Double-Precision FP Ops
Same instruction categories as SIMD single-precision FP instructions Operate on both elements of packed data, in parallel -> SIMD Some instructions have scalar or packed versions IEEE 754 Compliant FP Arithmetic Not bit exact with x87: 80 bit internal vs 64 bit mem Usable in all modes: real, virtual x86, SMM, and protected (16-bit & 32-bit) X2 X1 / Scalar S Exponent Significand 51 52 62 63 x87 vs. SSE/WNI: x87 internally represents all data at 80-bits extended precision. The data is only rounded to the specified precision ( 64 or 32 bits) upon saving to memory. This reduces some rounding error for x87 as compared to SSE/WNI where data is never represented internally beyond the precision specified by the programmer. Which mans: There are some/few incompatibilities between old x86+x87 code and newer PIII and P4 code.

FP Instruction Syntax Arithmetic FP Instructions can be:
Packed or Scalar Single-Precision or Double-Precision ASM Intrinsics addps _mm_add_ps() Add Packed Single addpd _mm_add_pd() Add Packed Double addss _mm_add_ss() Add Scalar Single addsd _mm_add_sd() Add Scalar Double Packed or Scalar Shows how to interpret mnemonics for assembly instructions and the corresponding intrinsics. Single or Double

New SSE2 Data Types addps addpd addsd addss
Packed & Scalar FP Instructions operate on packed single- or double-precision floating point elements Packed instructions operate on 4 (sp) or 2 (dp) floats Scalar instructions operate only on the right-most field addps X4opY4 X3opY3 X2opY2 X1opY1 X4 X3 X2 X1 Y4 Y3 Y2 Y1 op addpd X2opY2 X1opY1 X2 X1 Y2 Y1 op addsd Y2 X1opY1 X2 X1 Y1 op addss Shows how data moves around in registers for the different flavors of ADD. Y4 Y3 Y2 X1opY1 X4 X3 X2 X1 Y1 op

Extended SIMD Integer Ops
All MMX™/SSE integer instructions operate on 128-bit wide data in XMM registers Additionally, some new functionality MOVDQA, MOVDQU: 128-bit aligned/unaligned moves PADDQ, PSUBQ: 64-bit Add/Subtract for mm & xmm regs PMULUDQ: Packed 32 * 32 bit Multiply PSLLDQ, PSRLDQ: 128-bit byte-wise Shifts PSHUFD: Shuffle four double-words in xmm register PSHUFL/HW: Shuffle four words in upper/lower half of xmm reg PUNPCKL/HQDQ: Interleave upper/lower quadwords Full 128-bit Conversions: 4 Ints vs. 4 SP Floats New 128-bit SIMD Integer provides a superset of MMX capabilities. All MMX code is portable to WNI. Key new functionality - Packed 32*32 multiplies are intended for use with RSA authentication and RC5 encryption.

New SIMD Integer Data Formats
New 128-bit data-types for fixed-point integer data 16 Packed bytes 8 Packed words 4 Packed doublewords 2 Quadwords 127 63 8 7 127 63 16 15 127 63 31 32 Shows how integer data can be represented in an XMM register. 127 63

New DP Instruction Categories
ADD, SUB, MUL, DIV, SQRT MAX, MIN Full 52-bit precision mantissa (Packed & Scalar) AND, ANDN, OR, XOR Operate uniformly on entire 128-bit register Must use DP instructions for double-precision data Computation MOVAPD, MOVUPD 128-bit DP moves (aligned/unaligned) MOVH/LPD, MOVSD 64-bit DP moves SHUFPD Shuffle packed doubles Select data using 2-bit immediate operand Data Formatting Logic

DP Packed & Scalar Operations
The new Packed & Scalar FP Instructions operate on packed double precision floating point elements Packed instructions operate on 2 numbers Scalar instructions operate on least-significant number X2opY2 X1opY1 op X2 X1 Y2 Y1 addpd X2 X1 addsd op Y2 Y1 Y2 X1opY1

SHUFPD Instruction SHUFPD: Shuffle Packed Double-FP
1 1 y2 y1 x2 x1 XMM1 XMM2 y2-y1 x2-x1 XMM1 SHUFPD XMM1, XMM2, 3 // binary 11 y2 x2 XMM1 SHUFPD XMM1, XMM2, 2 // binary 10 y2 x1 XMM1

New DP instruction Categories, Cont'd
CMPPD, CMPSD Compare & mask (Packed/Scalar) COMISD Scalar compare and set status flags MOVMSKPD Store 2-bit mask of DP sign bits in a reg32 Branching CVT Convert DP to SP & 32-bit integer w/ rounding (Packed/Scalar) CVTT Convert DP to 32-bit integer w/ truncation (Packed/Scalar) Type Conversion

Compare & Mask Operation
CMPPD: Compare Packed Double-FP CMPPD XMM0, XMM1, 1 // 1 = less than 8.6 3.5 XMM0 XMM1 < 1.1 12.3 ….111 ….000

Cache Enhancements On-die trace cache for decoded uops (TC)
Holds 12K uops 8K on-die, 1st level data cache (L1) 64-byte line size Pentium Pro was 32 bytes Ultrafast, multiple accesses per instruction 256K on-die, 2nd level write-back, unified data and instruction cache (L2) 128-byte line size operates at full processor clock frequency PREFETCH instructions return 128 bytes to L2 Faster Trace cache replaces L1 instruction cache. Effectively removes the micro-op decoder from the pipeline for tight, performance critical loops. The processor forms traces without help from the programmer. The trace formation process is explained in the Wmt Optimizations guide. People may wonder why the 1st level cache is smaller than it has been on previous processors, there are two approaches cache design can take, to improve performance, you can either make it bigger or faster. In this case, the cache has bigger lines (64 bytes), and runs faster than previous generations of L1. This 1st level cache is for data only, the instruction cache has been implemented in the form of what’s called the trace cache which stores 12K worth of decoded uOps. The smaller caches also allow everything to fit on one die. 8K on-die, 1st level data cache 4-way set associative, 64-byte cache lines Multiple accesses per CPU clock 256K on-die, 2nd level write-back, unified data and instruction cache Full clock speed (one access per CPU clock) 8-way set associative, 128 byte cache lines

New Cacheability Instructions
MMX™/SSE cacheability instructions preserved New Functionality: CLFLUSH: Cache line flush LFENCE / MFENCE: Load Fence / Memory Fence PAUSE: Pause execution MASKMOVDQU: Mask move 128-bit integer data MOVNTPD: Streaming store with 2 64-bit DP FP data MOVNTDQ: Streaming store with 128-bit integer data MOVNTI: Streaming store with 32-bit integer data - New streaming store instructions cover the two new packed data types and the original 32-bit integer registers. The overall concept is the same as streaming stores for the Pentium III. - Fence instructions are mainly for driver-level optimizations. 3D device driver programmers would be interested in this. - MASKMOVDQU is a natural expansion of the SSE integer instruction from 64-bit width to 128 bits. It allows the programmer to selectively write bytes to memory based on a bit mask held by an integer register. - CLFLUSH has some potential application-level uses. It allows the programmer to precisely control when cache is flushed, so that it can be done during idle memory cycles. So, in a case where cache flushing is inevitable, it can be done ahead of time rather than waiting for a bunch of dirty writebacks that will interfere with the next memory intensive sequence. - PAUSE is essentially a NOP which uses no resources but causes a finite delay in the execution stream (used to create efficient spin-locks)

Streaming Stores Willamette implementation supports:
Writing to uncacheable buffer (e.g. AGP) with full line-writes Re-reading same buffer with full line-reads New in WNI, compared to Katmai/CuMine Integer streaming store Operates on integer registers (ie, EAX, EBX) Useful for OS, by avoiding need to save FP state, just move raw bits

Detail: Cache Line Flush
CLFLUSH: Cache line containing m8 flushed and invalidated from all caches in the coherency domain Linear address based; allowed by user code Potential usage: Allows incoherent (AGP) I/O data to be mapped as WB for high read performance and flushed when updated Example: video encode stream Precise control of dirty data eviction may increase performance by idle memory cycles

Detail: Fences Capabilities introduced over time to enable software managed coherence: Write combining with the Pentium Pro processor SFence and memory streaming with Streaming SIMD Extensions New Willamette Fences completes the tool set to enable full software coherence management LFence, strong load order Blocks younger loads from passing a prior load instruction All loads preceding an LFence will be completed before loads coming after the LFence MFence Achieves effect of LFence and SFence instructions executed at same time Necessary, as issuing an SFence instruction followed by an LFence instruction does not prevent a load from passing a prior store

Pause Instruction PAUSE architecturally a NOP on IA-32 processor generations Usable since Willamette! Not necessary to check processor type. PAUSE is hint to processor that code is a spin- wait or non- performance- critical code. A processor which uses the hint can: Significantly improves performance of spin-wait loops without negative performance impact, by inserting a implementation- dependent delay that helps processors with dynamic execution (a. k. a. out- of- order execution) exit from the spin- loop faster Significantly reduces power consumption during spin- wait loops 3rd Party Libraries planned: - Direct3D - OpenGL

NetBurstTM µArchitecture Overview
System Bus 2nd Level Cache 8-way 1st Level Cache (Data) 4-way Bus Unit Fetch/ Decode Trace Cache Microcode ROM Frequently used paths Less frequently used paths Execution Out-of-Order Core Retirement BTBs/Branch Prediction Front End

NetBurstTM µArchitecture
3.2 GB/s System Interface L2 Cache and Control L2 Cache and Control Decoder BTB & I-TLB Trace Cache Rename/Alloc uop Queues BTB uCode ROM 3 Schedulers Integer RF ALU L1 D-Cache and D-TLB Store AGU Load FP RF FMul FAdd MMX SSE FP move FP store #3 Intel® Pentium® 4 Processor can also do a memory load and store operation each clock cycle. It first needs to calculate the memory address in the address generation units, labeled here “AGU”, for the load and the store and then it completes the load or the store to the L1 D-cache. Since programs have lots of loads and stores having both a load and a store port keeps this from being a bottleneck. Issues of overlapping load- and store-addresses not addressed here, but obvious solution: Seialize in original order. #4 Intel® Pentium® 4 Processor also has an on-chip L2 cache that stores both code and data and it has a fast system bus. Lets look at Floating point and multi-media execution. Intel® Pentium® 4 Processor ’s goal on floating point and multi-media was to be significantly faster than a P6. #1 If we look at the basic FP HW we see that Intel® Pentium® 4 Processor can do a new FP or 128-bit SSE execution operation each clock cycle. When doing our early performance work we considered having 2 full FP/SSE execution units but found that having a 2nd much simpler 128-bit data movement unit that does moves and stores buys most of the performance that a full execution port would provide but at a much lower silicon cost. #2 Intel® Pentium® 4 Processor can also do a 128-bit load and a 128-bit store from the L1 D-cache each clock cycle. This helps keep the FP execution units fed with data. #3 This is only part of the FP/multi-media performance story. FP programs have lots of long latency operations. Intel® Pentium® 4 Processor has a very deep instruction window -- over a 100 instructions in flight. This allows the machine to examine a large section of the program at once to find lots of independent FP/SSE instructions to stay busy. This deep window actually buys much more performance than more FP execution units would do. #4 Often FP/multi-media programs don’t fit in the L1 cache so they need high bandwidth to and from the L2 cache. Intel® Pentium® 4 Processor has a high bandwidth path from the L2 to the L1 cache to enable this. #5 And lastly, many FP and stream-oriented multi-media applications actually stream data from main memory. They need a very fast system bus to keep from being choked by memory and the 128-byte lines help as well. Intel® Pentium® 4 Processor has a fast 3.2 GB/sec bus that is able to keep this FP/multi-media engine well fed. In summary, Intel® Pentium® 4 Processor delivers significantly higher x87 and SSE performance than previous IA32 machines by having a well balanced high performance FP/multi-media engine.

NetBurstTM µArchitecture Summary
Quad Pumps bus to keep the Caches loaded Stores most recent instructions as µops in TC to enhance instruction issue Improves Program Execution Issues up to 3 µops per Clock Dispatches up to 6 µops to Execution Units per clock Retires up to 3 µops per clock Feeds back branch and data information to have required instructions and data available

What is Hyperthreading?
Ability of processor to run multiple threads Duplicate architecture state creates illusion to SW of Dual Processor (DP) Execution unit shared between two threads, but dedicated if one stalls Effect of Hyperthreading on Xeon Processor: CPU utilization increases to 50% (from ~35%) About 30% performance gain for some applications with the same processor frequency Hyperthreading Technology Results: 1. More performance with enabled applications 2. Better responsiveness with existing applications

Hyperthreading Implementation
Almost two Logical Processors Architecture state (registers) and APIC duplicated Share execution units, caches, branch prediction, control logic and buses Processor Execution Resource Adv. Programmable Interrupt Control Architecture State On-Die Caches *APIC: Advanced Programmable Interrupt Controller. Handles interrupts sent to a specified logical processor This is what Ht architecture looks like on higher levels. the two logical processors are on same processor core and share resources like cache, execution unit, branch predictors and bus. While architecture state and APIC is duplicated. Arch. State includes eight general purpose registers, control registers, machine state registers. APIC is Advanced Programmable Interrupt Controller and is responsible for handling the interrupts sent to specific logical processor. System Bus

Benefits to Xeon™ Processor Hyperthreading Technology Performance for Dual Processor Servers
Hyper Threading Technology Performance Gains Enhancements in bandwidth, throughput and thread-level parallelism with Hyperthreading Technology deliver an acceleration of performance Source: Veritest (Sep, 2002). Comparisons based on Intel internal measurements w/pre-production hardware 1) HTT on and off configurations with the Intel® Xeon™ processor 2.80 GHz with 512KB L2 cache, Intel® Server Board SE7501WV2 with Intel® E7501 chipset, 2GB DDR, Microsoft Windows* 2000 Server SP2, Intel® PRO/1000 Gigabit Server adapter, AMI 438 MegaRAID* controller v MB EDO RAM- Dell PowerVault 210S disk array. 2) HTT on and off configurations with the Intel® Xeon™ processor 2.80 GHz with 512KB L2 cache, Intel® Server Board SE7501WV2 with Intel® E7501 chipset, 2GB DDR, Microsoft Windows* 2000 Server SP2, Intel® PRO/1000 Gigabit Server adapter, AMI 438 MegaRAID* controller v MB EDO RAM- Dell PowerVault 210S disk array. Intel® Xeon™ processor 2.8GHz with 512KB cache, Microsoft Windows* 2000 Hyperthreading Technology increases performance by ~20% on Some Server Applications Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit or call (U.S.) or

Hyperthreading for Workstation
Intel® Xeon™ processor 2.8GHz with 512KB cache Hyperthreading Technology performance gains Performance gains whether running: Multiple tasks within one application Multiple applications running at once Multi-Threaded Application CHARMm* 3DSM*5 D2cluster* BLAST* Lightwave3D*75 Multi-Tasking Application Patran* + Nastran* Multiple Compiles 3ds max* + Photoshop Compile + Regression Maya* multiple renderings + Animation Source: Intel Corporation. With and without Hyperthreading Technology on the following system configuration: Intel Xeon Processor 2.80 GHz/533 MHz system bus with 512KB L2 cache, Intel® E7505 chipset-based Pre-Release platform, 1GB PC2100 DDR CL2 CAS2-2-2, (2) 18GB Seagate* Cheetah ST318452LW 15K Ultra160 SCSI hard drive using Adaptec SCSI adapter BIOS , nVidia* Quadro4 Pro 980XGL 128MB AGP 8x graphics card with driver version 40.52, Windows XP* Professional build 2600. Hyperthreading Technology increases performance by 15-37% on Workstation Applications Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit or call (U.S.) or

Hyperthreading Resources
Type Description Example Shared Each logical processor can use, evict or allocate any part of resource Cache, WC Buffers, VTune reg. MS-ROM Duplicated Each logical processor has it’s own set of resources APIC, registers, TSC, IP Split Resources are hard partitioned in half Load/Store buffers, ITLB, ROB, IAQ Tagged Resource entries are tagged with processor ID Trace Cache, DTLB Most resources on physical processor are fully shared to improve dynamic utilization of the resource. Any logical processor can use, evict or allocate any part of the resource. There is no static partitioning, allocation is done dynamically. Next is duplicated. In this type, when HT is enabled. Each logical processor has its own set of resources. The examples are registers (which include general purpose registers and some of the machine registers), time stamp counters (which together form architecture state of a processor) and also APIC. IP is duplicated to simultaneously track execution and state changes of two logical processors. In case of split, the resources are hard partitioned in half. Reason for partitioning the load/store buffers include ensure fairness and allow forward progress for 2 independent logical processors. Partitioning prevents the stalled logical processor from using all entries. Tagged resource are almost same as shared resource, BUT they have processor ID tag bit associated. A logical processor can only use those entries, which match that processor’s ID. But, here any logical processor can allocate or evict any entry, which is not possible in split resources. The examples are: DTLB and trace cache.

Xeon Processor Pipeline Simplified
Buffering Queues separate major pipeline logic blocks Buffering queues are either partitioned or duplicated to ensure independent forward progress through each logic block Queue Queue Queue Fetch Decode TC/MSROM Rename/Allocate OOO Execute Retirement Here, the ovals (in white) are the major pipeline logic blocks. And the rectangles are intermediate buffering queues. This is high level view of micro-architectural pipeline. the buffering queues (which are in green and pink color) separate these major pipeline logic blocks. Some of these buffering queues are duplicated and some is partitioned. (like, the queues between fetch and decode and between decode and TC are duplicated and rest are partitioned.) Buffering Queues duplicated Buffering Queues partitioned

Optional server product
HT in NetBurst Bus unit 3rd level cache Optional server product 2nd level cache 1st level cache 4 way Fetch/ Decode Trace Cache MS ROM OOO Execution Retirement BTBs/ Branch Prediction System Bus Front End Execution Trace Cache Microcode Store ROM (MSROM) ITLB and Branch Prediction IA-32 Instruction Decode Micro-op Queue This is what net-burst architecture looks like. And as mentioned in the agenda slide, we will have a look in detail at Front End, OOO Execution unit and memory subsystem. The first one is front end, which includes trace cache, micro-code ROM, ITLB and branch prediction, IA32 instruction decode and Micro-op queue. The following foils explain these logic units in detail.

Front End Responsible for delivering instruction to the later pipe stages Trace Cache Hit When the requested instruction trace is present in trace cache Trace cache miss Requested instruction is brought in the trace cache from L2 cache

Trace Cache Hit Front End
IP IP Micro-Op Queue Trace Cache Two separate instruction pointers Two logical processors arbitrate for access to TC each cycle If one logical processor stalls,other uses full bandwidth of TC This is what on non-HT machine. So we have one IP, TC has entries from the processor and requested instruction will be delivered to micro-op queue. This is what happens when HT is enabled. Now the first scenario is: there is a TC hit. In this case, the instruction will be just taken from TC and will be delivered to micro-op queue. Most instructions in the program are fetched and executed from TC. Two sets of instruction pointers track the progress of two s/w threads (on 2 logical processors). Two logical processors arbitrate access to TC every clock cycle. If both processors want access to TC, then one of them is granted access first and access to the other is given in alternating clock cycle. If one of the logical processors is stalled or is waiting on some event, then other logical processor can use the full band-width of trace cache I.e. every clock cycle. (Full bandwidth 1 micro-op/clock)

Both models can be implemented on HT processors
Programming Models Two major types of parallel programming models Domain decomposition Functional decomposition Domain Decomposition Multiple threads working on subsets of the data Functional Decomposition Different computation on the same data E.g. Motion estimation vs. color conversion, e.t.c. Both models can be implemented on HT processors

Threading Implementation
O/S thread implementations may differ Microsoft Win32 NT threads (supports 1-1 O/S level threading) Fibers (supports M-N user level threading) Linux Native Linux Thread (severely broken & inefficient) IBM Next Generation Posix Threads (NGPT) – IBM’s attempt to fix Linux native thread Redhat Native Posix Thread Model for Linux (NPTL) -supports 1-1 O/S level threading that is to be Posix compliant Others Pthreads (generic Posix compliant thread) Sun Solaris Light Weight Processes (lwp), Sun Solaris user level threads Thread Model Issues Somewhat Orthogonal to HT

OS Implications of HT ALL UP OS Legacy MP OS Backward Compatible,
will not take the advantage of Enabled MP OS OS with Basic Hyperthreading Technology Functionality Optimized MP OS OS with optimized Hyperthreading Technology support Divided Operating System support into 3 categories All single processor OS’s and Legacy MP OS’s that will run on Hyperthreading Technology capable processors but will not take advantage of this new technology Hyperthreading Technology Enabled MP Operating Systems which work with Hyperthreading Technology and recognize all of the logical processors but which are not optimized for performance. Hyperthreading Technology Optimized Operating Systems which are tuned to get the best performance from Hyperthreading Technology. ALL UP OS All UP OS’s, Legacy MP OS’s OS will run un-modified on systems with Hyperthreading Technology enabled processor OS will not take advantage of Hyperthreading Technology Enabled MP OS Functional support Not optimized for performance PAUSE instruction in “busy wait” loops No execution based timing loops Recognizes systems with Hyperthreading Technology enabled processors Uses both logical processors Optimized MP OS Programming interface to allow apps to bind themselves to a specific logical processor Tuned to get best performance from the additional logical processors Includes HLT instruction in idle loops Fully Compatible with ALL existing O/S… but only optimized O/S enables the most benefits

HT Optimized OS Windows XP Windows 2003 Enabled
Windows XP Professional Windows 2003 Enterprise Data Center Enabled RedHat Enterprise Server (version 7.3, 8.0) RedHat Advanced Server 2.1 Suse (8.0, 9.0)

OS Scheduling HT enabled O/S sees two processors for each HT physical processor Enumerates first logical processor from all physical processors first Schedules processors almost same as regular SMP Thread priority determines schedule, but CPU dispatch matters O/S independently submits code stream for thread to logical processors and can independently interrupt or halt each logical processor (no change) Logical Processor 1 Physical Processor 1 Physical Processor 0 CPUID

Thread Management Avoid coding practices that disable hyperthreaded processors, e.g. Avoid 64KB Aliasing Avoid processor serializing events (e.g. FP denormals, self modifying codes, e.t.c.) Avoid Spin Locks Minimize lock contention to less than two threads per lock Use “Pause” and “O/S synchronization” when Spin-Wait loops must be implemented In addition, follow multi-threading best practices: Use O/S services to block waiting threads Spin as briefly as possible before yielding to O/S Avoid false sharing Avoid unintended synchronizations (C Runtime, C++ Template Library implementations)

Threading Tools Intel ThreadChecker Tool OpenMP Intel Vtune Analyzer
Itemization of parallelization bugs and source ThreadChecker class OpenMP Thread model in which programmer introduces parallelism or threading via directives or pragmas Intel Vtune Analyzer Provides analysis and drills down to source code ThreadChecker Integration GuideView Parallel performance tuning

Software Tools Intel C/C++ Compiler Register Viewing Tool (RVT)
Support for SSE and SSE2 using C++ classes, intrinsics, and assembly Improved Vectorization and prefetch insertion Profile-guided optimizations G7 compiler switch for Pentium® 4 optimizations Register Viewing Tool (RVT) Shows contents of XMM registers as they are updated Plugs into Microsoft* Visual Studio* Microsoft* Visual Studio* 6.0 Processor Pack* Support for SSE and SSE2 instructions, including intrinsics Available for free download from Microsoft* Microsoft* Visual Studio* .NET Provides improved support for Intel® NetBurst™ micro-architecture Recognizes XMM registers Details on how tools (compiler, rvt, vtune) will be distributed to Northwood ISVs are still being worked out. Intel C/C++ Compiler: taken from Vectorization: Vectorizing compilers strive to automatically enable SIMD processing automatically. The Intel® C Compiler provides a vectorizer that automatically enables SIMD processing (if possible), and creates optimal code for the underlying processor capabilities. EX: Use intrinsics, have the compiler vectorize my code, and use compiler switches or cpuid dispatch to generate optimal code for Pentium® II, III, 4… Potentially Itanium™-based architecture. Profile-Guided Optimizations(PGO) PGO works by monitoring processor specific performance statistics and attempts to improve performance by adjusting the executable. As implemented on the Intel® C/C++ & Fortran compilers it is a 3-step process: compile->execute & gather perf. Stats->re-compile Benefits of PGO: 1) improved instruction cache usage. It moves frequently accessed code segments adjacent to one another, and move seldom-accessed code to the end of the module. This can eliminate branches and shrink code size = more efficient processor instruction fetching. 2) improved branch prediction. PGO generates branch hints for the Pentium® 4 Processor during the optimization phase of the re-compile w/ PGO. Targeted Applications for PGO: - Apps containing multiple execution paths, where a handful of paths are frequently executed. - Large apps with many function calls or branches (further benefits from IPO). PGO works best for code with many frequently executed branches that are difficult to predict at compile time. An example is code that is heavy with error-checking in which the error conditions are false most of the time. The "cold" error-handling code can be placed such that the branch is rarely mispredicted. Eliminating the interleaving of "hot" and "cold" code improves instruction cache behavior. For example, the use of PGO often allows the compiler to make better decisions about function inlining, thereby increasing the effectiveness of interprocedural optimizations. Microsoft Visual Studio 6.0 Processor Pack is required for support SSE & SSE2 instructions & intrinsics. Available at: Microsoft Visual Studio* .NET supports the capability to view the contents of the SSE & SSE2 registers (XMM0-XMM7).

Hyperthreading is NOT:
Hyperthreading is not a full, dual-core processor Hyper-threading does not deliver multi-processor scaling Dual Processor Hyper-Threading Dual Core Cache Cache Processor core APIC Arch. State On-Die Cache APIC Arch. State On-Die Cache Processor core APIC Arch. State APIC Arch. State Processor core Processor core

Backup

TERMS Branch: transfer of control to address different from next instruction. Unconditional or conditional. Branch Prediction: Ability to guess target of conditional branch. Can be wrong, in which case we have mis-predict. CISC: complex instruction set computer Compiler: Tool translating high-level instructions into low-level machine instructions. Can be asm source (ASCII) or binary machine code. EPIC (Explicitly Parallel Instruction Computing): New architecture jointly defined by Intel® and HP.Is foundation of new 64-bit Instruction Set Architecture

TERMS Explicit parallelism: Intended ability of two tasks to be executed by design (explicitly) at the same time. Task can be as simple as an instruction, or as complex as a complete program. Implicit parallelism: Incidental ability of two or more tasks to be executed at the same time. Example: sequence of integer add and FP convert instructions without common registers or memory addresses, executed on a target machine that happens to have respective HW modules available.

TERMS Instruction Set Architecture (ISA): Architecturally visible instructions that perform software functions and direct operations within the processor. HP and Intel® jointly developed a new 64-bit ISA.This ISA integrates technical concepts from the EPIC technology. Memory latency: Time to move data from memory to the processor, at request of processor. Mispredict: A wrong guess, where new flow of control will continue as a result of a branch (or similar control flow instruction).

Chapter 6 Intel© x86 Microprocessor Architecture

Similar presentations

Presentation on theme: "Chapter 6 Intel© x86 Microprocessor Architecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 6 Intel© x86 Microprocessor Architecture

Similar presentations

Presentation on theme: "Chapter 6 Intel© x86 Microprocessor Architecture"— Presentation transcript:

Similar presentations

About project

Feedback