Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECE 4100/6100 Advanced Computer Architecture Lecture 12 P6 and NetBurst Microarchitecture Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering.

Similar presentations

Presentation on theme: "ECE 4100/6100 Advanced Computer Architecture Lecture 12 P6 and NetBurst Microarchitecture Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering."— Presentation transcript:

1 ECE 4100/6100 Advanced Computer Architecture Lecture 12 P6 and NetBurst Microarchitecture Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

2 2 P6 System Architecture SystemMemory(DRAM) MCH Front-SideBus PCIUSBI/O GraphicsProcessor Local Frame Buffer PCIExpress AGP (SRAM) L2 Cache Back-SideBus P6 Core Host Processor L1Cache(SRAM) GPU ICH chipset

3 3 Instruction Fetch Unit P6 Microarchitecture BTB/BAC Instruction Fetch Unit Bus interface unit InstructionDecoder InstructionDecoder Register Alias Table Allocator Microcode Sequencer Reservation Station ROB & Retire RF AGU MMX IEU/JEU IEU/JEU FEU MIU Memory Order Buffer Data Cache Unit (L1) External bus Chip boundary ControlFlow (Restricted)DataFlow Instruction Fetch Cluster Issue Cluster Out-of-order Cluster Memory Cluster Bus Cluster

4 4 Pentium III Die Map zEBL/BBL – External/Backside Bus logic zMOB - Memory Order Buffer zPacked FPU - Floating Point Unit for SSE zIEU - Integer Execution Unit zFAU - Floating Point Arithmetic Unit zMIU - Memory Interface Unit zDCU - Data Cache Unit (L1) zPMH - Page Miss Handler zDTLB - Data TLB zBAC - Branch Address Calculator zRAT - Register Alias Table zSIMD - Packed Floating Point unit zRS - Reservation Station zBTB - Branch Target Buffer zTAP – Test Access Port zIFU - Instruction Fetch Unit and L1 I-Cache zID - Instruction Decode zROB - Reorder Buffer zMS - Micro-instruction Sequencer

5 5 P6 Basics One implementation of IA32 architecture Deeply pipeline processor In-order front-end and back-end Dynamic execution engine (restricted dataflow) Speculative execution P6 microarchitecture family processors include –Pentium Pro –Pentium II (PPro + MMX + 2x caches) –Pentium III (P-II + SSE + enhanced MMX, e.g. PSAD) –Pentium 4 (Not P6, will be discussed separately) –Pentium M (+SSE2, SSE3,  op fusion) –Core (PM + SSSE3, SSE4, Intel 64 (EM64T), MacroOp fusion, 4  op retired rate vs. 3 of previous proliferation)

6 6 P6 Pipelining Next IP I-CacheILD Rotate Dec1 Dec2 Br Dec RS Write RAT IDQ In-order FE Exec2 Exec n Multi-cycle pipeline AGU DCache1 DCache2 Non-blocking memory pipeline RS schd RS Disp Exec / WB Single-cycle pipeline 83: Data WB 82: Int WB schedule 81: Mem/FP WB FE in-order boundary Retirement in-order boundary Ret ptr wr Ret ROB rd RRF wr … … … ……….. RS Scheduling Delay ROB Scheduling Delay MOB Scheduling Delay IFU1IFU2IFU3DEC1DEC2RATROBDISEX RET1RET AGU MOB blk MOB wr MOB disp DCache1 Dcache2 MOB wakeup Blocking memory pipeline

7 7 Instruction Fetching Unit IFU1: Initiate fetch, requesting 16 bytes at a time IFU2: Instruction length decoder, mark instruction boundaries, BTB makes prediction (2 cycles) IFU3: Align instructions to 3 decoders in format Streaming Buffer Instruction Cache Victim Cache Instruction TLB data addr P.Addr Branch Target Buffer Next PC Mux Other fetch requests Linear Address Select mux ILD Length marks Instruction rotator Instruction buffer #bytes consumed by ID Prediction marks

8 8 Static Branch Prediction (stage 17 Br. Dec of pg. 6) BTB miss? PC-relative? Conditional? Backwards? Return? UnconditionalPC-relative? NoNo No No No No Yes Yes Yes Yes Yes Yes BTB dynamic predictor’s decision Taken Taken Taken Taken Taken Indirect jump Not Taken

9 9 Dynamic Branch Prediction Similar to a 2-level PAs design Associated with each BTB entry W/ 16-entry Return Stack Buffer 4 branch predictions per cycle (due to 16-byte fetch per cycle) Speculative update (2 copies of BHR)  Static prediction provided by Branch Address Calculator when BTB misses (see prior slide) 512-entry BTB 110 Branch History Register (BHR) Pattern History Tables (PHT) Prediction Rc: Branch Result 2-bit sat. counter Spec. update New (spec) history 1101 W0W1W2W3

10 10 X86 Instruction Decode decoder Decode rate depends on instruction alignment DEC1: translate x86 into micro-operation’s (  ops) DEC2: move decoded  ops to ID queue MS performs translations either –Generate entire  op sequence from the “microcode ROM” –Receive 4  ops from complex decoder, and the rest from microcode ROM Subsequent Instructions followed by the inst needing MS are flushed complex (1-4) complex (1-4) simple (1) simple (1) simple (1) simple (1) (16 bytes) Micro- instruction sequencer (MS) Instruction decoder queue (6  ops) Next 3 inst#Inst to dec S,S,S3 S,S,CFirst 2 S,C,SFirst 1 S,C,CFirst 1 C,S,S3 C,S,CFirst 2 C,C,SFirst 1 C,C,CFirst 1 S: Simple C: Complex Instruction Buffer To RAT/ALLOC

11 11 Register Alias Table (RAT) Register renaming for 8 integer registers, 8 floating point (stack) registers and flags: 3  op per cycle physicalPSrc40 80-bit physical registers embedded in the ROB (thereby, 6 bit to specify PSrc) RAT looks up physical ROB locations for renamed sources based on RRF bit Override logic is for dependent  ops decoded at the same cycle Misprediction will revert all pointers to point to Retirement Register File (RRF) In-order queue FP TOS Adjus t FP RAT Array Integer RAT Array Logical Src Int and FP Overrides Array Physical Src (Psrc) RAT PSrc’s Physical ROB Pointers Allocator 25 2 ECX 15 EAX EBX ECX EDX Renaming Example ROBRRF PSrc

12 12 Partial Stalls due to RAT Partial register stalls: Occurs when writing a smaller (e.g. 8/16-bit) register followed by a larger (e.g. 32-bit) read –Because need to read different partial pieces from multiple physical registers ! Partial flags stalls: Occurs when a subsequent instruction reads more flags than a prior unretired instruction touches EAXAXwrite read MOV AX, m8 ; ADD EAX, m32 ; stall Partial register stalls XOR EAX, EAX MOV AL, m8 ; ADD EAX, m32 ; no stall SUB EAX, EAX MOV AL, m8 ; ADD EAX, m32 ; no stall Idiom Fix (1) Idiom Fix (2) CMP EAX, EBX INC ECX JBE XX ; stall Partial flag stalls (1) z JBEZFCF ZF ZF z JBE reads both ZF and CF while INC affects (ZF,OF,SF,AF,PF) i.e. only ZF z LAHF EFLAGS TEST z LAHF loads low byte of EFLAGS while TEST writes partial of them TEST EBX, EBX LAHF ; stall Partial flag stalls (2)

13 13 Partial Register Width Renaming 32/16-bit accesses: low bank –Read from low bank (AL/BL/CL/DL;AX/BX/CX/DX;EAX/EBX/ECX/EDX/EDI/ESI/EBP/ESP) –Write to both banks (AH/BH/CH/DH) 8-bit RAT accesses: depending on which bank is being written and only update the particular bank In-order queue FP TOS Adjust FP RAT Array Logical Src Int and FP Overries Array Physical Src RAT Physical Src Physical ROB Pointers from Allocator  op0: MOV AL = (a)  op1: MOV AH = (b)  op2: ADD AL = (c)  op3: ADD AH = (d) Integer RAT Array INT Low Bank (32b/16b/L): 8 entries INT High Bank (H): 4 entries Size(2) RRF(1)PSrc(6) Allocator

14 14 Allocator (ALLOC) The interface between in-order and out-of-order pipelines Allocates into ROB, MOB and RS –“ 3-or-none ”  ops per cycle into ROB and RS Must have 3 free ROB entries or no allocation –“ all-or-none ” policy for MOB Stall allocation when not all the valid MOB  ops can be allocated PdstGenerate physical destination token Pdst from the ROB and pass it to the Register Alias Table (RAT) and RS Stalls upon shortage of resources

15 15 Reservation Stations (RS) Gateway to execution: binding max 5  op to each port per cycle Port binding at dispatch time (certain  op can only be bound to one port) 20  op entry buffer bridging the In-order and Out-of-order engine (32 entries in Core) RS fields include  op opcode, data valid bits, Pdst, Psrc, source data, BrPred, etc. Oldest first FIFO scheduling when multiple  ops are ready at the same cycle Port 0 Port 1 Port 2 Port 3 Port 4 IEU0FaddFmulImulDivIEU1JEUAGU0AGU1 MOB DCU ROB RRF PfaddPfmul Pfshuf WB bus 1 WB bus 0 Ld addr St addr LDA STA STD St data Loaded data RS Retired data

16 16 ReOrder Buffer (ROB) A 40-entry circular buffer (96-entry in Core) –157-bit wide –Provide 40 alias physical registers Out-of-order completion Deposit exception in each entry Retirement (or de-allocation) –After resolving prior speculation –Handle exceptions thru MS –Clear OOO state when a mis-predicted branch or exception is detected –3  op ’ s per cycle in program order –For multi-  op x86 instructions: none or all (atomic) ALLOC RAT RS RRF ROB MS (exp)  code assist

17 17 Memory Execution Cluster Manage data memory accesses Address Translation Detect violation of access ordering Fill buffers (FB) in DCU, similar to MSHR for non-blocking cache support RS / ROB LDSTASTD DTLBDTLB LDSTA DCUDCU Load Buffer Store Buffer EBL Memory Cluster Memory Cluster movl ecx, edi addl ecx, 8 movl -4(edi), ebx movl eax, 4(ecx) RS cannot detect this and could dispatch them at the same time FB

18 18 Memory Order Buffer (MOB) Allocated by ALLOC A second order RS for memory operations 1  op for load; 2  op ’ s for store: Store Address (STA) and Store Data (STD) MOB  16-entry load buffer (LB) (32-entry in Core, 64 in SandyBridge)  12-entry store address buffer (SAB) (20-entry in Core, 36 in SandyBridge)  SAB works in unison with Store data buffer (SDB) in MIU Physical Address Buffer (PAB) in DCU  Store Buffer (SB): SAB + SDB + PAB Senior Stores  Upon STD/STA retired from ROB senior  SB marks the store “ senior ” program order  Senior stores are committed back in program order to memory when bus idle or SB full Prefetch instructions in P-III  Senior load  Senior load behavior  Due to no explicit architectural destination  New Memory dependency predictor in Core to predict store-to-load dependencies

19 19 Store Coloring ALLOC assigns Store Buffer ID (SBID) in program order ALLOC tags loads with the most recent SBID Check loads against stores with equal or younger SBIDs for potential address conflicts SDB forwards data if conflict detected x86 Instructions  op’s store color mov (0x1220), ebxstd ebx 2 sta 0x mov (0x1110), eax std eax 3 sta 0x mov ecx, (0x1220)ld 0x mov edx, (0x1280)ld 0x mov (0x1400), edxstd edx 4 sta 0x mov edx, (0x1380)ld 0x1380 4

20 20 Memory Type Range Registers (MTRR) Control registers written by the system (OS) Memory TypesSupporting Memory Types –UnCacheable (UC) –Uncacheable Speculative Write-combining (USWC or WC) Use a fill buffer entry as WC buffer –WriteBack (WB) –Write-Through (WT) –Write-Protected (WP) E.g. Support copy-on-write in UNIX, save memory space by allowing child processes to share with their parents. Only create new memory pages when child processes attempt to write. Page Miss Handler (PMH) –Look up MTRR while supplying physical addresses –Return memory types and physical address to DTLB

21 21 Intel NetBurst Microarchitecture Pentium 4 ’ s microarchitecture Original target market: Graphics workstations, but … Design Goals: –Performance, performance, performance, … –Unprecedented multimedia/floating-point performance Streaming SIMD Extensions 2 (SSE2) SSE3 introduced in Prescott Pentium 4 (90nm) –Reduced CPI Low latency instructions High bandwidth instruction fetching Rapid Execution of Arithmetic & Logic operations –Reduced clock period New pipeline designed for scalability

22 22 Innovations Beyond P6 Hyperpipelined technology Streaming SIMD Extension 2 Hyper-threading Technology (HT) Execution trace cache Rapid execution engine Staggered adder unit Enhanced branch predictor Indirect branch predictor (also in Banias Pentium M) Load speculation and replay

23 23 Pentium 4 Fact Sheet IA-32 fully backward compatible Available at speeds ranging from 1.3 to ~3.8 GHz Hyperpipelined (20+ stages) 125 million transistors in Prescott (1.328 billion in 16MB on-die L3 Tulsa, 65nm) 0.18 μ for 1.3 to 2GHz; 0.13μ for 1.8 to 3.4GHz; 90nm for 2.8GHz to 3.6GHz Die Size of 122mm 2 (Prescott 90nm), 435mm 2 (Tulsa 65nm), Consumes 115 watts of power at 3.6Ghz 1066MHz system bus Prescott L1 16KB, 8-way vs. previous P4 ’ s 8KB 4-way 1MB, 512KB or 256KB 8-way full-speed on-die L2 (B/W example: 89.6 to L1) 2MB L3 cache (in P4 HT Extreme edition, 0.13μ only), 16MB in Tulsa 144 new 128 bit SIMD instructions (SSE2), and 16 SSSE instructions in Prescott HyperThreading Technology (Not in all versions)

24 24 Building Blocks of Netburst Bus Unit Level 2 Cache Memory subsystem Fetch/ Dec ETC μROM BTB / Br Pred. System bus L1 Data Cache Execution Units INT and FP Exec. Unit OOO logic Retire Branch history update Front-end Out-of-Order Engine

25 25 Pentium 4 Microarchitectue (Prescott) BTB (4k entries) I-TLB/Prefetcher IA32 Decoder Execution Trace Cache (12K  ops) Trace Cache BTB (2k entries)  Code ROM  op Queue Allocator / Register Renamer INT / FP  op Queue Memory  op Queue Memory scheduler INT Register File / Bypass Network FP RF / Bypass Ntwk AGUAGU 2x ALU Slow ALU Ld addr St addr SimpleInst.SimpleInst.ComplexInst. FPMMXSSE/2/3 FP Move L1 Data Cache (16KB 8-way, 64-byte line, WT, 1 rd + 1 wr port) Fast Slow/General FP scheduler Simple FP QuadPumped800MHz 6.4 GB/sec BIU U-L2 Cache U-L2 Cache 1MB 8-way 128B line, WB 108 GB/s 256 bits 64 bits 64-bitSystemBus

26 26 Pipeline Depth Evolution PREFDEC EXECWB P5 Microarchitecture IFU1IFU2IFU3DEC1DEC2RATROBDISEX RET1RET2 P6 P6 Microarchitecture TC NextIPTC FetchDriveAllocQueueRenameScheduleDispatchReg FileExecFlagsBr CkDrive NetBurst Microarchitecture (Willamette) 20 stages NetBurst Microarchitecture (Prescott) > 30 stages

27 27 Execution Trace Cache Primary first level I-cache to replace conventional L1 –Decoding several x86 instructions at high frequency is difficult, take several pipeline stages –Branch misprediction penalty is considerable Advantages –Cache post-decode  ops (think about fill unit) –High bandwidth instruction fetching –Eliminate x86 decoding overheads –Reduce branch recovery time if TC hits Hold up to 12,000  ops –6  ops per trace line –Many (?) trace lines in a single trace

28 28 Execution Trace Cache Deliver 3  op ’ s per cycle to OOO engine if br pred is good X86 instructions read from L2 when TC misses (7+ cycle latency) TC Hit rate ~ 8K to 16KB conventional I-cache Simplified x86 decoder –Only one complex instruction per cycle –Instruction > 4  op will be executed by micro-code ROM (P6 ’ s MS) Perform branch prediction in TC –512-entry BTB + 16-entry RAS –With BP in x86 IFU, reduce 33% misprediction compared to P6 –Intel did not disclose the details of BP algorithms used in TC and x86 IFU (Dynamic + Static)

29 29 Out-Of-Order Engine Similar design philosophy with P6 uses –Allocator –Register Alias Table –128 physical registers –126-entry ReOrder Buffer –48-entry load buffer –24-entry store buffer

30 30 Register Renaming Schemes ROB (40-entry) RRF Data Status EBX ECX EDX ESI EDI EAX ESP EBPRAT P6 Register Renaming Allocated sequentially EBX ECX EDX ESI EDI EAX ESP EBP Retirement RAT

31 31 Micro-op Scheduling  op FIFO queues –Memory queue for loads and stores –Non-memory queue  op schedulers –Several schedulers fire instructions from 2  op queues to execution (P6 ’ s RS) –4 distinct dispatch ports –Maximum dispatch: 6  ops per cycle (2 fast ALU from Port 0,1 per cycle; 1 from ld/st ports) Exec Port 0Exec Port 1Load PortStore Port Fast ALU (2x pumped) Fast ALU (2x pumped) FP Move INT Exec FP Exec Memory Load Memory Store Add/sub Logic Store Data Branches FP/SSE Move FP/SSE Store FXCH Add/subShift Rotate FP/SSE Add FP/SSE Mul FP/SSE Div MMX Loads LEA Prefetch Stores

32 32 Data Memory Accesses Prescott: 16KB 8-way L1 + 1MB 8-way L2 (with a HW prefetcher), 128B line Load-to-use speculation –Dependent instruction dispatched before load finishes Due to the high frequency and deep pipeline depth From load scheduler to execution is longer than execution itself –Scheduler assumes loads always hit L1 mis-speculation –If L1 miss, dependent instructions left the scheduler receive incorrect data temporarily – mis-speculation –Replay logic Re-execute the load when mis-speculated Mis-speculated operations are placed into a replay queue for being re- dispatched –All trailing independent instructions are allowed to proceed –Tornado breaker Up to 4 outstanding load misses (= 4 fill buffers in original P6) Store-to-load forwarding buffer –24 entries –Have the same starting physical address –Load data size <= store data size

33 33 Fast Staggered ALU For frequent ALU instructions (No multiply, no shift, no rotate, no branch processing) Double pumped clocks Each operation finishes in 3 fast cycles –Lower-order 16-bit and bypass –Higher-order 16-bit and bypass –ALU flags generation Bit[15:0] Bit[31:16] Flags

34 34 Branch Predictor P4 uses the same hybrid predictor of Pentium M Bimodal Predictor Local Predictor Global Predictor MUX Pred_G Pred_LPred_B L_hit G_hit

35 35 In Pentium M and Prescott Pentium 4 Prediction based on global history Indirect Branch Predictor

36 36 New Instructions over Pentium CMOVcc / FCMOVcc r, r/m –Conditional moves (predicated move) instructions –Based on conditional code (cc) FCOMI/P : compare FP stack and set integer flags RDPMC/RDTSC instructions –PMC: P6 has 2, Netburst (P4) has 18 Uncacheable Speculative Write-Combining (USWC) — weakly ordered memory type for graphics memory

37 37 New Instructions SSE2 in Pentium 4 (not p6 microarchitecture) –Double precision SIMD FP SSSE in Core 2 –Supplemental instructions for shuffle, align, add, subtract. Intel 64 (EM64T) –64 bit support, new registers (8 more on top of 8) –In Celeron D, Core 2 (and P4 Prescott, Pentium D) –Almost compatible with AMD64 –AMD ’ s NX bit or Intel ’ s XD bit for preventing buffer overflow attacks

38 38 Streaming SIMD Extension 2 P-III SSE (Katmai New Instructions: KNI) xmm –Eight 128-bit wide xmm registers (new architecture state) –Single-precision –Single-precision 128-bit SIMD FP Four 32-bit FP operations in one instruction Broken down into 2  ops for execution (only 80-bit data in ROB) mm –64-bit SIMD MMX (use 8 mm registers — map to FP stack) –Prefetch (nta, t0, t1, t2) and sfence P4 SSE2 (Willamette New Instructions: WNI) Double-precision –Support Double-precision 128-bit SIMD FP Two 64-bit FP operations in one instruction Throughput: 2 cycles for most of SSE2 operations (exceptional examples: DIVPD and SQRTPD: 69 cycles, non-pipelined.) xmm –Enhanced 128-bit SIMD MMX using xmm registers

39 39 Examples of Using SSE X3X2X1X0 Y3Y2Y1Y0opopopop X3 op Y3 X2 op Y2 X1 op Y1 X0 op Y0 X3X2X1X0 Y3Y2Y1Y0 op X3X2X1 Packed SP FP operation (e.g. ADDPS xmm1, xmm2 ) Scalar SP FP operation (e.g. ADDSS xmm1, xmm2 ) xmm1 xmm2 xmm1 xmm1 xmm2 xmm1 Shuffle FP operation (8-bit imm) (e.g. SHUFPS xmm1, xmm2, imm8 ) X3X2X1X0 Y3Y2Y1Y0 Y3.. Y0 Y3.. Y0 Y3.. Y0 X3.. X0 X3.. X0 X3.. X0 Shuffle FP operation (8-bit imm) (e.g. SHUFPS xmm1, xmm2, 0xf1 ) X3X2X1X0 Y3Y2Y1Y0xmm1Y3X0X1Y3 xmm2 xmm1

40 40 Examples of Using SSE and SSE2 X3X2X1X0 Y3Y2Y1Y0opopopop X3 op Y3 X2 op Y2 X1 op Y1 X0 op Y0 X3X2X1X0 Y3Y2Y1Y0 op X3X2X1 Packed SP FP operation (e.g. ADDPS xmm1, xmm2 ) Scalar SP FP operation (e.g. ADDSS xmm1, xmm2 ) xmm1 xmm2 xmm1 xmm1 xmm2 xmm1 Shuffle FP operation (e.g. SHUFPS xmm1, xmm2, imm8 ) X3X2X1X0 Y3Y2Y1Y0 Y3.. Y0 Y3.. Y0 Y3.. Y0 X3.. X0 X3.. X0 X3.. X0 Shuffle FP operation (8-bit imm) (e.g. SHUFPS xmm1, xmm2, 0xf1 ) X3X2X1X0 Y3Y2Y1Y0xmm1Y3X0X1Y3 xmm2 xmm1 X0 op Packed DP FP operation (e.g. ADDPD xmm1, xmm2 ) Scalar DP FP operation (e.g. ADDSD xmm1, xmm2 ) xmm1 xmm2 xmm1 xmm1 xmm2 xmm1 Shuffle FP operation (e.g. SHUFPS xmm1, xmm2, imm8 ) Shuffle DP operation (2-bit imm) (e.g. SHUFPD xmm1, xmm2, imm2 ) X1 Y0Y1 X0 op Y0 X1 op Y1 op X0X1 Y0Y1 X0 op Y0 X1 op X0X1 Y0Y1 X1 or X0 Y1 or Y0 SSE SSE2

41 41 HyperThreading Intel Xeon Processor and Intel Xeon MP Processor Enable Simultaneous Multi-Threading (SMT) –Exploit ILP through TLP ( — Thread-Level Parallelism) –Issuing and executing multiple threads at the same snapshot 2 logical processorsSingle P4 w/ HT appears to be 2 logical processors Share the same execution resources –dTLB shared with logical processor ID –Some other shared resources are partitioned (next slide) Architectural states and some microarchitectural states are duplicated –IPs, iTLB, streaming buffer –Architectural register file –Return stack buffer –Branch history buffer –Register Alias Table

42 42 Multithreading (MT) Paradigms Thread 1 Unused Execution Time FU1 FU2 FU3 FU4 Conventional Superscalar Single Threaded Simultaneous Multithreading (or Intel’s HT) Fine-grained Multithreading (cycle-by-cycle Interleaving) Thread 2 Thread 3 Thread 4 Thread 5 Coarse-grained Multithreading (Block Interleaving) Chip Multiprocessor (CMP) or called Multi-Core Processors today

43 43 HyperThreading Resource Partitioning TC (or UROM) is alternatively accessed per cycle for each logical processor unless one is stalled due to TC miss  op queue (into ½ ) after fetched from TC ROB (126/2) LB (48/2) SB (24/2) (32/2 for Prescott) General  op queue and memory  op queue (1/2) TLB ( ½ ?) as there is no PID Retirement: alternating between 2 logical processors

Download ppt "ECE 4100/6100 Advanced Computer Architecture Lecture 12 P6 and NetBurst Microarchitecture Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering."

Similar presentations

Ads by Google