Advanced CISC Implementations: Pentium 4

Slides:



Advertisements
Similar presentations
Massachusetts Institute of Technology
Advertisements

Asanovic/Devadas Spring VLIW/EPIC: Statically Scheduled ILP Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.
Asanovic/Devadas Spring Advanced Superscalar Architectures Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.
CPU Structure and Function
Topics Left Superscalar machines IA64 / EPIC architecture
Chapter 3 General-Purpose Processors: Software
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Krste Asanovic Electrical Engineering and Computer Sciences
CS6290 Pentiums. Case Study1 : Pentium-Pro Basis for Centrinos, Core, Core 2 (We’ll also look at P4 after this.)
CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Instruction Level Parallelism 2. Superscalar and VLIW processors.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Computer Organization and Architecture
Computer Organization and Architecture
1 Microprocessor-based Systems Course 4 - Microprocessors.
Intel Labs Labs Copyright © 2000 Intel Corporation. Fall 2000 Inside the Pentium ® 4 Processor Micro-architecture Next Generation IA-32 Micro-architecture.
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
IA- 32 Architecture Richard Eckert Anthony Marino Matt Morrison Steve Sonntag.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Chapter 12 CPU Structure and Function. Example Register Organizations.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
Computer performance.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.
ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
PowerPC 604 Superscalar Microprocessor
Architecture & Organization 1
Dr. George Michelogiannakis EECS, University of California at Berkeley
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Flow Path Model of Superscalars
Introduction to Pentium Processor
Pipelining: Advanced ILP
Instruction Level Parallelism and Superscalar Processors
The Microarchitecture of the Pentium 4 processor
Architecture & Organization 1
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 11: Memory Data Flow Techniques
Comparison of Two Processors
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Krste Asanovic Electrical Engineering and Computer Sciences
Advanced Computer Architecture
* From AMD 1996 Publication #18522 Revision E
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Presentation transcript:

Advanced CISC Implementations: Pentium 4 Asanovic/DevadasSpring 2002 6.823 Advanced CISC Implementations: Pentium 4 Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology

Intel Pentium Pro (1995) During decode, translate complex x86 Asanovic/DevadasSpring 2002 6.823 Intel Pentium Pro (1995) During decode, translate complex x86 instructions into RISC-like micro-operations (uops) – e.g., “R  R op Mem” translates into load T, Mem # Load from Mem into temp reg R  R op T # Operate using value in temp Execute uops using speculative out-of-order superscalar engine with register renaming Pentium Pro family architecture (P6 family) used on Pentium-II and Pentium-III processors

Intel Pentium 4 (2000) Deeper pipelines than P6 family Asanovic/DevadasSpring 2002 6.823 Intel Pentium 4 (2000) Deeper pipelines than P6 family – about half as many levels of logic per pipeline stage as P6 Trace cache holds decoded uops – only has a single x86->uop decoder Decreased latency in same process technology – aggressive circuit design – new microarchitectural tricks This lecture contains figures and data taken from: “The microarchitecture of the Pentium 4 processor”, Intel Technology Journal, Q1, 2001

Integer and FP Execution Units BTB/Branch Prediction Pentium 4 Block Diagram Asanovic/DevadasSpring 2002 6.823 System Bus Bus Unit Level 1 Data Cache Level 2 Cache Execution Units Memory Subsystem Integer and FP Execution Units Out-of-order execution logic Trace Cache Microcode ROM Fetch/Decode Retirement Branch History Update BTB/Branch Prediction Front End Out-of-order Engine

Asanovic/DevadasSpring 2002 6.823 P-III vs. P-4 Pipelines Basic Pentium lll Processor Misprediction Pipeline Fetch Fetch Decode Decode Decode Rename ROB Rd Rdy/ Sch Dispatch Exec Basic Pentium 4 Processor Misprediction Pipeline TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Disp Disp RF RF Ex Flgs Br Ck Drive In same process technology, ~1.5x clock frequency Performance Equation: Time = Instructions * Cycles * Time Program Program Instruction Cycle

Apple Marketing for G4 Shorter data pipeline Asanovic/DevadasSpring 2002 6.823 Apple Marketing for G4 Shorter data pipeline The performance advantage of the PowerPC G4 starts with its data pipeline. The term “processor pipeline” refers to the number of processing steps, or stages, it takes to accomplish a task. The fewer the steps, the shorter — and more efficient — the pipeline. Thanks to its efficient 7-stage design (versus 20 stages for the Pentium 4 processor) the G4 processor can accomplish a task with 13 fewer steps than the PC. You do the math.

Relative Frequency of Intel Designs Asanovic/DevadasSpring 2002 6.823 Relative Frequency of Intel Designs Relative Frequency Over time, fewer logic levels per pipeline stage and more advanced circuit design Higher frequency in same process technology

Deep Pipeline Design Greater potential throughput but: Asanovic/DevadasSpring 2002 6.823 Deep Pipeline Design Greater potential throughput but: Clock uncertainty and latch delays eat into cycle time budget – doubling pipeline depth gives less than twice frequency improvement Clock load and power increases – more latches running at higher frequencies More complicated microarchitecture needed to cover long branch mispredict penalties and cache miss penalties – from Little’s Law, need more instructions in flight to cover longer latencies larger reorder buffers P-4 has three major clock domains – Double pumped ALU (3 GHz), small critical area at highest speed – Main CPU pipeline (1.5 GHz) – Trace cache (0.75 GHz), save power

Pentium 4 Trace Cache Holds decoded uops in predicted program flow Asanovic/DevadasSpring 2002 6.823 Holds decoded uops in predicted program flow order, 6 uops per line Code in memory cmp Code packed in trace cache (6 uops/line) Trace cache fetches one 6 uop line every 2 CPU clock cycles (runs at ½ main CPU rate) br T1 ... T1: sub br T2 T2: mov sub br T3 T3: add sub mov br T4 T4: cmp br T1 sub br T2 mov sub br T3 add sub mov br T4 T4:...

Trace Cache Advantages Asanovic/DevadasSpring 2002 6.823 Trace Cache Advantages Removes x86 decode from branch mispredict penalty – Parallel x86 decoder took 2.5 cycles in P6, would be 5 cycles in P-4 design Allows higher fetch bandwidth fetch for correctly predicted taken branches – P6 had one cycle bubble for correctly predicted taken branches – P-4 can fetch a branch and its target in same cycle Saves energy – x86 decoder only powered up on trace cache refill

Pentium 4 Front End L2 Cache Front End BTB Inst. Prefetch & TLB Asanovic/DevadasSpring 2002 6.823 Pentium 4 Front End L2 Cache x86 instructions, 8 Bytes/cycle Fetch Buffer Front End BTB (4K Entries) Inst. Prefetch & TLB Single x86 instruction/cycle x86 Decoder 4 uops/cycle Trace Cache Fill Buffer 6 uops/line Trace Cache (12K uops) Translation from x86 instructions to internal uops only happens on trace cache miss, one x86 instruction per cycle. Translations are cached in trace cache.

16-entry subroutine return address stack Asanovic/DevadasSpring 2002 6.823 P-4 Trace Cache Fetch 1 2 TC Next IP (BTB) Trace Cache (12K uops, 2K lines of 6 uops) 3 4 5 6 7 8 9 TC Fetch Drive Alloc Rename Queue Schedule 1 Schedule 2 Schedule 3 Dispatch 1 Dispatch 2 Register File 1 Register File 2 Execute Flags Branch Check Drive Trace IP Trace BTB (512 entries) 16-entry subroutine return address stack Microcode ROM 6 uops every two CPU cycles uop buffer 3 uops/cycle 10 11 12 13 14 15 16 17 18 19 20

P-III vs. P-4 Renaming 1 2 TC Next IP (BTB) 3 4 5 6 7 8 9 TC Fetch Asanovic/DevadasSpring 2002 6.823 P-III vs. P-4 Renaming 1 2 TC Next IP (BTB) 3 4 5 6 7 8 9 TC Fetch Drive Alloc Rename Queue Schedule 1 Schedule 2 Schedule 3 Dispatch 1 Dispatch 2 Register File 1 Register File 2 Execute Flags Branch Check Drive 10 11 12 13 14 15 16 17 18 19 20 P-4 physical register file separated from ROB status. ROB entries allocated sequentially as in P6 family. One of 128 physical registers allocated from free list. No data movement on retire, only Retirement RAT updated.

P-4 µOp Queues and Schedulers Asanovic/DevadasSpring 2002 6.823 1 2 TC Next IP (BTB) Allocated/Renamed uops 3 uops/cycle 3 4 5 6 7 8 9 TC Fetch Drive Alloc Uop queues are in-order within each queue Memory uop Queue Arithmetic uop Queue Rename Queue Schedule 1 Schedule 2 Schedule 3 Dispatch 1 Dispatch 2 Register File 1 Register File 2 Execute Flags Branch Check Drive Fast Scheduler(x2) Memory Scheduler General Scheduler Simple FP Scheduler 10 11 12 13 14 15 16 17 18 19 20 Ready uops compete for dispatch ports (Fast schedulers can each dispatch 2 ALU operations per cycle)

P-4 Execution Ports Schedulers compete for access to execution ports Asanovic/DevadasSpring 2002 6.823 P-4 Execution Ports Exec Port 0 Exec Port 01 Load Port Store Port ALU (Double Speed) ALU (Double Speed) Integer Operation FP execute Memory Load Memory Store FP Move Add/Sub Logic Store Data Branches Shift/rotate FP/SSE-Add FP/SSE Mul FP/SSE Div MMX FP/SSE Move FP/SSE Store FXCH Add/Sub All loads LEA SW prefetch Store Address Schedulers compete for access to execution ports Loads and stores have dedicated ports ALUs can execute two operations per cycle Peak bandwidth of 6 uops per cycle – load, store, plus four double-pumped ALU operations

P-4 Fast ALUs and Bypass Path Asanovic/DevadasSpring 2002 6.823 P-4 Fast ALUs and Bypass Path Register File and Bypass Network L1 Data Cache Fast ALUs and bypass network runs at double speed All “non-essential” circuit paths handled out of loop to reduce circuit loading (shifts, mults/divs, branches, flag/ops) Other bypassing takes multiple clock cycles

P-4 Staggered ALU Design Asanovic/DevadasSpring 2002 6.823 P-4 Staggered ALU Design Staggers 32-bit add and flag compare into three ½ cycle phases – low 16 bits – high 16 bits – flag checks Bypass 16 bits around every ½ cycle – back-back dependent 32-bit adds at 3GHz in 0.18µm L1 Data Cache access starts with bottom 16 bits as index, top 16 bits used as tag check later

Long delay from schedulers to load hit/miss Asanovic/DevadasSpring 2002 6.823 P-4 Load Schedule Speculation 1 2 TC Next IP (BTB) 3 4 5 6 7 8 9 TC Fetch Drive Alloc Rename Queue Schedule 1 Schedule 2 Schedule 3 Dispatch 1 Dispatch 2 Register File 1 Register File 2 Load Execute 1 Load Execute 2 Branch Check Drive 10 11 12 13 14 15 16 17 18 19 20 Long delay from schedulers to load hit/miss P-4 guesses that load will hit in L1 and schedules dependent operations to use value If load misses, only dependent operations are replayed

P-4 Branch Penalty 20 cycle branch mispredict penalty 1 2 TC Next IP Asanovic/DevadasSpring 2002 6.823 P-4 Branch Penalty 1 2 TC Next IP (BTB) 3 4 5 6 7 8 9 TC Fetch Drive Alloc 20 cycle branch mispredict penalty Rename Queue Schedule 1 Schedule 2 Schedule 3 Dispatch 1 Dispatch 2 Register File 1 Register File 2 Execute Flags Branch Check Drive 10 11 12 13 14 15 16 17 18 19 20 P-4 uses new “trade secret” branch prediction algorithm Intel claims 1/3 fewer mispredicts than P6 algorithm

Integer Register File / Bypass Network L1 Data Cache (8kbyte 4-way) Asanovic/DevadasSpring 2002 6.823 P-4 Microarchitecture Front-End BTB (4K Entries) Instruction TLB/Prefetcher System Bus Instruction Decoder Microcode ROM Trace Cache BTB (512 Entries) Trace Cache (12 uops) Quad Pumped 3.2 GB/s Uop Queue Allocator / Register Renamer Memory Uop Queue Interger/Floating Point uop Queue Bus Interface Unit Memory Scheduler Fast Slow/General FP Scheduler Simple FP Integer Register File / Bypass Network FP Register / Bypass L2 Cache (256k Byre 8-way) AGU AGU 2x ALU 2x ALU Slow ALU FP MMX SSE SSE2 FP Move Store Address Simple lnstr Load Address Simple lnstr Complet lnstr L1 Data Cache (8kbyte 4-way)

Instruction Fetch Unit: Pentium-III Die Photo Asanovic/DevadasSpring 2002 6.823 Programmable External and Backside Packed FP Datapaths Interrupt Control Clock Bus Logic Page Miss Handler 16KB 4-way s.a. D$ Integer Datapaths Floating-Point Datapaths Memory Order Buffer Memory Interface Unit (convert floats to/from memory format) MMX Datapaths Register Alias Table Allocate entries (ROB, MOB, RS) Reservation 256KB 8-way s.a. Instruction Fetch Unit: 16KB 4-way s.a. I-cache Station Branch Address Calc Reorder Buffer (40-entry physical regfile + architect. regfile) Microinstruction Sequencer Instruction Decoders: 3 x86 insts/cycle

Scaling of Wire Delay Asanovic/DevadasSpring 2002 6.823 Over time, transistors are getting relatively faster than long wires – wire resistance growing dramatically with shrinking width and height – capacitance roughly fixed for constant length wire – RC delays of fixed length wire rising Chips are getting bigger – P-4 >2x size of P-III Clock frequency rising faster than transistor speed – deeper pipelines, fewer logic gates per cycle – more advanced circuit designs (each gate goes faster) ⇒ Takes multiple cycles for signal to cross chip

Visible Wire Delay in P-4 Design driving signals across chip! Asanovic/DevadasSpring 2002 6.823 Visible Wire Delay in P-4 Design 1 2 TC Next IP (BTB) 3 4 5 6 7 8 9 TC Fetch Drive Alloc Rename Queue Schedule 1 Schedule 2 Schedule 3 Dispatch 1 Dispatch 2 Register File 1 Register File 2 Execute Flags Branch Check Drive 10 11 12 13 14 15 16 17 18 19 20 Pipeline stages dedicated to just driving signals across chip!

Instruction Set Translation Asanovic/DevadasSpring 2002 6.823 Instruction Set Translation Convert a target ISA into a host machine’s ISA Pentium Pro (P6 family) – translation in hardware after instruction fetch Pentium-4 family – translation in hardware at level 1 instruction cache refill Transmeta Crusoe – translation in software using “Code Morphing”