Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Slides:



Advertisements
Similar presentations
Topics Left Superscalar machines IA64 / EPIC architecture
Advertisements

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Intel Xeon Nehalem Architecture Billy Brennan Christopher Ruiz Kay Sackey.
Out-of-Order Machine State Instruction Sequence: Inorder State: Look-ahead State: Architectural State: R3  A R7  B R8  C R7  D R4  E R3  F R8  G.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.
April 27, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based.
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
Architecture Basics ECE 454 Computer Systems Programming
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
1 Lecture 7: Speculative Execution and Recovery using Reorder Buffer Branch prediction and speculative execution, precise interrupt, reorder buffer.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
The Alpha – Data Stream Matt Ziegler.
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.
1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.
Use of Pipelining to Achieve CPI < 1
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
CS 352H: Computer Systems Architecture
Instruction Level Parallelism
/ Computer Architecture and Design
Computer Structure Multi-Threading
Pentium Pro Case Study Prof. Mikko H. Lipasti
Flow Path Model of Superscalars
The Microarchitecture of the Pentium 4 processor
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 11: Memory Data Flow Techniques
Ka-Ming Keung Swamy D Ponpandi
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Sampoorani, Sivakumar and Joshua
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
15-740/ Computer Architecture Lecture 10: Out-of-Order Execution
Ka-Ming Keung Swamy D Ponpandi
Presentation transcript:

Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti

Pentium Pro Case Study Microarchitecture – Order-3 Superscalar – Out-of-Order execution – Speculative execution – In-order completion Design Methodology Performance Analysis Retrospective

Goals of P6 Microarchitecture IA-32 Compliant Performance (Frequency - IPC) Validation Die Size Schedule Power

P6 – The Big Picture

Memory Hierarchy Level 1 instruction and data caches - 2 cycle access time Level 2 unified cache - 6 cycle access time Separate level 2 cache and memory address/data bus

Instruction Fetch Cache I n s t. B u f I n s t. R o t a t o r Inst. Length Decoder Length Inst. Marks P r e d i c t i o n M a r k s Instruction Victim ICache Stream TLB Buffer Physical Addr. L2 Cache (256Kb) F e t c h A d d r e s s Next Addr. Logic Other Fetch Requests Branch Target Buffer (512) P r e d i c t i o n To Decode 1 6 b y t e s 1 6 b y t e s + m a r k s I n s t r u c t i o n D a t a M u x I n s t r u c t i o n D a t a (8Kb) 2 cycle Branch Target

Instruction Cache Unit

Branch Target Buffer F e t c h A d d r. T a g 4 - b i t B H R B r. O f f s e t 4 - b i t B H R s p e c. T a r g e t A d d r. Way S e t s F e t c h A d d r. T a g 4 - b i t B H R B r. O f f s e t 4 - b i t B H R s p e c. T a r g e t A d d r. Way 1 F e t c h A d d r. T a g 4 - b i t B H R B r. O f f s e t 4 - b i t B H R s p e c. T a r g e t A d d r. Way 3 Tag Compare PHT 1 6 e n t r i e s / s e t Return Stack Prediction Control Logic Prediction & Target Addr. Fetch Address Pattern History Table (PHT) is not speculatively updated A speculative Branch History Register (BHR) and prediction state is maintained Uses speculative prediction state if it exist for that branch

Branch Prediction Algorithm Current prediction updates the speculative history prior to the next instance of the branch instruction Branch History Register (BHR) is updated during branch execution Branch recovery flushes front-end and drains the execution core Branch mis-prediction resets the speculative branch history state to match BHR

Instruction Decode - 1 Branch instruction detection Branch address calculation - Static prediction and branch always execution One branch decode per cycle (break on branch)

Instruction Decode - 2 Instruction Buffer contains up to 16 instructions, which must be decoded and queued before the instruction buffer is re-filled Macro-instructions must shift from decoder 2 to decoder 1 to decoder 0

What is a uop? Small two-operand instruction - Very RISC like. IA-32 instruction add (eax),(ebx)MEM(eax) <- MEM(eax) + MEM(ebx) Uop decomposition: ld guop0,(eax)guop0<- MEM(eax) ld guop1,(ebx)guop1<- MEM(ebx) add guop0,guop1guop0 <- guop0 + guop1 sta eax std guop0MEM(eax) <- guop0

Instruction Dispatch Register Renaming Allocation requirements “3-or-none” Reorder buffer entries Reservation station entry Load buffer or store buffer entry Dispatch buffer “probably” dispatches all 3 uops before re-fill

Register Renaming - 1 Similar to Tomasulo’s Algorithm - Uses ROB entry number as tags The register alias tables (RAT) maintain a pointer to the most recent data for the renamed register Execution results are stored in the ROB

Register Renaming - Example © Shen, Lipasti 15

Challenges to Register Renaming

Out-of-Order Execution Engine In-order branch issue and execution In-order load/store issue to address generation units Instruction execution and result bus scheduling Is the reservation station “truly” centralized & what is “binding”?

Reservation Station Cycle 1 – Order checking – Operand availability Cycle 2 – Writeback bus scheduling

Memory Ordering Buffer (MOB) Load buffer retains loads until completed, for coherency checking Store forwarding out of store buffers 2 cycle latency through MOB “Store Coloring” - Load instructions are tagged by the last store

Instruction Completion Handles all exception/interrupt/trap conditions Handles branch recovery – OOO core drains out right-path instructions, commits to RRF – In parallel, front end starts fetching from target/fall-through – However, no renaming is allowed until OOO core is drained – After draining is done, RAT is reset to point to RRF – Avoids checkpointing RAT, recovering to intermediate RAT state Commits execution results to the architectural state in-order – Retirement Register File (RRF) – Must handle hazards to RRF (writes/reads in same cycle) – Must handle hazards to RAT (writes/reads in same cycle) “Atomic” IA-32 instruction completion – uops are marked as 1st or last in sequence – exception/interrupt/trap boundary 2 cycle retirement

Pentium Pro Design Methodology - 1

Pentium Pro Performance Analysis Observability – On-chip event counters – Dynamic analysis Benchmark Suite – BAPco Sysmark bit Windows NT applications – Winstone bit Windows NT applications – Some SPEC95 benchmarks

Performance – Run Times

Performance – IPC vs. uPC

Performance – Cache Misses

Performance – Branch Prediction

Conclusions IA-32 Compliant Performance (Frequency - IPC) ISpec FSpec SPECint SPECfp95 Validation Die Size - Fabable Schedule - 1 year late Power -

Retrospective Most commercially successful microarchitecture in history Evolution – Pentium II/III, Xeon, etc. Derivatives with on-chip L2, ISA extensions, etc. – Replaced by Pentium 4 as flagship in 2001 High frequency, deep pipeline, extreme speculation – Resurfaced as Pentium M in 2003 Initially a response to Transmeta in laptop market Pentium 4 derivative (90nm Prescott) delayed, slow, hot – Core Duo, Core 2 Duo, Core i7 replaced Pentium 4 © Shen, Lipasti 29

Microarchitectural Updates Pentium M (Banias), Core Duo (Yonah) – Micro-op fusion (also in AMD K7/K8) Multiple uops in one: (add eax,[mem] => ld/alu), sta/std These uops decode/dispatch/commit once, issue twice – Better branch prediction Loop count predictor Indirect branch predictor – Slightly deeper pipeline (12 stages) Extra decode stage for micro-op fusion Extra stage between issue and execute (for RS/PLRAM read) – Data-capture reservation station (payload RAM) Clock gated for 32 (int), 64 (fp), and 128 (SSE) operands © Shen, Lipasti 30

Microarchitectural Updates Core 2 Duo (Merom) – 64-bit ISA from AMD K8 – Macro-op fusion Merge uops from two x86 ops E.g. cmp, jne => cmpjne – 4-wide decoder (Complex + 3x Simple) Peak x86 decode throughput is 5 due to macro-op fusion – Loop buffer Loops that fit in 18-entry instruction queue avoid fetch/decode overhead – Even deeper pipeline (14 stages) – Larger reservation station (32), instruction window (96) – Memory dependence prediction © Shen, Lipasti 31

Microarchitectural Updates Nehalem (Core i7/i5/i3) – RS size 36, ROB 128 – Loop cache up to 28 uops – L2 branch predictor – L2 TLB – I$ and D$ now 32K, L2 back to 256K, inclusive L3 up to 8M – Simultaneous multithreading – RAS now renamed (repaired) – 6 issue, 48 load buffers, 32 store buffers – New system interface (QPI) – finally dropped front-side bus – Integrated memory controller (up to 3 channels) – New STTNI instructions for string/text handling © Shen, Lipasti 32

Microarchitectural Updates Sandybridge (2 nd generation Core i7) – On-chip integrated graphics (GPU) – Decoded uop cache up to 1.5K uops, handles loops, but more general – 54-entry RS, 168-entry ROB – Physical register file: 144 FP, 160 integer – 256-bit AVX units: 8 DPFLOP/cycle, 16 SPFLOP/cycle – 2 general AGUs enable 2 ld/cycle, 2 st/cycle or any combination, 2x128-bit load path from L1 D$ © Shen, Lipasti 33