CS6290 Pentiums. Case Study1 : Pentium-Pro Basis for Centrinos, Core, Core 2 (We’ll also look at P4 after this.)

Slides:



Advertisements
Similar presentations
Topics Left Superscalar machines IA64 / EPIC architecture
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
CS6290 Speculation Recovery. Loose Ends Up to now: –Techniques for handling register dependencies Register renaming for WAR, WAW Tomasulo’s algorithm.
© 2006 Edward F. Gehringer ECE 463/521 Lecture Notes, Spring 2006 Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Spring 2006.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Chapter 8. Pipelining.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
COMP381 by M. Hamdi 1 Pipeline Hazards. COMP381 by M. Hamdi 2 Pipeline Hazards Hazards are situations in pipelining where one instruction cannot immediately.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Chapter 12 Pipelining Strategies Performance Hazards.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Goal: Reduce the Penalty of Control Hazards
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
DLX Instruction Format
Appendix A Pipelining: Basic and Intermediate Concepts
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
RISC:Reduced Instruction Set Computing. Overview What is RISC architecture? How did RISC evolve? How does RISC use instruction pipelining? How does RISC.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
COMP25212 Lecture 51 Pipelining Reducing Instruction Execution Time.
Branch Hazards and Static Branch Prediction Techniques
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
1 Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations.
PipeliningPipelining Computer Architecture (Fall 2006)
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
CS 352H: Computer Systems Architecture
Computer Architecture Chapter (14): Processor Structure and Function
Instruction Level Parallelism
William Stallings Computer Organization and Architecture 8th Edition
PowerPC 604 Superscalar Microprocessor
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Flow Path Model of Superscalars
Pipelining: Advanced ILP
Morgan Kaufmann Publishers The Processor
Instruction Level Parallelism and Superscalar Processors
Lecture 6: Advanced Pipelines
The Microarchitecture of the Pentium 4 processor
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Ka-Ming Keung Swamy D Ponpandi
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Super-scalar.
How to improve (decrease) CPI
Control unit extension for data hazards
Lecture 10: Branch Prediction and Instruction Delivery
Pipelining.
Control unit extension for data hazards
Control unit extension for data hazards
Ka-Ming Keung Swamy D Ponpandi
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

CS6290 Pentiums

Case Study1 : Pentium-Pro Basis for Centrinos, Core, Core 2 (We’ll also look at P4 after this.)

Hardware Overview (commit) (issue/alloc) RS: 20 entries, unified ROB: 40 entries

Speculative Execution & Recovery FE 1OOO 1 Normal execution: speculatively fetch and execute instructions FE 1OOO 1 OOO core detects misprediction, flush FE and start refetching FE 2 OOO 1 New insts fetched, but OOO core still contains wrong-path uops FE 2 OOO 1 OOO core has drained, retire bad branch and flush rest of OOO core FE 2 OOO 2 Normal execution: speculatively fetch and execute instructions

Branch Prediction TagTargetHist 2-bit ctrs BTB PC = Use dynamic predictor hit? Use static predictor: Stall until decode miss? PC-relative? Conditional? Backwards? Return? No No No No Yes Yes Yes Yes Yes Taken Taken Taken Taken Indirect jump Not Taken

Micro-op Decomposition CISC  RISC –Simple x86 instructions map to single uop Ex. INC, ADD (r-r), XOR, MOV (r-r, load) –Moderately complex insts map to a few uops Ex. Store  STA/STD ADD (r-m)  LOAD/ADD ADD (m-r)  LOAD/ADD/STA/STD –More complex make use of UROM PUSHA  STA/STD/ADD, STA/STD/ADD, …

Decoder limitation –Decode up to three instructions per cycle Three decoders, but asymmetric Only first decoder can handle moderately complex insts (those that can be encoded with up to 4 uops) If need more than 4 uops, go to UROM A: B: C: D: 4-2 E: 4-1

“Simple” Core After decode, the machine only deals with uops until commit Rename, RS, ROB, … –Looks just like a RISC-based OOO core –A couple of changes to deal with x86 Flags Partial register writes

Execution Ports Unified RS, multiple ALUs –Ex. Two Adders –What if multiple ADDs ready at the same time? Need to choose 2-of-N and make assignments –To simplify, each ADD is assigned to an adder during Alloc stage –Each ADD can only attempt to execute on its assigned adder If my assigned adder is busy, I can’t go even if the other adder is idle Reduce selection problem to choosing 1-of-N (easier logic)

Execution Ports (con’t) IEU0IEU1 FaddJEU STA AGU LDA AGU RS Entries Port 0Port 1Port 2Port 3Port 4 Memory Ordering Buffer (a.k.a. LSQ) In theory, can exec up to 5 uops per cycle… assuming they match the ALUs exactly STD Data Cache Fmul Imul Div

RISC  CISC Commit External world doesn’t know about uops Instruction commit must be all-or-nothing –Either commit all uops from an inst or none –Ex. ADD [EBX], ECX LOAD [EBX] ADD tmp0 = EBX, ECX STA tmp1 = EBX STD tmp2 = tmp0 –If load has page fault, if store has protection fault, if …

Case Study 2: Intel P4 Primary Objectives –Clock speed Implies performance –True if CPI not increases too much Marketability (GHz sells!) –Clock speed

Faster Clock Speed Less work per cycle Traditional single-cycle tasks may be multi-cycle –More pipeline bubbles, idle resources More pipeline stages –More control logic (need to control each stage) –More circuits to design (more engineering effort) More critical paths –More timing paths are at or close to clock speed –Less benefit from tuning worst paths Higher power –P = ½CV 2 f

Extra Delays Needed Branch mispred pipeline has 2 “Drive” stages –Extra delay because P4 can’t get from Point A to Point B in less than a cycle Side Note –P4 does not have a “20 stage pipeline” It’s much longer!

Make Common Case Fast Fetch: –Usually I$ hit –Branches are frequent –Branches are often taken –Branch mispredictions are not that infrequent Even if frequency is low, cost is high (pipe flush) P4 Uses a “Trace Cache” –Caches dynamic instruction stream Contrast to I$ which caches the static instruction image

Traditional Fetch/I$ Fetch from only one I$ line per cycle –If fetch PC points to last instruction in a line, all you get is one instruction Potentially worse for x86 since arbitrary byte-aligned instructions may straddle cache lines –Can only fetch instructions up to a taken branch Branch misprediction causes pipeline flush –Cost in cycles is roughly num-stages from fetch to branch execute

Trace Cache Multiple “I$ Lines” per cycle Can fetch past taken branch –And even multiple taken branches AB C DE F Traditional I$ ABCDEF Trace Cache 1

Decoded Trace Cache Trace cache holds decoded x86 instructions instead of raw bytes On branch mispred, decode stage not exposed in pipeline depth L 2 Decoder TraceBuilderTraceCache Dispatch, Renamer, Allocator, etc. Branch Mispred Does not add To mispred Pipeline depth

Less Common Case Slower Trace Cache is Big –Decoded instructions take more room X86 instructions may take 1-15 bytes raw All decoded uops take same amount of space –Instruction duplication Instruction “X” may be redundantly stored –ABX, CDX, XYZ, EXY Tradeoffs –No I$ Trace$ miss requires going to L2 –Decoder width = 1 Trace$ hit = 3 ops fetched per cycle Trace$ miss = 1 op decoded (therefore fetched) per cycle

Addition Common Case: Adds, Simple ALU Insts Typically an add must occur in a single cycle P4 “double-pumps” adders for 2 adds/cycle! –2.0 GHz P4 has 4.0 GHz adders X = A + B Y = X + C A[0:15] B[0:15] C[0:15] X[0:15] A[16:31] B[16:31] X[16:31] C[16:31] Cycle 0 Cycle 0.5 Cycle 1

Common Case Fast So long as only executing simple ALU ops, can execute two dependent ops per cycle 2 ALUs, so peak = 4 simple ALU ops per cycle –Can’t sustain since T$ only delivers 3 ops per cycle –Still useful (e.g., after D$ miss returns)

Less Common Case Slower Requires extra cycle of bypass when not doing only simple ALU ops –Operation may need extra half-cycle to finish Shifts are relatively slower in P4 (compared to previous latencies in P3) –Can reduce performance of code optimized for older machines

Common Case: Cache Hit Cache hit/miss complicates dynamic scheduler Need to know instruction latency to schedule dependent instructions Common case is cache hit –To make pipelined scheduler, just assume loads always hit

Pipelined Scheduling In cycle 3, start scheduling B assuming A hits in cache At cycle 10, A’s result bypasses to B, and B executes A: MOV ECX [EAX] B: XOR EDX ECX

Less Common Case is Slower A: MOV ECX [EAX] B: XOR EDX ECX C: SUB EAX ECX D: ADD EBX EAX E: NOR EAX EDX F: ADD EBX EAX

Replay On cache miss, dependents are speculatively misscheduled –Wastes execution slots Other “useful” work could have executed instead –Wastes a lot of power –Adds latency Miss not known until cycle 9 Start rescheduling dependents at cycle 10 Could have executed faster if miss was known

P4 Philosophy Overview Amdahl’s Law… Make the common case fast!!! –Trace Cache –Double-Pumped ALUs –Cache Hits –There are other examples… Resulted in very high frequency

P4 Pitfall Making the less common case too slow –Performance determined by both the common case and uncommon case –If uncommon case too slow, can cancel out gains of making common case faster –common by what metric? (should be time) Lesson: Beware of Slhadma –Don’t screw over the less common case

Tejas Lessons Next-Gen P4 (P5?) Cancelled spring 2004 –Complexity of super-duper-pipelined processor –Time-to-Market slipping –Performance Goals slipping –Complexity became unmanageable –Power and thermals out of control “Performance at all costs” no longer true

Lessons to Carry Forward Performance is still King But restricted by power, thermals, complexity, design time, cost, etc. Future processors are more balanced –Centrino, Core, Core 2 –Opteron