Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.

Slides:



Advertisements
Similar presentations
Topics Left Superscalar machines IA64 / EPIC architecture
Advertisements

Superscalar Processors
Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Instruction Level Parallelism 2. Superscalar and VLIW processors.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 P6 Architecture Electronic Computers LM. 2 PIPELINE Between the three main sections compensation queues are inserted. The machine instructions are rotated.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Out-of-Order Machine State Instruction Sequence: Inorder State: Look-ahead State: Architectural State: R3  A R7  B R8  C R7  D R4  E R3  F R8  G.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
Mobile Pentium 4 Architecture Supporting Hyper-ThreadingTechnology Hakan Burak Duygulu CmpE
EECE476: Computer Architecture Lecture 23: Speculative Execution, Dynamic Superscalar (text 6.8 plus more) The University of British ColumbiaEECE 476©
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution.
Mult. Issue CSE 471 Autumn 011 Multiple Issue Alternatives Superscalar (hardware detects conflicts) –Statically scheduled (in order dispatch and hence.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Goal: Reduce the Penalty of Control Hazards
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
CSC 4250 Computer Architectures November 7, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.
PipeliningPipelining Computer Architecture (Fall 2006)
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Instruction-Level Parallelism and Its Dynamic Exploitation
Dynamic Scheduling Why go out of style?
Computer Architecture
/ Computer Architecture and Design
PowerPC 604 Superscalar Microprocessor
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Introduction to Pentium Processor
Computer architecture M
Computer architecture M
Superscalar Processors & VLIW Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 11: Memory Data Flow Techniques
Ka-Ming Keung Swamy D Ponpandi
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
Chapter 3: ILP and Its Exploitation
Overcoming Control Hazards with Dynamic Scheduling & Speculation
Ka-Ming Keung Swamy D Ponpandi
Presentation transcript:

Pentium III Instruction Stream

Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods that the third generation P6/IA32 architecture uses and their advantages/disadvantages.

Features Completely speculative executionCompletely speculative execution superscalar issuesuperscalar issue Speculative register renamingSpeculative register renaming Deeply pipelined executionDeeply pipelined execution Large branch prediction unitLarge branch prediction unit

Pentium III Execution Deeply PipelinedDeeply Pipelined –Over 30 stages for many ops (without miss penalties) –Several tradeoffs for deeply pipelined models Stall penaltiesStall penalties Clock rateClock rate

Pentium III Execution Model Consists ofConsists of –In-order front end/issue –Out of order execution core –In order retirement unit (non-speculative)

Front End Execution ICache accessICache access Branch predictionBranch prediction DecodeDecode IssueIssue

ICache Icache isIcache is –16KB, 4 way set associative, 32 byte cache lines L2 (unified)L2 (unified)

Branch Prediction BTB (branch target buffer) decides address of next executed instructionBTB (branch target buffer) decides address of next executed instruction Speculative state advantagesSpeculative state advantages –Less complicated recovery –Less Mispredict costs BTB runs off of prefetchBTB runs off of prefetch

Branch Prediction (Cont.) Dynamic predictorDynamic predictor –Yeh’s algorithm –last 4 directions available per branch address –One cycle disadvantage on taken branches –RSB

Branch Prediction (Cont.) Static predictorStatic predictor –6 cycle penalty –Forward branches(not taken) –Backward branches(taken)

Decode Three decode unitsThree decode units –Two simple, one complex Micro opsMicro ops –RISC type operations Can be 1-4 per CISC operationCan be 1-4 per CISC operation

Decode (Cont.) Issue problems ariseIssue problems arise –Program instruction ordering very important TradeoffTradeoff –Issue of 4-wide instructions improves compiler performance by allowing more optimization

Decode (Cont.) Williamette (last IA32 architecture) hasWilliamette (last IA32 architecture) has –Execution trace cache Immediately accessible (no cache hit delay)Immediately accessible (no cache hit delay) Exploits temporal localityExploits temporal locality

Execution Micro-ops follow distinct trailsMicro-ops follow distinct trails –RAT (register alias table) –ROB (re-order buffer) –Reservation station –Execution units

RAT Register Mappings (source, destination)Register Mappings (source, destination) –Eliminates false dependencies In-Order RetirementIn-Order Retirement –Allows out of order execution from ROB Issues up to 3 micro-ops to ROB per cycleIssues up to 3 micro-ops to ROB per cycle –See any throughput problems?

RAT (cont.) Can access either ROB or RRFCan access either ROB or RRF –Solves true dependencies –State bits required Branch Mispredicts?Branch Mispredicts? –Flush all state(mappings) older than branch –No new mappings until all current instructions retired

ROB ROB is temporary location of queued micro-opsROB is temporary location of queued micro-ops 40 entries40 entries –Contain micro-ops, state, and results

ROB states SDSD –Scheduled for execution DPDP –Micro-op is at head of dispatch queue EXEX –Currently being executed WBWB –Completed execution; waiting for results RR, RTRR, RT –Ready for retirement, being retired

Reservation Station

Reservation Station (Cont.) 5 ports for different ops5 ports for different ops –FP, Int, MMX, SSE, LSQ ops –More throughput problems? 20 entry queue20 entry queue –Organization not specified

Execution SchedulingScheduling –One scheduler for each port –20 entry queue optimized by priority algorithm DispatchDispatch –All 5 ports can be dispatched every clock cycle

Execution (Cont.) DispatchDispatch –Dcache misses, hazards resolved –Results written back to ROB Resolves dependency chainResolves dependency chain

Retirement Results written to RRFResults written to RRF –Non-speculative state –Register maps deleted, if possible

Throughput

Area Considerations As it turns outAs it turns out –IA32 architecture doesn’t scale entirely well Die area a large problemDie area a large problem Bus / logical complexity grows in non linear fashionBus / logical complexity grows in non linear fashion

Finally It seems thatIt seems that –IA32 is at an end –VLIW is next