EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Similar presentations

Presentation on theme: "EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12."— Presentation transcript:

1 EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12

2 Optimizing CPU Performance Golden Rule: t CPU = N inst *CPI*t CLK Given this, what are our options –Reduce the number of instructions executed –Reduce the cycles to execute an instruction –Reduce the clock period Our next focus: Further reducing CPI –Approach: Superscalar execution –Capable of initiating multiple instructions per cycle –Possible to implement for in-order or out-of-order pipelines

3 Why Superscalar? PipeliningSuperscalar + Pipelining Optimization results in more complexity –Longer wires, more logic  higher t CLK and t CPU –Architects must strike a balance with reductions in CPI

4 Implications of Superscalar Execution Instruction fetch? –Taken branches, multiple branches, partial cache lines Instruction decode? –Simple for fixed length ISA, much harder for variable length Renaming? –Multi-port RT, inter-inst dependencies must be recognized Dynamic Scheduling? –Requires multiple results buses, smarter selection logic Execution? –Multiple functional units, multiple result buses Commit? –Multiple ROB/ARF ports, dependencies must be recognized

5 P4 Overview Latest iA32 processor from Intel –Equipped with the full set of iA32 SIMD operations –First flagship architecture since the P6 microarchitecture –Pentium 4 ISA = Pentium III ISA + SSE2 –SSE2 (Streaming SIMD Extensions 2) provides 128-bit SIMD integer and floating point operations + prefetch

6 Comparison Between Pentium III and Pentium 4

7 Execution Pipeline

8 Front End Predicts branches Fetches/decodes code into trace cache Generates  ops for complex instructions Prefetches instructions that are likely to be executed

9 Branch Prediction Dynamically predict the direction and target of branches based on PC using BTB If no dynamic prediction available, statically predict –Taken for backwards looping branches –Not taken for forward branches –Implemented at decode Traces built across (predicted) taken branches to avoid taken branch penalties Also includes a 16-entry return address stack predictor

10 Decoder Single decoder available –Operates at a maximum of 1 instruction per cycle Receives instructions from L2 cache 64 bits at a time Some complex instructions must enlist the micro-ROM –Used for very complex iA32 instructions (> 4  ops) –After the microcode ROM finishes, the front- end resumes fetching  ops from the Trace Cache

11 Execution Pipeline

12 Trace Cache Primary instruction cache in P4 architecture –Stores 12k decoded  ops On a miss, instructions are fetched from L2 Trace predictor connects traces Trace cache removes –Decode latency after mispredictions –Decode power for all pre-decoded instructions

13 Branch Hints P4 software can provide hints to branch prediction and trace cache –Specify the likely direction of a branch –Implemented with conditional branch prefixes –Used for decode-stage predictions and trace building

14 Execution Pipeline


16 Execution 126  ops can in flight at once –Up to 48 loads / 24 stores Can dispatch up to 6  ops per cycle 2x trace cache and retirement  op bandwidth –Provides additional B/W for scheduling mispeculation

17 Execution Units

18 Register Renaming

19 8-entry architectural register file 128-entry physical register file 2 RAT (Front-end RAT and Retirement RAT) Retirement RAT eliminates register writes into ARF

20 Store and Load Scheduling Out of order store and load operations Stores are always in program order 48 loads and 24 stores could be in flight Store/load buffers are allocated at the allocation stage –Total 24 store buffers and 48 load buffers

21 Execution Pipeline

22 Retirement Can retire 3  ops per cycle Implements precise exceptions Reorder buffer used to organize completed  ops Also keeps track of branches and sends updated branch information to the BTB

23 Data Stream of Pentium 4 Processor

24 On-chip Caches L1 instruction cache (Trace Cache) L1 data cache L2 unified cache –All caches use a pseudo-LRU replacement algorithm Parameters:

25 L1 Data Cache Non-blocking –Support up to 4 outstanding load misses Load latency –2-clock for integer –6-clock for floating-point 1 Load and 1 Store per clock Load speculation –Assume the access will hit the cache –“Replay” the dependent instructions when miss detected

26 L2 Cache Non-blocking Load latency –Net load access latency of 7 cycles Bandwidth –1 load and 1 store in one cycle –New cache operations may begin every 2 cycles –256-bit wide bus between L1 and L2 –48Gbytes per second @ 1.5GHz

27 L2 Cache Data Prefetcher Hardware prefetcher monitors the reference patterns Bring cache lines automatically Attempts to fetch 256 bytes ahead of current access Prefetch for up to 8 simultaneous independent streams

28 System Bus Deliver data with 3.2Gbytes/S 64-bit wide bus Four data phase per clock cycle (quad pumped) 100MHz clocked system bus

29 Execution on MPEG4 Benchmarks @ 1 GHz

30 Performance Trends Moore's Law Speedup Performance Gap Real-time speech 10k SPECInt2000

31 Power Trends Real-time Speech 500 mW Power Power Gap Hot Plate Nuclear Reactor Rocket Nozzle

Download ppt "EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12."

Similar presentations

Ads by Google