The Pentium 4 CPSC 321 Andreas Klappenecker. Today’s Menu Advanced Pipelining Brief overview of the Pentium 4.

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
Pentium microprocessors CAS 133 – Basic Computer Skills/MS Office CIS 120 – Computer Concepts I Russ Erdman.
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
1 Microprocessor-based Systems Course 4 - Microprocessors.
Computer Organization and Architecture The CPU Structure.
Chapter 12 Pipelining Strategies Performance Hazards.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
7-Aug-15 (1) CSC Computer Organization Lecture 6: A Historical Perspective of Pentium IA-32.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing.
Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Lecture # 10 Processors Microcomputer Processors.
The Pentium Series CS 585: Computer Architecture Summer 2002 Tim Barto.
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
CS203 – Advanced Computer Architecture ILP and Speculation.
1 ECE 734 Final Project Presentation Fall 2000 By Manoj Geo Varghese MMX Technology: An Optimization Outlook.
Use of Pipelining to Achieve CPI < 1
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
CS 352H: Computer Systems Architecture
Advanced Architectures
Instruction Level Parallelism
Computer Architecture
Part IV Data Path and Control
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Part IV Data Path and Control
Instruction Level Parallelism and Superscalar Processors
Morgan Kaufmann Publishers The Processor
Superscalar Processors & VLIW Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Presentation transcript:

The Pentium 4 CPSC 321 Andreas Klappenecker

Today’s Menu Advanced Pipelining Brief overview of the Pentium 4

Instruction Level Parallelism Pipelining exploits the potential parallelism among instructions. There are two main methods to increase the potential amount of parallelism: Increase the depth of the pipeline to overlap more instructions Replicate the internal components of the computer so that it can launch multiple instructions in every pipeline stage

Washer-Dryer Example Suppose that the washer cycle is longer than the other cycles. We can divide our washer into three machines that perform the wash, rinse, and spin steps of a traditional washer. (Move from a four to six pipeline stages) A multiple issue laundry would replace our household washer and dryer with, say, three washers and three dryers.

Multiple-Issue Processors We have two different approaches to multiple-issue processors: The approach to decide at compile time which instructions should be issued is called static multiple issue The approach to decide at execution time which instructions should be issued is called dynamic multiple issue

Multiple Issues with Multiple-Issue 1.Package instructions into issue slots: How does the processor determine how many instructions and which instructions can be issued in a given clock cycle? 2.Dealing with data and control hazards: In static issue processors, some or all consequences of these hazards are handled statically by the compiler. Dynamic issue processors attempt to alleviate at least some classes of hazards using hardware techniques

Speculation The most important method to exploit more ILP is speculation. The compiler or the processor guess about the properties of an instruction, to enable execution of instructions that depend on the current instruction. For example, a compiler can use speculation to reorder instructions and move instructions beyond a branch.

Recovery from wrong Speculations Speculation in software: the compiler inserts additional instructions to that check the accuracy of a speculation and provide a fix-up routine when the speculation was incorrect. Speculation in hardware: The processor usually buffers the results until it knows that they are no longer speculative. If the speculation was correct, then the instructions are completed by allowing the contents to be written to registers or memory; otherwise the buffers are flushed and the correct instruction sequence is re-executed.

Register Renaming A compiler can get more performance from loops by so-called loop unrolling; this is a technique where multiple copies of the loop are made => more ILP by overlapping instructions from different iterations In the loop unrolling, the compiler will usually introduce additional registers to eliminate dependencies that are not true data dependencies (just name dependence). The process is called register renaming.

Pentium 4

Intel’s History Intel Founded First DRAM 8086 Microprocessor Intel286™ Processor Intel386™ Processor Intel486™ Processor Intel Pentium® Processor Intel Pentium® Processor with MMX™ technology Intel Pentium® II Processor First EPROM Intel Pentium® Pro Processor DRAM Exit Flash Memory Intro Intel Inside® Launch ProShare® Introduced 100 Mbit E-Net Card First Intel Inside® Brand TV Ad First Microprocessor 4004 First Intel Motherboard 1998 Intel Celeron™ Processor Intel Pentium® II Xeon™ Processor Gbit E-Net Card Intel Pentium® III And Xeon™ Processors Internet Exchange Architecture Pentium® 4 Processor st Pb-Free Devices Slide courtesy of Intel

The Pentium4 Architecture Graphic courtesy of Tom’s hardware guide

A Glance at a Pentium 4 Chip Picture courtesy of Tom’s hardware guide

Pentium4 The Pentium 4 was first released in Some of its features are: fast system bus advanced transfer cache advanced dynamic execution (execution trace cache and enhanced branch prediction) “hyper” pipeline technology rapid execution engine enhanced floating point and multimedia (SSE2)

Some Features The processor uses micro-operations/operands simple instructions of unified length easier sequencing than variable length x86 instr. understood by the execution units the length is not exactly small

System Bus The system bus is clocked at 100 MHz, 64 bits wide, “quad-pumped”, meaning that is can transfer 8 bytes * 100 million/s*4= 3,200 MB/s (this is about 3 times the speed of the system bus of the Pentium 3) Intel introduced the 850 chipset to sustain high data exchange rates between processor and system

Data Caches Data passes a level 2 cache (256 KB), (8-way associative, 128 byte cache lines that are divided into 64 byte blocks that are read in one burst, read latency is 7 clock cycles; we come back later to such issues) Data passes a small level 1 cache (8 KB) Hardware pre-fetch unit (allows the processor to guess and fetch some that that is presumably used next; good for streaming video applications).

Execution Pipeline: The Trace Cache The Pentium 4 does not use an L1 instruction cache, but rather an “execution trace cache”. Note that the decoding of x86 instructions is much more complex than on MIPS The execution trace cache is basically an instruction cache after the decoding unit (which generates the micro-operations), so that decoding does not have to be repeated. Supplies next pipeline stage with 6 micro- operations every 2 clock cycles.

The Trace Cache Actual program instructions Trace cache can contain instructions of both branches

The Pipeline The branch prediction aids the execution trace cache; it has a fairly large branch target buffer The 20 stage hyper pipeline The pipeline can keep up to 126 instructions

The Pipeline Trace cache

Rapid Execution Engine The rapid execution engine consists of two ALUs and two AGUs that run at twice the clock speed. Not every instruction can be processed by the rapid execution engine; those instructions need to use e.g. the slower ALU AGU = address generation unit to load or store at the correct address (used whenever you have indirect addressing a[i]).

Streaming SIMD Extensions SSE2 The Pentium 4 can operate on 128 bit data as 4 single precision FP values (SSE) 2 double precision FP values (SSE2) 16 byte values (SSE2) 8 word values (SSE2) 4 double word values (SSE2) 2 quad word values bit values single instruction multiple data instructions

Pentium 4 Pipeline 1.Trace cache access, predictor 5 clock cycles Microoperation queue 2.Reorder buffer allocation, register renaming 4 clock cycles functional unit queues 3.Scheduling and dispatch unit 5 clock cycles 4.Register file access 2 clock cycles 5.Execution 1 clock cycle reorder buffer 6.Commit 3 clock cycles (total: 20 clock cycles)

Pentium 4 Generations Willamette Northwood (smaller transistors, later hyper-threading) Extreme Edition (added 2MB level 3 cache) Prescott (90 nm process, new micro architecture) Irwindale (as Prescott, but with doubled L2 cache) Dual Core

Hyper-Threading A typical thread of code of the IA-32 architecture uses about 35% of the microarchitecture execution resources. Intel added a little bit of hardware to schedule and control two threads. The operating system sees two logical processors

To Probe Further Read Chapter 6 Hennessy and Patterson, Computer Architecture: A Quantitative Approach Intel website AMD websiter