The original MIPS I CPU ISA has been extended forward three times The practical result is that a processor implementing MIPS IV is also able to run MIPS.

Slides:



Advertisements
Similar presentations
PIPELINING AND VECTOR PROCESSING
Advertisements

PIPELINE AND VECTOR PROCESSING
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Lecture 12 Reduce Miss Penalty and Hit Time
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
COMP 4211 Seminar Presentation Based On: Computer Architecture A Quantitative Approach by Hennessey and Patterson Presenter : Feri Danes.
Processor Overview Features Designed for consumer and wireless products RISC Processor with Harvard Architecture Vector Floating Point coprocessor Branch.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Microprocessor Systems Design I Instructor: Dr. Michael Geiger Spring 2012 Lecture 2: 80386DX Internal Architecture & Data Organization.
EENG449b/Savvides Lec 5.1 1/27/04 January 27, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Basic Processing Unit (Week 6)
5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
The MIPS R10000 Superscalar Microprocessor Kenneth C. Yeager Nishanth Haranahalli February 11, 2004.
Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Ravikumar Source:
The R10000 Microprocessor A Presentation by Dennis Konz, Birger Nahs, Klaus Wagner & Sibbi Wormser.
Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)
Principles of Linear Pipelining
Computer Architecture Lecture 32 Fasih ur Rehman.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Chapter One Introduction to Pipelined Processors
8086 Internal Architecture
Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Pentium Architecture Arithmetic/Logic Units (ALUs) : – There are two parallel integer instruction pipelines: u-pipeline and v-pipeline – The u-pipeline.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 January Session 2.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Fundamentals of Programming Languages-II
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
Protection in Virtual Mode
Computer Architecture Chapter (14): Processor Structure and Function
Instruction Level Parallelism
PowerPC 604 Superscalar Microprocessor
5.2 Eleven Advanced Optimizations of Cache Performance
/ Computer Architecture and Design
Flow Path Model of Superscalars
Introduction to Pentium Processor
Pipelining: Advanced ILP
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Superscalar Processors & VLIW Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
* From AMD 1996 Publication #18522 Revision E
COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING
Presentation transcript:

The original MIPS I CPU ISA has been extended forward three times The practical result is that a processor implementing MIPS IV is also able to run MIPS I, MIPS II, or MIPS III binary programs without change.

MIPS is implemented in the following sets of CPU designs : MIPS I, implemented in the R2000 and R3000 MIPS II, implemented in the R6000 MIPS III, implemented in the R4400 MIPS IV, implemented in the R8000 and R10000

R4400 Pipeline Pipeline and Superpipeline Architecture Previous MIPS processors had linear pipeline architectures Example of such a linear pipeline is the R4400 superpipeline In the R4400 Superpipeline architecture, an instruction is executed each cycle of the pipeline clock (PCycle), or each pipe stage

What is a Superscalar Processor? A superscalar processor is one that can fetch, execute and complete more than one instruction in parallel. At each stage, four instructions are handled in parallel. The structure of 4-way superscalar pipeline is shown in Figure

The R10000 processor has the following major features It implements the 64-bit MIPS IV instruction set architecture (ISA) It can decode four instructions each pipeline cycle, appending them to one of three instruction queues It has five execution pipelines connected to separate internal integer and floating-point execution (or functional) units It uses dynamic instruction scheduling and out-of-order execution

It has individually-optimized secondary cache and System interface Ports It has an internal controller for the external secondary cache It has an internal System interface controller with multiprocessor support It uses a precise exception model (exceptions can be traced back to the instruction that caused them) It uses non-blocking caches It has separate on-chip 32-Kbyte primary instruction and data caches

It uses speculative instruction issue (also termed “speculative branching”) (speculation is a means of hiding latency, Modern superscalar processors rely heavily on speculative execution for performance. Speculation hides branch latencies and thereby boosts performance by executing the likely branch path without stalling. Branch predictors, which provide accuracies up to 96% are the key to effective speculation.

The R10000 superscalar processor fetches and decodes four instructions in parallel each cycle (or pipeline stage). Each pipeline includes stages for fetching (stage 1 ) decoding (stage 2) Issuing instructions (stage 3) Reading register operands (stage 3) Executing instructions (stages 4 through 6) Storing results(stage 7). R10000 Superscalar Pipeline

Superscalar Pipeline Architecture in the R10000

Instruction Queues As shown in Figure, each instruction decoded in stage 2 is appended to one of three instruction queues: integer queue address queue floating-point queue Execution Pipelines The three instruction queues can issue one new instruction per cycle to each of the five execution pipelines: the integer queue issues instructions to the two integer ALU pipelines the address queue issues one instruction to the Load/Store Unit pipeline the floating-point queue issues instructions to the floating-point adder and multiplier pipelines A sixth pipeline, the fetch pipeline, reads and decodes instructions from the instruction cache.

64-bit Integer ALU Pipeline The 64-bit integer pipeline has the following characteristics: It has a 16-entry integer instruction queue that dynamically issues Instructions It has a 64-bit 64-location integer physical register file, with seven read and three write ports It has two 64-bit arithmetic logic units: - ALU1 contains an arithmetic-logic unit, shifter, and integer branch comparator -ALU2 contains an arithmetic-logic unit, integer multiplier, and divider

Load/Store Pipeline The load/store pipeline has the following characteristics: It has a 16-entry address queue that dynamically issues instructions, and uses the integer register file for base and index registers It has a 16-entry address stack for use by non-blocking loads and Stores It has a 44-bit virtual address calculation unit It has a 64-entry fully associative Translation-Look aside Buffer (TLB), which converts virtual addresses to physical addresses, using a 40-bit physical address. Each entry maps two pages, with sizes ranging from 4 Kbytes to 16 Mbytes, in powers of 4.

64-bit Floating-Point Pipeline The 64-bit floating-point pipeline has the following characteristics: It has a 16-entry instruction queue, with dynamic issue It has a 64-bit 64-location floating-point physical register file, with five read and three write ports (32 logical registers) It has a 64-bit parallel multiply unit (3-cycle pipeline with 2-cycle latency) which also performs move instructions It has a 64-bit add unit (3-cycle pipeline with 2-cycle latency) which handles addition, subtraction, and miscellaneous floating-point Operations It has separate 64-bit divide and square-root units which can operate concurrently (these units share their issue and completion logic with the floating-point multiplier)

Block Diagram of the R10000 Processor

Functional Units The five execution pipelines allow overlapped instruction execution by issuing instructions to the following five functional units: Two integer ALU’s (ALU1 and ALU2) The Load/Store unit (address calculate) The floating-point adder The floating-point multiplier There are also three “iterative” units to compute more complex results: Integer multiply and divide operations are performed by an Integer Multiply/Divide execution unit; these instructions are issued to ALU2. ALU2 remains busy for the duration of the divide. Floating-point divides are performed by the Divide execution unit; these instructions are issued to the floating-point multiplier. Floating-point square root are performed by the Square-root execution unit; these instructions are issued to the floating-point multiplier.

Primary Instruction Cache (I-cache) The primary instruction cache has the following characteristics: It contains 32 Kbytes, organized into 16-word blocks, is 2-way set associative, using a least-recently used (LRU) replacement algorithm It reads four consecutive instructions per cycle, beginning on any word boundary within a cache block, but cannot fetch across a block boundary. Its instructions are predecoded, its fields are rearranged, and a 4-bit unit select code is appended It checks parity on each word It permits non-blocking instruction fetch

Primary Data Cache (D-cache) The primary data cache has the following characteristics: It has two interleaved arrays (two 16 Kbyte ways) It contains 32 Kbytes, organized into 8-word blocks, is 2-way set associative, using an LRU replacement algorithm. It handles 64-bit load/store operations It handles 128-bit refill or write-back operations It permits non-blocking loads and stores It checks parity on each byte

Instruction Decode And Rename Unit The instruction decode and rename unit has the following characteristics: It processes 4 instructions in parallel It replaces logical register numbers with physical register numbers (register renaming) - It maps integer registers into a 33-word-by-6-bit mapping table that has 4 write and 12 read ports - It maps floating-point registers into a 32-word-by-6-bit mapping table that has 4 write and 16 read ports It has a 32-entry active list of all instructions within the pipeline.

Branch Unit The branch unit has the following characteristics: It allows one branch per cycle Conditional branches can be executed speculatively, up to 4-deep It has a 44-bit adder to compute branch addresses It has a 4-quadword branch-resume buffer, used for reversing mispredicted speculatively-taken branches

Instruction Queues The processor keeps decoded instructions in three instruction queues, which dynamically issue instructions to the execution units. The queues allow the processor to fetch instructions at its maximum rate, without stalling because of instruction conflicts or dependencies. Each queue uses instruction tags to keep track of the instruction in each execution pipeline stage. These tags set a Done bit in the active list as each instruction is completed.

Integer Queue The integer queue issues instructions to the two integer arithmetic units: ALU1 and ALU2. The integer queue contains 16 instruction entries. Up to four instructions may be written during each cycle; newly-decoded integer instructions are written into empty entries in no particular order. Instructions remain in this queue only until they have been issued to an ALU. Branch and shift instructions can be issued only to ALU1. Integer multiply and divide instructions can be issued only to ALU2. Other integer instructions can be issued to either ALU. The integer queue controls six dedicated ports to the integer register file: two operand read ports and a destination write port for each ALU.

Releasing Register Dependency In Integer Queue In the one cycle queue must issue two instruction. Find out which operands will be ready Request dependent instructions

Floating-Point Queue The floating-point queue issues instructions to the floating-point multiplier and the floating-point adder. The floating-point queue contains 16 instruction entries. Up to four instructions may be written during each cycle; newly-decoded floating-point instructions are written into empty entries in random order. Instructions remain in this queue only until they have been issued to a floating-point execution unit. The floating-point queue controls six dedicated ports to the floating-point register file: two operand read ports and a destination port for each execution unit. The floating-point queue uses the multiplier’s issue port to issue instructions to the square-root and divide units. These instructions also share the multiplier’s register ports.

Address Queue The address queue issues instructions to the load/store unit. The address queue contains 16 instruction entries. Unlike the other two queues, the address queue is organized as a circular First-In First-Out (FIFO) buffer. A newly decoded load/store instruction is written into the next available sequential empty entry; up to four instructions may be written during each cycle. The FIFO order maintains the program’s original instruction sequence so that memory address dependencies may be easily computed. Instructions remain in this queue until they have graduated; they cannot be deleted immediately after being issued, since the load/store unit may not be able to complete the operation immediately. The address queue contains more complex control logic than the other queues. An issued instruction may fail to complete because of a memory dependency, a cache miss, or a resource conflict; in these cases, the queue must continue to reissue the instruction until it is completed.

Integer and floating point register files each contain 64 physical registers. Execution units read operands from the register files and write directly back. Integer Register File: 7 Read ports and 3 Write ports 2 dedicates read ports and one dedicated write port to each ALU 2 dedicated read ports to Address Calculation Unit. Seventh read port handles store,jump register and move to floating point instructions. Third write port handles load,branch-and-link, and move from floating point instructions. Floating Point Register File: 5 read and 3 write ports Adder and multiplier each have two dedicated read ports and one dedicated write port. Fifth read port handles store and move instructions Third write port handles load and move instructions Register Files

ALU Block Diagram Each of the two integer ALU’s contains a 64 bit adder and a logic unit ALU1 contains a 64-bit shifter and branch conditional logic. ALU 2 contains partial integer multiplier and integer divide logic.

Floating Point Execution Unit Block Diagram

Floating point adder The adder does floating point addition, subtraction, compare and conversion operations. In first stage subtracts the operand exponents, selects the larger operand, and aligns the smaller mantissa in a 55-bit right shifter. The second stage adds or subtracts the mantissas, depending on the operation and signs of the operands.

Floating point multiplier. During the first cycle, the unit booth encodes the 53 bit mantissa of the multiplier and uses it to select 27 partial products. A Compression tree uses an array of (4,2) carry save address(csa) which sum 4 bits into 2 sum and carry outputs. During the second cycle the resulting 106 bit sum and carry values are combined using a 106 bit carry propagate adder. A final 53 bit adder rounds the result.

Memory Hierarchy To run large programs effectively, the R10000 implements a non blocking memory hierarchy with tow levels of set associative caches. The chip also controls a large external secodary cache. All chaches use a least recently used(LRU) replacement algorithm Both primary chaches use a virtual address and a physical address stag. To minimize latency the processor can access each primary cache concurrently with address translation in its TLB. This techniques simplifies the cache design. Disadvantages: It works well as long as the program uses consistent virtual index to reference the Same page. The secondary cache controller detects any violation and ensures that the primary caches retain a single copy of each cache line.

Address calculation unit and data cache block diagram

Load and Store unit The address queue issues load and store instruction to address calculation unit and data cache. When cache is not busy a load instruction simultaneously access the TLB, cache tag array and cache data array. Address calculation The R calculates virtual memory address as the sum of two 64-bit registers or the sum of the register and a 16-bit immediate field. Results from the data cache or the ALU’s can bypass the register files into the operand registers. The TLB translates the virtual address to physical address

Memory Address Translation TLB translates the 44-bit virtual address to 40 bit physical address. It can have 64 entries Each entry maps the pair of virtual pages and independently selects the page size of any power of 4 between 4Kilobytes and 16Kbytes TLB consists of Content addressable memory which compares virtual address, and a ram section which contains corresponding physical address

Clock And Output Drives

Clocks An on chip phase locked loop generates all timing synchronously with an External system interface clock. System Interface The R10000 communicates with the outside world using a 64 bit split transaction System bus with multiplexed address and data.

Performance An aggressive, superscalar microprocessor the R10000 features fast clocks and non blocking caches, set associative memory subsystem. Its design emphasize concurrency and latency hiding techniques to effectively run large real world applications

References: pdf

Thank You