UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

Slides:



Advertisements
Similar presentations
CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
Advertisements

Instruction Level Parallelism 2. Superscalar and VLIW processors.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Processor Overview Features Designed for consumer and wireless products RISC Processor with Harvard Architecture Vector Floating Point coprocessor Branch.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Alpha Microarchitecture Onur/Aditya 11/6/2001.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
DLX Instruction Format
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Appendix A Pipelining: Basic and Intermediate Concepts
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.
TDC 311 The Microarchitecture. Introduction As mentioned earlier in the class, one Java statement generates multiple machine code statements Then one.
The MIPS R10000 Superscalar Microprocessor Kenneth C. Yeager Nishanth Haranahalli February 11, 2004.
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.
Lecture 8: Processors, Introduction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014,
ALPHA Introduction I- Stream ALPHA Introduction I- Stream Dharmesh Parikh.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
SUN ULTRASPARC-III ARCHITECTURE
The original MIPS I CPU ISA has been extended forward three times The practical result is that a processor implementing MIPS IV is also able to run MIPS.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Chapter One Introduction to Pipelined Processors
Computer Architecture: Wrap-up CENG331 - Computer Organization Instructors: Murat Manguoglu(Section 1) Erol Sahin (Section 2 & 3) Adapted from slides of.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Fetch Directed Prefetching - a Study
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Pentium Architecture Arithmetic/Logic Units (ALUs) : – There are two parallel integer instruction pipelines: u-pipeline and v-pipeline – The u-pipeline.
The Alpha – Data Stream Matt Ziegler.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Sun Microsystems’ UltraSPARC-IIi a Stunt-Free Presentation by Christine Munson Amanda Peters Carl Sadler.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.
1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 24 –RISC II.
Instruction Level Parallelism
ALPHA Introduction I- Stream
PowerPC 604 Superscalar Microprocessor
5.2 Eleven Advanced Optimizations of Cache Performance
Flow Path Model of Superscalars
Introduction to Pentium Processor
Pipelining: Advanced ILP
The Microarchitecture of the Pentium 4 processor
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Alpha Microarchitecture
Alex Saify Chad Reynolds James Aldorisio Brian Bischoff
Control unit extension for data hazards
* From AMD 1996 Publication #18522 Revision E
Presentation transcript:

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan

Presentation Outline Background Introduction to the UltraSPARC Instruction Issue Unit Integer Execute Unit Floating Point Unit Memory Subsystem

Introduction 3 rd generation of Sun Microsystems’ 64 bit SPARC V9 architecture Design Target  600 MHz  70-watt power 1.8V  0.25-micron process with 6 metal layers  Transistors Count - 12 million (RAM) 4 million (Logic)  Die size of 360mm 2

A Tour of the UltraSPARC 14 stage pipeline Instruction Issue Unit occupies stages A through J Integer Execution Unit - stages R through D Data Cache Unit – stages E through W Floating Point Unit – stages E through D

Design Goals Minimum latency for integer execution path, determines cycle time - limit stage size to approximately 8 logic gates Minimize performance degradation due to clock overhead, e.g. - On-chip caches are wave pipelined Minimize branch misprediction latency – use of miss queue

Instruction Pipeline

Instruction Issue Unit

UltraSparc III is a static speculation machine. Compiler makes the speculation path sequential, results in fewer requirements on the Fetch Stage A contains a small, 32-byte buffer to support sequential prefetching into instruction cache I-cache access over 2 cycles (P and F), it is wave pipelined Pipeline

Instruction Issue Unit – Contd. ITLB and branch prediction mechanism overlapped with I-cache access Target address is generated only in Stage B and redirected to Stage A if taken 20 entry instruction queue and 4-entry miss queue. Latter stores alternate execution path to mitigate effects of misprediction Stages I and J used to decode and dispatch instructions; scoreboarding is used to check for operand dependency. Pipeline

Branch Prediction Mechanism Slightly modified Gshare algorithm with 16K saturating 2-bit counters – the three low order index bits into predictor use PC info only 8 cycle misprediction delay, need to drain stages Pipeline

Integer Execute Unit Executes loads, stores, shift, arithmetic, logical and branch instructions 4 integer executions per cycle – 2 from (arithmetic/logical/shift), 1 from load/store and 1 branch Entire data path uses dynamic precharge circuits – this is the E stage Future file technique to handle exceptions – we have working and architectural register files (WARF) Pipeline

Integer Execute Unit – Contd. Integer execution accesses data from WRF in the R stage and writes to it in C stage. ARF copied into WRF in case of exceptions. Results are committed into ARF at the end of the pipe. Integer multiply and divide are not pipelined and are executed in the ASU; strategy is to decouple less frequently executed instructions. Pipeline

Floating Point Unit Floating point and partitioned fixed point (graphics) instructions 3 datapaths  4 stage divide/multiply  4 stage add/subtract/compare  Unpipelined divide/square root Push FPU by one stage to keep integer unit compact (counter the effect of wire delays) Pipeline

Data Cache Unit

Memory – L1 Data Cache 64 KB, 4-way, 32-byte line 2 cycle access time – Wave pipelined Sum addressed memory (SAM) – combines address addition and word line decode Pipeline

Memory - Prefetch Cache 2 KB, 2 way, 64-byte line Multi-ported SRAM Streaming data possible (similar to stream buffers) Detects striding loads – hardware prefetch issued independent of software prefetch Pipeline

Memory – Write Cache 2 KB, 4 way, 64-byte line Reduce bandwidth due to store traffic Sole source of on-chip dirty data – easy to handle on-chip cache consistency Write-validate scheme- multiplex between L2 bytes and write-cache bytes for loads Pipeline

External Memory Interface L2 Cache – Direct-mapped, Unified Data and Instruction, 12 cycle access time Cache controller allows programmable support of 4 MB or 8 MB On-chip Main Memory Controller On-chip Tags – allow associative L2 cache design without latency penalty Pipeline

Layout of UltraSPARC III