Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Structure of Computer Systems
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
THE AMD-K7 TM PROCESSOR Microprocessor Forum 1998 Dirk Meyer.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
Instruction Level Parallelism (ILP) Colin Stevens.
SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.
Review: Multiprocessor Basics
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
CS252 Project Presentation Optimizing the Leon Soft Core Marghoob Mohiyuddin Zhangxi TanAlex Elium Dept. of EECS University of California, Berkeley.
How Multi-threading can increase on-chip parallelism
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Lecture 30Fall 2006 Computer Architecture Fall 2006 Lecture 30. CMPs & SMTs Adapted from Mary Jane Irwin ( ) [Adapted.
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
Niagara: a 32-Way Multithreaded SPARC Processor
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
COMP25212 CPU Multi Threading Learning Outcomes: to be able to: –Describe the motivation for multithread support in CPU hardware –To distinguish the benefits.
.1 Multiprocessor on a Chip & Simultaneous Multi-threads [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
The Alpha – Data Stream Matt Ziegler.
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.
.1 Multiprocessor on a Chip & Simultaneous Multi-threads [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
The Standford Hydra CMP  Lance Hammond  Benedict A. Hubbert  Michael Siu  Manohar K. Prabhu  Michael Chen  Kunle Olukotun Presented by Jason Davis.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.
COMP 740: Computer Architecture and Implementation
Computer Structure Multi-Threading
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Levels of Parallelism within a Single Processor
Computer Architecture Lecture 4 17th May, 2006
Chapter 6 Memory System Design
How to improve (decrease) CPI
Lecture 20: OOO, Memory Hierarchy
* From AMD 1996 Publication #18522 Revision E
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Levels of Parallelism within a Single Processor
Presentation transcript:

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing Architectures

Outline Introduction Motivation for Niagara Niagara Overview Sparc Pipeline Thread Selection Policy Integer Register File Memory Conclusion

Introduction A multithreaded processor is one that enables more than one thread to exist on the CPU at the same time Types of Multithreading: Coarse-grained: switch happens on event Fine-grained: switch every cycle or event Simultaneous Multithreading: Minimal resource sharing Chip multiprocessing: more than one CPU on the same chip

Motivation for Niagara ILP does not provide enough parallelism for two reasons: Memory latency Inherently low application ILP Designed to improve throughput Total work done across multiple threads Another target is power consumption Reduced clock frequency Sharing of pipelines

Motivation for Niagara Designed for high performance server applications Have client request-level parallelism (TLP) Shared memory single-issue CPUs perform better than complex multiple issue CPUs Combines CMP and fine-grained multithreading

Niagara Overview Supports 32 hardware threads Each 4 threads are a group which shares a pipeline called Sparc pipe (total 8 Sparc pipes) Each Sparc pipe has seprate I- and D-cache Memory latency is hidden by thread switching at a cost of zero cycles Shared 3MB L2 cache: 12-way set associative Crossbar interconnect between pipes, L2 cache and other CPU shared resources with bandwidth: 200 GB/s

Niagara Block Diagram

Sparc Pipeline Single-issue, six-stage pipeline fetch, thread select, decode, execute, memory, write back Each pipeline supports 4 threads Each thread has unique instruction and store buffer and register set L1 cache, TLBs, FUs are shared Pipeline registers are shared except the fetch and thread select stages

Sparc Pipeline Fetch: 64-entry ITLB access Two instructions per cycle with a predecode bit Thread select: Insert instructions to the buffer if resources are busy Deode: Decode and register file access Forwarding for data dependant instructions Execute: ALU, shift : single cycle mul, div : long latency and causes thread switch Memory: DTLB access Can cause a late trap which flushes all subsequently fetched instructions from the same thread and a thread switch Write back

Sparc Pipeline

Thread Selection Policy Switch every cycle Least recently used thread has high priority Speculative instructions (fetched after load) have low priority Deselection of threads can happen because of long latency instructions like mul and div or by traps discovered later as in cache miss

Integer Register File 3 read ports For single issue, store and few 3-source instructions 2 write ports For single issue and long latency operations Register window implementation A window has ins, outs and locals Each thread has 8 windows A new window is added by procedure calls, slide up Procedure return, slide down

Integer Register File Divided into working set and architectural set Working set is implemented by fast register file cells Architectural consists of SRAM cells Sets are linked by a transfer port Window change triggers thread switch Threads share the read circuit but not the registers

Integer Register File

Memory L1 I-cache is 16KB 4-way set associative with block size of 32B Random replacement policy L1 D-cache is 8KB 4-way set associative with block size of 16B write through policy Simple coherence protocol L2 implements write back policy

Conclusion Designed for power Requires no special compiler Exploits TLP Useful for client-server applications Implements fine-grain multithreading and CMP