CS25212 Coarse Grain Multithreading Learning Objectives: – To be able to describe a coarse grain multithreading implementation – To be able to estimate.

Slides:



Advertisements
Similar presentations
The Fetch – Execute Cycle
Advertisements

Machine cycle.
CS364 CH16 Control Unit Operation
Quiz 4 Solution. n Frequency = 2.5GHz, CLK = 0.4ns n CPI = 0.4, 30% loads and stores, n L1 hit =0, n L1-ICACHE : 2% miss rate, 32-byte blocks n L1-DCACHE.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Better answers The Alpha and Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer.
1 Recap: Memory Hierarchy. 2 Unified vs.Separate Level 1 Cache Unified Level 1 Cache (Princeton Memory Architecture). A single level 1 cache is used for.
CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.
SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.
Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Lecture 32: Chapter 5 Today’s topic –Cache performance assessment –Associative caches Reminder –HW8 due next Friday 11/21/2014 –HW9 due Wednesday 12/03/2014.
CPU Fetch/Execute Cycle
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
COMP Multithreading. Coarse Grain Multithreading Minimal pipeline changes – Need to abort instructions in “shadow” of miss – Resume instruction.
Computer Architecture and the Fetch-Execute Cycle
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
Caches in Systems COMP25212 Cache 4. Learning Objectives To understand: –“3 x C’s” model of cache performance –Time penalties for starting with empty.
COMP25212 CPU Multi Threading Learning Outcomes: to be able to: –Describe the motivation for multithread support in CPU hardware –To distinguish the benefits.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
The Central Processing Unit (CPU) and the Machine Cycle.
Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Computer Structure & Architecture 7b - CPU & Buses.
CS /02 Semester II Help Session IIA Performance Measures Colin Tan S
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
More on Pipelining 1 CSE 2312 Computer Organization and Assembly Language Programming Vassilis Athitsos University of Texas at Arlington.
Computer Organization CS224 Fall 2012 Lessons 41 & 42.
HOW COMPUTERS WORK THE CPU & MEMORY. THE PARTS OF A COMPUTER.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
More on Pipelining 1 CSE 2312 Computer Organization and Assembly Language Programming Vassilis Athitsos University of Texas at Arlington.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
COMPUTER SYSTEMS ARCHITECTURE A NETWORKING APPROACH CHAPTER 12 INTRODUCTION THE MEMORY HIERARCHY CS 147 Nathaniel Gilbert 1.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Computer Operation. Binary Codes CPU operates in binary codes Representation of values in binary codes Instructions to CPU in binary codes Addresses in.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.
Utopium. ● MCU8051 processor ● Several asynchronous examples ● 256 instructions – Some reather complex ● 256 bytes of ram with memory mapped: – Register.
OCR GCSE Computer Science Teaching and Learning Resources
COMP SYSTEM ARCHITECTURE
CS2100 Computer Organization
Caches in Systems Feb 2013 COMP25212 Cache 4.
Simultaneous Multithreading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
System Architecture 1 Chapter 2.
The fetch-execute cycle
Hyperthreading Technology
Lecture: SMT, Cache Hierarchies
Figure 13.1 MIPS Single Clock Cycle Implementation.
CSCI206 - Computer Organization & Programming
CS 101 – Sept. 25 Continue Chapter 5
From before the Break Classic 5-stage pipeline
Hardware Multithreading
Lecture: SMT, Cache Hierarchies
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
CPU Key Revision Points.
Lecture: SMT, Cache Hierarchies
PIPELINING Santosh Lakkaraju CS 147 Dr. Lee.
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Lecture: SMT, Cache Hierarchies
Hardware Multithreading
Learning Objectives To be able to describe the purpose of the CPU
Objectives Describe common CPU components and their function: ALU Arithmetic Logic Unit), CU (Control Unit), Cache Explain the function of the CPU as.
Computer Architecture
CS161 – Design and Architecture of Computer Systems
Presentation transcript:

CS25212 Coarse Grain Multithreading Learning Objectives: – To be able to describe a coarse grain multithreading implementation – To be able to estimate performance of this implementation – To be able to state important assumptions of this performance model

CPU Support for Multithreading Data Cache Fetch Logic Decode LogicFetch LogicExec LogicFetch LogicMem LogicWrite Logic Inst CachePC A PC B VA Mapping A VA Mapping B Address Translation GPRs A GPRs B Design Issue: when to switch threads

Coarse-Grain Multithreading Switch Thread on “expensive” operation: – E.g. I-cache miss – E.g. D-cache miss Some are easier than others!

Switch Threads on Icache miss Inst aIFIDEXMEMWB Inst bIFIDEXMEMWB Inst cIF MISSIDEXMEMWB Inst dIFIDEXMEM Inst eIFIDEX Inst fIFID Inst X Inst Y Inst Z ----

Performance of Coarse Grain Assume (conservatively) – 1GHz clock (1nS clock tick!), 20nS memory ( = 20 clocks) – 1 i-cache miss per 100 instructions – 1 instruction per clock otherwise Then, time to execute 100 instructions without multithreading – clock cycles – Inst per Clock = 100 / 120 = With multithreading: time to exec 100 instructions: – 100 [+ 1] – Inst per Clock = 100 / 101 =

Switch Threads on Dcache miss Inst aIFIDEXM-MissWB Inst bIFIDEXMEMWB Inst cIFIDEXMEMWB Inst dIFIDEXMEM Inst eIFIDEX Inst fIFID MISS Inst X Inst Y Performance: similar calculation (STATE ASSUMPTIONS!) Where to restart after memory cycle? I suggest instruction “a” – why? Abort these

Coarse Grain Multithreading Minimal pipeline changes – Need to abort instructions in “shadow” of miss – Resume instruction stream to recover Good to compensate for infrequent, but expensive pipeline disruption

But… Performance problems with multithreading? a)……………….. b)……………….. c)………………..