TRIPS Primary Memory System Simha Sethumadhavan 1.

Slides:



Advertisements
Similar presentations
Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.
Advertisements

L.N. Bhuyan Adapted from Patterson’s slides
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
The University of Adelaide, School of Computer Science
CSCI 4717/5717 Computer Architecture
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Multiscalar processors
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture ­Facilitate parallel execution ­Scale well with advancing.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Lec17.1 °For in-order pipeline, 2 options: Freeze pipeline in Mem stage (popular early on: Sparc, R4000) IF ID EX Mem stall stall stall … stall Mem Wr.
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.
Cache Organization of Pentium
Multiprocessor Cache Coherency
Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18,
Input / Output CS 537 – Introduction to Operating Systems.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented.
Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.
Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.
The original MIPS I CPU ISA has been extended forward three times The practical result is that a processor implementing MIPS IV is also able to run MIPS.
B. Ramamurthy.  12 stage pipeline  At peak speed, the processor can request both an instruction and a data word on every clock.  We cannot afford pipeline.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Router Architecture. December 21, 2015SoC Architecture2 Network-on-Chip Information in the form of packets is routed via channels and switches from one.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
COSC 3330/6308 Second Review Session Fall Instruction Timings For each of the following MIPS instructions, check the cycles that each instruction.
TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Cache Organization of Pentium
Multiscalar Processors
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
12.4 Memory Organization in Multiprocessor Systems
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture: Out-of-order Processors
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Lu Peng, Jih-Kwon Peir, Konrad Lai
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Constructive Computer Architecture Tutorial 7 Final Project Overview
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
/ Computer Architecture and Design
* From AMD 1996 Publication #18522 Revision E
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

TRIPS Primary Memory System Simha Sethumadhavan 1

2 Motivation Trends – Wire-delay partitioned microarchitecture partitioned memory system – Memory-wall large-instruction windows high-bandwidth memory system Challenges – Maintain sequential memory semantics (inherently centralized) – Low-latency despite communication delays – High bandwidth despite sequential memory semantics Solutions – Distributed LSQ – Memory-side dependence predictor – Address-interleaved LSQs, Caches and MSHRs

3 TRIPS Architectural Features Load/Store Ordering (Sequential Semantics) – Track load/store dependences using LSIDs – Encoded in a 5-bit field as part of the instruction TRIPS Block-atomic execution – Architecture state updated only when all register and memory outputs have been generated – Block header encodes number of store outputs per block Weak Consistency

4 Major D-Tile Responsibilities Provide D-cache access for arriving loads and stores Perform address translation with DTLB Handle cache misses with MSHRs Perform dynamic memory disambiguation with load/store queues Perform dependence prediction for aggressive load/store issue Detect per-block store completion Write stores to caches/memory upon commit Store merging on L1 cache misses

5 Load Execution Scenarios TLBDep. PrLSQCacheResponse Miss---Report TLB Exception Hit Wait (Hit) -- Defer load until all prior stores are received. Non deterministic latency HitMiss HitForward data from L1 Cache HitMiss Forward data from L2 Cache (or memory) Issue cache fill request (if cacheable) HitMissHit Forward data from LSQ HitMissHitMiss Forward data from LSQ Issue cache fill request (if cacheable)

6 Load Pipeline Common case: 2 cycle load hit latency For LSQ hits, load latency varies between 4 to n+3 cycles, where n is size of the load in bytes – Stalls all pipelines except forwarding stages (cycles 2 and 3) TLB Hit DP Miss Cache Hit LSQ Miss Cycle 1 Access Cache, TLB, DP, and LSQ Cycle 2 Load Reply TLB Hit DP miss Cache Hit/Miss LSQ Hit Cycle 1 Access Cache, TLB, DP, and LSQ Cycle 2 Identify Store in the LSQ Cycle 3 Read Store Data Cycle 4 Load Reply Stall Pipe

7 Deferred Load Pipeline Deferred Load Processing (a.k.a. Replay pipe) Cycle X+1 Prepare Load for Replay Cycle X+2 Re-execute as if new load Dependence Predictor Hit Cycle 1 Access Cache, TLB, DP, and LSQ Cycle 2 Mark Load Waiting Cycle X All Prior Stores Arrived Stall Pipe Deferred loads are woken up when all prior stores have arrived at D-tiles – Loads between two consecutive stores can be woken up out-of-order Deferred loads are re-injected into the main pipeline, as if they were new load – During this phase, loads get the value from dependent loads, if any Pipeline stall – When deferred load in cycle X+2 cannot be re-injected into the main pipeline – Load cannot be prepared in cycle X+1 because of resource conflicts in the LSQ

8 Store Commit Pipeline Stores are committed in LSID and block order at each D-tile – Written into L1 cache or bypassed on a cache miss – One store committed per D-tile per cycle Up to four stores total can be committed every cycle, possibly out-of-order – T-morph: Store commits from different threads are not interleaved On a cache miss, stores are inserted into the merge buffer Commit pipe stalls when cache ports are busy – Fills from missed loads take up cache write ports Store Commit Cycle 1 Pick Store to commit Cycle 2 Read Store data from LSQ Cycle 3 Access cache tags Cycle 4 Check for hit or miss Stall Pipe Cycle 5 Write to Cache or Merge buffer Stall Pipe

9 Store Tracking Need to share store arrival information between D-tiles – Block completion – Waking up deferred loads Data Status Network (DSN) is used to communicate store arrival information – Multi-hop broadcast network – Each link carries store FrameID and store LSID DSN Operation – Store arrival initiates transfer on DSN – All tiles know about the store arrival after 3 cycles GSN Operation – Completion can be detected at any D-tile but special processing in done in D-tile0 before sending it to G-tile – Messages sent on DSN Store arrival DT3 CYCLE 0 CYCLE 1 CYCLE 2 Block Completion Signal DT2 DT1 DT0 Store count counter GSN Inactive DSN Active DSN

10 Load and Store Miss Handling Load Miss Operation – Cycle 1: Miss determination – Cycle 2: MSHR allocation and merging – Cycle 3: Requests sent out on OCN, if available Load Miss Return – Cycle 1: Wake up all loads waiting on incoming data – Cycle 2: Select one load; data for the load miss starts arriving – Cycle 3: Prepare load and re-inject into main pipeline, if unblocked Store Merging – Attempts to merge consecutive writes to same line – Snoops on all incoming load misses for coherency Resources are shared across threads

11 Coherence checks e.g. Load and store misses to same cache line CC D-Tile Pipelines Block Diagram Cache Store Port arb Missed load pipeline Spill Mgmt Line fill pipeline Store commit pipeline Replay load pipeline Cache + LSQ Load Port ST X LD X BYPASSES Unfinished store bypassing to load in previous stage. Bypassing=> fewer stalls => higher throughput Miss handling L2 Req. Channel arb Load input from E-Tile Load output to E-Tile Store miss pipe Store hit Costly resources are shared e.g. cache and LSQ ports, L2 bandwidth To cite: Design and Implementation of the TRIPS Primary Memory System, ICCD