RAMP Gold: Architecture and Timing Model Andrew Waterman, Zhangxi Tan, Rimas Avizienis, Yunsup Lee, David Patterson, Krste Asanović Parallel Computing.

Slides:



Advertisements
Similar presentations
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Lecture 12 Reduce Miss Penalty and Hit Time
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
Performance of Cache Memory
Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Virtual Memory I Steve Ko Computer Sciences and Engineering University at Buffalo.
How caches take advantage of Temporal locality
SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
CIS629 - Fall 2002 Caches 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
Zhangxi Tan, Krste Asanovic, David Patterson UC Berkeley
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
Review for Midterm 2 CPSC 321 Computer Architecture Andreas Klappenecker.
RAMP Gold RAMPants Parallel Computing Laboratory University of California, Berkeley.
2/27/2002CSE Cache II Caches, part II CPU On-chip cache Off-chip cache DRAM memory Disk memory.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
CMPE 421 Parallel Computer Architecture
CSE431 L22 TLBs.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 22. Virtual Memory Hardware Support Mary Jane Irwin (
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
Virtual Memory Expanding Memory Multiple Concurrent Processes.
Virtual Memory. Virtual Memory: Topics Why virtual memory? Virtual to physical address translation Page Table Translation Lookaside Buffer (TLB)
B. Ramamurthy.  12 stage pipeline  At peak speed, the processor can request both an instruction and a data word on every clock.  We cannot afford pipeline.
CML CML CS 230: Computer Organization and Assembly Language Aviral Shrivastava Department of Computer Science and Engineering School of Computing and Informatics.
Lecture Objectives: 1)Explain the relationship between miss rate and block size in a cache. 2)Construct a flowchart explaining how a cache miss is handled.
Memory Architecture Chapter 5 in Hennessy & Patterson.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
RISC / CISC Architecture by Derek Ng. Overview CISC Architecture RISC Architecture  Pipelining RISC vs CISC.
Memory Hierarchy and Cache Design (4). Reducing Hit Time 1. Small and Simple Caches 2. Avoiding Address Translation During Indexing of the Cache –Using.
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Set-Associative Caches Instructors: Randy H. Katz David A. Patterson
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Soner Onder Michigan Technological University
COSC3330 Computer Architecture
Memory COMPUTER ARCHITECTURE
Computer Organization CS224
From Address Translation to Demand Paging
Section 9: Virtual Memory (VM)
From Address Translation to Demand Paging
CSC 4250 Computer Architectures
Architecture Background
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers
ECE 445 – Computer Organization
Introduction to Pentium Processor
Systems Architecture II
CPE 631 Lecture 05: Cache Design
Lecture 20: OOO, Memory Hierarchy
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Overview Prof. Eric Rotenberg
CS 286 Computer Architecture & Organization
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Fundamentals of Computing: Computer Architecture
Cache Memory Rabi Mahapatra
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

RAMP Gold: Architecture and Timing Model Andrew Waterman, Zhangxi Tan, Rimas Avizienis, Yunsup Lee, David Patterson, Krste Asanović Parallel Computing Laboratory University of California, Berkeley

RAMP Gold Overview Tiled CMP simulator ISA: SPARC V8 – (ARM/Thumb-2 later?) Split timing and function (both on FPGA) Host-multithreaded Runs on V5LX110T (XUP) Par Lab InfiniCore Functional Model Pipeline Arch State Timing Model Pipeline Timing State

RAMP Gold Target Machine SPARC V8 CORE SPARC V8 CORE I$ D$ DRAM Shared L2$ / Interconnect SPARC V8 CORE SPARC V8 CORE I$ D$ SPARC V8 CORE SPARC V8 CORE I$ D$ SPARC V8 CORE SPARC V8 CORE I$ D$ … 64 cores

RAMP Gold v1 Target Features 64 single issue in-order SPARCv8 processors – Simple, 5-stage pipeline – FPU Cache Timing model – Configurable size, line size, associativity, miss penalty, shared/private – Change parameters without resynthesis

RAMP Gold Architecture Mapping the target machine directly to an FPGA is inefficient Solution: split timing and functionality + Multithreading – The timing logic decides how many target cycles an instruction sequence should take – Simulating the functionality of an instruction might take multiple host cycles

Function/Timing Split Advantages Flexibility – Can configure target at runtime – Synthesize design once, change target model parameters at will Efficient FPGA resource usage – Example 1: model a 2-cycle FPU in 10 host cycles – Example 2: model a 16MB L2$ using only 256KB host BRAM to store tags/metadata Enables multithreading

Split Timing and Function Functional model executes ISA correctly Timing model determines how long a program takes to run CPU L1 D$ MEM = Target Machine CPU FM MEM FM Functional ModelTiming Model CPU TM L1 D$ TM MEM TM L1 D$ FM +

Functional model executes ISA correctly Timing model determines how long a program takes to run CPU L1 D$ MEM CPU FM MEM FM = Target MachineFunctional ModelTiming Model CPU TM L1 D$ TM MEM TM + Split Timing and Function

TM + FM from 30,000 ft CPU Timing Model CPU Timing Model L1 D$ Timing Model CPU Functional Model CPU Functional Model Memory Timing Model Memory Timing Model Memory Functional Model Memory Functional Model instruction ld/st address store data ld/st addressstall load data ld/st address store data stall instruction complete

TM + FM from 3,000 ft Memory Timing Model Memory Timing Model Memory Functional Model Memory Functional Model instruction ld/st address, store data ld/st addressstall load data ld/st address, store data stall instruction complete CPU TM IF CTRL DEC EX MEM WB CPU FM TM1 TM2 L1 D$ TM

Example: Target Load Miss Memory Timing Model Memory Timing Model Memory Functional Model Memory Functional Model instruction ld/st address, store data ld/st addressstall load data ld/st address, store data stall instruction complete CPU TM IF CTRL DEC EX MEM WB CPU FM TM1 TM2 L1 D$ TM

Timing-Driven Host Pipeline TS IF DE EX WB MEM2 TM1 TARGET MEMORY TM/FM TM2 TM3 L1 D$ TM MEM1 Store Buffer Load Result Buffer CPU/D$ Timing Model CPU Functional Model {TID,INST}{TID,ADDR} T0T1T2 ADDLDST LD ADD

Cache Modeling The cache model maintains tag, state, protocol bits internally Whenever the functional model issues a memory operation, the cache model determines how many target cycles to stall … tag index offset tag, state = = = = = = hit/miss associativity

Multithreaded, Pipelined Cache TM tag, state = = Address tag, state = = = = Index hit?

Quick & Dirty Validation 32KB, 2-way L1 D$, 64B lines 256KB, 4-way L2$, 64B lines

Status Functional + simple timing model work in HW – Running real programs (e.g. SPLASH2) Near term future work – Move from current “functional-first + stall” configuration to timing-driven described here – More interesting memory system timing model – Functional potpourri (FDIV, MMU, …)

DEMO Run OCEAN with different L1 D$ parameters

Questions? Thank you!