Samira Khan University of Virginia Nov 5, 2018

Slides:



Advertisements
Similar presentations
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
Advertisements

What is memory? Memory is used to store information within a computer, either programs or data. Programs and data cannot be used directly from a disk or.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
L18 – Memory Hierarchy 1 Comp 411 – Fall /30/2009 Memory Hierarchy Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
Lecture 19 Today’s topics Types of memory Memory hierarchy.
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
07/11/2005 Register File Design and Memory Design Presentation E CSE : Introduction to Computer Architecture Slides by Gojko Babić.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
CS203 – Advanced Computer Architecture Main Memory Slides adapted from Onur Mutlu (CMU)
Fall 2010 Computer Architecture Lecture 6: Caching Basics Prof. Onur Mutlu Carnegie Mellon University.
Samira Khan University of Virginia Mar 3, 2016 COMPUTER ARCHITECTURE CS 6354 Main Memory The content and concept of this course are adapted from CMU ECE.
Samira Khan University of Virginia April 12, 2016
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
CACHE _View 9/30/ Memory Hierarchy To take advantage of locality principle, computer memory implemented as a memory hierarchy: multiple levels.
CS161 – Design and Architecture of Computer Main Memory Slides adapted from Onur Mutlu (CMU)
CMSC 611: Advanced Computer Architecture
Computer Architecture Lecture 10: Branch Prediction II
Caches Samira Khan March 21, 2017.
Memories.
Caches Samira Khan March 23, 2017.
CS203 – Advanced Computer Architecture
Dynamic Branch Prediction
18-447: Computer Architecture Lecture 34: Emerging Memory Technologies
Lecture: Branch Prediction
Lecture: Branch Prediction
Computer Architecture Lecture 19: Memory Hierarchy and Caches
Samira Khan University of Virginia Oct 23, 2017
Samira Khan University of Virginia Sep 17, 2017
5.2 Eleven Advanced Optimizations of Cache Performance
Architecture Background
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Samira Khan University of Virginia Nov 13, 2017
Samira Khan University of Virginia Oct 9, 2017
Scalable High Performance Main Memory System Using PCM Technology
Samira Khan University of Virginia Dec 4, 2017
Chapter 8 Digital Design and Computer Architecture: ARM® Edition
Introduction to Computer Systems
Module 3: Branch Prediction
CSC D70: Compiler Optimization Memory Optimizations
15-740/ Computer Architecture Lecture 24: Control Flow
MICROPROCESSOR MEMORY ORGANIZATION
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
DRAM SCALING CHALLENGE
Samira Khan University of Virginia Sep 12, 2018
Memory Organization.
Lecture 20: OOO, Memory Hierarchy
2.C Memory GCSE Computing Langley Park School for Boys.
15-740/ Computer Architecture Lecture 14: Prefetching
15-740/ Computer Architecture Lecture 18: Caching Basics
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Hardware Main memory 26/04/2019.
Chapter Five Large and Fast: Exploiting Memory Hierarchy
So far we have dealt with control hazards in instruction pipelines by:
Areg Melik-Adamyan, PhD
Cache Memory and Performance
Samira Khan University of Virginia Mar 27, 2019
Prof. Onur Mutlu Carnegie Mellon University 9/21/2012
Samira Khan University of Virginia Mar 6, 2019
Memory Principles.
Presentation transcript:

Samira Khan University of Virginia Nov 5, 2018 COMPUTER ARCHITECTURE CS 6354 Main Memory Samira Khan University of Virginia Nov 5, 2018 The content and concept of this course are adapted from CMU ECE 740

AGENDA Logistics Review from the last class Branch Prediction Main Memory

LOGISTICS Nov 7: Third Student Paper Presentation Session Two papers Memory Dependence Prediction using Store Sets, ISCA 1998. Google Workloads for Consumer Devices: Mitigating Data Movement Bottleneck, ASPLOS 2018 Presenters do not need to submit reviews

INSIGHTS USED IN BRANCH PREDICTORS Last-time and 2BC predictors exploit “last-time” predictability Realization 1: A branch’s outcome can be correlated with other branches’ outcomes Global branch correlation Realization 2: A branch’s outcome can be correlated with past outcomes of the same branch (other than the outcome of the branch “last-time” it was executed) Local branch correlation

TWO LEVEL GLOBAL BRANCH PREDICTION First level: Global branch history register (N bits) The direction of last N branches Second level: Table of saturating counters for each history entry The direction the branch took the last time the same history was seen Pattern History Table (PHT) 00 …. 00 1 1 ….. 1 0 00 …. 01 2 3 previous one 00 …. 10 GHR (global history register) index 1 11 …. 11 Yeh and Patt, “Two-Level Adaptive Training Branch Prediction,” MICRO 1991.

CAN WE DO BETTER? Last-time and 2BC predictors exploit “last-time” predictability Realization 1: A branch’s outcome can be correlated with other branches’ outcomes Global branch correlation Realization 2: A branch’s outcome can be correlated with past outcomes of the same branch (other than the outcome of the branch “last-time” it was executed) Local branch correlation

LOCAL BRANCH CORRELATION McFarling, “Combining Branch Predictors,” DEC WRL TR 1993.

MORE MOTIVATION FOR LOCAL HISTORY To predict a loop branch “perfectly”, we want to identify the last iteration of the loop By having a separate PHT entry for each local history, we can distinguish different iterations of a loop Works for “short” loops

CAPTURING LOCAL BRANCH CORRELATION Idea: Have a per-branch history register Associate the predicted outcome of a branch with “T/NT history” of the same branch Make a prediction is based on the outcome of the branch the last time the same local branch history was encountered Called the local history/branch predictor Uses two levels of history (Per-branch history register + history at that history register value)

TWO LEVEL LOCAL BRANCH PREDICTION First level: A set of local history registers (N bits each) Select the history register based on the PC of the branch Second level: Table of saturating counters for each history entry The direction the branch took the last time the same history was seen Pattern History Table (PHT) 00 …. 00 1 1 ….. 1 0 00 …. 01 2 3 00 …. 10 index 1 Local history registers 11 …. 11 Yeh and Patt, “Two-Level Adaptive Training Branch Prediction,” MICRO 1991.

TWO-LEVEL LOCAL HISTORY PREDICTOR Which directions earlier instances of *this branch* went Direction predictor (2-bit counters) taken? PC + inst size Next Fetch Address Program Counter hit? Address of the current instruction target address Cache of Target Addresses (BTB: Branch Target Buffer)

CAN WE DO EVEN BETTER? Predictability of branches varies Some branches are more predictable using local history Some using global For others, a simple two-bit counter is enough Yet for others, a bit is enough Observation: There is heterogeneity in predictability behavior of branches No one-size fits all branch prediction algorithm for all branches Idea: Exploit that heterogeneity by designing heterogeneous branch predictors

HYBRID BRANCH PREDICTORS Idea: Use more than one type of predictor (i.e., multiple algorithms) and select the “best” prediction E.g., hybrid of 2-bit counters and global predictor Advantages: + Better accuracy: different predictors are better for different branches + Reduced warmup time (faster-warmup predictor used until the slower-warmup predictor warms up) Disadvantages: -- Need “meta-predictor” or “selector” -- Longer access latency McFarling, “Combining Branch Predictors,” DEC WRL Tech Report, 1993.

ALPHA 21264 TOURNAMENT PREDICTOR Minimum branch penalty: 7 cycles Typical branch penalty: 11+ cycles 48K bits of target addresses stored in I-cache Predictor tables are reset on a context switch Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro 1999.

BRANCH PREDICTION ACCURACY (EXAMPLE) Bimodal: table of 2bc indexed by branch address

ARE WE DONE W/ BRANCH PREDICTION? Hybrid branch predictors work well E.g., 90-97% prediction accuracy on average Some “difficult” workloads still suffer, though! E.g., gcc Max IPC with tournament prediction: 9 Max IPC with perfect prediction: 35

ARE WE DONE W/ BRANCH PREDICTION? Chappell et al., “Simultaneous Subordinate Microthreading (SSMT),” ISCA 1999.

SOME OTHER BRANCH PREDICTOR TYPES Loop branch detector and predictor Loop iteration count detector/predictor Works well for loops, where iteration count is predictable Used in Intel Pentium M Perceptron branch predictor Learns the direction correlations between individual branches Assigns weights to correlations Jimenez and Lin, “Dynamic Branch Prediction with Perceptrons,” HPCA 2001. Hybrid history length based predictor Uses different tables with different history lengths Seznec, “Analysis of the O-Geometric History Length branch predictor,” ISCA 2005.

MAIN MEMORY BASICS

THE MAIN MEMORY SYSTEM Main memory is a critical component of all computing systems: server, mobile, embedded, desktop, sensor Main memory system must scale (in size, technology, efficiency, cost, and management algorithms) to maintain performance growth and technology scaling benefits Processor and caches Main Memory Storage (SSD/HDD)

MEMORY SYSTEM: A SHARED RESOURCE VIEW Storage

STATE OF THE MAIN MEMORY SYSTEM Recent technology, architecture, and application trends lead to new requirements exacerbate old requirements DRAM and memory controllers, as we know them today, are (will be) unlikely to satisfy all requirements Some emerging non-volatile memory technologies (e.g., PCM) enable new opportunities: memory+storage merging We need to rethink the main memory system to fix DRAM issues and enable emerging technologies to satisfy all requirements

MAJOR TRENDS AFFECTING MAIN MEMORY (I) Need for main memory capacity, bandwidth, QoS increasing Main memory energy/power is a key system design concern DRAM technology scaling is ending

MAJOR TRENDS AFFECTING MAIN MEMORY (II) Need for main memory capacity, bandwidth, QoS increasing Multi-core: increasing number of cores Data-intensive applications: increasing demand/hunger for data Consolidation: cloud computing, GPUs, mobile Main memory energy/power is a key system design concern DRAM technology scaling is ending

EXAMPLE TREND: MANY CORES ON CHIP Simpler and lower power than a single large core Large scale parallelism on chip AMD Barcelona 4 cores Intel Core i7 8 cores IBM Cell BE 8+1 cores IBM POWER7 8 cores Nvidia Fermi 448 “cores” Intel SCC 48 cores, networked Tilera TILE Gx 100 cores, networked Sun Niagara II 8 cores

CONSEQUENCE: THE MEMORY CAPACITY GAP Memory capacity per core expected to drop by 30% every two years Trends worse for memory bandwidth per core! Core count doubling ~ every 2 years DRAM DIMM capacity doubling ~ every 3 years

MAJOR TRENDS AFFECTING MAIN MEMORY (III) Need for main memory capacity, bandwidth, QoS increasing Main memory energy/power is a key system design concern ~40-50% energy spent in off-chip memory hierarchy [Lefurgy, IEEE Computer 2003] DRAM consumes power even when not used (periodic refresh) DRAM technology scaling is ending

MAJOR TRENDS AFFECTING MAIN MEMORY (IV) Need for main memory capacity, bandwidth, QoS increasing Main memory energy/power is a key system design concern DRAM technology scaling is ending ITRS projects DRAM will not scale easily below X nm Scaling has provided many benefits: higher capacity (density), lower cost, lower energy

THE DRAM SCALING PROBLEM DRAM stores charge in a capacitor (charge-based memory) Capacitor must be large enough for reliable sensing Access transistor should be large enough for low leakage and high retention time Scaling beyond 40-35nm (2013) is challenging [ITRS, 2009] DRAM capacity, cost, and energy/power hard to scale

SOLUTIONS TO THE DRAM SCALING PROBLEM Two potential solutions Tolerate DRAM (by taking a fresh look at it) Enable emerging memory technologies to eliminate/minimize DRAM Do both Hybrid memory systems

SOLUTION 1: TOLERATE DRAM Overcome DRAM shortcomings with System-DRAM co-design Novel DRAM architectures, interface, functions Better waste management (efficient utilization) Key issues to tackle Reduce refresh energy Improve bandwidth and latency Reduce waste Enable reliability at low cost

SOLUTION 2: EMERGING MEMORY TECHNOLOGIES Some emerging resistive memory technologies seem more scalable than DRAM (and they are non-volatile) Example: Phase Change Memory Expected to scale to 9nm (2022 [ITRS]) Expected to be denser than DRAM: can store multiple bits/cell But, emerging technologies have shortcomings as well Can they be enabled to replace/augment/surpass DRAM? Even if DRAM continues to scale there could be benefits to examining and enabling these new technologies.

HYBRID MEMORY SYSTEMS CPU DRAMCtrl PCM Ctrl DRAM Phase Change Memory (or Tech. X) Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost Slow, wears out, high active energy Hardware/software manage data allocation and movement to achieve the best of multiple technologies

MAIN MEMORY IN THE SYSTEM CORE 0 L2 CACHE 0 L2 CACHE 1 CORE 1 SHARED L3 CACHE DRAM INTERFACE DRAM MEMORY CONTROLLER DRAM BANKS CORE 2 L2 CACHE 2 L2 CACHE 3 CORE 3

IDEAL MEMORY Zero access time (latency) Infinite capacity Zero cost Infinite bandwidth (to support multiple accesses in parallel)

THE PROBLEM Ideal memory’s requirements oppose each other Bigger is slower Bigger  Takes longer to determine the location Faster is more expensive Memory technology: SRAM vs. DRAM Higher bandwidth is more expensive Need more banks, more ports, higher frequency, or faster technology

MEMORY TECHNOLOGY: DRAM Dynamic random access memory Capacitor charge state indicates stored value Whether the capacitor is charged or discharged indicates storage of 1 or 0 1 capacitor 1 access transistor Capacitor leaks through the RC path DRAM cell loses charge over time DRAM cell needs to be refreshed row enable _bitline

DRAM CELL OPERATION 1. A DRAM cell stores data as charge 2. A DRAM cell is refreshed every 64 ms Transistor Capacitor Bitline Bitline Contact Here in the left side I will show a logical view of a DRAM cell, and in the right side I will show a vertical cross section of a cell. DRAM cell stores data as charge in a capacitor. The amount of the charge indicates either data zero or one. A transistor acts as a switch to the capacitor connected to a line called bitline. When a DRAM cell is read, the transistor in turned on, and the capacitor charge is sensed through the bitline. Capacitor Transistor LOGICAL VIEW VERTICAL CROSS SECTION A DRAM cell

MEMORY TECHNOLOGY: SRAM Static random access memory Two cross coupled inverters store a single bit Feedback path enables the stored value to persist in the “cell” 4 transistors for storage 2 transistors for access row select bitline _bitline

AN ASIDE: PHASE CHANGE MEMORY Phase change material (chalcogenide glass) exists in two states: Amorphous: Low optical reflexivity and high electrical resistivity Crystalline: High optical reflexivity and low electrical resistivity PCM is resistive memory: High resistance (0), Low resistance (1) Lee, Ipek, Mutlu, Burger, “Architecting Phase Change Memory as a Scalable DRAM Alternative,” ISCA 2009.

WHY MEMORY HIERARCHY? We want both fast and large But we cannot achieve both with a single level of memory Idea: Have multiple levels of storage (progressively bigger and slower as the levels are farther from the processor) and ensure most of the data the processor needs is kept in the fast(er) level(s)

MEMORY HIERARCHY Fundamental tradeoff Idea: Memory hierarchy Fast memory: small Large memory: slow Idea: Memory hierarchy Latency, cost, size, bandwidth Hard Disk CPU Cache RF Main Memory (DRAM)

CACHING BASICS: EXPLOIT TEMPORAL LOCALITY Idea: Store recently accessed data in automatically managed fast memory (called cache) Anticipation: the data will be accessed again soon Temporal locality principle Recently accessed data will be again accessed in the near future This is what Maurice Wilkes had in mind: Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On Electronic Computers, 1965. “The use is discussed of a fast core memory of, say 32000 words as a slave to a slower core memory of, say, one million words in such a way that in practical cases the effective access time is nearer that of the fast memory than that of the slow memory.”

CACHING BASICS: EXPLOIT SPATIAL LOCALITY Idea: Store addresses adjacent to the recently accessed one in automatically managed fast memory Logically divide memory into equal size blocks Fetch to cache the accessed block in its entirety Anticipation: nearby data will be accessed soon Spatial locality principle Nearby data in memory will be accessed in the near future E.g., sequential instruction access, array traversal This is what IBM 360/85 implemented 16 Kbyte cache with 64 byte blocks Liptay, “Structural aspects of the System/360 Model 85 II: the cache,” IBM Systems Journal, 1968.

A NOTE ON MANUAL VS. AUTOMATIC MANAGEMENT Manual: Programmer manages data movement across levels -- too painful for programmers on substantial programs “core” vs “drum” memory in the 50’s still done in some embedded processors (on-chip scratch pad SRAM in lieu of a cache) Automatic: Hardware manages data movement across levels, transparently to the programmer ++ programmer’s life is easier simple heuristic: keep most recently used items in cache the average programmer doesn’t need to know about it You don’t need to know how big the cache is and how it works to write a “correct” program! (What if you want a “fast” program?)

AUTOMATIC MANAGEMENT IN MEMORY HIERARCHY Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On Electronic Computers, 1965. “By a slave memory I mean one which automatically accumulates to itself words that come from a slower main memory, and keeps them available for subsequent use without it being necessary for the penalty of main memory access to be incurred again.”

A MODERN MEMORY HIERARCHY Register File 32 words, sub-nsec L1 cache ~32 KB, ~nsec L2 cache 512 KB ~ 1MB, many nsec L3 cache, ..... Main memory (DRAM), GB, ~100 nsec Swap Disk 100 GB, ~10 msec manual/compiler register spilling Memory Abstraction Automatic HW cache management automatic demand paging

Samira Khan University of Virginia Nov 5, 2018 COMPUTER ARCHITECTURE CS 6354 Main Memory Samira Khan University of Virginia Nov 5, 2018 The content and concept of this course are adapted from CMU ECE 740