CSE 586 Computer Architecture Lecture 6

CSE 586 Computer Architecture Lecture 6
Jean-Loup Baer CSE 586 Spring 00

Highlights from last week
Memory hierarchies “work” because of the principle of locality Temporal and spatial locality Two main interfaces in the memory hierarchy Caches – main memory (the topic of the lecture) Main memory – disk (secondary memory) Same questions arise at both interfaces: Size , placement, retrieval, replacement, and timing of the information being transferred CSE 586 Spring 00

Highlights from last week (c’ed)
Cache organizations Direct-mapped, fully-associative, set-associative Decomposition of the address for hit/miss detection Write-through vs. write-back The 3 C’s Cache performance Metrics: CPIc , Average memory access time Examples of naïve analysis CSE 586 Spring 00

Highlights from last week (c’ed)
Improving performance by giving more “associativity” Victim caches; column-associative caches; skewed ass. caches Reducing conflict misses Interaction with the O.S.: page coloring Ineraction with the compiler: code placement Improving performance by tolerating memory latency Prefetching Write buffers Today’s lecture CSE 586 Spring 00

Critical Word First Recall for a cache miss, need to
Send address + DRAM access + Send data on bus Optimization: Send first, from next level in memory hierarchy, the word for which there was a miss Send that word directly to CPU register (or IF buffer if it’s an I-cache miss) as soon as it arrives Need a one block buffer to hold the incoming block (and shift it) before storing it in the cache CSE 586 Spring 00

Sectored (or subblock) Caches
First cache ever (IBM 360/85 in late 60’s) was a sector cache On a cache miss, send only a subblock, change the tag and invalidate all other subblocks Saves on memory bandwidth Reduces number of tags but requires good spatial locality in application Requires status bits (valid, dirty) per subblock Might reduce false-sharing in multiprocessors But requires metadata status bits for each subblock CSE 586 Spring 00

Sector Cache Status bits data tag subblock1 subblockn
CSE 586 Spring 00

Lock-up Free Caches Proposed in early 1980’s but implemented only recently because quite complex Allow cache to have several outstanding miss requests (hit under miss). Cache miss “happens” during EX stage, i.e., longer (unpredictable) latency Important not to slow down operations that don’t depend on results of the load Single hit under miss (HP PA 1700) relatively simple For several outstanding misses, require the use of MSHR’s (Miss Status Holding Register). CSE 586 Spring 00

MSHR’s The outstanding misses do not necessarily come back in the order they were detected For example, miss 1 can percolate from L1 to main memory while miss 2 can be resolved at the L2 level Each MSHR must hold information about the particular miss it will handle such as: Info. relative to its placement in the cache Info. relative to the “missing” item (word, byte) and where to forward it (CPU register) CSE 586 Spring 00

Implementation of MSHR’s
Quite a variety of alternatives MIPS 10000, Alpha 21164, Pentium Pro One particular way of doing it: Valid (busy) bit (limited number of MSHR’s – structural hazard) Address of the requested cache block Index in the cache where the block will go Comparator (to prevent using the same MSHR for a miss to the same block) If data to be forwarded to CPU at the same time as in the cache, needs addresses of registers (one per possible word/byte) Valid bits (for writes) CSE 586 Spring 00

Cache Hierarchy Two, and even three, levels of caches quite common now
L2 (or L3, i.e., board-level) very large but since L1 filters many references, “local” hit rate might appear low (maybe 50%) (compulsory misses still happen) In general L2 have longer cache blocks and larger associativity In general L2 caches are write-back, write allocate CSE 586 Spring 00

Characteristics of Cache Hierarchy
Multi-Level inclusion (MLI) property between off-board cache (L2 or L3) and on-chip cache(s) (L1 and maybe L2) L2 contents must be a superset of L1 contents (or at least have room to store these contents if L1 is write-back) If L1 and L2 are on chip, they could be mutually exclusive (and inclusion will be with L3) MLI very important for cache coherence in multiprocessor systems (shields the on-chip caches from unnecessary interference) Prefetching at L2 level is an interesting challenge (made easier if L2 tags are kept on-chip) CSE 586 Spring 00

“Virtual” Address Caches
Will get back to this after we study TLB’s Virtually addressed, virtually tagged caches Main problem to solve is the Synonym problem (2 virtual addresses corresponding to the same physical address). Virtually addressed, physically tagged Advantage: can allow cache and TLB accesses concurrently CSE 586 Spring 00

Miscellaneous Techniques
Improving on write time. Pipeline the write with a buffer to delay the data write by one cycle Note: not needed for reads where tag and data are processed in parallel For superscalar machines Duplicate the L1 cache(s) (could be cheaper than multiple ports?) For (highly) associative caches Keep for each set the MRU index so that it is checked first (cf. MIPS R10000 L2 which has an 8K*1 prediction table to that effect). Etc. CSE 586 Spring 00

Impact of Branch Prediction on Caches
If we are on predicted path and: An I-cache miss occurs, what should we do: stall or fetch? A D-cache miss occurs, what should we do: stall or fetch? If we fetch and we are on the right path, it’s a win If we fetch and are on the wrong path, it is not necessarily a loss Could be a form of prefetching (if branch was mispredicted, there is a good chance that that path will be taken later) However, the channel between the cache and higher-level of hierarchy is occupied while something more pressing could be waiting for it CSE 586 Spring 00

Recall: Anatomy of a Predictor
Exec. Event selec. Pred. Index. Recovery? Feedback Pred. Mechan. CSE 586 Spring 00

Anatomy of a Cache Predictor
Exec. Event selec. Pred. Index. No need for recovery Feedback Pred. Mechan. CSE 586 Spring 00

Load/storecache miss Exec. Pred. trigger. Pred. Index. Feedback Pred. Mechan. CSE 586 Spring 00

PC; EA; global/local history Exec. Pred. trigger. Pred. Index. Feedback Pred. Mechan. CSE 586 Spring 00

Exec. Pred. trigger. Pred. Index. Additional metadata Associative buffers Specialized caches Feedback Pred. Mechan. CSE 586 Spring 00

Exec. Pred. trigger. Pred. Index. Feedback Pred. Mechan. Counters Stride predictors Finite context Markov pred. CSE 586 Spring 00

Exec. Pred. trigger. Pred. Index. Feedback Pred. Mechan. Often imprecise CSE 586 Spring 00

Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory latency at the DRAM level): Access time: time between the read is requested and the desired word arrives Cycle time: minimum time between requests to memory (cycle time > access time because need for stabilization of address lines) CSE 586 Spring 00

DRAM’s Address lines split into row and column addresses. A read operation consists of: RAS (Row access strobe) CAS (Column access strobe) If device has been precharged, access time = RAS + CAS If not, have to add precharge time RAS, CAS, and Precharge are of the same order of magnitude In DRAM, data needs to be written back after a read, hence cycle time > access time CSE 586 Spring 00

DRAM array page Page buffer Column address Row address
CSE 586 Spring 00

DRAM and SRAM D stands for “dynamic” S stands for “static”
Each bit is single transistor (plus capacitor; hence the need to rewrite info after a read) Needs to be recharged periodically. Hence refreshing. All bits in a row can be refreshed concurrently (just read the row). For each row it takes RAS time. S stands for “static” Uses 6 transistors/bit (some use 4). No refresh and no need to write after read (i.e., information is not lost by reading; very much like a F/F in a register). CSE 586 Spring 00

DRAM vs. SRAM Cycle time of SRAM 10 to 20 times faster than DRAM
For same technology, capacity of DRAM 5 to 10 times that of SRAM Hence Main memory is DRAM On-chip caches are SRAM Off-chip caches (it depends) DRAM growth Capacity: Factor of 4 every 3 years (60% per year) Cycle time. Improvement of 20% per generation (7% per year) CSE 586 Spring 00

How to Improve Main Memory Bandwidth
It’s easier to improve on bandwidth than on latency Sending address: can’t be improved (and this is latency) Although split-transaction bus allows some overlap Make memory wider (assume monolithic memory) Sending one address, yields transfer of more than one word if the bus width allows it (and it does nowadays) But less modularity (buy bigger increments of memory) CSE 586 Spring 00

Interleaving (introducing parallelism at the DRAM level)
Memory is organized in banks Bank i stores all words at address j modulo i All banks can read a word in parallel Ideally, number of banks should match (or be a multiple of) the L2 block size (in words) Bus does not need to be wider (buffer in the DRAM bank) Writes to individual banks for different addresses can proceed without waiting for the preceding write to finish (great for write-through caches) CSE 586 Spring 00

Banks of Banks Superbanks interleaved by some bits other than lower bits Superbanks composed of banks interleaved on low order bits for sequential access Superbanks allow parallel access to memory Great for lock-up free caches, for concurrent I/O and for multiprocessors sharing main memory CSE 586 Spring 00

Limitations of Interleaving (sequential access)
Number of banks limited by increasing chip capacity With 1M x 1 bit chips, it takes 64 x 8 = 512 chips to get 64 MB (easy to put 16 banks of 32 chips) With 16 M x 1 chips, it takes only 32 chips (only one bank) More parallelism in using 4M x 4 chips (32 chips in 4 banks) In the N * m (N number of MB, m width of bits out of each chip) m is limited by electronic constraints to about 8 or maybe 16. CSE 586 Spring 00

Example Memory Path of a Workstation
DRAM L2 Bank 0 Memory bus Data switch CPU + L1 16B 32B Processor bus Bank n To/from I/O bus CSE 586 Spring 00

Page-mode and Synchronous DRAMs
Introduce a page buffer In page mode no need for a RAS But if a miss, need to precharge + RAS + CAS In SDRAM, same as page-mode but subsequent accesses even faster (burst mode) CSE 586 Spring 00

Analysis of “Enhanced” DRAM’s
Analysis : Let p be the precharge time, r be RAS, a be CAS, h be hit ratio in page buffer, b be burst time in SDRAM Assume we need 4 accesses to transfer a cache line In page mode DRAM, it takes r + 4a if the bank was precharged 4a if the bank was in page mode and we have a hit p + r + 4a if the bank was in page mode and we have a miss Access time depends on whether we want to keep the DRAM all the time in page mode [(p+r).(1-h) + 4a] or not [r+4a] (assuming that we have time to precharge between accesses) Same analysis for SDRAM replacing 4a by a + 3b CSE 586 Spring 00

Cached DRAM and Processor in Memory
Put some SRAM on DRAM chip More flexibility in buffer size than page mode Can precharge DRAM while accessing SRAM But fabrication is different Go one step further (1 billion transistors/chip) Put “simple” processor and SRAM and DRAM on chip Great bandwidth for processor-memory interface Cache with very large block size since parallel access to many banks is possible Can’t have too complex of a processor Need to invest in new fabs CSE 586 Spring 00

Processor in Memory (PIM)
Generality depends on the intended applications IRAM Vector processor; data stream apps; low power FlexRAM Memory chip = Host + Simple multiprocessor + banks of DRAM; memory intensive apps. Active Pages Co-processor paradigm; reconfigurable logic in memory FBRAM Graphics in memory CSE 586 Spring 00

Rambus Specialized memory controller (scheduler), channel, and RDRAM’s
Parallelism and pipelining, e.g. Independent row , column, and data buses (narrow -- 2 bytes) Pipelined memory subsystem (several packets/access; packets are 4 cycles = 10 ns) Parallelism within theRDRAMs (many banks with 4 possible concurrent operations) Parallelism among RDRAM’s (large number of them) Great for “streams of data” (Graphics, games) CSE 586 Spring 00

Direct Rambus Extremely fast bus (400 MHz clock, 800 MHz transfer rate) Great bandwidth for stream data but still high latency for random read/writes Row [2:0] Memory controller Column [4:0] Data [15:0] Bk 0 Pg 0 Bk 15 Pg 15 RDRAM 0 RDRAM n, n up to 31 CSE 586 Spring 00

Split-transaction Bus
Allows transactions (address, control, data) for different requests to occur simultaneously Required for efficient Rambus Great for SMP’s sharing a single bus CSE 586 Spring 00

Evolution in Memory Management Techniques
In early days, single program run on the whole machine Used all the memory available Even so, there was often not enough memory to hold data and program for the entire run Use of overlays, i.e., static partitioning of program and data so that parts that were not needed at the same time could share the same memory addresses Soon, it was noticed that I/O was much more time consuming than processing, hence the advent of multiprogramming CSE 586 Spring 00

Multiprogramming Multiprogramming
Several programs are resident in main memory at the same time When one program executes and needs I/O, it relinquishes CPU to another program Some important questions from the memory management viewpoint: How does one program ask for (more) memory How is one program protected from another CSE 586 Spring 00

Virtual Memory: Basic idea
Idea first proposed and implemented at the University of Manchester in the early 60’s. Basic idea is to compile/link a program in a virtual space as large as the addressing space permits Then, divide the virtual space in “chunks” and bring those “chunks’ on demand in physical memory Provide a general (fully-associative) mapping between virtual “chunks” and physical “chunks” CSE 586 Spring 00

Virtual Memory Implementations
When the virtual space is divided into chunks of the same size, called pages, we have a paging system If chunks are of different sizes, we have segments Segments correspond to semantic objects (a good thing) but implementation is more difficult (memory allocation of variable size segments; checks for out of bounds etc.) Paging (segmented) systems predate caches But same questions (mapping, replacement, writing policy) An enormous difference: penalty for a miss Requires hardware assists for translation and protection CSE 586 Spring 00

Paging Allows virtual address space larger than physical memory
Allows sharing of physical memory between programs (multiprogramming) without much fragmentation Physical memory allocated to a program does not need to be contiguous; only an integer number of pages Allows sharing of pages between programs (not always simple) CSE 586 Spring 00

Two Extremes in the Memory Hierarchy
CSE 586 Spring 00

Other Extreme Differences
Mapping: Restricted (L1) vs. general (Paging) Hardware assist for virtual address translation (TLB) Miss handler Harware only for caches Software only for paging system (context-switch) Hardware and/or software for TLB Replacement algorithm Not important for caches Very important for paging system Write policy Always write back for paging systems CSE 586 Spring 00

Illustration of Paging
Program A Physical memory V.p.0 Frame 0 V.p.1 Frame 1 V.p.2 Frame 2 V.p.3 Note: In general n, q >> m Programs A and B share frame 0 but with different virtual page numbers Not all virtual pages of a program are mapped at a given time V.p.n Frame m V.p.0 V.p.1 Program B V.p.2 Mapping device V.p.q CSE 586 Spring 00

Mapping Device: Page Tables
Page tables contain page table entries (PTE): Virtual page number (implicit/explicit), physical page number,valid, protection, dirty, use bits (for LRU-like replacement), etc. Hardware register points to the page table of the running process Earlier system: contiguous (in virtual space) page tables; Now, multi-level page tables In some systems, inverted page tables (with a hash table) In all modern systems, page table entries are cached in a TLB CSE 586 Spring 00

Illustration of Page Table
Page table for Program A Program A Physical memory V.p.0 Frame 0 V.p.1 1 2 Frame 1 1 m V.p.2 Frame 2 V.p.3 1 V.p.n Frame m Valid bits V.p.0 V.p.1 1 Program B V.p.2 V.p.q 1 1 Page table for Program B CSE 586 Spring 00

Virtual Address Translation
Virtual page number Offset 1 Page table Physical frame number Offset CSE 586 Spring 00

From Virtual Address to Memory Location (highly abstracted)
ALU Virtual address Page table Memory hierarchy Physical address CSE 586 Spring 00

Translation Look-aside Buffers (TLB)
Keeping page tables in memory defeats the purpose of caches Needs one memory reference to do the translation Hence, introduction of caches to cache page table entries; these are the TLB’s There have been attempts to use the cache itself instead of a TLB but it has been proven not to be worthwhile Nowadays, TLB for instructions and TLB for data Some part of the TLB’s reserved for the system Of the order of 128 entries, quite associative CSE 586 Spring 00

TLB’s TLB miss handled by hardware or by software (e.g., PAL code in Alpha) TLB miss cycles -> no context-switch Addressed in parallel with access to the cache Since smaller, goes faster It’s on the critical path For a given TLB size (number of entries) Larger page size -> larger mapping range CSE 586 Spring 00

TLB organization Virtual page number Offset tag Index Copy of PTE v d
prot Physical frame number CSE 586 Spring 00

From Virtual Address to Memory Location (highly abstracted; revisited)
ALU hit Virtual address cache miss miss TLB Main memory hit Physical address CSE 586 Spring 00

Address Translation At each memory reference the hardware searches the TLB for the translation TLB hit and valid PTE the physical address is passed to the cache TLB miss, either hardware or software (depends on implementation) searches page table in memory. If PTE is valid, contents of the PTE loaded in the TLB and back to step above In hardware the TLB miss takes a few cycles In software takes up to 100 cycles In either case, no context-switch If PTE is invalid, we have a page fault (even on a TLB hit) CSE 586 Spring 00

Speeding up L1 Access Cache can be (speculatively) accessed in parallel with TLB if its indexing bits are not changed by the virtual-physical translation Cache access (for reads) is pipelined: Cycle 1: Access to TLB and access to L1 cache (read data at given index) Cycle 2: Compare tags and if hit, send data to register CSE 586 Spring 00

Virtually Addressed Cache
Tag Index Dsp Page Number Offset 1 1 Tag data PTE 2. Compare TLB Cache CSE 586 Spring 00

“Virtual” Caches Previous slide: Virtually addressed, physically tagged Can be done for small L1, i.e., capacity < (page * ass.) Can be done for larger caches if O.S. does a form of page coloring such that “index” is the same for synonyms (see below) Can also be done more generally (complicated but can be elegant) Virtually addressed, virtually tagged caches Synonym problem (2 virtual addresses corresponding to the same physical address). Inconsistency since the same physical location can be mapped into two different cache blocks Can be handled by software (disallow it) or by hardware (with “pointers” ) Use of PID’s to only partially flush the cache CSE 586 Spring 00

CSE 586 Computer Architecture Lecture 6

Similar presentations

Presentation on theme: "CSE 586 Computer Architecture Lecture 6"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 586 Computer Architecture Lecture 6

Similar presentations

Presentation on theme: "CSE 586 Computer Architecture Lecture 6"— Presentation transcript:

Similar presentations

About project

Feedback