© Karen Miller, 2011 1 What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
Memory Organization.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Caching I Andreas Klappenecker CPSC321 Computer Architecture.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
CMPE 421 Parallel Computer Architecture
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
What is cache memory?. Cache Cache is faster type of memory than is found in main memory. In other words, it takes less time to access something in cache.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
2007 Sept. 14SYSC 2001* - Fall SYSC2001-Ch4.ppt1 Chapter 4 Cache Memory 4.1 Memory system 4.2 Cache principles 4.3 Cache design 4.4 Examples.
CS 1104 Help Session I Caches Colin Tan, S
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X)  Hit Rate : the fraction of memory access found in.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
CACHE MEMORY CS 147 October 2, 2008 Sampriya Chandra.
Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Performance improvements ( 1 ) How to improve performance ? Reduce the number of cycles per instruction and/or Simplify the organization so that the clock.
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
CSE 351 Section 9 3/1/12.
Memory and cache CPU Memory I/O.
Improving Memory Access 1/3 The Cache and Virtual Memory
Multilevel Memories (Improving performance using alittle “cash”)
How will execution time grow with SIZE?
Cache Memory Presentation I
Chapter 8 Digital Design and Computer Architecture: ARM® Edition
Memory and cache CPU Memory I/O.
Lecture 22: Cache Hierarchies, Memory
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
10/16: Lecture Topics Memory problem Memory Solution: Caches Locality
How can we find data in the cache?
Lecture 20: OOO, Memory Hierarchy
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Lecture 20: OOO, Memory Hierarchy
CS-447– Computer Architecture Lecture 20 Cache Memories
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Chapter Five Large and Fast: Exploiting Memory Hierarchy
10/18: Lecture Topics Using spatial locality
Caches & Memory.
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast fast at what? (easy answer: fast at my programs)

© Karen Miller, price performance slow fast ¢$$$

© Karen Miller, Architectural features ways of increasing speed generally fall into 2 categories: 1. parallelism 2. memory hierarchies

© Karen Miller, parallelism Suppose we have 3 tasks: t1, t2, and t3. If independent, then serial A serial implementation on 1 computer: t1 t2t3 parallel A parallel implementation (given that we have 3 computers). t1 t2 t3

© Karen Miller, memory woes: P M physically separate memory makes memory accesses SLOW ! P and M co-located ? very expensive ! or memory too small !

© Karen Miller, HW design technique to make some memory accesses complete faster is the implementation of hierarchical memory cacheing (also known as cacheing )

© Karen Miller, Recall the fetch and execute cycle:  fetch instruction   PC update  decode  get operands (  for a load)  do operation  store result (  for a store)  requires a memory access

© Karen Miller, Now look at the memory access patterns of lots of programs. In general, memory access patterns are not random. locality They exhibit locality 1. temporal 2. spatial

© Karen Miller, temporal locality Recently referenced memory locations are likely to referenced again (soon!) loop: instr A1 instr A2 instr A3 b A4 Instruction stream references: A1 A2 A3 A4 A1 A2 A3 A4 A1 A2 A3... Note that the same memory location is repeatedly read (for the fetch).

© Karen Miller, spatial locality Memory locations near to referenced locations are likely to also be referenced. array memory Code must do something to each element of the array. Must load each element....

© Karen Miller, The fetch of the code exhibits a high degree of spatial locality. I1 I2 I3 I4 I5... I2 is next to I1. If these instructions are not branches, then we fetch I1 I2 I3 etc.

© Karen Miller, cache A cache is designed to hold copies of a subset of memory locations.  smaller (in terms of bytes) than main memory  faster than main memory  co-located: processor and cache are on the same chip

© Karen Miller, Intel 386 chip (1985 image)

© Karen Miller, Pentium II (1997 image)

© Karen Miller, P sends memory request to C.  hit  hit: requested location's copy is in the C  miss  miss: requested location's copy is NOT in the C. So, send the memory access to M. P and C M

© Karen Miller, Needed terminology: miss ratio = hit ratio = # of misses # of hits total # of accesses or 1 - miss ratio You already assumed that total # of accesses = # of misses + # of hits

© Karen Miller, So, when designing a cache, keep likely to be referenced (again) bytes and their neighbors in the cache... So, what is in the cache is different for each different program. On average, for a given program: AverageAverage MemoryMemory AccessAccess TimeTime = T c + (miss ratio) (T m )

© Karen Miller, For example: T c = 1 nsec T m = 20 nsec A specific program has 98% hits... AMAT = 1 + (.02) (20) = 1.4 nsec Each individual memory access takes 1 nsec (hit) or 21 nsec (miss)

© Karen Miller, Divide all of memory up into fixed-size blocks... 1 block Copy the entire block into the cache. Make the block size greater than 1 word.

© Karen Miller, An unrealistic cache, with 4 block frames block 00 block 01 block 10 block 11 this is 1 frame another frame yet another frame and a 4th frame

© Karen Miller, Each main memory block maps to a specific block frame.... main memory cache 2 bits of the address define this mapping

© Karen Miller, Take advantage of spatial locality by making the block size greater than 1 word. On a miss, copy the entire block into the cache, and then keep it there as long as possible. (Why?) How the cache uses the address to do a look up: index # byte/word within block ? which block frame

© Karen Miller,  Which block frame is known as index # or (sometimes) line #  But, many main memory blocks map to the same cache block frame... only one may be in the frame at a time!  We must distinguish which one is in the frame right now.

© Karen Miller, tag  most significant bits of the block's address  to distinguish which main memory block is in the cache block frame  tag is kept in the cache together with its data block

© Karen Miller, how the address is utilized by the cache (so far) tagindex #byte w/i block tagsdata blocks address

© Karen Miller,  Still missing... must distinguish block frames that have nothing in them from ones that have a block from main memory (consider power up for a computer system: nothing is in the cache)  We need 1 bit per block, most often called a valid bit (sometimes called a present bit)

© Karen Miller, cache access (or cache lookup)  index # is used to find the correct block frame  Is block frame valid? YES: Compare address tag to block frame's tag: match: HIT no match: MISS NO: MISS

© Karen Miller, Completed diagram of the cache: tagindex #byte w/i block tagsdata blocks address valid

© Karen Miller, This cache is called direct mapped or 1-way set associative or set associative, with a set size of 1 Each index # maps to exactly 1 block frame

© Karen Miller, VTagData VTagData VTagData direct mapped 3 bits for index # 2-way set associative 2 bits for index # same amount of data

© Karen Miller, How about 4-way set associative, or 8-way set associative? For a fixed number of block frames,  larger set size tends to lead to higher hit ratios  larger set size means that the amount of HW (circuitry) goes up, and T c increases 

© Karen Miller, VTagData Implementing writes   memory 1write through change data in the cache, and send the write to main memory slow , but very little circuitry

© Karen Miller, VTagData 2write back at first, change data in the cache write to memory only when necessary dirty bit dirty bit is set on a write, to identify blocks to be written back to memory when a program completes, all dirty blocks must be written to memory...

© Karen Miller, write back (continued)  faster multiple stores to the same location result in only 1 main memory access  more circuitry   must maintain the dirty bit  dirty miss : a miss caused by a read or write to a block not in the cache, but the required block frame has its dirty bit set. So, there is a write of the dirty block, followed by a read of the requested block.

© Karen Miller, VTagData How about 2 separate caches ? I-cache I-cache  for instructions only  can be rather small, and still have excellent performance. VTagData VTagData D-cache D-cache  for data only  needs to be fairly large

© Karen Miller, D We can send memory accesses to the 2 caches independently... (increased parallelism) I P fetch load/store M

© Karen Miller, C P M  Called an L1 cache (level 1)  This hierarchy works so well, that most systems have 2 levels of cache. L1 P M L2