Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Karen Miller, 2011 1 What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.

Similar presentations


Presentation on theme: "© Karen Miller, 2011 1 What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast."— Presentation transcript:

1 © Karen Miller, 2011 1 What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast fast at what? (easy answer: fast at my programs)

2 © Karen Miller, 2011 2 price performance slow fast ¢$$$

3 © Karen Miller, 2011 3 Architectural features ways of increasing speed generally fall into 2 categories: 1. parallelism 2. memory hierarchies

4 © Karen Miller, 2011 4 parallelism Suppose we have 3 tasks: t1, t2, and t3. If independent, then serial A serial implementation on 1 computer: t1 t2t3 parallel A parallel implementation (given that we have 3 computers). t1 t2 t3

5 © Karen Miller, 2011 5 memory woes: P M physically separate memory makes memory accesses SLOW ! P and M co-located ? very expensive ! or memory too small !

6 © Karen Miller, 2011 6 HW design technique to make some memory accesses complete faster is the implementation of hierarchical memory cacheing (also known as cacheing )

7 © Karen Miller, 2011 7 Recall the fetch and execute cycle:  fetch instruction   PC update  decode  get operands (  for a load)  do operation  store result (  for a store)  requires a memory access

8 © Karen Miller, 2011 8 Now look at the memory access patterns of lots of programs. In general, memory access patterns are not random. locality They exhibit locality 1. temporal 2. spatial

9 © Karen Miller, 2011 9 temporal locality Recently referenced memory locations are likely to referenced again (soon!) loop: instr 1 @ A1 instr 2 @ A2 instr 3 @ A3 b loop @ A4 Instruction stream references: A1 A2 A3 A4 A1 A2 A3 A4 A1 A2 A3... Note that the same memory location is repeatedly read (for the fetch).

10 © Karen Miller, 2011 10 spatial locality Memory locations near to referenced locations are likely to also be referenced. array memory Code must do something to each element of the array. Must load each element....

11 © Karen Miller, 2011 11 The fetch of the code exhibits a high degree of spatial locality. I1 I2 I3 I4 I5... I2 is next to I1. If these instructions are not branches, then we fetch I1 I2 I3 etc.

12 © Karen Miller, 2011 12 cache A cache is designed to hold copies of a subset of memory locations.  smaller (in terms of bytes) than main memory  faster than main memory  co-located: processor and cache are on the same chip

13 © Karen Miller, 2011 13 Intel 386 chip (1985 image)

14 © Karen Miller, 2011 14 Pentium II (1997 image)

15 © Karen Miller, 2011 15 P sends memory request to C.  hit  hit: requested location's copy is in the C  miss  miss: requested location's copy is NOT in the C. So, send the memory access to M. P and C M

16 © Karen Miller, 2011 16 Needed terminology: miss ratio = hit ratio = # of misses # of hits total # of accesses or 1 - miss ratio You already assumed that total # of accesses = # of misses + # of hits

17 © Karen Miller, 2011 17 So, when designing a cache, keep likely to be referenced (again) bytes and their neighbors in the cache... So, what is in the cache is different for each different program. On average, for a given program: AverageAverage MemoryMemory AccessAccess TimeTime = T c + (miss ratio) (T m )

18 © Karen Miller, 2011 18 For example: T c = 1 nsec T m = 20 nsec A specific program has 98% hits... AMAT = 1 + (.02) (20) = 1.4 nsec Each individual memory access takes 1 nsec (hit) or 21 nsec (miss)

19 © Karen Miller, 2011 19 Divide all of memory up into fixed-size blocks... 1 block Copy the entire block into the cache. Make the block size greater than 1 word.

20 © Karen Miller, 2011 20 An unrealistic cache, with 4 block frames block 00 block 01 block 10 block 11 this is 1 frame another frame yet another frame and a 4th frame

21 © Karen Miller, 2011 21 Each main memory block maps to a specific block frame.... main memory cache 2 bits of the address define this mapping 00 01 10 11

22 © Karen Miller, 2011 22 Take advantage of spatial locality by making the block size greater than 1 word. On a miss, copy the entire block into the cache, and then keep it there as long as possible. (Why?) How the cache uses the address to do a look up: index # byte/word within block ? which block frame

23 © Karen Miller, 2011 23  Which block frame is known as index # or (sometimes) line #  But, many main memory blocks map to the same cache block frame... only one may be in the frame at a time!  We must distinguish which one is in the frame right now.

24 © Karen Miller, 2011 24 tag  most significant bits of the block's address  to distinguish which main memory block is in the cache block frame  tag is kept in the cache together with its data block

25 © Karen Miller, 2011 25 how the address is utilized by the cache (so far) tagindex #byte w/i block 00 01 10 11 tagsdata blocks address

26 © Karen Miller, 2011 26  Still missing... must distinguish block frames that have nothing in them from ones that have a block from main memory (consider power up for a computer system: nothing is in the cache)  We need 1 bit per block, most often called a valid bit (sometimes called a present bit)

27 © Karen Miller, 2011 27 cache access (or cache lookup)  index # is used to find the correct block frame  Is block frame valid? YES: Compare address tag to block frame's tag: match: HIT no match: MISS NO: MISS

28 © Karen Miller, 2011 28 Completed diagram of the cache: tagindex #byte w/i block 00 01 10 11 tagsdata blocks address valid

29 © Karen Miller, 2011 29 This cache is called direct mapped or 1-way set associative or set associative, with a set size of 1 Each index # maps to exactly 1 block frame

30 © Karen Miller, 2011 30 VTagData VTagData VTagData direct mapped 3 bits for index # 2-way set associative 2 bits for index # same amount of data

31 © Karen Miller, 2011 31 How about 4-way set associative, or 8-way set associative? For a fixed number of block frames,  larger set size tends to lead to higher hit ratios  larger set size means that the amount of HW (circuitry) goes up, and T c increases 

32 © Karen Miller, 2011 32 VTagData Implementing writes   memory 1write through change data in the cache, and send the write to main memory slow , but very little circuitry

33 © Karen Miller, 2011 33 VTagData 2write back at first, change data in the cache write to memory only when necessary dirty bit dirty bit is set on a write, to identify blocks to be written back to memory when a program completes, all dirty blocks must be written to memory...

34 © Karen Miller, 2011 34 2write back (continued)  faster multiple stores to the same location result in only 1 main memory access  more circuitry   must maintain the dirty bit  dirty miss : a miss caused by a read or write to a block not in the cache, but the required block frame has its dirty bit set. So, there is a write of the dirty block, followed by a read of the requested block.

35 © Karen Miller, 2011 35 VTagData How about 2 separate caches ? I-cache I-cache  for instructions only  can be rather small, and still have excellent performance. VTagData VTagData D-cache D-cache  for data only  needs to be fairly large

36 © Karen Miller, 2011 36 D We can send memory accesses to the 2 caches independently... (increased parallelism) I P fetch load/store M

37 © Karen Miller, 2011 37 C P M  Called an L1 cache (level 1)  This hierarchy works so well, that most systems have 2 levels of cache. L1 P M L2


Download ppt "© Karen Miller, 2011 1 What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast."

Similar presentations


Ads by Google