Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cache Extras & Branch Prediction. TLB and Cache Operation (See Figure 7.26 on page 594) On a memory access, the following operations occur.

Similar presentations


Presentation on theme: "Cache Extras & Branch Prediction. TLB and Cache Operation (See Figure 7.26 on page 594) On a memory access, the following operations occur."— Presentation transcript:

1 Cache Extras & Branch Prediction

2 TLB and Cache Operation (See Figure 7.26 on page 594) On a memory access, the following operations occur.

3 Virtual Memory And Cache Yes No TLB Access TLB Hit ? Access Page Table Access Cache Virtual Address Cache Hit ? Yes No Access Memory Physical Addresses Data TLB access is serial with cache access

4 Overlapped TLB & Cache Access Physical page number Page offset Virtual Memory view of a Physical Address Disp Cache view of a Physical Address # SetTag In the above example #Set is not contained within the Page Offset  The #Set is not known Until the Physical page number is known  Cache can be accessed only after address translation done. 3 2 1 011 10 9 815 14 13 1229 28 27 3 2 1 011 10 9 815 14 13 1229 28 27

5 Overlapped TLB & Cache Access (cont) Physical page number Page offset Virtual Memory view of a Physical Address Disp Cache view of a Physical Address # SetTag In the above example #Set is contained within the Page Offset  The #Set is known immediately  Cache can be accessed in parallel with address translation  First the tags from the appropriate set are brought Tag comparison takes place only after the Physical page number is known (after address translation done). 3 2 1 011 10 9 815 14 13 1229 28 27 3 2 1 011 10 9 815 14 13 1229 28 27

6 Overlapped TLB & Cache Access (cont) First Stage –Virtual Page Number goes to TLB for translation –Page offset goes to cache to get all tags from appropriate set Second Stage –The Physical Page Number, obtained from the TLB, goes to the cache for tag comparison. Limitation: Cache ≤ (page size * associativity) How can we overcome this limitation ? –Fetch 2 sets and mux after translation –Increase associativity

7 Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 7 Pseudo LRU  We will use as an example a 4-way set associative cache.  Full LRU records the full-order of way access in each set (which way was most recently accessed, which was second, and so on).  Pseudo LRU (PLRU) records a partial order, using 3 bits per-set: – Bit 0 specifies whether LRU way is either one of 0 and 1 or one of 2 and 3. – Bit 1 specifies which of ways 0 and 1 was least recently used – Bit 2 specifies which of ways 2 and 3 was least recently used  For example if order in which ways were accessed is 3,0,2,1, then bit 0 =1, bit 1 =1, bit 2 =1 0 1 2 3 0 1 00 11 bit 0 bit 2 bit 1

8 Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 8 3Cs Absolute Miss Rate (SPEC92) Conflict Compulsory vanishingly small

9 Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 9 How Can We Reduce Misses?  3 Cs: Compulsory, Capacity, Conflict  In all cases, assume total cache size is not changed:  What happens if: 1) Change Block Size: Which of 3Cs is obviously affected? 2) Change Associativity: Which of 3Cs is obviously affected? 3) Change Compiler: Which of 3Cs is obviously affected?

10 Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 10 Reduce Misses via Larger Block Size

11 Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 11 Reduce Misses via Higher Associativity  We have two conflicting trends here:  Higher associativity  improve the hit ratio BUT  Increase the access time  Slow down the replacement  Increase complexity  Most of the modern cache memory systems are using at least 4- way associative cache memories

12 Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 12 Example: Avg. Memory Access Time vs. Miss Rate  Example: assume Cache Access Time = 1.10 for 2-way, 1.12 for 4- way, 1.14 for 8-way vs. CAT of direct mapped Cache SizeAssociativity (KB)1-way2-way4-way8-way 12.332.152.072.01 21.981.861.761.68 41.721.671.611.53 81.461.481.471.43 161.291.321.321.32 321.201.241.251.27 641.141.201.211.23 1281.101.171.181.20 Effective access time to cache (Red means -> not improved by more associativity) Note this is for a specific example

13 Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 13 Reducing Miss Penalty by Critical Word First  Don’t wait for full block to be loaded before restarting CPU – Early restart  As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution – Critical Word First  Request the missed word first from memory and send it to the CPU as soon as it arrives  Let the CPU continue execution while filling the rest of the words in the block  Also called wrapped fetch and requested word first  Example: – 64 bit = 8 byte bus, 32 byte cache line  4 bus cycles to fill line – Fetch date from 95H 80H-87H88H-8FH90H-97H98H-9FH 1243

14 Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 14 Prefetchers  In order to avoid compulsory misses, we need to bring the information before it was requested by the program  We can use the locality of references behavior – Space -> bring the environment. – Time -> same “patterns” repeats themselves.  Prefetching relies on having extra memory bandwidth that can be used without penalty  There are hardware and software prefetchers.

15 Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 15 Hardware Prefetching  Instruction Prefetching – Alpha 21064 fetches 2 blocks on a miss  Extra block placed in stream buffer in order to avoid possible cache pollution in case the pre-fetched instructions will not be required  On miss check stream buffer – Branch predictor directed prefetching  Let branch predictor run ahead  Data Prefetching – Try to predict future data access  Next sequential  Stride  General pattern

16 Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 16 Software Prefetching  Data Prefetch – Load data into register (HP PA-RISC loads) – Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) – Special prefetching instructions cannot cause faults; a form of speculative execution  How it is done – Special prefetch intrinsic in the language – Automatically by the compiler  Issuing Prefetch Instructions takes time – Is cost of prefetch issues < savings in reduced misses? – Higher superscalar reduces difficulty of issue bandwidth

17 CS252/Culler Lec 4.17 1/31/02 Reducing Misses by Hardware Prefetching of Instructions & Data E.g., Instruction Prefetching –Alpha 21064 fetches 2 blocks on a miss –Extra block placed in “stream buffer” –On miss check stream buffer Works with data blocks too: –Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% –Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Prefetching relies on having extra memory bandwidth that can be used without penalty

18 CS252/Culler Lec 4.18 1/31/02 Reducing Misses by Software Prefetching Data Data Prefetch –Load data into register (HP PA-RISC loads) –Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) –Special prefetching instructions cannot cause faults; a form of speculative execution Prefetching comes in two flavors: –Binding prefetch: Requests load directly into register. »Must be correct address and register! –Non-Binding prefetch: Load into cache. »Can be incorrect. Faults? Issuing Prefetch Instructions takes time –Is cost of prefetch issues < savings in reduced misses? –Higher superscalar reduces difficulty of issue bandwidth

19 CS252/Culler Lec 4.19 1/31/02 Reducing Misses by Compiler Optimizations McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software Instructions –Reorder procedures in memory so as to reduce conflict misses –Profiling to look at conflicts(using tools they developed) Data –Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays –Loop Interchange: change nesting of loops to access data in order stored in memory –Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap –Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows

20 CS252/Culler Lec 4.20 1/31/02 Merging Arrays Example /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: 1 array of stuctures */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Reducing conflicts between val & key; improve spatial locality

21 CS252/Culler Lec 4.21 1/31/02 Loop Interchange Example /* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every 100 words; improved spatial locality

22 CS252/Culler Lec 4.22 1/31/02 Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 2 misses per access to a & c vs. one miss per access; improve spatial locality

23 CS252/Culler Lec 4.23 1/31/02 Blocking Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; }; Two Inner Loops: –Read all NxN elements of z[] –Read N elements of 1 row of y[] repeatedly –Write N elements of 1 row of x[] Capacity Misses a function of N & Cache Size: –2N 3 + N 2 => (assuming no conflict; otherwise …) Idea: compute on BxB submatrix that fits

24 CS252/Culler Lec 4.24 1/31/02 Blocking Example /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; }; B called Blocking Factor Capacity Misses from 2N 3 + N 2 to N 3 /B+2N 2 Conflict Misses Too?

25 Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 25 Other techniques

26 Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 26 Multi-ported cache and Banked Cache  A n-ported cache enables n cache accesses in parallel – Parallelize cache access in different pipeline stages – Parallelize cache access in a super-scalar processors  Effectively doubles the cache die size  Possible solution: banked cache – Each line is divided to n banks – Can fetch data from k  n different banks (in possibly different lines)

27 Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 27 Separate Code / Data Caches  Enables parallelism between data accesses (done in the memory access stage) and instruction fetch (done in fetch stage of the pipelined processors)  Code cache is a read only cache – No need to write back line into memory when evicted – Simpler to manage  What about self modified code? (X86 only) – Whenever executing a memory write need to snoop the code cache – If the code cache contains the written address, the line in which the address is contained is invalidated – Now the code cache is accessed both in the fetch stage and in the memory access stage  Tags need to be dual ported to avoid stalling

28 Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 28 Increasing the size with minimum latency loss - L2 cache  L2 is much larger than L1 (256K-1M in compare to 32K-64K)  Used to be off chip cache (between the cache and the memory bus). Now, most of the implementations are on-chip. (but some architectures have level 3 cache off-chip) – If L2 is on-chip, why not just make L1 larger?  Can be inclusive: – All addresses in L1 are also contained in L2 – Data in L1 may be more updated than in L2 – L2 is unified (code / data)  Most architectures do not require the caches to be inclusive (although, due to the size difference they are)

29 Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 29 Victim Cache  Problem: per set load may non-uniform – some sets may have more conflict misses than others  Solution: allocate ways to sets dynamically, according to the load  When a line is evicted from the cache it placed on the victim cache – If the victim cache is full - LRU line is evicted to L2 to make room for the new victim line from L1  On cache lookup, victim cache lookup is also performed (in parallel)  On victim cache hit, – line is moved back to cache – evicted line moved to the victim cache – Same access time as cache hit  Especially effective for direct mapped cache – Enables to combine the fast hit time of a direct mapped cache and still reduce conflict misses

30 Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 30 Stream Buffers  Before inserting a new line into the cache put it in a stream buffer  Line is moved from stream buffer into cache only if we get some indication that the line will be accessed in the future  Example: – Assume that we scan a very large array (much larger than the cache), and we access each item in the array just once – If we inset the array into the cache it will thrash the entire cache – If we detect that this is just a scan-once operation (e.g., using a hint from the software) we can avoid putting the array lines into the cache

31 Virtual memory and process switch Whenever we switch the execution between processes we need to: –Save the state of the process –Load the state of the new process –Make sure that the P0BPT and P1BPT registers are pointing to the new translation tables –Clean the TLB –We do not need to clean the cache (Why????)

32

33 Prediction: Branches, Dependencies, Data New era in computing? Prediction has become essential to getting good performance from scalar instruction streams. We will discuss predicting branches, data dependencies, actual data, and results of groups of instructions: –At what point does computation become a probabilistic operation + verification? –We are pretty close with control hazards already… Why does prediction work? –Underlying algorithm has regularities. –Data that is being operated on has regularities. –Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems. Prediction  Compressible information streams?

34 Dynamic Branch Prediction Is dynamic branch prediction better than static branch prediction? –Seems to be. Still some debate to this effect –Josh Fisher had good paper on “Predicting Conditional Branch Directions from Previous Runs of a Program.” ASPLOS ‘92. In general, good results if allowed to run program for lots of data sets. »How would this information be stored for later use? »Still some difference between best possible static prediction (using a run to predict itself) and weighted average over many different data sets –Paper by Young et all, “A Comparative Analysis of Schemes for Correlated Branch Prediction” notices that there are a small number of important branches in programs which have dynamic behavior.

35 Dynamic Branch Prediction Branch-Prediction Buffer (branch history table): The simplest thing to do with a branch is to predict whether or not it is taken. This helps in pipelines where the branch delay is longer than the time it takes to compute the possible target PCs. –If we can save the decision time, we can branch sooner. Note that this scheme does NOT help with the MIPS we studied. –Since the branch decision and target PC are computed in ID, assuming there is no hazard on the register tested.

36 Basic Idea Keep a buffer (cache) indexed by the lower portion of the address of the branch instruction. –Along with some bit(s) to indicate whether or not the branch was recently taken or not. If the prediction is incorrect, the prediction bit is inverted and stored back. The branch direction could be incorrect because: Of misprediction OR Instruction mismatch In either case, the worst that happens is that you have to pay the full latency for the branch.

37 Need Address at Same Time as Prediction Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) –Note: must check for branch match now, since can’t use wrong branch address Return instruction addresses predicted with stack Branch PCPredicted PC =? PC of instruction FETCH Predict taken or untaken

38 Avi Mendelson 3/2005 © 38 Basic MIPS  Arch, pipeline Dynamic Branch Prediction Look up PC of inst in fetch ?= Branch predicted taken or not taken No:Inst is not pred to be branch Yes:Inst is pred to be branch Branch PC Target PC History Predicted Target  Add a Branch Target Buffer (BTB) the predicts (at fetch)  Instruction is a branch  Branch taken / not-taken  Taken branch target

39 Avi Mendelson 3/2005 © 39 Basic MIPS  Arch, pipeline BTB  Allocation  Allocate instructions identified as branches (after decode)  Both conditional and unconditional branches are allocated  Not taken branches need not be allocated  BTB miss implicitly predicts not-taken  Prediction  BTB lookup is done parallel to IC lookup  BTB provides  Indication that the instruction is a branch (BTB hits)  Branch predicted target  Branch predicted direction  Branch predicted type (e.g., conditional, unconditional)  Update (when branch outcome is known)  Branch target  Branch history (taken / not-taken)

40 Avi Mendelson 3/2005 © 40 Basic MIPS  Arch, pipeline BTB (cont.)  Wrong prediction  Predict not-taken, actual taken  Predict taken, actual not-taken, or actual taken but wrong target  In case of wrong prediction – flush the pipeline  Reset latches (same as making all instructions to be NOPs)  Select the PC source to be from the correct path  Need get the fall-through with the branch  Start fetching instruction from correct path  Assuming P% correct prediction rate CPI new = 1 + (0.2 × (1-P)) × 3  For example, if P=0.7 CPI new = 1 + (0.2 × 0.3) × 3 = 1.18

41 Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table: Lower bits of PC address index table of 1-bit values –Says whether or not branch taken last time –No address check (saves HW, but may not be right branch) Problem: in a loop, 1-bit BHT will cause 2 mispredictions (avg is 9 iterations before exit): –End of loop case, when it exits instead of looping as before –First time through loop on next time through code, when it predicts exit instead of looping –Only 80% accuracy even if loop 90% of the time

42 Solution: 2-bit scheme where change prediction only if get misprediction twice: Red: stop, not taken Green: go, taken Adds hysteresis to decision making process Dynamic Branch Prediction (Jim Smith, 1981) T T NT Predict Taken Predict Not Taken Predict Taken Predict Not Taken T NT T

43 BHT Accuracy Mispredict because either: –Wrong guess for that branch –Got branch history of wrong branch when index the table 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% 4096 about as good as infinite table (in Alpha 211164)

44 Case for Branch Prediction when Issue N instructions per clock cycle 1.Branches will arrive up to n times faster in an n- issue processor 2.Amdahl’s Law => relative impact of the control stalls will be larger with the lower potential CPI in an n-issue processor

45 7 Branch Prediction Schemes 1.1-bit Branch-Prediction Buffer 2.2-bit Branch-Prediction Buffer 3.Correlating Branch Prediction Buffer 4.Tournament Branch Predictor 5.Branch Target Buffer 6.Integrated Instruction Fetch Units 7.Return Address Predictors

46 Accuracy v. Size (SPEC89)

47 Special Case Return Addresses Register Indirect branch hard to predict address SPEC89 85% such branches for procedure return Since stack discipline for procedures, save return address in small buffer that acts like a stack: 8 to 16 entries has small miss rate

48 Avi Mendelson 3/2005 © 48 Basic MIPS  Arch, pipeline Adding a BTB to the Pipeline ALUSrc 6 ALU result Zero + Shift left 2 ALU Control ALUOp RegDst RegWrite Read reg 1 Read reg 2 Write reg Write data Read data 1 Read data 2 Register File [15-0] [20-16] [15-11] Sign extend 16 32 ID/EX EX/MEM MEM /WB Instruction MemRead MemWrite Address Write Data Read Data Memory Branch PCSrc MemtoReg 4 + IF/ID PC 0 1 muxmux 0 1 muxmux 0 muxmux 1 0 muxmux Inst. Memory Address Instruction BTB 1 2 pred target pred dir PC+4 (Not-taken target) taken target 3 Mispredict Detection Unit Flush predicted target PC+4 (Not-taken target) predicted direction − 4 address target direction alloc/updt

49 Avi Mendelson 3/2005 © 49 Basic MIPS  Arch, pipeline Using The BTB PC moves to next instruction Inst Mem gets PC and fetches new inst BTB gets PC and looks it up IF/ID latch loaded with new inst BTB Hit ?Br taken ? PC  PC + 4PC  perd addr IF ID IF/ID latch loaded with pred inst IF/ID latch loaded with seq. inst Branch ? yesno yes noyes EXE

50 Avi Mendelson 3/2005 © 50 Basic MIPS  Arch, pipeline Using The BTB (cont.) ID EXE MEM WB Branch ? Calculate br cond & trgt Flush pipe & update PC Corect pred ? yesno IF/ID latch loaded with correct inst continue Update BTB yes no continue

51 Dynamic Branch Prediction Summary Prediction becoming important part of scalar execution Branch History Table: 2 bits for loop accuracy Correlation: Recently executed branches correlated with next branch. –Either different branches –Or different executions of same branches Tournament Predictor: more resources to competitive solutions and pick between them Branch Target Buffer: include branch address & prediction Predicated Execution can reduce number of branches, number of mispredicted branches Return address stack for prediction of indirect jump

52 Conclusion 1985-2000: 1000X performance –Moore’s Law transistors/chip => Moore’s Law for Performance/MPU Hennessy: industry been following a roadmap of ideas known in 1985 to exploit Instruction Level Parallelism and (real) Moore’s Law to get 1.55X/year –Caches, Pipelining, Superscalar, Branch Prediction, Out-of-order execution, … ILP limits: To make performance progress in future need to have explicit parallelism from programmer vs. implicit parallelism of ILP exploited by compiler, HW? –Otherwise drop to old rate of 1.3X per year? –Less than 1.3X because of processor-memory performance gap? Impact on you: if you care about performance, better think about explicitly parallel algorithms vs. rely on ILP?

53 שאלה לדוגמא מהנדסי Intel מבקשים שתחשבו באיזה אחוז תשתפר יעילות העבודה של CPU אם יחידת ה Branch Prediction תשפר את יכולת הניבוי באחוז אחד ובשני אחוזים. לשם כך עליך לענות על השאלות הבאות. כל החישובים הם עפ"י ההנחות הבאות: –אחת לחמש (1:5) פקודות מכונה היא פקודת קפיצה מותנת (conditional branch) – מחיר ניבוי לא נכון הוא 12 stalls. (הנח שאין stalls אחרים מלבד אלה שנובעים מניחוש לא נכון). –לפני השיפור ה Branch Prediction Unit מנחש נכון ב 95% מהקפיצות (Branches). (4%) מה אחוז ה stalls כאשר היחידה מנחשת נכון ב- 95% מהקפיצות המותנות (בריצה ארוכה איזה אחוז ממחזורי השעון יהיו stalls)? (2%) מה אחוז ה stalls כאשר היחידה מנחשת נכון ב- 96% מהקפיצות המותנות? (2%) מה אחוז ה stalls כאשר היחידה מנחשת נכון ב- 97% מהקפיצות המותנות? הסבר את חישוביך. מה אם 25 stalls?

54 Virtually Addressed Caches On MIPS R2000, the TLB translated the virtual address to a physical address before the address was sent to the cache => physically addressed cache. Another approach is to have the virtual address be sent directly to the cache=> virtually addressed cache –Avoids translation if data is in the cache –If data is not in the cache, the address is translated by a TLB/page table before going to main memory. –Often requires larger tags –Can result in aliasing, if two virtual addresses map to the same location in physical memory. With a virtually indexed cache, the tag gets translated into a physical address, while the rest of the address is accessing the cache.

55 Virtually Addressed Cache Only require address translation on cache miss! synonym problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address! nightmare for update: must update all cache entries with same physical address or memory becomes inconsistent determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits. data CPU Trans- lation Cache Main Memory VA hit PA

56 Memory Protection With multiprogramming, a computer is shared by several programs or processes running concurrently –Need to provide protection –Need to allow sharing Mechanisms for providing protection –Provide both user and supervisor (operating system) modes –Provide CPU state that the user can read, but cannot write »user/supervisor bit, page table pointer, and TLB –Provide method to go from user to supervisor mode and vice versa »system call or exception : user to supervisor »system or exception return : supervisor to user –Provide permissions for each page in memory –Store page tables in the operating systems address space - can’t be accessed directly by user.

57 Handling TLB Misses and Page Faults When a TLB miss occurs either –Page is present in memory and update the TLB »occurs if valid bit of page table is set –Page is not present in memory and O.S. gets control to handle a page fault If a page fault occur, the operating system –Access the page table to determine the physical location of the page on disk –Chooses a physical page to replace - if the replaced page is dirty it is written to disk –Reads a page from disk into the chosen physical page in main memory. Since the disk access takes so long, another process is typically allowed to run during a page fault.

58 Pitfall: Address space to small One of the biggest mistakes than can be made when designing an architect is to devote to few bits to the address –address size limits the size of virtual memory –difficult to change since many components depend on it (e.g., PC, registers, effective-address calculations) As program size increases, larger and larger address sizes are needed – 8 bit: Intel 8080 (1975) –16 bit: Intel 8086 (1978) –24 bit: Intel 80286 (1982) –32 bit: Intel 80386 (1985) –64 bit: Intel Merced (1998)

59 Virtual Memory Summary Virtual memory (VM) allows main memory (DRAM) to act like a cache for secondary storage (magnetic disk). Page tables and TLBS are used to translate the virtual address to a physical address The large miss penalty of virtual memory leads to different strategies from cache –Fully associative –LRU or LRU approximation –Write-back –Done by software


Download ppt "Cache Extras & Branch Prediction. TLB and Cache Operation (See Figure 7.26 on page 594) On a memory access, the following operations occur."

Similar presentations


Ads by Google