Presentation is loading. Please wait.

Presentation is loading. Please wait.

OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview

Similar presentations


Presentation on theme: "OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview"— Presentation transcript:

1 OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview
Chris Gilmore Portland State University/OMSE

2 Today Caches DLX Assembly CPU Overview

3 Computer System (Idealized)
Disk Memory CPU Disk Controller Today, we visit the disk, and try and work our way back to the CPU Today’s lecture is mostly not represented in textbooks so pay attention! Ask questions! System Bus

4 The Big Picture: Where are We Now?
The Five Classic Components of a Computer Next Topic: Simple caching techniques Many ways to improve cache performance Control Datapath Memory Processor Input Output So where are in in the overall scheme of things. Well, we just finished designing the processor’s datapath. Now I am going to show you how to design the control for the datapath. +1 = 7 min. (X:47)

5 Recap: Levels of the Memory Hierarchy
Upper Level Processor faster Instr. Operands Cache Blocks Memory Pages Disk Files Tape Larger Lower Level

6 Recap: exploit locality to achieve fast memory
Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. DRAM is slow but cheap and dense: Good choice for presenting the user with a BIG memory system SRAM is fast but expensive and not very dense: Good choice for providing the user FAST access time. Let’s summarize today’s lecture. The first thing we covered is the principle of locality. There are two types of locality: temporal, or locality of time and spatial, locality of space. We talked about memory system design. The key idea of memory system design is to present the user with as much memory as possible in the cheapest technology while by taking advantage of the principle of locality, create an illusion that the average access time is close to that of the fastest technology. As far as Random Access technology is concerned, we concentrate on 2: DRAM and SRAM. DRAM is slow but cheap and dense so is a good choice for presenting the use with a BIG memory system. SRAM, on the other hand, is fast but it is also expensive both in terms of cost and power, so it is a good choice for providing the user with a fast access time. +2 = 78 min. (Y:58)

7 Memory Hierarchy: Terminology
Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty Lower Level Memory A HIT is when the data the processor wants to access is found in the upper level (Blk X). The fraction of the memory access that are HIT is defined as HIT rate. HIT Time is the time to access the Upper Level where the data is found (X). It consists of: (a) Time to access this level. (b) AND the time to determine if this is a Hit or Miss. If the data the processor wants cannot be found in the Upper level. Then we have a miss and we need to retrieve the data (Blk Y) from the lower level. By definition (definition of Hit: Fraction), the miss rate is just 1 minus the hit rate. This miss penalty also consists of two parts: (a) The time it takes to replace a block (Blk Y to BlkX) in the upper level. (b) And then the time it takes to deliver this new block to the processor. It is very important that your Hit Time to be much much smaller than your miss penalty. Otherwise, there will be no reason to build a memory hierarchy. +2 = 14 min. (X:54) To Processor Upper Level Memory Blk X From Processor Blk Y

8 The Art of Memory System Design
Workload or Benchmark programs Processor reference stream <op,addr>, <op,addr>,<op,addr>,<op,addr>, . . . op: i-fetch, read, write Memory Optimize the memory system organization to minimize the average memory access time for typical workloads Idea of a cache: Put copies of data we use into memory -> reactive, as we reference the data, bring it into cache $ MEM

9 Example: Fully Associative
Fully Associative Cache No Cache Index Compare the Cache Tags of all cache entries in parallel Example: Block Size = 32 B blocks, we need N 27-bit comparators By definition: Conflict Miss = 0 for a fully associative cache 31 4 Cache Tag (27 bits long) Byte Select Ex: 0x01 Cache Tag Valid Bit Cache Data We just store all the upper bits of the address (except Byte select) that is associated with the cache block as the cache tag and have one comparator for every entry. The address is sent to all entries at once and compared in parallel and only the one that matches are sent to the output. This is called an associative lookup. Needless to say, it is very hardware intensive. Usually, fully associative cache is limited to 64 or less entries. Two types of misses: Compulsory misses, Capacity Misses Assume we have 64 entries here. The first 64 items we accessed can fit in. But when we try to bring in the 65th item, we will need to throw one of them out to make room for the new item. Capacity Miss. +3 = 41 min. (Y:21) : X Byte 31 Byte 1 Byte 0 : X Byte 63 Byte 33 Byte 32 X X : : : X

10 Example: 1 KB Direct Mapped Cache with 32 B Blocks
For a 2 ** N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2 ** M) Block address 31 9 4 Cache Tag Example: 0x50 Cache Index Byte Select Ex: 0x01 Ex: 0x00 Stored as part of the cache “state” Valid Bit Cache Tag Cache Data : Byte 31 Byte 1 Byte 0 : While the direct mapped cache is on the simple end of the cache design spectrum, the fully associative cache is on the most complex end. Let’s use a specific example with realistic numbers: assume we have a 1 KB direct mapped cache with block size equals to 32 bytes. In other words, each block associated with the cache tag will have 32 bytes in it (Row 1). With Block Size equals to 32 bytes, the 5 least significant bits of the address will be used as byte select within the cache block. Since the cache size is 1K byte, the upper 32 minus 10 bits, or 22 bits of the address will be stored as cache tag. The rest of the address bits in the middle, that is bit 5 through 9, will be used as Cache Index to select the proper cache entry. +2 = 30 min. (Y:10) 0x50 Byte 63 Byte 33 Byte 32 1 2 3 : : : : Byte 1023 Byte 992 31

11 Set Associative Cache : : : :
N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel Example: Two-way set associative cache Cache Index selects a “set” from the cache The two tags in the set are compared to the input in parallel Data is selected based on the tag result Cache Index Valid Cache Tag Cache Data Cache Data Cache Block 0 Cache Tag Valid : Cache Block 0 : : : This is called a 2-way set associative cache because there are two cache entries for each cache index. Essentially, you have two direct mapped cache works in parallel. This is how it works: the cache index selects a set from the cache. The two tags in the set are compared in parallel with the upper bits of the memory address. If neither tag matches the incoming address tag, we have a cache miss. Otherwise, we have a cache hit and we will select the data on the side where the tag matches occur. This is simple enough. What is its disadvantages? +1 = 36 min. (Y:16) Compare Adr Tag Compare 1 Sel1 Mux Sel0 OR Cache Block Hit

12 Disadvantage of Set Associative Cache
N-way Set Associative Cache versus Direct Mapped Cache: N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss decision and set selection In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss. Cache Data Cache Block 0 Cache Tag Valid : Cache Index Mux 1 Sel1 Sel0 Cache Block Compare Adr Tag OR Hit First of all, a N-way set associative cache will need N comparators instead of just one comparator (use the right side of the diagram for direct mapped cache). A N-way set associative cache will also be slower than a direct mapped cache because of this extra multiplexer delay. Finally, for a N-way set associative cache, the data will be available AFTER the hit/miss signal becomes valid because the hit/mis is needed to control the data MUX. For a direct mapped cache, that is everything before the MUX on the right or left side, the cache block will be available BEFORE the hit/miss signal (AND gate output) because the data does not have to go through the comparator. This can be an important consideration because the processor can now go ahead and use the data without knowing if it is a Hit or Miss. Just assume it is a hit. Since cache hit rate is in the upper 90% range, you will be ahead of the game 90% of the time and for those 10% of the time that you are wrong, just make sure you can recover. You cannot play this speculation game with a N-way set-associatvie cache because as I said earlier, the data will not be available to you until the hit/miss signal is valid. +2 = 38 min. (Y:18)

13 Increased Miss Penalty
Block Size Tradeoff Larger block size take advantage of spatial locality BUT: Larger block size means larger miss penalty: Takes longer time to fill up the block If block size is too big relative to cache size, miss rate will go up Too few cache blocks In general, Average Access Time: = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate Average Access Time Miss Penalty Miss Rate As I said earlier, block size is a tradeoff. In general, larger block size will reduce the miss rate because it take advantage of spatial locality. But remember, miss rate NOT the only cache performance metrics. You also have to worry about miss penalty. As you increase the block size, your miss penalty will go up because as the block gets larger, it will take you longer to fill up the block. Even if you look at miss rate by itself, which you should NOT, bigger block size does not always win. As you increase the block size, assuming keeping cache size constant, your miss rate will drop off rapidly at the beginning due to spatial locality. However, once you pass certain point, your miss rate actually goes up. As a result of these two curves, the Average Access Time (point to equation), which is really the more important performance metric than the miss rate, will go down initially because the miss rate is dropping much faster than the increase in miss penalty. But eventually, as you keep on increasing the block size, the average access time can go up rapidly because not only is the miss penalty is increasing, the miss rate is increasing as well. Let me show you why your miss rate may go up as you increase the block size by another extreme example. +3 = 33 min. (Y:13) Exploits Spatial Locality Increased Miss Penalty & Miss Rate Fewer blocks: compromises temporal locality Block Size Block Size Block Size

14 A Summary on Sources of Cache Misses
Compulsory (cold start or process migration, first reference): first access to a block “Cold” fact of life: not a whole lot you can do about it Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity Capacity: Cache cannot contain all blocks access by the program Solution: increase cache size Coherence (Invalidation): other process (e.g., I/O) updates memory (Capacity miss) That is the cache misses are due to the fact that the cache is simply not large enough to contain all the blocks that are accessed by the program. The solution to reduce the Capacity miss rate is simple: increase the cache size. Here is a summary of other types of cache miss we talked about. First is the Compulsory misses. These are the misses that we cannot avoid. They are caused when we first start the program. Then we talked about the conflict misses. They are the misses that caused by multiple memory locations being mapped to the same cache location. There are two solutions to reduce conflict misses. The first one is, once again, increase the cache size. The second one is to increase the associativity. For example, say using a 2-way set associative cache instead of directed mapped cache. But keep in mind that cache miss rate is only one part of the equation. You also have to worry about cache access time and miss penalty. Do NOT optimize miss rate alone. Finally, there is another source of cache miss we will not cover today. Those are referred to as invalidation misses caused by another process, such as IO , update the main memory so you have to flush the cache to avoid inconsistency between memory and cache. +2 = 43 min. (Y:23)

15 Source of Cache Misses Quiz
Assume constant cost. Direct Mapped N-way Set Associative Fully Associative Cache Size: Small, Medium, Big? Compulsory Miss: Conflict Miss Capacity Miss Let me see whether you guys can help me fill out this table. So all things being equal, which organization do you think we should use if we want to build the biggest cache (Big, Medium, Small). Answer: The direct mapped cache is the simplest so we can make it the biggest. The fully associative cache is the most complex so it will be the smallest. How about compulsory miss, which cache do you think has the highest compulsory miss rate (High, Medium Low)? Answer: Well, the direct mapped cache with the biggest cache size probably will take the longest to warm up so it probably will have the highest compulsory miss rate. But the good thing about compulsory miss rate is that if you are going to run billions of instructions, you really don’t care your initial misses before the cache is warm. Conflict Misses: which do you think has the lowest conflict miss rate. Answer: Simple. By definition, the fully-associative cache will have zero conflict misses so you can’t get any lower than that. For the direct mapped cache, each memory location can ONLY go to one cache location, so the conflict misses probably will be highest. Capacity Misses: well who has the biggest cache size wins in this category. So the direct mapped cache will have the lowest capacity miss rate while the fully associative cache will have the highest. Finally, invalidation misses. They probably affect all three caches the same way. + 3 = 46 min. (Y:26) Coherence Miss Choices: Zero, Low, Medium, High, Same

16 Sources of Cache Misses Answer
Direct Mapped N-way Set Associative Fully Associative Cache Size Big Medium Small Compulsory Miss Same Same Same Conflict Miss High Medium Zero Capacity Miss Low Medium High This is a hidden slide. Coherence Miss Same Same Same Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant.

17 Recap: Four Questions for Caches and Memory Hierarchy
Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)

18 Q1: Where can a block be placed in the upper level?
Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets Fully associative: block 12 can go anywhere Block no. Direct mapped: block 12 can go only into block 4 (12 mod 8) Block no. Set associative: block 12 can go anywhere in set 0 (12 mod 4) Set 1 2 3 Block no. Block-frame address Block no.

19 Q2: How is a block found if it is in the upper level?
offset Block Address Tag Index Set Select Data Select Direct indexing (using index and block offset), tag compares, or combination Increasing associativity shrinks index, expands tag

20 Q3: Which block should be replaced on a miss?
Easy for Direct Mapped Set Associative or Fully Associative: Random FIFO LRU (Least Recently Used) LFU (Least Frequently Used) Associativity: 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB 5.2% 5.7% % 5.3% % 5.0% 64 KB 1.9% 2.0% % 1.7% % 1.5% 256 KB 1.15% 1.17% % % % %

21 Q4: What happens on a write?
Write through—The information is written to both the block in the cache and to the block in the lower-level memory. Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? Pros and Cons of each? WT: read misses cannot result in writes, coherency easier WB: no writes of repeated writes WT always combined with write buffers so they don’t wait for lower level memory

22 Write Buffer for Write Through
Cache Processor DRAM Write Buffer A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer’s nightmare: Store frequency (w.r.t. time) > 1 / DRAM write cycle Write buffer saturation Once the data is written into the write buffer and assuming a cache hit, the CPU is done with the write. The memory controller will then move the write buffer’s contents to the real memory behind the scene. The write buffer works as long as the frequency of store is not too high. Notice here, I am referring to the frequency with respect to time, not with respect to number of instructions. Remember the DRAM cycle time we talked about last time. It sets the upper limit on how frequent you can write to the main memory. If the store are too close together or the CPU time is so much faster than the DRAM cycle time, you can end up overflowing the write buffer and the CPU must stop and wait. +2 = 60 min. (Y:40)

23 Write Buffer Saturation
Cache Processor DRAM Write Buffer Store frequency (w.r.t. time) > 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): Store buffer will overflow no matter how big you make it The CPU Cycle Time <= DRAM Write Cycle Time Solution for write buffer saturation: Use a write back cache Install a second level (L2) cache: (does this always work?) A Memory System designer’s nightmare is when the Store frequency with respect to time approaches 1 over the DRAM Write Cycle Time. We called this Write Buffer Saturation. In that case, it does NOT matter how big you make the write buffer, the write buffer will still overflow because you simply feeding things in it faster than you can empty it. This is called Write Buffer Saturation and I have seen this happened before in simulation and whey that happens your processor will be running at DRAM cycle time--very very slow. The first solution for write buffer saturation is to get rid of this write buffer and replace this write through cache with a write back cache. Another solution is to install the 2nd level cache between the write buffer and memory and makes the 2nd level write back. +2 = 62 min. (Y:42) Cache L2 Cache Processor DRAM Write Buffer

24 Write-miss Policy: Write Allocate versus Not Allocate
Assume: a 16-bit write to memory location 0x0 and causes a miss Do we read in the block? Yes: Write Allocate No: Write Not Allocate 31 9 4 Cache Tag Example: 0x00 Cache Index Byte Select Ex: 0x00 Ex: 0x00 Valid Bit Cache Tag Cache Data : 0x50 Byte 31 Byte 1 Byte 0 : Let’s look at our 1KB direct mapped cache again. Assume we do a 16-bit write to memory location 0x and causes a cache miss in our 1KB direct mapped cache that has 32-byte block select. After we write the cache tag into the cache and write the 16-bit data into Byte 0 and Byte 1, do we have to read the rest of the block (Byte 2, 3, ... Byte 31) from memory? If we do read the rest of the block in, it is called write allocate. But stop and think for a second. Is it really necessary to bring in the rest of the block on a write miss? True, the principle of spatial locality implies that we are likely to access them soon. But the type of access we are going to do is likely to be another write. So if even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss. If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid. This bring us to the topic of sub-blocking. +2 = 64 min. (Y:44) Byte 63 Byte 33 Byte 32 1 2 3 : : : : Byte 1023 Byte 992 31

25 Impact on Cycle Time Average Memory Access time =
IR PC I -Cache D Cache A B R T IRex IRm IRwb miss invalid Miss Cache Hit Time: directly tied to clock rate increases with cache size increases with associativity Average Memory Access time = Hit Time + Miss Rate x Miss Penalty Time = IC x CT x (ideal CPI + memory stalls) Don’t worry about the pipeline diagram for now, we’ll discuss that later IC = Instruction Count, CT = cycle time

26 What happens on a Cache miss?
For in-order pipeline, 2 options: Freeze pipeline in Mem stage (popular early on: Sparc, R4000) IF ID EX Mem stall stall stall … stall Mem Wr IF ID EX stall stall stall … stall stall Ex Wr Use Full/Empty bits in registers + MSHR queue MSHR = “Miss Status/Handler Registers” (Kroft) Each entry in this queue keeps track of status of outstanding memory requests to one complete memory line. Per cache-line: keep info about memory address. For each word: register (if any) that is waiting for result. Used to “merge” multiple requests to one memory line New load creates MSHR entry and sets destination register to “Empty”. Load is “released” from pipeline. Attempt to use register before result returns causes instruction to block in decode stage. Limited “out-of-order” execution with respect to loads. Popular with in-order superscalar architectures. Out-of-order pipelines already have this functionality built in… (load queues, etc).

27 Improving Cache Performance: 3 general options
Time = IC x CT x (ideal CPI + memory stalls) Average Memory Access time = Hit Time + (Miss Rate x Miss Penalty) = (Hit Rate x Hit Time) + (Miss Rate x Miss Time) 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

28 Improving Cache Performance
1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

29 3Cs Absolute Miss Rate (SPEC92)
Conflict Compulsory vanishingly small

30 2:1 Cache Rule miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2 Conflict

31 3Cs Relative Miss Rate Conflict Flaws: for fixed block size
Flaws: for fixed block size Good: insight => invention

32 1. Reduce Misses via Larger Block Size

33 2. Reduce Misses via Higher Associativity
2:1 Cache Rule: Miss Rate DM cache size N ­ Miss Rate 2-way cache size N/2 Beware: Execution time is only final measure! Will Clock Cycle time increase? Hill [1988] suggested hit time for 2-way vs. 1-way external cache +10%, internal + 2% DM = Direct mapped

34 Example: Avg. Memory Access Time vs. Miss Rate
Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct mapped Cache Size Associativity (KB) 1-way 2-way 4-way 8-way (Green means A.M.A.T. not improved by more associativity) (AMAT = Average Memory Access Time) CCT = Clock Cycle Time

35 3. Reducing Misses via a “Victim Cache”
How to combine fast hit time of direct mapped yet still avoid conflict misses? Add buffer to place data discarded from cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache Used in Alpha, HP machines TAGS DATA Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data To Next Lower Level In Hierarchy

36 4. Reducing Misses by Hardware Prefetching
E.g., Instruction Prefetching Alpha fetches 2 blocks on a miss Extra block placed in “stream buffer” On miss check stream buffer Works with data blocks too: Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Prefetching relies on having extra memory bandwidth that can be used without penalty

37 5. Reducing Misses by Software Prefetching Data
Data Prefetch Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) Special prefetching instructions cannot cause faults; a form of speculative execution Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses? Higher superscalar reduces difficulty of issue bandwidth

38 6. Reducing Misses by Compiler Optimizations
McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software Instructions Reorder procedures in memory so as to reduce conflict misses Profiling to look at conflicts(using tools they developed) Data Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows

39 Improving Cache Performance (Continued)
1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

40 0. Reducing Penalty: Faster DRAM / Interface
New DRAM Technologies RAMBUS - same initial latency, but much higher bandwidth Synchronous DRAM Better BUS interfaces CRAY Technique: only use SRAM

41 1. Reducing Penalty: Read Priority over Write on Miss
Cache Processor DRAM Write Buffer A Write Buffer Allows reads to bypass writes Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer’s nightmare: Store frequency (w.r.t. time) > 1 / DRAM write cycle Write buffer saturation Secret: We really didn't write to the memory directly. We are writing to a write buffer. Once the data is written into the write buffer and assuming a cache hit, the CPU is done with the write. The memory controller will then move the write buffer’s contents to the real memory behind the scene. The write buffer works as long as the frequency of store is not too high. Notice here, I am referring to the frequency with respect to time, not with respect to number of instructions. Remember the DRAM cycle time we talked about last time. It sets the upper limit on how frequent you can write to the main memory. If the store are too close together or the CPU time is so much faster than the DRAM cycle time, you can end up overflowing the write buffer and the CPU must stop and wait. +2 = 60 min. (Y:40)

42 1. Reducing Penalty: Read Priority over Write on Miss
Write-Buffer Issues: Write through with write buffers offer RAW conflicts with main memory reads on cache misses If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% )  Check write buffer contents before read; if no conflicts, let the memory access continue Write Back? Read miss replacing dirty block Normal: Write dirty block to memory, and then do the read Instead copy the dirty block to a write buffer, then do the read, and then do the write CPU stall less since restarts as soon as do read

43 2. Reduce Penalty: Early Restart and Critical Word First
Don’t wait for full block to be loaded before restarting CPU Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first Generally useful only in large blocks, Spatial locality a problem; tend to want next sequential word, so not clear if benefit by early restart block

44 3. Reduce Penalty: Non-blocking Caches
Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss requires F/E bits on registers or out-of-order execution requires multi-bank memories “hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses Requires multiple memory banks (otherwise cannot support) Pentium Pro allows 4 outstanding memory misses

45 Value of Hit Under Miss for SPEC
Integer Floating Point Value of Hit Under Miss for SPEC “Hit under n Misses” AMAT = Available Memory Access Time FP programs on average: AMAT= > > > 0.26 Int programs on average: AMAT= > > > 0.19 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss

46 4. Reduce Penalty: Second-Level Cache
Proc L2 Equations AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2) Definitions: Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2) Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU (Miss RateL1 x Miss RateL2) Global Miss Rate is what matters L1 Cache L2 Cache

47 Reducing Misses: which apply to L2 Cache?
Reducing Miss Rate 1. Reduce Misses via Larger Block Size 2. Reduce Conflict Misses via Higher Associativity 3. Reducing Conflict Misses via Victim Cache 4. Reducing Misses by HW Prefetching Instr, Data 5. Reducing Misses by SW Prefetching Data 6. Reducing Capacity/Conf. Misses by Compiler Optimizations

48 L2 cache block size & A.M.A.T.
32KB L1, 8 byte path to memory

49 Improving Cache Performance (Continued)
1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache: - Lower Associativity (+victim caching)? - 2nd-level cache - Careful Virtual Memory Design

50 Summary #1/ 3: The Principle of Locality:
Program likely to access a relatively small portion of the address space at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three (+1) Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses. Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! Capacity Misses: increase cache size Coherence Misses: Caused by external processors or I/O devices Cache Design Space total size, block size, associativity replacement policy write-hit policy (write-through, write-back) write-miss policy Let’s summarize today’s lecture. I know you have heard this many times and many ways but it is still worth repeating. Memory hierarchy works because of the Principle of Locality which says a program will access a relatively small portion of the address space at any instant of time. There are two types of locality: temporal locality, or locality in time and spatial locality, or locality in space. So far, we have covered three major categories of cache misses. Compulsory misses are cache misses due to cold start. You cannot avoid them but if you are going to run billions of instructions anyway, compulsory misses usually don’t bother you. Conflict misses are misses caused by multiple memory location being mapped to the same cache location. The nightmare scenario is the ping pong effect when a block is read into the cache but before we have a chance to use it, it was immediately forced out by another conflict miss. You can reduce Conflict misses by either increase the cache size or increase the associativity, or both. Finally, Capacity misses occurs when the cache is not big enough to contains all the cache blocks required by the program. You can reduce this miss rate by making the cache larger. There are two write policy as far as cache write is concerned. Write through requires a write buffer and a nightmare scenario is when the store occurs so frequent that you saturates your write buffer. The second write polity is write back. In this case, you only write to the cache and only when the cache block is being replaced do you write the cache block back to memory. +3 = 77 min. (Y:57)

51 Summary #2 / 3: The Cache Design Space
Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back write allocation The optimal choice is a compromise depends on access characteristics workload use (I-cache, D-cache, TLB) depends on technology / cost Simplicity often wins Cache Size Associativity Block Size Bad No fancy replacement policy is needed for the direct mapped cache. As a matter of fact, that is what cause direct mapped trouble to begin with: only one place to go in the cache--causes conflict misses. Besides working at Sun, I also teach people how to fly whenever I have time. Statistic have shown that if a pilot crashed after an engine failure, he or she is more likely to get killed in a multi-engine light airplane than a single engine airplane. The joke among us flight instructors is that: sure, when the engine quit in a single engine stops, you have one option: sooner or later, you land. Probably sooner. But in a multi-engine airplane with one engine stops, you have a lot of options. It is the need to make a decision that kills those people. Factor A Factor B Good Less More

52 Summary #3 / 3: Cache Miss Optimization
Lots of techniques people use to improve the miss rate of caches: Technique MR MP HT Complexity Larger Block Size + – 0 Higher Associativity + – 1 Victim Caches Pseudo-Associative Caches HW Prefetching of Instr/Data Compiler Controlled Prefetching Compiler Reduce Misses + 0 miss rate MR = Miss Rate MP = Miss Penalty HT = Hit Time

53 Onto Assembler! What is assembly language?
Machine-Specific Programming Language one-one correspondence between statements and native machine language matches machine instruction set and architecture

54 What is an assembler? Systems Level Program
Usually works in conjunction with the compiler translates assembly language source code to machine language object file - contains machine instructions, initial data, and information used when loading the program listing file - contains a record of the translation process, line numbers, addresses, generated code and data, and a symbol table

55 Why learn assembly? Learn how a processor works
Understand basic computer architecture Explore the internal representation of data and instructions Gain insight into hardware concepts Allows creation of small and efficient programs Allows programmers to bypass high-level language restrictions Might be necessary to accomplish certain operations Setjmp/longjmp, buffer overflows

56 Machine Representation
A language of numbers, called the Processor’s Instruction Set The set of basic operations a processor can perform Each instruction is coded as a number Instructions may be one or more bytes Every number corresponds to an instruction Setjmp/longjmp

57 Assembly vs Machine Machine Language Programming
Writing a list of numbers representing the bytes of machine instructions to be executed and data constants to be used by the program Assembly Language Programming Using symbolic instructions to represent the raw data that will form the machine language program and initial data constants

58 Assembly Mnemonics represent Machine Instructions
Each mnemonic used represents a single machine instruction The assembler performs the translation Some mnemonics require operands Operands provide additional information register, constant, address, or variable Assembler Directives

59 Instruction Set Architecture: a Critical Interface
software instruction set hardware Portion of the machine that is visible to the programmer or the compiler writer.

60 Good ISA Lasts through many implementations (portability, compatibility) Can be used for many different applications (generality) Provide convenient functionality to higher levels Permits an efficient implementation at lower levels

61 Von Neumann Machines Von Neumann “invented” stored program computer in 1945 Instead of program code being hardwired, the program code (instructions) is placed in memory along with data Control ALU Program Data

62 Basic ISA Classes Memory to Memory Machines
Every instruction contains a full memory address for each operand Maybe the simplest ISA design However memory is slow Memory is big (lots of address bits)

63 Memory-to-memory machine
Assumptions Two operands per operation first operand is also the destination Memory address = 16 bits (2 bytes) Operand size = 32 bits (4 bytes) Instruction code = 8 bits (1 byte) Example A = B+C (hypothetical code) mov A, B ; A <= B add A, C ; A <= B+C 5 bytes for instruction 4 bytes for fetch 1st and 2nd operands 4 bytes to store results add needs 17 bytes and mov needs 13 byts Total 30 bytes memory traffic

64 Why CPU Storage? A small amount of storage in the CPU Simplest Case
To reduce memory traffic by keeping repeatedly used operands in the CPU Avoid re-referencing memory Avoid having to specify full memory address of the operand This is a perfect example of “make the common case fast”. Simplest Case A machine with 1 cell of CPU storage: the accumulator

65 Accumulator Machine Assumptions Two operands per operation
1st operand in the accumulator 2nd operand in the memory accumulator is also the destination (except for store) Memory address = 16 bits (2 bytes) Operand size = 32 bits (4 bytes) Instruction code = 8 bits (1 byte) Example A = B+C (hypothetical code) Load B ; acc <= B Add C ; acc <= B+C Store A ; A <= acc 3 bytes for instruction 4 bytes to load or store the second operand 7 bytes per instruction 21 bytes total memory traffic

66 Stack Machines Instruction sets are based on a stack model of execution. Aimed for compact instruction encoding Most instructions manipulate top few data items (mostly top 2) of a pushdown stack. Top few items of the stack are kept in the CPU Ideal for evaluating expressions (stack holds intermediate results) Were thought to be a good match for high level languages Awkward Become very slow if stack grows beyond CPU local storage No simple way to get data from “middle of stack”

67 Stack Machines Binary arithmetic and logic operations
Operands: top 2 items on stack Operands are removed from stack Result is placed on top of stack Unary arithmetic and logic operations Operand: top item on the stack Operand is replaced by result of operation Data move operations Push: place memory data on top of stack Pop: move top of stack to memory

68 General Purpose Register Machines
With stack machines, only the top two elements of the stack are directly available to instructions. In general purpose register machines, the CPU storage is organized as a set of registers which are equally available to the instructions Frequently used operands are placed in registers (under program control) Reduces instruction size Reduces memory traffic

69 General Purpose Registers Dominate
1975-present all machines use general purpose registers Advantages of registers registers are faster than memory registers are easier for a compiler to use e.g., (A*B) – (C*D) – (E*F) can do multiplies in any order registers can hold variables memory traffic is reduced, so program is sped up (since registers are faster than memory) code density improves (since register named with fewer bits than memory location)

70 Classifying General Purpose Register Machines
General purpose register machines are sub-classified based on whether or not memory operands can be used by typical ALU instructions Register-memory machines: machines where some ALU instructions can specify at least one memory operand and one register operand Load-store machines: the only instructions that can access memory are the “load” and the “store” instructions DLX is a load/store machine

71 Comparing number of instructions
Code sequence for A = B+C for five classes of instruction sets: Register (Register-memory) load R1 B add R1 C store A R1 Register (Load-store) Load R1 B Load R2 C Add R1 R1 R2 Store A R1 Memory to Memory mov A B add A C Accumulator load B add C store A Stack push B push C add pop A DLX/MIPS is one of these

72 Instruction Set Definition
Objects = architecture entities = machine state Registers General purpose Special purpose (e.g. program counter, condition code, stack pointer) Memory locations Linear address space: 0, 1, 2, …,2^s -1 Operations = instruction types Data operation Arithmetic Logical Data transfer Move (from register to register) Load (from memory location to register) Store (from register to memory location) Instruction sequencing Branch (conditional) Jump (unconditional)

73 Topic: DLX Great links to learn more DLX:
Instructional Architecture Much nicer and easier to understand than x86 (barf) The Plan: Teach DLX, then move to x86/y86 DLX: RISC ISA, very similar to MIPS Great links to learn more DLX:

74 DLX Architecture Based on observations about instruction set architecture Emphasizes: Simple load-store instruction set Design for pipeline efficiency Design for compiler target DLX registers 32 32-bit GPRS named R0, R1, ..., R31 32 32-bit FPRs named F0, F2, ..., F30 Accessed independently for 32-bit data Accessed in pairs for 64-bit (double-precision) data Register R0 is hard-wired to zero Other status registers, e.g., floating-point status register Byte addressable in big-endian with 32-bit address Arithmetic instructions operands must be registers

75 MIPS: Software conventions for Registers
0 zero constant 0 1 at reserved for assembler 2 v0 expression evaluation & 3 v1 function results 4 a0 arguments 5 a1 6 a2 7 a3 8 t0 temporary: caller saves (callee can clobber) 15 t7 16 s0 callee saves . . . (callee must save) 23 s7 24 t8 temporary (cont’d) 25 t9 26 k0 reserved for OS kernel 27 k1 28 gp Pointer to global area 29 sp Stack pointer 30 fp frame pointer 31 ra Return Address (HW)

76 Addressing Modes This table shows the most common modes.
Example Instruction Meaning When Used Register Add R4, R3 R[R4] <- R[R4] + R[R3] When a value is in a register. Immediate Add R4, #3 R[R4] <- R[R4] + 3 For constants. Displacement Add R4, 100(R1) R[R4] <- R[R4] + M[100+R[R1] ] Accessing local variables. Register Deferred Add R4, (R1) M[R[R1] ] Using a pointer or a computed address. Absolute Add R4, (1001) R[R4] <- R[R4] + M[1001] Used for static data. DLX only does first 3, and only load/store can do displacement Although register deferred is just displacement with 0, and absolute is just displacement with r0

77 Memory Organization Viewed as a large, single-dimension array, with an address. A memory address is an index into the array "Byte addressing" means that the index points to a byte of memory. 8 bits of data 1 8 bits of data 2 8 bits of data 3 8 bits of data 4 8 bits of data 5 8 bits of data 6 8 bits of data

78 Memory Addressing Bytes are nice, but most data items use larger "words" For DLX, a word is 32 bits or 4 bytes. 2 questions for design of ISA: Since one could read a 32-bit word as four loads of bytes from sequential byte addresses or as one load word from a single byte address, How do byte addresses map to word addresses? Can a word be placed on any byte boundary? Usually can’t be put on a word boundary, saves wires! (Next slide)

79 Addressing Objects: Endianess and Alignment
Big Endian: address of most significant byte = word address (xx00 = Big End of word) IBM 360/370, Motorola 68k, MIPS, Sparc, HP PA Little Endian: address of least significant byte = word address (xx00 = Little End of word) Intel 80x86, DEC Vax, DEC Alpha (Windows NT) little endian byte 0 msb lsb Aligned big endian byte 0 Not Aligned Alignment: require that objects fall on address that is multiple of their size.

80 Assembly Language vs. Machine Language
Assembly provides convenient symbolic representation much easier than writing down numbers e.g., destination first Machine language is the underlying reality e.g., destination is no longer first Assembly can provide 'pseudoinstructions' e.g., “move r10, r11” exists only in Assembly would be implemented using “add r10,r11,r0” When considering performance you should count real instructions

81 Stored Program Concept
Instructions are bits Programs are stored in memory — to be read or written just like data Fetch & Execute Cycle Instructions are fetched and put into a special register Bits in the register "control" the subsequent actions Fetch the “next” instruction and continue Processor Memory memory for data, programs, compilers, editors, etc.

82 DLX arithmetic ALU instructions can have 3 operands
add $R1, $R2, $R3 sub $R1, $R2, $R3 Operand order is fixed (destination first) Example: C code: A = B + C DLX code: add r1, r2, r (registers associated with variables by compiler)

83 DLX arithmetic Design Principle: simplicity favors regularity. Why?
Of course this complicates some things... C code: A = B + C + D; E = F - A; MIPS code: add r1, r2, r3 add r1, r1, r4 sub r5, r6, r1 Operands must be registers, only 32 registers provided Design Principle: smaller is faster Why?

84 Execution assembly instructions
Program counter holds the instruction address CPU fetches instruction from memory and puts it onto the instruction register Control logic decodes the instruction and tells the register file, ALU and other registers what to do An ALU operation (e.g. add) data flows from register file, through ALU and back to register file

85 ALU Execution Example

86 ALU Execution example

87 Memory Instructions Load and store instructions lw r11, offset(r10);
sw r11, offset(r10); Example: C code: A[8] = h + A[8]; assume h in r2 and base address of the array A in r3 DLX code: lw r4, 32(r3) add r4, r2, r4 sw r4, 32(r3) Store word has destination last Remember arithmetic operands are registers, not memory!

88 Memory Operations - Loads
Load data from memory lw R6, 0(R5) # R6 <= mem[0x14]

89 Memory Operations - Stores
Storing data to memory works essentially the same way sw R6 , 0(R5) R6 = 200; let’s assume R5 =0x18 mem[0x18] <-- 200

90 So far we’ve learned: DLX — loading words but addressing bytes — arithmetic on registers only Instruction Meaning add r1, r2, r3 r1 = r2 + r3 sub r1, r2, r3 r1 = r2 – r3 lw r1, 100(r2) r1 = Memory[r2+100] sw r1, 100(r2) Memory[r2+100] = r1

91 Use of Registers Example: a = ( b + c) - ( d + e) ; // C statement
# r1 – r5 : a - e add r10, r2, r3 add r11, r4, r5 sub r1, r10, r11 a = b + A[4]; // add an array element to a var // r3 has address A lw r4, 16(r3) add r1, r2, r4

92 Use of Registers : load and store
Example A[8]= a + A[6] ; // A is in r3, a is in r2 lw r1, 24(r3) # r1 gets A[6] contents add r1, r2, r1 # r1 gets the sum sw r1, 32(r3) # sum is put in A[8]

93 load and store Ex: a = b + A[i]; // A is in r3, a,b, i in // r1, r2, r4 add r11, r4, r4 # r11 = 2 * i add r11, r11, r11 # r11 = 4 * i add r11, r11, r3 # r11 =addr. of A[i] (r3+(4*i)) lw r10, 0(r11) # r10 = A[i] add r1, r2, r10 # a = b + A[i]

94 Example: Swap Swapping words r2 has the base address of the array v
lw r10, 0(r2) lw r11, 4(r2) sw r10, 4(r2) sw r11, 0(r2) temp = v[0] v[0] = v[1]; v[1] = temp;

95 DLX Instruction Format
I-type R-type J-type I-type Instructions R-type Instructions Briefly, there are 3 types of instructions, and we’ll see how we fit all our instructions into these 3 types J-type Instructions

96 Machine Language Instructions, like registers and words of data, are also 32 bits long Example: add r10, r1, r2 registers have numbers, 10, 1, 2 Instruction Format: R-type Instructions 0 opcode for ALU, only lowest 6 bits of func used, means add

97 Machine Language Consider the load-word and store-word instructions, What would the regularity principle have us do? New principle: Good design demands a compromise Introduce a new type of instruction format I-type for data transfer instructions other format was R-type for register Example: lw r10, 32(r2) I-type Instructions (Loads/stores)

98 Machine Language 0000010 offset to .L1 Jump instructions
Example: j .L1 J-type Instructions (Jump, Jump and Link, Trap, return from exception) offset to .L1

99 DLX Instruction Format
I-type R-type J-type I-type Instructions R-type Instructions J-type Instructions

100 Instructions for Making Decisions
beq reg1, reg2, L1 Go to the statement labeled L1 if the value in reg1 equals the value in reg2 bne reg1, reg2, L1 Go to the statement labeled L1 if the value in reg1 does not equals the value in reg2 j L1 Unconditional jump jr r10 “jump register”. Jump to the instruction specified in register r10

101 Making Decisions if ( a != b) goto L1; // x,y,z,a,b mapped to r1-r5
Example if ( a != b) goto L1; // x,y,z,a,b mapped to r1-r5 x = y + z; L1 : x = x – a; bne r4, r5, L1 # goto L1 if a != b add r1, r2, r3 # x = y + z (ignored if a=b) L1:sub r1, r1, r4 # x = x – a (always ex) Always ex = always executed

102 if-then-else Example: if ( a==b) x = y + z; else x = y – z ;
bne r4, r5, Else # goto Else if a!=b add r1, r2, r3 # x = y + z j Exit # goto Exit Else : sub r1,r2,r3 # x = y – z Exit :

103 Example: Loop with array index
Loop: g = g + A [i]; i = i + j; if (i != h) goto Loop r1, r2, r3, r4 = g, h, i, j, array base = r5 LOOP: add r11, r3, r3 #r11 = 2 * i add r11, r11, r11 #r11 = 4 * i add r11, r11, r5 #r11 = adr. Of A[i] lw r10, 0(r11) #load A[i] add r1, r1, r10 #g = g + A[i] add r3, r3, r4 #i = i + j bne r3, r2, LOOP

104 Other decisions Set R1 on R2 less than R3 slt R1, R2, R3
Compares two registers, R2 and R3 R1 = 1 if R2 < R3 else R1 = 0 if R2 >= R3 Example slt r11, r1, r2 Branch less than Example: if(A < B) goto LESS slt r11, r1, r2 #t1 = 1 if A < B bne r11, $0, LESS Slt = Set Conditional Less Than

105 Loops Example : while ( A[i] == k ) // i,j,k in r3. r4, r5
i = i + j; // A is in r6 Loop: sll r11, r3, 2 # r11 = 4 * i add r11, r11, r6 # r11 = addr. Of A[i] lw r10, 0(r11) # r10 = A[i] bne r10, r5, Exit # goto Exit if A[i]!=k add r3, r3, r4 # i = i + j j Loop # goto Loop Exit: sll = shift left long

106 Addresses in Branches and Jumps
Instructions: bne r14,r15,Label Next instruction is at Label if r14≠r15 beq r14,r15,Label Next instruction is at Label if r14=r15 j Label Next instruction is at Label Formats: Addresses are not 32 bits — How do we handle this with large programs? — First idea: limitation of branch space to the first 216 bits I J op rs rt 16 bit address op bit address

107 Addresses in Branches Instructions: Formats:
bne r14,r15,Label Next instruction is at Label if r14≠r15 beq r14,r15,Label Next instruction is at Label if r14=r15 Formats: Treat the 16 bit number as an offset to the PC register – PC-relative addressing Word offset instead of byte offset, why?? most branches are local (principle of locality) Jump instructions just use the high order bits of PC – Pseudodirect addressing 32-bit jump address = 4 Most Significant bits of PC concatenated with 26-bit word address (or 28- bit byte address) Address boundaries of 256 MB op rs rt 16 bit address I

108 Conditional Branch Distance
• 65% of integer branches are 2 to 4 instructions

109 Conditional Branch Addressing
PC-relative since most branches are relatively close to the current PC At least 8 bits suggested (128 instructions) Compare Equal/Not Equal most important for integer programs (86%)

110 PC-relative addressing
For larger distances: Jump register jr required.

111 Example LOOP: mult $9, $19, $10 # R9 = R19*R10 lw $8, 1000($9) # R8 bne $8, $21, EXIT add $19, $19, $20 #i = i + j j LOOP EXIT: ... Assume address of LOOP is 0x8000 Offset 0 = PC + 4 2 0x8000

112 Procedure calls Procedures or subroutines:
Needed for structured programming Steps followed in executing a procedure call: Place parameters in a place where the procedure (callee) can access them Transfer control to the procedure Acquire the storage resources needed for the procedure Perform desired task Place results in a place where the calling program (caller) can access them Return control to the point of origin

113 Resources Involved Registers used for procedure calling:
$a0 - $a3 : four argument registers in which to pass parameters $v0 - $v1 : two value registers in which to return values r31 : one return address register to return to the point of origin Transferring the control to the callee: jal ProcedureAddress jump-and-link to the procedure address the return address (PC+4) is saved in r31 Example: jal 20000 Returning the control to the caller: jr r31 instruction following jal is executed next Jr = jump register

114 Memory Stacks Useful for stacked environments/subroutine call & return even if operand stack not part of architecture Stacks that Grow Up vs. Stacks that Grow Down: High address inf. Big 0 Little a grows up grows down Memory Addresses SP b c 0 Little inf. Big Low address

115 Calling conventions int func(int g, int h, int i, int j) { int f;
f = ( g + h ) – ( i + j ) ; return ( f ); } // g,h,i,j - $a0,$a1,$a2,$a3, f in r8 func : addi $sp, $sp, -12 #make room in stack for 3 words sw r11, 8($sp) #save the regs we want to use sw r10, 4($sp) sw r8, 0($sp) add r10, $a0, $a1 #r10 = g + h add r11, $a2, $a3 #r11 = i + j sub r8, r10, r11 #r8 has the result add $v0, r8, r0 #return reg $v0 has f

116 Calling (cont.) lw r10, 4($sp) # restore r10
addi $sp, $sp, 12 # restore sp jr $ra we did not have to restore r10-r19 (caller save) we do need to restore r1-r8 (must be preserved by callee)

117 Nested Calls Stacking of Subroutine Calls & Returns and Environments:
CALL B CALL C C: RET B: A B D: A B C A B A Some machines provide a memory stack as part of the architecture (e.g., VAX, JVM) Sometimes stacks are implemented via software convention

118 Compiling a String Copy Proc.
void strcpy ( char x[ ], y[ ]) { int i=0; while ( x[ i ] = y [ i ] != 0) i ++ ; } // x and y base addr. are in $a0 and $a1 strcpy : addi $sp, $sp, -4 # reserve 1 word space in stack sw r8, 0($sp) # save r8 add r8, $zer0, $zer0 # i = 0 L1 : add r11, $a1, r8 # addr. of y[ i ] in r11 lb r12, 0(r11) # r12 = y[ i ] add r13, $a0, r8 # addr. Of x[ i ] in r13 sb r12, 0(r13) # x[ i ] = y [ i ] beq r12, $zero, L2 # if y [ i ] = 0 goto L2 addi r8, r8, 1 # i ++ j L1 # go to L1 L2 : lw r8, 0($sp) # restore r8 addi $sp, $sp, 4 # restore $sp jr $ra # return

119 IA - 32 1978: The Intel 8086 is announced (16 bit architecture) 1980: The 8087 floating point coprocessor is added 1982: The increases address space to 24 bits, +instructions 1985: The extends to 32 bits, new addressing modes : The 80486, Pentium, Pentium Pro add a few instructions (mostly designed for higher performance) 1997: 57 new “MMX” instructions are added, Pentium II 1999: The Pentium III added another 70 instructions (SSE) 2001: Another 144 instructions (SSE2) 2003: AMD extends the architecture to increase address space to 64 bits, widens all registers to 64 bits and other changes (AMD64) 2004: Intel capitulates and embraces AMD64 (calls it EM64T) and adds more media extensions “This history illustrates the impact of the “golden handcuffs” of compatibility “adding new features as someone might add clothing to a packed bag” “an architecture that is difficult to explain and impossible to love”

120 IA-32 Overview Complexity: Saving grace:
Instructions from 1 to 17 bytes long one operand must act as both a source and destination one operand can come from memory complex addressing modes e.g., “base or scaled index with 8 or 32 bit displacement” Saving grace: the most frequently used instructions are not too difficult to build compilers avoid the portions of the architecture that are slow “what the 80x86 lacks in style is made up in quantity, making it beautiful from the right perspective”

121 IA32 Registers Oversimplified Architecture Fun fact:
Four 32-bit general purpose registers: eax, ebx, ecx, edx al is a register to mean “the lower 8 bits of eax” Stack Pointer esp Fun fact: Once upon a time, only x86 was a 16-bit CPU So, when they upgraded x86 to 32-bits... Added an “e” in front of every register and called it “extended”

122 Intel 80x86 Integer Registers
GPR0 EAX Accumulator GPR1 ECX Count register, string, loop GPR2 EDX Data Register; multiply, divide GPR3 EBX Base Address Register GPR4 ESP Stack Pointer GPR5 EBP Base Pointer – for base of stack seg. GPR6 ESI Index Register GPR7 EDI CS Code Segment Pointer SS Stack Segment Pointer DS Data Segment Pointer ES Extra Data Segment Pointer FS Data Seg. 2 GS Data Seg. 3 PC EIP Instruction Counter Eflags Condition Codes

123 x86 Assembly mov <dest>, <src>
Move the value from <src> into <dest> Used to set initial values add <dest>, <src> Add the value from <src> to <dest> sub <dest>, <src> Subtract the value from <src> from <dest>

124 x86 Assembly push <target>
Push the value in <target> onto the stack Also decrements the stack pointer, ESP (remember stack grows from high to low) pop <target> Pops the value from the top of the stack, put it in <target> Also increments the stack pointer, ESP

125 x86 Assembly jmp <address> Jump to an instruction (like goto)
Change the EIP to <address> Call <address> A function call. Pushes the current EIP + 1 (next instruction) onto the stack, and jumps to <address>

126 x86 Assembly lea <dest>, <src>
Load Effective Address of <src> into register <dest>. Used for pointer arithmetic (not actual memory reference) int <value> interrupt – hardware signal to operating system kernel, with flag <value> int 0x80 means “Linux system call”

127 x86 Assembly Condition Codes:
CZ: Carry Flag – Overflow Detection (Unsigned) ZF: Zero Flag SF: Sign Flag OF: Overflow Flag – Overflow Detection (Signed) Conditional Codes are you usually accessed through conditional branches (Not Directly)

128 Interrupt convention int 0x80 – System call interupt
eax – System call number (eg. 1-exit, 2-fork, 3-read, 4-write) ebx – argument #1 ecx – argument #2 edx – argument #3

129 CISC vs RISC RISC = Reduced Instruction Set Computer (DLX)
CISC = Complex Instruction Set Computer (x86) Both have their advantages. Will go into more detail next lecture

130 RISC Not very many instructions
All instructions all the same length in both execution time, and bit length Results in simpler CPU’s (Easier to optimize) Usually takes more instructions to express useful operation

131 CISC Very many instructions (x86 instruction set over 700 pages)
Variable length instructions More Complex CPU’s Usually takes fewer instructions to express useful operation (lower memory requirements)

132 Some links Great stuff on x86 asm:

133 Instruction Architecture Ideas
If code size is most important, use variable length instructions If simplicity is most important, use fixed length instructions Simpler ISA’s are easier to optimize! Recent embedded machines (ARM, MIPS) added optional mode to execute subset of 16-bit wide instructions (Thumb, MIPS16); per procedure decide performance or density

134 Summary Instruction complexity is only one variable Design Principles:
lower instruction count vs. higher CPI / lower clock rate Design Principles: simplicity favors regularity smaller is faster good design demands compromise make the common case fast Instruction set architecture a very important abstraction indeed!


Download ppt "OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview"

Similar presentations


Ads by Google