Memory Hierarchy Design

Memory Hierarchy Design
Unit -IV Memory Hierarchy Design

Cache Performance A better measure of memory hierarchy performance is the average memory access time. Average Memory Access time = Hit time+ Miss rate *Miss penalty. where hit time is the time to hit in the cache. Components of average access time can be measured either in absolute time or in the number of clock cycles that the CPU waits for memory.

Average Memory Access time and processor performance
An obvious question is whether average access time due to cache misses predicts processor performance. i) There are other reasons for stalls such as contention due to I/O devices using memory. Designers assume that all memory hierarchy typically dominates other reasons for stalls. ii) The answer depends also on the CPU. CPU stalls during misses, and the memory stall time is strongly correlated to average memory access time. CPU time = (CPU execution clock cycles+ memory stall clock cycles)*Clock cycle time.

Miss penalty and out of order Execution processors
For an out of order execution processor how do you define miss penalty? Redefine memory stalls to lead to a new definition of miss penalty as non overlapped latency: (Memory stall cycles/Instruction) = (misses/Instruction) * (Total miss latency – overlapped miss latency). We now have to decide the following: Length of memory latency: What to consider as the start and end of a memory operation in an out of order operation. Length of Latency Overlap: What is the start of overlap with the processor.

Improving Cache performance
To help summarize this section and to act as a handy reference.

Reducing Cache Miss Penalty
Multi level Caches. Critical Word first and Early Restart. Giving priority to read misses over writes. Merging Write buffer. Victim Caches.

Multi level Caches The technique ignores the CPU concentrating on the interface between cache and memory. Adding another level of cache between the original cache and memory simplifies the decision. A first level cache can be small enough to match the clock cycle time of the fast CPU. The Second level cache can be large enough to capture many accesses that would go to main memory.

Multi level Caches Average memory access time = Hit timeL1+Miss rate L1*Miss penalty L1 Miss penalty L1= Hit timeL2+Miss rate L2*Miss penalty L2. Average memory access time = Hit timeL1+Miss rate L1*(Hit timeL2+Miss rate L2*Miss penalty L2).

Multi level Caches To avoid ambiguity these terms are adopted here for a two level cache system. Local Miss Rate: The rate is simply the number of misses in a cache divided by the total number of memory access to this cache. Global Miss rate: The number of misses in the cache divided by the total number of memory access generated by the CPU.

Critical Word first and Early Restart
Multilevel Caches require extra hardware to reduce the miss penalty but not this second technique. It is based on the observation that CPU normally need just one word of the block at a time. Critical word first: Request the missed word from memory and send it to the CPU as soon as it arrives. Let the CPU continue execution while filling the rest of the words in block. It is also called Wrapped fetch and requested word first. Early Restart: Fetch the words in the normal order, but as soon as the requested word of the block arrives sent it to the CPU and let the CPU continue execution.

Giving priority to read misses over writes
The simplest way out of this dilemma is for the read miss to wait until the write buffer is empty. The alternative is to check the contents of the write buffer on a read miss, and there are no conflicts and the memory system is available. The cost of writes by the processor in a write back cache can also be reduced. Suppose a read miss will replace a dirty memory block. Instead of writing the dirty block to memory and then reading memory. We could copy the dirty block to a buffer then read memory and then write memory. If a read miss occurs, the processor can either stall until the buffer is empty or check the address of word in the buffer for conflicts.

Merging Write buffer This technique also involves write buffers this time improving their efficiency. If a write buffer is empty the data and full address are written in the buffer and write is finished from the CPU perspective. CPU continues working while the write buffer prepares to write the word to memory. If the buffer contains the modified blocks, the address can be checked to see if the address of this new data matches the address of a valid write buffer entry. New data are combined with that entry, called write merging. If the buffer is full and there is no address match, the cache must wait until the buffer has an empty entry.

Victim Caches CPU Address Data In Out ? Tag data Victim Cache Write
buffer ? Lower level memory

Victim Caches One approach to lower the miss penalty is to remember what was discarded in case it is needed again. Since discarded data has already been fetched, it can be used again at small cost. Recycling requires a small, fully associate cache between a cache and it’s refill path. Victim cache contains only blocks that are discarded from a cache because of a miss– victims– and are checked on a miss to see if they have the desired data before going to the next lower-level memory. If it is found there, the victim block and cache block are swapped.

Reducing Miss rate The classical approach to improving cache behavior is to reduce miss rates. We first start with a model a that sorts all misses into three simple categories. i) Compulsory: The very first access to a block cannot be in the cache, so the block must be brought into the cache. It is called cold start misses or first reference misses. ii) Capacity: if the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur because of blocks being discarded and later retrieved. iii) Conflict: If the block placement strategy is set associative or direct mapped conflict misses will occur because a block may be discarded and later retrieved if too many blocks map to its set. These misses called collision misses or interference misses.

To show the benefit of associativity conflict misses are divided into misses caused by each decrease in associativity Here four divisions of conflict misses and how they are calculated. Eight-way: Conflict misses due to going from fully associative to eight way associative Four-way: Conflict misses due to going from eight way associative to four way associative. Two-way: Conflict misses due to going from four way associative to two way associative. One-way: Conflict misses due to going from two way associative to one way associative.

Five techniques Larger Block Size Larger Caches Higher Associativity.
Way prediction and Pseudo associative caches. Compiler optimizations.

Larger Block Size The simplest way to reduce the cache miss rate to increase in block size. Larger block sizes will reduce compulsory misses. This reduction occurs the principle of locality of two components. Temporal Locality: Spatial Locality:

Larger Block Size At the same time, Larger blocks increase the miss penalty. Since they reduce the number of blocks in the cache, larger blocks may increase conflict misses and even capacity misses if the cache is small. There is also no benefit to reducing miss rate if it increases the average memory access time.

Larger Caches To reduce the capacity misses is to increase capacity of the cache. Drawback is longer hit time and higher cost. The technique has been especially popular in off chips caches. The size of the second and third level caches in 2001 equals the size of the main memory in desktop computers.

Higher Associativity It shows how miss rate improves with higher associativity. There are two general rule of thumb. The first is that eight way set associative is for practical purpose as effective in reducing misses for these sized caches as fully associative. The second observation called the 2:1 cache rule thumb.

Way prediction and Pseudo Associative caches
Another approach reduces conflict misses and yet maintains the hit speed of the direct mapped cache. In way prediction extra bits are kept in the cache to predict the way or block with in set of next cache access. This prediction means the multiplexer is set clearly to select the desired block.

Pseudo Associative or column associative
Accesses proceed as in the direct mapped cache for a hit. On a miss however before going to the next lower level of the memory hierarchy. A Second cache entry is checked to see if it matches there. A simple way is to invert the most significant bit of the index field to find other block in the “pseudo set”. Pseudo associative caches have one fast and one slow hit time – corresponding to a regular hit and pseudo hit.

Compiler optimization
This is the final technique reduce the miss rates without any hardware changes. The magical reduction comes from optimized software and hardware. To increase the performance gap between processors and main memory has inspired compiler writes to scrutinize the memory hierarchy. Another code optimization aims for better efficiency from long cache blocks. Aligning basic blocks so that entry the entry point is at the beginning of a cache block decreases the chance of a cache miss for sequential code. Goal: Try to improve the temporal and spatial locality of the data.

Loop Interchange Some programs have nested loops that access data in memory in sequential order. Simply exchanging the nesting of the loops can make the code access the data in the order they are stored. Example: before: for(j=0;j<100;j++) { for(i=0;i<5000;i++) { x[i][j]=2*x[i][j]; }

Example: after: for(i=0;i<5000;i++) { for(j=0;j<100;j++) x[i][j]=2*x[i][j]; }

Blocking The optimized tries to reduce misses via improved temporal locality. We are again dealing with multiple arrays, with some arrays accessed by rows and some by columns. Storing the arrays row by row or column by column does not solve the problem. Such orthogonal accesses mean that transformations such as loop interchange are not helpful. Instead of operating on entire rows or columns of an array blocked algorithms operate on sub matrices or blocks. Goal: to maximize accesses to the data loaded in to the cache before the data are replaced.

Reducing Hit Time Hit time is critical because it affects the clock rate of the processor. In many processors today the cache access time limits the clock cycle rate even for processors that take multiple clock cycles to access the cache.

Techniques Small and Simple caches.
Avoiding Address translation during the indexing of the cache. Pipelined Cache Access. Trace Caches.

Small and simple caches
A time-consuming portion of a cache using the index portion of the address to read the tag memory and then compare it to the address. 1)one guideline from chap1suggests that smaller hardware is faster and small cache certainly helps the hit time. It is also critical to keep the cache small enough to fit on the same chip as the processor to avoid the time penalty of going off chip. 2) It is keep the cache simple such as direct mapping. The main benefit of direct mapped caches is that the designer can overlap the tag check with the transmission of the data.

For second level caches some designs strike a compromise by keeping the tags on chip and the data off chip. One approach is to determining the impact on hit time in advance of building a chip is to use CAD tools.

Avoiding address translation during indexing of the cache
The guide line of making case fast suggests that we use virtual addresses for the cache, since hits are much more common than misses such caches are termed as virtual caches with physical cache used to identify traditional cache that uses physical addresses. It distinguish two important tasks Indexing the cache comparing the cache.

1.Protection : page level protection is checked as part the virtual to physical addresses translation. 2.Every time a process is switched and virtual addresses refer to different physical addresses requiring the cache to be flushed. 3.Why virtual caches are not popular is that operating system and user programs may use two different virtual addresses for the same physical address. These duplicate address called synonyms or aliases could result in two copies of the same data in virtual cache.

Pipelined Caches The final technique is simply to pipeline cache access so that the effective latency of a first level cache hit can be multiple clock cycle giving fast cycle time and slow hits. This split increases the number of pipeline stages, leading to greater penalty on mis predicted branches and more clock cycles between the issues of the load and the use of the data.

Trace caches A Challenge in the effort to find instruction level parallelism beyond four instruction per cycles is to supply enough instructions every cycles without dependencies. One solution is called trace caches. It finds a dynamic traces of the executed instructions as determined by the CPU rather than containing static sequences of instructions as determined by memory. The Branch prediction is folded in to the cache and must be validated along with the addresses to have a valid fetch.

Trace caches have much more complicated address mapping mechanisms as the address are no longer aligned to power of 2 multiples of the word size. It store the instructions only from the branch entry point to the exit of the trace, there by avoiding such header and trailer overhead.

Main memory and organization for improving performance
Main memory is the next level down in the hierarchy. It satisfies the demands of caches and serves as the I/O interface, as it is the destination of the input. Performance measures of main memory emphasize both latency and bandwidth. Traditionally main memory latency is the primary concern of the cache while main memory bandwidth is the primary concern of I/O and multiprocessors.

Techniques Wider Main Memory. Simple Interleaved Memory.
Independent Memory Banks.

Wider Main Memory First level caches are often organized with a physical width of 1 word because most CPU accesses are that size. Doubling or quadruple the memory bandwidth. With a main memory width of 2 words the miss penalty in our example would drop from 4x64 or 256 clock cycles as calculated earlier to 2x64 or 128 clock cycles. There is cost in the wider connection between the CPU and memory, typically called as a memory bus. CPU’s will still access the cache a word at a time, so there now needs to be a multiplexor between cache and CPU. Second level caches can help since the multiplexing can be between first and second level caches. Drawback to wide memory is that the minimum increment is doubled or quadrupled when the width is doubled or quadrupled.

Simple Interleaved Memory.
Increasing width is one way to improve bandwidth, but another is to take advantage of the potential parallelism of having many chips in a memory system. Memory chips can be organized in banks to read or write multiple words at a time rather than a single word. In general, the purpose of interleaved memory is to try to take advantage of the potential memory bandwidth of all chips containing the needed words.

This mapping is referred to as the interleaving factor.
Interleaved memory normally means banks of memory that are word interleaved. This interleaving optimizes sequential memory accesses. A cache read miss is an ideal match to word inter leaved memory because the words in a block are read sequentially.

Independent Memory Banks.
The original motivation for memory banks was higher memory bandwidth by interleaving sequential accesses. This hardware is not much more difficult since the banks can share address line with a memory controller enabling each bank to use the data portion of the memory bus. A generalization of interleaving is to allow multiple independent accesses, where multiple memory controller allows banks to operate independently. Each bank needs separate address lines and possibly a separate data bus. Independent of memory technology higher bandwidth is available using memory banks by making memory and its bus wider.

Memory Technology In this section describes the technology inside the memory chips. Memory latency is traditionally quoted using two measures. Access time: It is the time between when the read is requested and when the desired word arrives. Cycle time: It is the minimum time between request to memory.

DRAM technology The main memory of virtually every desktop or server computer. DRAM grew in capacity a cost of a package with all the necessary address lines was an issue. The solution is to multiplex the address lines, there by cutting the number of address pins in half. One half address is sent first called the row access strobe (RAS) and it is followed by the other half of address called CAS. An additional requirement is signified by its letter D for dynamic. To pack more bits per chip DRAM use only single transistor to store the bit. To prevent the loss of information each bit must be refreshed. Periodically.

Every DRAM in the memory system must be access every row with in certain time window.
This requirement means that the memory system is occasionally unavailable because it is sending a signal telling every chip to refresh. The time for a refresh is typically a full memory access (RAS and CAS) for each row of the DRAM.

SRAM technology In contrast to DRAM are SRAM – the first letter standing for static. The dynamic nature of the circuits in DRAM require data to be written back after being read, hence the difference the access time and cycle time as well as the need t refresh. SRAM typically use six transistors per bit to prevent the information from being distributed when read. This difference in refresh alone can make a difference for embedded applications. SRAM needs only minimal power to retain the charge in stand by mode, but DRAM must continue to be refreshed occasionally. In DRAM designs the emphasis is on the cost per bit and capacity, while SRAM designs are concerned with speed and capacity.

Buses In a computer system, the various sub systems must have interfaces to one another. For instance the memory and CPU need to communicate and so do the CPU and I/O devices. This communication is commonly done by using bus. The bus serves as a shared communication link between the subsystems. Advantages: low cost and versality. cost of the bus is low. New devices can be added easily and peripherals may be moved between computer system that use a common bus.

Disadvantage: It creates a communication bottleneck, possibly limiting the maximum I/O throughput.
When I/O must pass through central bus this bandwidth limitation is as real as and sometimes more severe than memory bandwidth. To avoid the bus bottleneck, some I/O devices are connected to computers via storage Area Networks. (SAN) One reason of bus design: maximum bus speed is largely limited by physical factors: length of bus and number of the devices.

Buses were traditionally classified as CPU-memory buses or I/O buses.
I/O buses may be lengthy, may have many types of devices connected to them, have a wide range in the data bandwidth connected to them and normally follow a bus standard. CPU-memory buses on the other hand are short generally high speed and matched to the memory system to maximize memory-CPU bandwidth. During the design phase, the designer of a CPU-memory bus knows all the types of devices that must connect together. while the I/O bus designer must accept devices varying in latency and bandwidth capabilities. To lower costs, some computers have a single bus for the both memory and I/O devices. In the quest for higher I/O performance, some buses are a hybrid of the two.

Bus transaction: A bus transaction includes two parts: Sending the address and receiving or sending the data. A read transaction: transfer the data from memory. A Write transaction: Transaction writes data to the memory.

Bus deign decisions The design of a bus presents several option.
The computer system, decisions depend on cost and performance goals. The first three options in the figure clear – separate address and data lines wider data lines and multiple- word transfer all give higher performance at more cost. The next item in the table concerns the number of bus masters. These devices can initiate a read or write transaction; the CPU for instance is always a bus master. A bus has multiple masters when there are multiple CPUs or when I/O devices can initiate a bus transaction.

If there are multiple masters an arbitration scheme is required among the masters to decide which one gets the bus next. Arbitration is often a fixed priority for each device. With multiple masters, a bus can offer higher bandwidth by using packets. This technique is called split transactions.

Figure shows the split- transaction bus
Figure shows the split- transaction bus. The idea is to divide the bus events into requests and replies. Clocking: It concerns whether a bus is synchronous or asynchronous. If a bus is synchronous it includes a clock in the control lines and a fixed protocol for sending address and data relative to the clock. Since little or no logic is needed to decide what to do next, these buses can be both fast and inexpensive. Two major disadvantages: Because of clock skew problems synchronous buses can not be long and every thing on the bus must run at the same clock rate. Some buses allow the multiple speed devices on a bus but they are run at the slow rate.

An asynchronous bus, on the other hand is not clocked
An asynchronous bus, on the other hand is not clocked. Instead self timed handshaking protocols are used between the bus sender and receiver. It makes it much easier to accommodate a variety of devices and to lengthen the bus without worrying about the clock skew or synchronization problems. If a synchronous bus can be used it is usually faster than an asynchronous bus because it avoids the overhead of synchronizing the bus for each transaction.

RAID (Redundant Array of Inexpensive Disks)
An innovation that improves both dependability and performance of storage system is disk arrays. One argument for arrays is that potential throughput can be increased by having many disk drives. Simply spreading data over multiple disks called stripping automatically forces accesses to several disks. The drawback to arrays is that with more devices dependability decreases : N devices generally have 1/N the reliability of a single device. Although a disk array would have faults than a smaller number of larger disk. Dependability can be improved by adding redundant disks to the array to tolerate faults. If a single disk fails the lost information can be reconstructed from the redundant information.

These system have become known by the acronym RAID standard originally for Redundant Array of Inexpensive disk although some have renamed in to it Redundant Array of Independent disk. It shows how eight disks of user data must be supplemented by redundant or checks disks at each RAID level. One problem is discovering when a disk faults, Magnetic disks provide the information about their correct operation. Extra check information is recorded in each sector to discover the errors. Another issue in the design of RAID systems is decreasing the mean time to repair. This reduction is done by adding hot spares to the system – extra disks are not used in normal operation. When failure occurs an active disk in RAID, an idle hot spare is first pressed in to service. The data missing from the failed disk are reconstructed on to the hot spare. Hot swapping: Systems with hot swapping allow the components to be replaced without shutting down the computer.

No Redundancy RAID – 0 This notation refers to a disk array in which data are striped but there is no redundant tolerate disk failure. Striping across a set of disks makes the collection appear to software as a single large disk which simplifies storage management. It improves performance to large accesses ex: video editing systems.

RAID -1 (Mirroring) The traditional scheme for tolerating disk failure called mirroring or shadowing uses twice as many disks as does RAID 0. Whenever the data written to one disk those data are also written to a redundant disk. So that there are always two copies of the information. If a disk fails the system goes to the mirror to get the desired information. Mirroring is the most expensive RAID solution. One issue is how mirroring interacts with stripping. In RAID terminology, has evolved to call the former RAID 1+0 or RAID 10(Stripped mirrors) and the latter RAID 0+1 or RAID 01 (Mirrored stripes).

Bit interleaved parity - RAID 3
RAID3 is popular in applications with large set of data sets such as multimedia and scientific codes. Parity is one such scheme. Readers unfamiliar with the parity can think of the redundant disk as having as sum of the all the data in other disks. When a disk fails, then you subtract all the in the good disks from the parity disks. Parity is simply the sum of the modulo 2. The assumption of this technique is that failures are so rare that taking longer to recover from the from failure. Parity is accomplished by in this case by duplicating the data.

Block interleaved parity - RAID 4
RAID4 efficiently supports the mixture of large reads and large writes small reads and small writes. One drawback to the system is that the parity disk must be updated on every write. It is bottleneck for back to back writes. To fix the parity write bottleneck for writes. The distributed parity organization is RAID 5.

Distributed Block interleaved parity - RAID 5
How data are distributed in RAID4 and RAID5. In RAID5 the parity associated with each row of the data blocks is no longer as the stripe units are not located in the same disks. For example: A write to a block 8 on the right must also access its parity block p2,ther by occupying the first and third disks. A Second write to block5 on the right , implying an update to its parity block P1, accesses the second and fourth disks and thus could occur at the same time as the write to block 8. Those same writes to the organization on the left would result in changes to blocks p1 and p2 both on the fifth disk which would be a bottleneck.

RAID -6 P+Q Redundancy Parity based schemes protect against a single self-identifying failure. When a single failure correction is not sufficient, Parity can be generalized to have a second calculation over the data and another check disk of information. This second check allows recovery from a second failure. Thus the storage overhead is twice that of RAID 5.

I/O performances issues
It measures that have no counter parts in CPU design. DIVERSITY: Which I/O devices can connect to the computer system. CAPACITY: How many /O devices can connect to a computer system. In addition to three unique measures : Response time and Throughput and I/O throughput. Response time: It is defined as the time a task takes from the moment it is placed in the buffer until the server finishes the task. Through put: It is simply the average number of tasks completed by the server over a time period.

Throughput Vs Response Time
Figure show that throughput and response time for a typical I/O system. How does the architect balance these conflicting demands? Two studies of interactive environment: one keyboard oriented and one graphical. An interaction or transaction with a computer is divided in to three parts. Entry Time: The time for the user to enter the command. The Graphical system required 0.25 seconds on average to enter the command versus 4.0 seconds for the keyboard system.

System response time: The time between the use enters the command and the complete response is displayed. Think time: The time from of the response until the user begins to enter the next command. The sum of the three parts is called the transaction time.

Response time Vs Throughput
I/O benchmarks offer another perspective on the response time vs throughput trade off. From these figure two approaches report maximum throughput given either that 90% of response times must be less than the limit.

Designing an I/O systems
To find a design that meets goal for cost, dependability and variety of devices while avoiding bottlenecks in I/O performance. Avoiding bottlenecks means that components must be balanced between main memory and I/O devices because effective cost performance. Finally storage must be dependable, adding new constraints on proposed designs.

2. List the physical requirements for each I/O device.
1.List the different types of I/O devices to be connected to the machine. 2. List the physical requirements for each I/O device. 3. List the cost of each I/O device, including the portions. 4. List the reliability of each I/O device. 5.Record the CPU resource demands of each I/O device. Clock cycles for instructions used to initiate an I/O. CPU clock stalls due to waiting for I/O to support operation of an I/O devices. CPU clock cycles to recover from an I/O activity.

6. List the memory and I/O bus resource demands of each I/O devices.
7. The final step is assessing the performance and availability of the different ways to organize these I/O devices.

Memory Hierarchy Design

Similar presentations

Presentation on theme: "Memory Hierarchy Design"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Memory Hierarchy Design

Similar presentations

Presentation on theme: "Memory Hierarchy Design"— Presentation transcript:

Similar presentations

About project

Feedback