Reducing Hit Time Small and simple caches Way prediction Trace caches

Reducing Hit Time Small and simple caches Way prediction Trace caches
Sections 5.2, 5.3

Small and simple caches
Small hardware is faster desirable to keep L2 cache small enough to fit on processor chip Cache needs to be fast to match fast clock-cycle Time consuming – using index to grab tags and comparing to tag bits in address Simpler cache – direct mapped; tag check can be overlapped with transmission of data Compromise Keep tags on chip (fast tag comparison) and data off chip (larger cache) Sections 5.2, 5.3

Shows impact of size and associativity on hit time
Figure 5.4 Sections 5.2, 5.3

Way Prediction Stored with each cache set is extra bits used to predict which block (called the way) in the set will be accessed by the next access to the cache Tag in address is compared first to the predicted block’s tag (fast hit if correct) If prediction is wrong, tag is compared to other tags in the set (slower hit or possibly miss) Allows hit time to be usually as fast as the hit time of a direct mapped cache Sections 5.2, 5.3

Trace Caches Contain dynamic traces of executed instructions rather than static sequences of instructions as determined by layout in memory More difficult to access than a regular cache (branch prediction part of the addressing scheme) Branches can cause the same instruction to be in different traces and thus in more than one place in the trace cache Expensive in area, power and complexity; however, in spite of drawbacks, used in Intel Netburst architectures for storing micro-operations Sections 5.2, 5.3

Increasing cache bandwidth
Pipelined caches Multi-banked caches Non-blocking caches Sections 5.2, 5.3

Pipelined caches Pipeline cache access so that multiple accesses can be in progress simultaneously Provides fast clock cycle because stages can be small Provides high bandwidth But, slow hit time Example: Pentium III takes 2 clock cycles to access cache; Pentium 4 takes 4 Sections 5.2, 5.3

Multi-banked caches Typical main memory organization is as a collection of banks that can be accessed in parallel Banks are starting to appear now in cache organizations as well See figure 5.6 (next slide) Sections 5.2, 5.3

Figure 5.6: four-way interleaved cache banks
Sections 5.2, 5.3

Nonblocking Caches CPU need not be stalled during a cache miss
Processors with separate memory units could continue with one memory operation while other one is stalled because of a cache miss Non-blocking cache is able to service other requests even after a miss has occurred (called hit under miss optimization) See figure 5.5 on next slide Note that what is really going on is that the miss penalty is being overlapped with another memory access Sections 5.2, 5.3

Reducing the miss penalty
Critical word first Merging write buffers Sections 5.2, 5.3

Critical word first Critical word first - request the missed word first and send it to the CPU as soon as it arrives; rest of block arrives while CPU continues execution Early restart – fetch the words in order of address in the block, but as soon as requested word arrives, send it to CPU Sections 5.2, 5.3

Merging Write Buffer Typical write buffer uses one entry per write no matter how much data is written (word, multiwords) Merging write buffer uses the the address of previously written words to merge newly written words into the buffer Makes more efficient use of the space in the write buffer and uses memory more efficiently since multiword writes are faster than multiple single word writes Figure 5.12 on next slide Sections 5.2, 5.3

Reducing the miss rate Compiler optimizations Sections 5.2, 5.3

Compiler Optimizations
Compiler can reorder or transform code to reduce the number of cache misses Example: align basic blocks so that the entry point is at the beginning of a cache block Another example: interchange loops so that the data is accessed in the order in which it is stored Sections 5.2, 5.3

Loop Interchange /* Before */ for (j = 0; j < 100; j++)
for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j] /* After */ for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j] Sections 5.2, 5.3

Blocking Sometimes algorithms access arrays by both row and column thus loop interchange won’t improve locality Instead, accesses are blocked to maximize accesses to a portion of a row or column before continuing to next portion See figures 5.8 and 5.9 Sections 5.2, 5.3

Accesses to three arrays: x, y, z. White locations haven't been
accessed; light gray indicates older accesses; dark gray indicates newer accesses. x and y are read repeatedly to calculate new elements of z. Figure 5.8 Sections 5.2, 5.3

How the arrays will be accessed after blocking. The idea is that
these smaller portions of the arrays will fit into cache and used repeatedly before being discarded altogether. Figure 5.9 Sections 5.2, 5.3

Reducing Miss Penalty or Miss Rate via Parallelism
Hardware prefetching Compiler controlled prefetching Sections 5.2, 5.3

Hardware Prefetching of Instructions and Data
Instruction Prefetch Hardware fetches two blocks on a cache miss Requested block goes into the instruction cache Prefetched block goes into an instruction buffer On next miss, instruction buffer is checked for desired block Data prefetch Similar to instruction prefetching, however more buffers are needed since data doesn’t exhibit same locality as instructions Sections 5.2, 5.3

Speedup due to hardware prefetching on Pentium 4

Compiler Controlled Prefetching
Compiler inserts prefetch instructions that request data before it is needed Register prefetch – value loaded into a register Cache prefetch – data loaded only in cache, not register Nonfaulting prefetch instruction – if instruction causes an exception, the prefetch is turned into a no-op Cache does not stall during prefetch but continues to supply instructions and data (non-blocking cache) Sections 5.2, 5.3

Compiler Controlled Prefetching
prefetch(a[0]); //assume 8 byte blocks prefetch(a[8]); … for (i = 0; i < 1000; i++) { prefetch(A[i * 8 +16]); //prefetch for later a[i] = a[i] * 1000; } Sections 5.2, 5.3

Memory technology Two performance issues
Latency (time between start and completion) – impacts cache miss penalty Access time – time between when a read is requested and when desired word arrives Cycle time – minimum time between requests to memory Bandwidth (amount of data transferred in a given time period) impacts multiprocessor performance increased by using memory banks Sections 5.2, 5.3

SRAM Technology SRAM – (static RAM) no refresh needed thus access time is very close to cycle time Six transistors per bit is typical Minimal power to retain charge Design for speed and capacity (versus cost per bit) 8 to 16 times faster and more expensive than DRAM Sections 5.2, 5.3

DRAM Technology DRAM – (dynamic RAM) data has to be written back after being read Increase in capacity causes increase in address lines Solution: multiplex address lines (cuts number of pins in half) First: one half of address is sent (RAS – row access strobe) Second: second half of address is sent (CAS – column access strobe) See figure 5.12 Sections 5.2, 5.3

Sections 5.2, 5.3

Conventional DRAM Organization
d x w DRAM: dw total bits organized as d supercells of size w bits 16 x 8 DRAM chip cols 1 2 3 memory controller 2 bits / addr 1 rows 2 supercell (2,1) (to CPU) 3 8 bits / data internal row buffer 31

Reading DRAM Supercell (2,1)
16 x 8 DRAM chip cols memory controller 1 2 3 RAS = 2 2 / addr 1 rows 2 3 8 / data internal row buffer 32

Reading DRAM Supercell (2,1)
16 x 8 DRAM chip cols memory controller 1 2 3 CAS = 1 2 / addr supercell (2,1) To CPU 1 rows 2 3 8 / data supercell (2,1) internal row buffer 33 internal buffer

DRAM technology Single transistor used to store a bit
Reading bit (and time also) destroys the information so must be refreshed Bits refreshed by reading every row within a certain time frame Memory is occasionally unavailable due to this refresh Sections 5.2, 5.3

DIMM Dual inline memory modules – small boards that contain a collection of DRAM chips (4-16) Normally organized to be 8 bytes wide for desktop machines Sections 5.2, 5.3

Memory Modules addr (row = i, col = j) : supercell (i,j) 64 MB
DRAM 0 64 MB memory module consisting of eight 8Mx8 DRAMs 31 7 8 15 16 23 24 32 63 39 40 47 48 55 56 64-bit doubleword at main memory address A bits 0-7 8-15 16-23 24-31 32-39 40-47 48-55 56-63 DRAM 7 64-bit doubleword 31 7 8 15 16 23 24 32 63 39 40 47 48 55 56 64-bit doubleword at main memory address A Memory controller 36

DRAM performance issues
Capacity needs to be increasing at 55% every three years to keep up with processor performance, but is not Latency is decreasing at an evener slow rate Sections 5.2, 5.3

Improving DRAM performance
Fast page mode – supports repeated accesses to same row without intervening RAS Synchronous DRAM (SDRAM) – clock signal added to DRAM interface so access can be synchronous Double data rate (DDR) SDRAM – supports transfer of data on rising and falling edge of DRAM clock signal Sections 5.2, 5.3

Naming DDRs DDR DRAM named by the number of M transfers per second
Clock rate of 133MHz means two (rising and falling edge) 133M transfers per second  DDR name is DDR266 Sections 5.2, 5.3

Naming DIMMs Named according to the peak DIMM bandwidth
133MHz X 2 (rising and falling edge) X 8 bytes (width of DIMM) = 2100 MB/sec  DIMM name is PC2100 Sections 5.2, 5.3

Reducing Hit Time Small and simple caches Way prediction Trace caches

Similar presentations

Presentation on theme: "Reducing Hit Time Small and simple caches Way prediction Trace caches"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reducing Hit Time Small and simple caches Way prediction Trace caches

Similar presentations

Presentation on theme: "Reducing Hit Time Small and simple caches Way prediction Trace caches"— Presentation transcript:

Similar presentations

About project

Feedback