Presentation is loading. Please wait.

Presentation is loading. Please wait.

EECC722 - Shaaban #1 lec # 12 Fall 2001 10-29-2001 Computer System Components SDRAM PC100/PC133 100-133MHZ 64-128 bits wide 2-way interleaved ~ 900 MBYTES/SEC.

Similar presentations


Presentation on theme: "EECC722 - Shaaban #1 lec # 12 Fall 2001 10-29-2001 Computer System Components SDRAM PC100/PC133 100-133MHZ 64-128 bits wide 2-way interleaved ~ 900 MBYTES/SEC."— Presentation transcript:

1 EECC722 - Shaaban #1 lec # 12 Fall 2001 10-29-2001 Computer System Components SDRAM PC100/PC133 100-133MHZ 64-128 bits wide 2-way interleaved ~ 900 MBYTES/SEC )64bit) Double Date Rate (DDR) SDRAM PC2100 266MHZ 64-128 bits wide 4-way interleaved ~2.1 GBYTES/SEC (64bit) RAMbus DRAM (RDRAM) 400-800MHZ 16 bits wide ~ 1.6 GBYTES/SEC CPU Caches System Bus I/O Devices: Memory Controllers adapters Disks Displays Keyboards Networks NICs I/O Buses Memory Controller Example: PCI, 33MHZ 32 bits wide 133 MBYTES/SEC CPU Core 500 MHZ - 2.0 GHZ 4-way Superscaler RISC or RISC-core (x86): Deep Instruction Pipelines Dynamic scheduling Multiple FP, integer FUs Dynamic branch prediction Hardware speculation L1 L2 L3 Memory Bus All Non-blocking caches L1 16-64K 1-2 way set associative (on chip), separate or unified L2 128K- 1M 4-16 way set associative (on chip) unified L3 1-16M 8-16 way set associative (off chip) unified Examples: Alpha, AMD K7: EV6, 200-266MHZ Intel PII, PIII: GTL+ 100MHZ Intel P4 400MHZ

2 EECC722 - Shaaban #2 lec # 12 Fall 2001 10-29-2001 Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row (~every 8 msec). Static RAM may be used if the added expense, low density, power consumption, and complexity is feasible (e.g. Cray Vector Supercomputers) Main memory performance is affected by: –Memory latency: Affects cache miss penalty. Measured by: Access time: The time it takes between a memory access request is issued to main memory and the time the requested information is available to cache/CPU. Cycle time: The minimum time between requests to memory (greater than access time in DRAM to allow address lines to be stable) –Memory bandwidth: The sustained data transfer rate between main memory and cache/CPU.

3 EECC722 - Shaaban #3 lec # 12 Fall 2001 10-29-2001 Processor-Memory (DRAM) Performance Gap Current memory access latency: 100 or more CPU cycles

4 EECC722 - Shaaban #4 lec # 12 Fall 2001 10-29-2001 X86 CPU Cache/Memory Performance Gap Example: AMD Athlon T-Bird Vs. Intel PIII, Vs. P4 AMD Athlon T-Bird 1GHZ L1: 64K INST, 64K DATA (3 cycle latency), both 2-way L2: 256K 16-way 64 bit Latency: 7 cycles L1,L2 on-chip Intel PIII 1 GHZ L1: 16K INST, 16K DATA (3 cycle latency) both 4-way L2: 256K 8-way 256 bit, Latency: 7 cycles L1,L2 on-chip Intel P 4, 1.5 GHZ L1: 8K INST, 8K DATA (2 cycle latency) both 4-way 96KB Execution Trace Cache L2: 256K 8-way 256 bit, Latency: 7 cycles L1,L2 on-chip Source: http://www1.anandtech.com/showdoc.html?i=1360&p=15 High L1, L2 data miss rates main memory accessed for data

5 EECC722 - Shaaban #5 lec # 12 Fall 2001 10-29-2001 Logical DRAM Organization Square root of bits per RAS/CAS Column Decoder Sense Amps & I/O MemoryArray (2,048 x 2,048) A0…A1 0 … 11 D Q W ord Line Storage Cell

6 EECC722 - Shaaban #6 lec # 12 Fall 2001 10-29-2001 CPU/Memory Performance Gap Reduction Techniques Latency Reduction/Higher Bandwidth: –Memory Hierarchy with one or more levels of cache + software/hardware cache performance enhancement techniques (high level of locality essential). –Low latency, wider, faster system bus. –Independent Memory Bank Interleaving. –Lower latency, high-bandwidth memory interfaces: Current: DDR SDRAM, Direct Rambus (DRDRAM) Future: Magnetic RAM (MRAM). –Chip level integration of memory controller/main memory (IRAM-like). Latency-tolerant Architectures: –Simultaneous Multithreaded(SMT) Architectures. –Decoupled Architectures: Separate memory access from normal processor operation (e.g HiDISC).

7 EECC722 - Shaaban #7 lec # 12 Fall 2001 10-29-2001 The Memory Hierarchy Latency Reduction/Higher Bandwidth The Memory Hierarchy Part of The On-chip CPU Datapath 16-256 Registers One or more levels (Static RAM): Level 1: On-chip 16-64K Level 2: On or Off-chip 128-512K Level 3: Off-chip 128K-8M Registers Cache Main Memory Magnetic Disc Optical Disk or Magnetic Tape Farther away from The CPU Lower Cost/Bit Higher Capacity Increased Access Time/Latency Lower Throughput DRAM, RDRAM 16M-16G Interface: SCSI, RAID, IDE, 1394 4G-100G

8 EECC722 - Shaaban #8 lec # 12 Fall 2001 10-29-2001 A Typical Memory Hierarchy ( With Two Levels of Cache) Virtual Memory, Secondary Storage (Disk) Main Memory (DRAM) Second Level Cache (SRAM) L 2 1s 10,000,000s (10s ms) Speed (ns): 10s100s GsSize (bytes):KsMs Tertiary Storage (Tape) 10,000,000,000s (10s sec) Ts Control Datapath Processor Registers On-Chip Level One Cache L 1 Larger Capacity Faster

9 EECC722 - Shaaban #9 lec # 12 Fall 2001 10-29-2001 Improving Cache Performance Miss Rate Reduction Techniques:Miss Rate Reduction Techniques: *Increased cache capacity * Larger block size *Higher associativity * Victim caches *Hardware prefetching of instructions and data * Pseudo-associative Caches *Compiler-controlled prefetching * Compiler optimizations *Trace cache Cache Miss Penalty Reduction Techniques:Cache Miss Penalty Reduction Techniques: *Giving priority to read misses over writes * Sub-block placement *Early restart and critical word first * Non-blocking caches *Second-level cache (L 2 ) Cache Hit Time Reduction Techniques:Cache Hit Time Reduction Techniques: *Small and simple caches *Avoiding address translation during cache indexing *Pipelining writes for fast write hits

10 EECC722 - Shaaban #10 lec # 12 Fall 2001 10-29-2001 Cache Optimization Summary TechniqueMRMPHTComplexity Larger Block Size+–0 Higher Associativity+–1 Victim Caches+2 Pseudo-Associative Caches +2 HW Prefetching of Instr/Data+2 Compiler Controlled Prefetching+3 Compiler Reduce Misses+0 Trace Cache+3 Priority to Read Misses+1 Subblock Placement ++1 Early Restart & Critical Word 1st +2 Non-Blocking Caches+3 Second Level Caches+2 Small & Simple Caches–+0 Avoiding Address Translation+2 Pipelining Writes+1 Miss rate Hit time Miss Penalty

11 EECC722 - Shaaban #11 lec # 12 Fall 2001 10-29-2001 Latency Reduction/Higher Bandwidth Wider Main Memory System BUS: Memory width is increased to a number of words (usually the size of a cache block).  Memory bandwidth is proportional to memory width. e.g Doubling the width of cache and memory doubles memory bandwidth Simple Interleaved Memory: Memory is organized as a number of banks each one word wide. –Simultaneous multiple word memory reads or writes are accomplished by sending memory addresses to several memory banks at once. –Interleaving factor: Refers to the mapping of memory addressees to memory banks. e.g. using 4 banks, bank 0 has all words whose address is: (word address mod) 4 = 0

12 EECC722 - Shaaban #12 lec # 12 Fall 2001 10-29-2001 Three examples of bus width, memory width, and memory interleaving to achieve higher memory bandwidth Narrow bus and cache with interleaved memory Wider memory, bus and cache Simplest design: Everything is the width of one word

13 EECC722 - Shaaban #13 lec # 12 Fall 2001 10-29-2001 Memory Interleaving

14 EECC722 - Shaaban #14 lec # 12 Fall 2001 10-29-2001 Memory Width, Interleaving: An Example Given the following system parameters with single cache level L 1 : Block size=1 word Memory bus width=1 word Miss rate =3% Miss penalty=32 cycles (4 cycles to send address 24 cycles access time/word, 4 cycles to send a word) Memory access/instruction = 1.2 Ideal CPI (ignoring cache misses) = 2 Miss rate (block size=2 word)=2% Miss rate (block size=4 words) =1% The CPI of the base machine with 1-word blocks = 2+(1.2 x 0.03 x 32) = 3.15 Increasing the block size to two words gives the following CPI: –32-bit bus and memory, no interleaving = 2 + (1.2 x 002 x 2 x 32) = 3.54 –32-bit bus and memory, interleaved = 2 + (1.2 x.02 x (4 + 24 + 8) = 2.86 –64-bit bus and memory, no interleaving = 2 + (1.2 x 0.02 x 1 x 32) = 2.77 Increasing the block size to four words; resulting CPI: –32-bit bus and memory, no interleaving = 2 + (1.2 x 0.01 x 4 x 32) = 3.54 –32-bit bus and memory, interleaved = 2 + (1.2 x 0.01 x (4 +24 + 16) = 2.53 –64-bit bus and memory, no interleaving = 2 + (1.2 x 0.01 x 2 x 32) = 2.77

15 EECC722 - Shaaban #15 lec # 12 Fall 2001 10-29-2001 Simplified Asynchronous DRAM Read Timing Source: http://arstechnica.com/paedia/r/ram_guide/ram_guide.part2-1.html

16 EECC722 - Shaaban #16 lec # 12 Fall 2001 10-29-2001 Page Mode DRAM: Operation DRAM memory interfaces Page Mode DRAM: Operation

17 EECC722 - Shaaban #17 lec # 12 Fall 2001 10-29-2001 Simplified Asynchronous Fast Page Mode (FPM) DRAM Read Timing DRAM memory interfaces Simplified Asynchronous Fast Page Mode (FPM) DRAM Read Timing Typical timing at 66 MHZ : 5-3-3-3 For bus width = 64 bits = 8 bytes cache block size = 32 bytes It takes = 5+3+3+3 = 14 memory cycles or 15 ns x 14 = 210 ns to read 32 byte block Read Miss penalty for CPU running at 1 GHZ = 15 x 14 = 210 CPU cycles FPM DRAM speed rated using tRAC ~ 50-70ns

18 EECC722 - Shaaban #18 lec # 12 Fall 2001 10-29-2001 Extended Data Out DRAM operates in a similar fashion to Fast Page Mode DRAM except the data from one read is on the output pins at the same time the column address for the next read is being latched in. Simplified Asynchronous Extended Data Out (EDO) DRAM Read Timing Source: http://arstechnica.com/paedia/r/ram_guide/ram_guide.part2-1.html Typical timing at 66 MHZ : 5-2-2-2 For bus width = 64 bits = 8 bytes Max. Bandwidth = 8 x 66 / 2 = 264 Mbytes/sec It takes = 5+2+2+2 = 11 memory cycles or 15 ns x 11 = 165 ns to read 32 byte cache block Read Miss penalty for CPU running at 1 GHZ = 11 x 15 = 165 CPU cycles EDO DRAM speed rated using tRAC ~ 40-60ns

19 EECC722 - Shaaban #19 lec # 12 Fall 2001 10-29-2001 Characteristics of Synchronous DRAM Interface Architectures

20 EECC722 - Shaaban #20 lec # 12 Fall 2001 10-29-2001 Synchronous Dynamic RAM, SDRAM Organization DDR SDRAM: Similar organization but using four banks to allow data transfer on both rising and falling edges of the clock.

21 EECC722 - Shaaban #21 lec # 12 Fall 2001 10-29-2001 Simplified SDRAM Read Timing Typical timing at 133 MHZ (PC133 SDRAM) : 4-1-1-1 For bus width = 64 bits = 8 bytes Max. Bandwidth = 133 x 8 = 1064 Mbytes/sec It takes = 4+1+1+1 = 8 memory cycles or 7.5 ns x 8 = 60 ns to read 32 byte cache block Read Miss penalty for CPU running at 1 GHZ = 7.5 x 8 = 60 CPU cycles


Download ppt "EECC722 - Shaaban #1 lec # 12 Fall 2001 10-29-2001 Computer System Components SDRAM PC100/PC133 100-133MHZ 64-128 bits wide 2-way interleaved ~ 900 MBYTES/SEC."

Similar presentations


Ads by Google