Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department.

Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department Computer System Architecture ESGD2204 Saturday, 3 rd April 2010 Chapter 7 Lecture 12

Chapter 7 ADVANCED COMPUTERS ORGANIZATION AND DESIGN

Tracking Technology Performance Trends Drill down into 4 technologies: –Disks, –Memory, –Network, –Processors Compare ~1980 Old vs. ~2009 Modern (New) –Performance Milestones in each technology Compare for Bandwidth vs. Latency improvements in performance over time Bandwidth: number of events per unit time –E.g., M bits / second over network, M bytes / second from disk Latency: elapsed time for a single event – E.g., one-way network delay in microseconds, average disk access time in milliseconds

Disks: Archaic (Old) vs. Modern (Newfangled) Seagate 373453, 2009 15000 RPM (4X) 73.4 GBytes (2500X) Tracks/Inch: 64,000 (80X) Bits/Inch: 533,000 (60X) Four 2.5” platters (in 3.5” form factor) Bandwidth: 86 MBytes/sec (140X) Latency: 5.7 ms (8X) Cache: 8 MBytes CDC Wren I, 1980 3600 RPM 0.03 GBytes capacity Tracks/Inch: 800 Bits/Inch: 9,550 Three 5.25” platters Bandwidth: 0.6 MBytes/sec Latency: 48.3 ms Cache: none

Latency Lags Bandwidth (for last ~20 years) Performance Milestones Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x) (Latency = simple operation w/o contention, BW = best-case)

Memory: Archaic (Old) v. Modern (New) 1980 DRAM (asynchronous) 0.06 Mbits/chip 64,000 xtors, 35 mm 2 16-bit data bus per module, 16 pins/chip 13 Mbytes/sec Latency: 225 ns (no block transfer) 2009 Double Data Rate Synchr. (clocked) DRAM 256.00 Mbits/chip (4000X) 256,000,000 xtors, 204 mm 2 64-bit data bus per DIMM, 66 pins/chip (4X) 1600 Mbytes/sec (120X) Latency: 52 ns (4X) Block transfers (page mode)

Performance Milestones Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x) Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x) (Latency = simple operation w/o contention, BW = best-case) Latency Lags Bandwidth (for last ~20 years)

LANs: Archaic (Old) v. Modern (New) Ethernet 802.3 Year of Standard: 1978 10 Mbits/s link speed Latency: 3000  sec Shared media Coaxial cable Ethernet 802.3ae Year of Standard: 2003 10,000 Mbits/s(1000X) link speed Latency: 190  sec (15X) Switched media Category 5 copper wire Coaxial Cable: Copper core Insulator Braided outer conductor Plastic Covering Copper, 1mm thick, twisted to avoid antenna effect Twisted Pair: "Cat 5,6,7,8" is 4 twisted pairs in bundle

Performance Milestones Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x) Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x) Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x) (Latency = simple operation w/o contention, BW = best-case) Latency Lags Bandwidth (for last ~20 years)

CPUs: Archaic (Old) v. Modern (New) 1982 Intel 80286 12.5 MHz 2 MIPS (peak) Latency 320 ns 134,000 xtors, 47 mm 2 16-bit data bus, 68 pins Microcode interpreter, separate FPU chip (no caches) 2001-9 Intel Pentium 4- Core 1500 MHz = 1.5 GHz(120X) 4500 MIPS (peak) (2250X) Latency 15 ns (20X) 42,000,000 xtors, 217 mm 2 64-bit data bus, 423 pins 3-way superscalar, Dynamic translate to RISC, Superpipelined (22 stage), Out-of-Order execution On-chip 8KB Data caches, 96KB Instr. Trace cache, 256KB L2 cache

Performance Milestones Processor: ‘286, ‘386, ‘486, Pentium, Pentium Pro, Pentium 4 (21x,2250x) Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x) Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x) Disk : 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x) CPU high, Memory low (“Memory Wall”) (Latency = simple operation w/o contention, BW = best-case) Latency Lags v. Bandwidth (for last 20 yrs)

Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X) –Hit Rate: the fraction of memory accesses found in the upper level –Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieved from a block in the lower level (Block Y) –Miss Rate = 1 - (Hit Rate) –Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the upper level Hit Time << Miss Penalty(=500 instructions on 21264!) Lower Level Memory Upper Level Memory To Processor From Processor Blk X Blk Y

Cache Measures Hit rate: fraction found in that level –So high that usually talk about Miss rate –Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks) Miss penalty: time to replace a block from lower level, including time to replace in CPU –{replacement time : time to make upper-level room for block } –access time : time to lower level = f(latency to lower level) –transfer time : time to transfer block =f(BW between upper & lower levels)

4 Questions for Memory Hierarchy Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)

Q1: Where can a block be placed in the upper level? Memory block 12 placed in an 8-block cache: –Fully associative, direct mapped, 2-way set associative –S.A. Mapping = Block Number Modulo (Number of Sets) – (Allowed cache blocks for block 12 shown in blue.) Cache 01234567 Memory 1111111111222222222233 01234567890123456789012345678901 Full Mapped Direct Mapped (12 mod 8) = 4 2-Way Set Assoc (12 mod 4) = 0

Q2: How find block if in upper level = cache? Bits = 18b: tag 8b index: 256 entries/cache (4b: 16 wds/block 2b: 4 Byte/wd) or ( 6b: 64 Bytes/block 6 offset bits) Bits: (One-way) Direct Mapped Data Capacity: 16KB Cache = 256 x 512 / 8 Index => cache set Location of all possible blocks Tag for each block: No need to check index, offset bits Increasing associativity: Shrinks index & expands tag size Bit Fields in Memory Address Used to Access “Cache” Word ______________________________________________________________ Virtual Memory “Cache Block” Offset Bits In Page Block (a.k.a. Page ) Address IndexTag 18

Q3: Which block to replace after a miss? (After start up, cache is nearly always full) Easy if Direct Mapped (only 1 block “1 way” per index) If Set Associative or Fully Associative, must choose: –Random (“Ran”) Easy to implement, but not best, if only 2-way: 1bit/way –LRU (Least Recently Used) LRU is best, but hard to implement if > 8- way Also other LRU approximations better than Random Miss Rates for 3 Cache Sizes & Associativities Associativity 2-way 4-way 8-way DataSize LRU Ran LRU Ran LRU Ran 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% Random picks => same low miss rate as LRU for large caches

Q4: Write policy: What happens on a write? Write-ThroughWrite-Back Policy Data written to cache block is also written to next lower-level memory Write new data only to the cache Update lower level just before a written block leaves cache, i.e., erasing true value DebuggingEasierHarder Can read misses force writes? No Yes (used to slow some reads; now write-buffer) Do repeated writes touch lower level? Yes, memory busierNo Additional option -- let writes to an un-cached address allocate a new cache line (“write-allocate”), else just Write- Through.

Write Buffers for Write-Through Caches Q. Why a write buffer ? Processor Cache Write Buffer Lower Level Memory Write buffer holds (addresses&) data awaiting write-through to lower levels A. So CPU not stall for writes Q. Why a buffer, why not just one register ? A. Bursts of writes are common. Q. Are Read After Write (RAW) hazards an issue for write buffer? A. Yes! Drain buffer before next read or check buffer addresses before read- miss.

5 Basic Cache Optimizations Reducing Miss Rate 1.Larger Block size (reduce Compulsory, “cold”, misses) 2.Larger Cache size (reduce Capacity misses) 3.Higher Associativity (reduce Conflict misses) (… and multiprocessors have cache Coherence misses) (4 Cs) Reducing Miss Penalty 4.Multilevel Caches {total miss rate = π (local miss rate k ), where π means product of all items k, for k = 1 to max. } Reducing Hit Time (minimal cache latency) 5.Giving Reads Priority over Writes, since CPU waiting Read completes before earlier writes in write buffer

Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department.

Similar presentations

Presentation on theme: "Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department.

Similar presentations

Presentation on theme: "Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department."— Presentation transcript:

Similar presentations

About project

Feedback