COMP381 by M. Hamdi 1 Final Exam Review. COMP381 by M. Hamdi 2 Exam Format It will cover material after the mid-term (Cache to multiprocessors) It is.

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Performance of Cache Memory
1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Memory Hierarchy: The motivation
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
Review of Mem. HierarchyCSCE430/830 Review of Memory Hierarchy CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U.
EECC550 - Shaaban #1 Lec # 9 Spring Memory Hierarchy: Motivation The gap between CPU performance and main memory speed has been widening.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 20 - Memory.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
CIS629 - Fall 2002 Caches 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality.
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
1 Recap. 2 No. of Processors C.P.I Computational Power Improvement Multiprocessor Uniprocessor.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
Cache intro CSE 471 Autumn 011 Principle of Locality: Memory Hierarchies Text and data are not accessed randomly Temporal locality –Recently accessed items.
EECC551 - Shaaban #1 lec # 7 Winter Memory Hierarchy: The motivation The gap between CPU performance and main memory has been widening with.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Memory Hierarchy 2.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Lecture 19: Virtual Memory
Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
Computer Organization & Programming
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Basic Components of a Parallel (or Serial) Computer CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1  1998 Morgan Kaufmann Publishers Chapter Seven.
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
CDA 3101 Spring 2016 Introduction to Computer Organization Physical Memory, Virtual Memory and Cache 22, 29 March 2016.
CPE 626 CPU Resources: Introduction to Cache Memories Aleksandar Milenkovic Web:
CMSC 611: Advanced Computer Architecture
CS 704 Advanced Computer Architecture
Memory COMPUTER ARCHITECTURE
Yu-Lun Kuo Computer Sciences and Information Engineering
CS 704 Advanced Computer Architecture
Cache Memory Presentation I
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Lecture 23: Cache, Memory, Virtual Memory
Chapter 5 Memory CSE 820.
Systems Architecture II
CPE 631 Lecture 05: Cache Design
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Lecture 21: Memory Hierarchy
Cache - Optimization.
Cache Memory Rabi Mahapatra
Presentation transcript:

COMP381 by M. Hamdi 1 Final Exam Review

COMP381 by M. Hamdi 2 Exam Format It will cover material after the mid-term (Cache to multiprocessors) It is similar to the style of mid-term exam We will have 6-7 questions in the exam –One question: true/false or short questions which covers general topics. –5-6 other questions require calculation

COMP381 by M. Hamdi 3 Memory Systems

COMP381 by M. Hamdi 4 Memory Hierarchy - the Big Picture Problem: memory is too slow and/or too small Solution: memory hierarchy Fastest Slowest Smallest Biggest Highest Lowest Speed: Size: Cost: Control Datapath Secondary Storage (Disk) Processor Registers L2 Off-Chip Cache Main Memory (DRAM) L1 On-Chip Cache Larger Capacity Faster

COMP381 by M. Hamdi 5 Why Hierarchy Works The principle of locality –Programs access a relatively small portion of the address space at any instant of time. –Temporal locality: recently accessed instruction/data is likely to be used again –Spatial locality: instruction/data near recently accessed /instruction data is likely to be used soon Result: the illusion of large, fast memory Address Space 02 n - 1 Probability of reference

COMP381 by M. Hamdi 6 Cache Design & Operation Issues Q1: Where can a block be placed cache? (Block placement strategy & Cache organization) –Fully Associative, Set Associative, Direct Mapped. Q2: How is a block found if it is in cache? (Block identification) –Tag/Block. Q3: Which block should be replaced on a miss? (Block replacement) –Random, LRU. Q4: What happens on a write? (Cache write policy) –Write through, write back.

COMP381 by M. Hamdi 7 Q1: Block Placement Where can block be placed in cache? –In one predetermined place - direct-mapped Use fragment of address to calculate block location in cache Compare cache block with tag to test if block present –Anywhere in cache - fully associative Compare tag to every block in cache –In a limited set of places - set-associative Use address fragment to calculate set Place in any block in the set Compare tag to every block in set Hybrid of direct mapped and fully associative

COMP381 by M. Hamdi 8 Q2: Block Identification Every cache block has an address tag and index that identifies its location in memory Hit when tag and index of desired word match (comparison by hardware) Q: What happens when a cache block is empty? A: Mark this condition with a valid bit 0x 00001C0 0xff083c2d 1 Tag/indexValidData

COMP381 by M. Hamdi 9 Cache Replacement Policy Random –Replace a randomly chosen line LRU (Least Recently Used) –Replace the least recently used line

COMP381 by M. Hamdi 10 0x1234 Write-through Policy 0x1234 Processor Cache Memory 0x1234 0x5678

COMP381 by M. Hamdi 11 0x1234 Write-back Policy 0x1234 Processor Cache Memory 0x1234 0x5678 0x9ABC

COMP381 by M. Hamdi 12 Cache Performance Average Memory Access Time (AMAT), Memory Stall cycles The Average Memory Access Time (AMAT): The number of cycles required to complete an average memory access request by the CPU. Memory stall cycles per memory access: The number of stall cycles added to CPU execution cycles for one memory access. For an ideal memory: AMAT = 1 cycle, this results in zero memory stall cycles. Memory stall cycles per average memory access = (AMAT -1) Memory stall cycles per average instruction = Memory stall cycles per average memory access x Number of memory accesses per instruction = (AMAT -1 ) x ( 1 + fraction of loads/stores) Instruction Fetch

COMP381 by M. Hamdi 13 Cache Performance Unified cache: For a CPU with a single level (L1) of cache for both instructions and data and no stalls for cache hits: CPUtime = IC x (CPI execution + Mem Stall cycles per instruction) x Clock cycle time CPU time = IC x [CPI execution + Memory accesses/instruction x Miss rate x Miss penalty ] x Clock cycle time Split Cache: For a CPU with separate or split level one (L1) caches for instructions and data and no stalls for cache hits: CPUtime = IC x (CPI execution + Mem Stall cycles per instruction) x Clock cycle time Mem Stall cycles per instruction = Instruction Fetch Miss rate x Miss Penalty + Data Memory Accesses Per Instruction x Data Miss Rate x Miss Penalty

COMP381 by M. Hamdi 14 Memory Access Tree For Unified Level 1 Cache CPU Memory Access L1 Miss: % = (1- Hit rate) = (1-H1) Access time = M + 1 Stall cycles per access = M x (1-H1) L1 Hit: % = Hit Rate = H1 Access Time = 1 Stalls= H1 x 0 = 0 ( No Stall) L1L1 AMAT = H1 x 1 + (1 -H1 ) x (M+ 1) = 1 + M x ( 1 -H1) Stall Cycles Per Access = AMAT - 1 = M x (1 -H1) M = Miss Penalty H1 = Level 1 Hit Rate 1- H1 = Level 1 Miss Rate

COMP381 by M. Hamdi 15 Memory Access Tree For Separate Level 1 Caches CPU Memory Access L1L1 Instruction Data Data L1 Miss: Access Time : M + 1 Stalls per access: % data x (1 - Data H1 ) x M Data L1 Hit: Access Time: 1 Stalls = 0 Instruction L1 Hit: Access Time = 1 Stalls = 0 Instruction L1 Miss: Access Time = M + 1 Stalls Per access: %instructions x (1 - Instruction H1 ) x M Stall Cycles Per Access = % Instructions x ( 1 - Instruction H1 ) x M + % data x (1 - Data H1 ) x M AMAT = 1 + Stall Cycles per access

COMP381 by M. Hamdi 16 Cache Performance (various factors) Cache impact on performance –With and without cache –Processor clock rate Which one performs better: unified or split –Assuming same size What is the effect of cache organization on cache performance: 1-way, 8-way set associative –Tradeoffs between hit-time and hit-rate

COMP381 by M. Hamdi 17 Cache Performance (various factors) What is the affect of write policy on cache performance: Write back or write through – write allocate vs. no-write allocate –Stall Cycles Per Memory Access = % reads x (1 - H1 ) x M + % write x M –Stall Cycles Per Memory Access = (1-H1) x ( M x % clean + 2M x % dirty ) What is the effect of cache levels on performance: –Stall cycles per memory access = (1-H1) x H2 x T2 + (1-H1)(1-H2) x M –Stall cycles per memory access = (1-H1) x H2 x T2 + (1-H1) x (1-H2) x H3 x T3 + (1-H1)(1-H2) (1-H3)x M

COMP381 by M. Hamdi 18 Performance Equation To reduce CPUtime, we need to reduce Cache Miss Rate

COMP381 by M. Hamdi 19 Reducing Misses (3 Cs) Classifying Cache Misses: 3 Cs –C ompulsory — (Misses even in infinite size cache) –C apacity —(Misses due to size of cache) –C onflict —(Misses due to associative and size of cache) How to reduce the 3 Cs (Miss rate) –Increase Block Size –Increase Associativity –Use a Victim Cache –Use a Pseudo Associative Cache –Use a prefetching technique

COMP381 by M. Hamdi 20 Performance Equation To reduce CPUtime, we need to reduce Cache Miss Penalty

COMP381 by M. Hamdi 21 Memory Interleaving – Reduce miss penalty Interleaving Default Begin accessing one word, and while waiting, start accessing other three words (pipelining) CPU Cache Memory 4 bytes Bus CPU Cache Memory 2 4 bytes Memory 1 Memory 3 Memory 0 Bus Requires 4 separate memories, each 1/4 size Must finish accessing one word before starting the next access (1+25+1)x4 = 108 cycles cycles Spread out addresses among the memories Interleaving works perfectly with caches

COMP381 by M. Hamdi 22 Memory Interleaving: An Example Given the following system parameters with single cache level L 1 : Block size=1 word Memory bus width=1 word Miss rate =3% Miss penalty=27 cycles (1 cycles to send address 25 cycles access time/word, 1 cycles to send a word) Memory access/instruction = 1.2 Ideal CPI (ignoring cache misses) = 2 Miss rate (block size=2 word)=2% Miss rate (block size=4 words) =1% The CPI of the base machine with 1-word blocks = 2+(1.2 x 0.03 x 27) = 2.97 Increasing the block size to two words gives the following CPI: –32-bit bus and memory, no interleaving = 2 + (1.2 x.02 x 2 x 27) = 3.29 –32-bit bus and memory, interleaved = 2 + (1.2 x.02 x (28)) = 2.67 Increasing the block size to four words; resulting CPI: –32-bit bus and memory, no interleaving = 2 + (1.2 x 0.01 x 4 x 27) = 3.29 –32-bit bus and memory, interleaved = 2 + (1.2 x 0.01 x (30)) = 2.36

COMP381 by M. Hamdi 23 Cache vs. Virtual Memory Motivation for virtual memory (Physical memory size, multiprogramming) Concept behind VM is almost identical to concept behind cache. But different terminology! –Cache: Block VM: Page –Cache: Cache MissVM: Page Fault Caches implemented completely in hardware. VM implemented in software, with hardware support from CPU. Cache speeds up main memory access, while main memory speeds up VM access Translation Look-Aside Buffer (TLB) How to calculate the size of page tables for a given memory system How to calculate the size of pages given the size of page table

COMP381 by M. Hamdi 24 Virtual Memory Map Physical Memory Disk Individual Pages Virtual Memory: Definitions Key idea: simulate a larger physical memory than is actually available General approach: –Break address space up into pages –Each program accesses a working set of pages –Store pages: In physical memory as space permits On disk when no space left in physical memory –Access pages using virtual address

COMP381 by M. Hamdi 25 I/O Systems

COMP381 by M. Hamdi 26 I/O Systems

COMP381 by M. Hamdi 27 I/O concepts Disk Performance –Disk latency = average seek time + average rotational delay + transfer time + controller overhead Interrupt-driven I/O Memory-mapped I/O I/O channels: –DMA (Direct Memory Access) –I/O Communication protocols Daisy chaining Polling I/O Buses –Synchronous vs. asynchronous

COMP381 by M. Hamdi 28 RAID Systems Examined various RAID architectures: RAID0-RAID5: Cost, Performance (BW, I/O request rate) –RAID-0: No redundancy –RAID-1: Mirroring –RAID-2: Memory-style ECC –RAID-3: bit-interleaved parity –RAID-4: block-interleaved parity –RAID-5: block-interleaved distributed parity

COMP381 by M. Hamdi 29 Storage Architectures Examined various Storage architectures (Pros. And Cons): –DAS - Directly-Attached Storage –NAS - Network Attached Storage –SAN - Storage Area Network

COMP381 by M. Hamdi 30 Multiprocessors

COMP381 by M. Hamdi 31 Motivation Application needs Amdhal’s law –T(n) = –As n  , T(n)  Gustafson ’s law –T'(n) = s + n*p; T'(  )   !!!! 1 s+p/n 1s1s

COMP381 by M. Hamdi 32 SISD (Single Instruction, Single Data): –Typical uniprocessor systems that we’ve studied throughout this course. SIMD (Single Instruction, Multiple Data): –Multiple processors simultaneously executing the same instruction on different data. –Specialized applications (e.g., image processing). MIMD (Multiple Instruction, Multiple Data): –Multiple processors autonomously executing different instructions on different data. Flynn’s Taxonomy of Computing

COMP381 by M. Hamdi 33 Shared Memory Multiprocessors P/C Cache NIC MB P/C Cache NIC MB Bus/Custom-Designed Network Shared Memory

COMP381 by M. Hamdi 34 MPP (Massively Parallel Processing) Distributed Memory Multiprocessors P/C LM NIC MB P/C LM NIC MB Custom-Designed Network MB : Memory BusNIC : Network Interface Circuitry

COMP381 by M. Hamdi 35 Cluster Commodity Network (Ethernet, ATM, Myrinet) MB P/C M NIC P/C M Bridge LD NIC IOB LD : Local DiskIOB : I/O Bus

COMP381 by M. Hamdi 36 Grid P/C SM NIC LD Hub/LAN Internet IOC P/C SM NIC LD Hub/LAN IOC

COMP381 by M. Hamdi 37 Multiprocessor concepts SIMD Applications (Image processing) MIMD –Shared memory Cache coherence problems Bus scalability problems Distributed memory –Interconnection networks –Cluster of workstations

COMP381 by M. Hamdi 38 Preparation Strategy Read this review to focus your preparation –1 general question –5-6 other questions Around 50% for memory systems Around 50% I/O and multiprocessors Go through the lecture notes Go through the “training problems” We will have more office hours for help Good luck