Computer Architecture Key Points John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe II drifts off Waiheke.

Slides:



Advertisements
Similar presentations
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Advertisements

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
Computer Organization and Architecture
Computer Organization and Architecture
Cache Memory Locality of reference: It is observed that when a program refers to memory, the access to memory for data as well as code are confined to.
Processor - Memory Interface
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Translation Buffers (TLB’s)
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
COMPSYS 304 Computer Architecture Memory Management Units Reefed down - heading for Great Barrier Island.
Computer Architecture Key Points
Lecture 19: Virtual Memory
Computer Architecture Memory Management Units Iolanthe II - Reefed down, heading for Great Barrier Island.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Memory and cache CPU Memory I/O. CEG 320/52010: Memory and cache2 The Memory Hierarchy Registers Primary cache Secondary cache Main memory Magnetic disk.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga.
Computer Architecture Key Points John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe II drifts off Waiheke.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots on Cockburn.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Pipelining and Parallelism Mark Staveley
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1 Basic Components of a Parallel (or Serial) Computer CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Computer Architecture System Interface Units Iolanthe II in the Bay of Islands.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
EKT303/4 Superscalar vs Super-pipelined.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
SOFTENG 363 Computer Architecture Cache John Morris ECE/CS, The University of Auckland Iolanthe I at 13 knots on Cockburn Sound, WA.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
PipeliningPipelining Computer Architecture (Fall 2006)
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Cache Memory Presentation I
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Chapter 9: Virtual-Memory Management
Memory and cache CPU Memory I/O.
CSC3050 – Computer Architecture
10/18: Lecture Topics Using spatial locality
Presentation transcript:

Computer Architecture Key Points John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe II drifts off Waiheke Island

Memory Bottleneck State-of-the-art processor f = 3 GHz t clock = 330ps 1-2 instructions per cycle ~25% memory reference Memory response 4 instructions x 330ps ~1.2ns needed! Bulk semiconductor RAM 100ns+ for a ‘random’ access!  Processor will spend most of its time waiting for memory!

Memory Bottleneck Assume * Clock speed, f = 3GHz Cycle time,  cyc = 1/f = 330ps * 32-bit = 4 byte machine word Internal bandwidth = (bytes per word) * f = 4 * f = 12 GB/s * 64-bit PCI bus, f bus = 32 MHz Arrow width (roughly) indicates data bandwidth !! Clearly a bottleneck! Needs to be 3 GB/s for 25% load store instructions!

Cache Small, fast memory Typically ~50kbytes (1998) 2 cycle access time Same die as processor “Off-chip” cache possible Custom cache chip closely coupled to processor Use fast static RAM (SRAM) rather than slower dynamic RAM Several levels possible 2 nd level of the memory hierarchy “Caches” most recently used memory locations “closer” to the processor closer = closer in time

Memory Bottleneck Assume * Clock speed, f = 3GHz Cycle time,  cyc = 1/f = 330ps * 32-bit = 4 byte machine word Internal bandwidth = (bytes per word) * f = 4 * f = 12 GB/s * 64-bit PCI bus, f bus = 32 MHz Arrow width (roughly) indicates data bandwidth Cache provides small, very fast memory ‘close’ to processor ‘Close’ = close in time, ie with high bandwidth connection

Very large  very slow Memory Bottleneck Assume * Clock speed, f = 3GHz Cycle time,  cyc = 1/f = 330ps * 32-bit = 4 byte machine word Internal bandwidth = (bytes per word) * f = 4 * f = 12 GB/s * 64-bit PCI bus, f bus = 32 MHz Arrow width (roughly) indicates data bandwidth Small  fast Larger  slower

Memory hierarchy & performance Usual metric is machine cycle time,  cyc = 1/f Visible to programmer Registers < 1 cycle latency (respond in same cycle) Transparent to programmer Level 1 (L1) cache 2 cycle latency L2 cache 5-6 cycles L3 cache about 10 cycles Main memory 100+ cycles for a random access Disc > 1 ms or >10 6 cycles Effective memory access time,  eff =  f i t i where f i = fraction of hits at level i, t i = access time at level i

Cache - organisation Direct-mapped cache Each word in the cache has a tag Assume cache size - 2 k words machine words - p bits byte-addressed memory m = log 2 ( p/8 ) bits not used to address words m = 2 for 32-bit machines p-k-mmk p bits tagcache address byte address Address format

Cache - organisation Direct-mapped cache p-k-mmk tagcache address byte address tagdata Hit? memory CPU 2 k lines p-k-mp A cache line Memory address

Cache - Conflicts Conflicts Two addresses separated by 2 k+m will hit the same cache location p-k-mmk p bits tag cache address byte address Addresses in which these k bits are the same will map to the same cache line

Cache - Conflicts When a word is modified in cache  Write-back cache Only writes data back when needed  Misses  Two memory accesses Write modified word back Read new word  Write-through cache Low priority write to main memory is queued Processor is delayed by read only Memory write occurs in parallel with other work Instruction and necessary data fetches take priority

Cache - Write-through or write-back? Write-through Seems a good idea! but... Multiple writes to the same location waste memory bus bandwidth  Typical programs better with write-back caches however Often you can easily predict which will be best  Some processors ( eg PowerPC) allow you to classify memory regions as write-back or write-through

Cache - more bits Cache lines need some status bits Tag bits +.. Valid All set to false on power up Set to true as words are loaded into cache Dirty Needed by write-back cache Write- through cache always queues the write, so lines are never ‘dirty’ Tag VMData Cache line p-k-mp11

Cache – Improving Performance Conflicts ( addresses 2 k+m bytes apart ) Degrade cache performance Lower hit rate Murphy’s Law operates Addresses are never random! Some locations ‘thrash’ in cache Continually replaced and restored Alternatively Ideal cache performance depends on uniform access to all parts of memory Never happens in real programs!

Cache - Fully Associative All tags are compared at the same time Words can use any cache line

Cache - Fully Associative Associative Each tag is compared at the same time Any match  hit Avoids ‘unnecessary’ flushing Replacement Least Recently Used - LRU Needs extra status bits Cycles since last accessed Hardware cost high Extra comparators Wider tags p-m bits vs p-k-m bits

Cache - Set Associative Each line - two words two comparators only 2-way set associative

Cache - Set Associative n -way set associative caches n can be small: 2, 4, 8 Best performance Reasonable hardware cost Most high performance processors Replacement policy LRU choice from n Reasonable LRU approximation 1 or 2 bits Set on access Cleared / decremented by timer Choose cleared word for replacement

Cache - Locality of Reference  Temporal Locality Same location will be referenced again soon Access same data again Program loops - access same instruction again Caches described so far exploit temporal locality  Spatial Locality Nearby locations will be referenced soon Next element of an array Next instruction of a program

Cache - Line Length Spatial Locality Use very long cache lines Fetch one datum  Neighbours fetched also PowerPC 601 (Motorola/Apple/IBM) first of the single chip Power processors 64 sets 8-way set associative 32 bytes per line 32 bytes (8 instructions) fetched into instruction buffer in one cycle 64 x 8 x 32 = 16k byte total

Cache - Separate I- and D-caches Unified cache Instructions and Data in same cache Two caches - * Instructions * Data  Increases total bandwidth MIPS R Kbyte Instruction; 32Kbyte Data Instruction cache is pre-decoded! (32  36bits) Data 8-word (64byte) line, 2-way set associative 256 sets Replacement policy?

COMPSYS 304 Computer Architecture Memory Management Units Reefed down - heading for Great Barrier Island

Memory Management Unit èVirtual Address Space Each user has a “private” address space User D’s Address Space

Virtual Addresses Mappings between user space and physical memory created by OS

Memory Management Unit (MMU) Responsible for VIRTUAL  PHYSICAL address mapping Sits between CPU and cache Cache operates on Physical Addresses (mostly - some research on VA cache) CPU MMU Cache Main Mem D or I VA PA D or I

MMU - operation q-k

MMU - Virtual memory space Page Table Entries can also point to disc blocks Valid bit Set: page in memory address is physical page address Cleared: page “swapped out” address is disc block address MMU hardware generates page fault when swapped out page is requested Allows virtual memory space to be larger than physical memory Only “working set” is in physical memory Remainder on paging disc

Page Fault q-k

MMU – Page faults Very expensive! Gap in access times Main memory ~100+ ns Disc ~1+ ms A factor of 10 4 slower!! +May require write-back of old (but modified) page +May require reading of Page Table Entries from disc! Good way to make a system thrash!

MMU – Access control Provides additional protection to programmer Pages can be marked Read only Execute only Can prevent wayward programmes from corrupting their own programme code or vital data Protection is hardware! MMU will raise exception if illegal access attempted OS traps the exception and process it

MMU Inverted page tables Scheme which saves memory for page tables One PTE per page of physical memory Hash function used  Collisions probable  Possibly slower  Sharing  Map virtual pages for several users to same physical page  Good for sharing program code  Data also (read/write control provided by OS)  Saves physical memory  Reduces pressure on main memory

MMU TLB Cache for page table entries Enables MMU to translate VA  PA in time! Can be quite small: entries Often fully associative Small size avoids one ‘cost’ of FA cache Only comparators needed TLB Coverage Amount of memory covered by TLB entries Size of a program for which VA  PA translation will be fast

Memory Hierarchy - Operation

System Interface Unit Tasks Control bus Match cache line length to bus width Follow bus protocol Request / Grant / Data cycles Manage ‘burst’ transactions Burst transactions  greater bus efficiency More ‘work’ (data cycles) per transaction Overhead (request | grant | address) is smaller fraction of total bus cycles / transaction Maintain transaction queues Read (high priority) Write (low priority) Reads check write Q for latest copy of data

System Interface Unit: Bus efficiency Split phase transactions Separate address and data buses Separate address and data phases Overlap  greater bus utilization Multiple transactions ‘in flight’ at any time Slow peripheral devices don’t ‘hog’ the bus and prevent fast transactions ( eg memory) from accessing bus Overhead cycles ‘Work’ cycles 2 nd transaction starts before 1 st completes

System Interface Unit: Bus efficiency Single purpose bus Graphics, memory Simpler, faster Single direction (CPU  graphics buffer) Single device ( eg memory) Simpler protocol (only one type of device) Point to point wiring Shorter, faster Single driver (no need for delay in switch from read to write)

Superscalar Processors Superpipelined Deep pipeline (>5 stages) Hazards and dependencies limit depth Each stage has overhead Registers needed  Larger circuit  Speed reduction >8 stages  decrease in efficiency vs Superscalar  next slide

Superscalar Processors Superscalar Multiple functional units Integer ALUs, FPUs, branches, load/store Floating point typically 3 internal stages Usually several integer ALUs per FPU Addressing, loop calcs need integer ALU Instruction issue unit is now more complex Determines which instructions can be issued in each cycle What data is ready? Which functional units are free? Typically tries to issue 4 instructions / cycle Achieves 2-3 instructions / cycle on average Out of order execution Instructions executed when data is available Dependent instructions may stall while later ones execute Number of functional units > instruction issue width eg 6 FUs, max 4 instructions / cycle

Speculation Data prefetch Try to get data into cache well in advance No stall for memory read when data actually needed PowerPC: dcbt – data cache block touch Advice for the system – low priority read Pentium: prefetchT x ( x =0,1,2) Semantics varies for Pentium 3 and Pentium 4 Pentium 4 fetches into L2 cache only Compiler can detect many patterns eg sequential access of array elements for( j=0; j<n; j++ ) sum = sum + x[j]; Programmer can insert pre-fetch instructions Speculative because data may not be needed

Speculation - branching Branches are expensive Stall pipeline More expensive as pipeline depth increases! Fetching useless instructions wastes bandwidth!  Couple Branch unit with Instruction Issue unit Conditional branches if ( cond ) s1 else s2 Execute both s1 and s2 If functional units and data available Use idle resources! Squash results from wrong branch when value of cond known MIPS allows 4 streams of speculative execution Pentium 4: Up to 126 ‘in flight’ ? From a web article by an obvious Intel fan Starts with “The Pentium still kicks butt.” Not a good flag for an objective article! Probably counts instruction issue unit buffers + system interface transactions too!

Parallel Processing

Communications bottleneck! (Again!) Limits ability to write efficient parallel systems Exception Small group of embarrassingly parallel systems Very high computation : communication ratios Long computation on small data sets Results communicated to master PE are small Ideal: n PEs  Time t n = t 1 /n Actual: t n > t 1 /n Eventually: t n > t n-1 Adding PEs slows things down! Communications and thread management overhead

Parallel Processing Flynn’s Taxonomy Simple, but useful starting point Classification based on I (instruction stream) and D (data stream) 4 classes SISD (sequential PEs) SIMD (many simple PEs, vector machines, MMX, Altivec), MISD (no known examples), MIMD (general parallel processor)

Parallel Processing – Programming Models  Shared Memory Model All PEs see a common address space Trivial data distribution (none!) Threads of computation need explicit synchronization Synchronization is an overhead!  Dataflow or Functional  Message Passing Details follow …

Parallel Processing – Programming Models  Dataflow Model Execution is data-driven Used as model for both hardware and software Dataflow machines Functional languages Theoretically important Produce provably correct programs Slow in practice Cilk – Hybrid dataflow/imperative

Parallel Processing – Programming Models  Message Passing Execution is control-driven Threads run in their own address spaces on each PE Data transferred by sending and receiving messages Available as libraries of functions Can be invoked from any language Commonly used Message Passing Interface (MPI) C, FORTRAN, … libraries readily available Parallel Virtual Machine (PVM) First, generally considered less efficient than MPI Two basic primitive operations Send send( destination_PE, data_address, n_bytes ) Receive receive( destination_PE, data_address, n_bytes )

Parallel Processing – Programming Models  Message Passing Two basic primitive operations Send send( destination_PE, data_address, n_bytes ) Receive receive( destination_PE, data_address, n_bytes ) Data distributed explicitly by programmer Using send’s Synchronisation is implicit Receive waits for sender to complete computation and send data Considerer ‘low level’ as a programming language Programmer does everything!

Parallel Processing – Programming Models  Message Passing Efficient Runs faster than shared memory Cilk (hybrid dataflow/imperative) faster though Programmer usually knows more about problem Codes minimum data distribution and synchronization Libraries are easy to implement Can use any communications network Ethernet ATM Myrinet … etc Popular Most used in practice Libraries are widely available Programming concept is simple Even though it requires slightly more work!

Architectures Overriding messages  Communication overhead is the killer If communication patterns do not match needs of problem to be solved Then parallel overheads will swamp benefit from adding PEs Main overhead is sending and receiving data Message overheads Synchronization dead time  Coarse grain is the key Give away low level parallelism Minimize overheads Larger messages Longer running threads Reduce communication : computation ratio

Architectures SIMD Large numbers of small PEs connected in grid Easy to build Can solve certain problems efficiently Variations PEs are trivial – just ALU Simple PE, eg  processor Complex PE, eg Pentium, with local memory Systolic arrays Linear communication patterns Very limited range of problems Idea appears in ALUs of modern processors MMX (Intel), Altivec (Motorola), … Useful for graphics operations

Architectures Vector machines Three main components Address generation unit Streams vector data efficiently to and from memory Handles address computation ‘overhead’ Vector registers Fast FIFO queues for data ALUs Very fast floating point ALU Efficient for wide range of problems requiring vector and matrix computations Including sparse matrix (eg diagonal) Expensive $ n  10 6 each

Architectures Dataflow machines Data driven not control driven Dataflow graph exposes all possible parallelism Originally expected to be able to extract maximum parallelism and therefore maximum speedup! Fine grain dataflow dies because of communication overhead Coarse grain dataflow has potential But difficult to attract interest in non- mainstream (= non-Pentium) architectures New processor development is expensive $ n  10 8 each new 10 8 transistor chip

Architectures Dataflow machines Data driven not control driven Idea survives in instruction issue unit of high performance superscalars It issues instructions as Data is available Functional units are available Checks dependencies, hazards Finds instruction level parallelism (ILP) Limited parallelism Issues maximum ~4-8 instructions in each cycle

Architectures Network architectures Crossbar Ideal : Any PE  Any PE direct communication Only possible with low orders Ethernet Essentially linear common bus Switches and world-wide ‘grids’ provide additional paths and increase useful inter-PE bandwidth Rectangular Grids Easily implemented on 2-D circuit boards Hypercubes Reasonable compromise Effective bandwidth between arbitrary PEs Low order interconnection nodes Useful theoretical properties Simple definition of sub-cubes Match between interconnection pattern and problem shape vital Otherwise gains from additional PEs lost in comms overhead!