Memory System Performance Chapter 3

Slides:



Advertisements
Similar presentations
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Advertisements

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Processor - Memory Interface
Data Locality CS 524 – High-Performance Computing.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.
Computer ArchitectureFall 2007 © November 7th, 2007 Majd F. Sakr CS-447– Computer Architecture.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Basics and Architectures
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
TDC 311 The Microarchitecture. Introduction As mentioned earlier in the class, one Java statement generates multiple machine code statements Then one.
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Memory Hierarchies Sonish Shrestha October 3, 2013.
SOFTENG 363 Computer Architecture Cache John Morris ECE/CS, The University of Auckland Iolanthe I at 13 knots on Cockburn Sound, WA.
What is it and why do we need it? Chris Ward CS147 10/16/2008.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.
Vector computers.
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
Buffering Techniques Greg Stitt ECE Department University of Florida.
1 Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
CMSC 611: Advanced Computer Architecture
Improving Memory Access The Cache and Virtual Memory
Advanced Architectures
CSE 351 Section 9 3/1/12.
Reducing Hit Time Small and simple caches Way prediction Trace caches
Improving Memory Access 1/3 The Cache and Virtual Memory
Ramya Kandasamy CS 147 Section 3
How will execution time grow with SIZE?
The Hardware/Software Interface CSE351 Winter 2013
Assembly Language for Intel-Based Computers, 5th Edition
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
CACHE MEMORY.
COMP4211 : Advance Computer Architecture
Pipelining and Vector Processing
Superscalar Processors & VLIW Processors
Memory Hierarchies.
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
November 14 6 classes to go! Read
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
How can we find data in the cache?
M. Usha Professor/CSE Sona College of Technology
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Part V Memory System Design
Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Superscalar and VLIW Architectures
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Fundamentals of Computing: Computer Architecture
How does the CPU work? CPU’s program counter (PC) register has address i of the first instruction Control circuits “fetch” the contents of the location.
Instruction Level Parallelism
Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Memory System Performance Chapter 3 Dagoberto A.R.Justo PPGMAp UFRGS 4/29/2019 intro

Introduction The clock rate of your CPU does not alone determine its performance Nowadays, memory speeds are becoming the limitation Hardware designers are creating architectures that try to overcome memory speed limitations However, the hardware is designed to be efficient under only some models of how the program running on them is designed Thus, careful program design is essential to obtain high performance We introduce these issues by looking at simple matrix operations and modeling their performance, given certain characteristics of the CPU architecture

Memory Issues -- Definitions Consider an architecture with several caches How long does it take to receive a word in a particular cache, once requested? How many words can be retrieved in a unit of time, once the first one is received? Assume we request a word of memory and receive a block of data of size b words containing the desired word Latency: The time l to receive the first word after the request (usually in nanoseconds) Bandwidth The number of words per time unit received with one request (usually in million of words per second)

Hypothetical Machine 1 (no Cache) Clock rate: 1 GHz (1 nanosecond, 1 ns) Two multiply-add units can perform 2 multiply and 2 add operations per cycle Fast CPU -- 4 FLOP per cycle  peak performance: 4 GFlops Latency: 100 ns takes 100 cycles x 1 ns to obtain a word once requested Block size: 1 word (=8 bytes=64 bits) Thus, the bandwidth is 10 megawords per second (Mwords/s) However, very slow in practice Consider a dot product operation Each step is a multiply and add accessing 2 elements of 2 vectors and accumulating the result in a register, say s 2 elements are sent to cache every 200 ns and 2 ops are performed in 1 cycle Thus, the machine runs at 100.5ns for each flop  10 Mflops -- a factor 400 times slower than the peak Fetch 1 word Fetch 1 word 2FL X 100

Hypothetical Machine 2 (Cache BUT The Problem is Different) Consider matrix multiplication problem Cache size: 32 Kbytes Block size: 1 word (=8 bytes) Two situations are different here The problem is different Dot-product performs 1 operation for each data in average 2n operands, 2n operations Matrix multiplication has data reuse O(n3) operations for O(n2) data This machine has a cache with line (block) size 4 A cache is an intermediate storage area for which the CPU accesses its memory in 1 cycle (low latency) but stores sufficient data to take advantage of data reuse -- the important aspect of matrix multiplication

 Same processor, clock rate: 1 GHz (1 ns) Latency: 100 ns (from memory to cache) Block size: 1 word Let n=32, A,B,C be 3232 matrices. Consider multiplying C=A*B Each matrix needs 1024 words, 3 matrices, times 8 bytes per matrix = 24KB, which fits entirely in a 32 KB cache Time to fetching 2 matrices into the cache 2048 words x 100 ns = takes 204.8 s Perform 2n3 operations for the matrix multiply (2*323 = 64K op) Time: 2*323/4 ns or 16.3 s (4 flop per cycle) Thus, the flop rate is 64K/(204.8+16.3)  296 Mflops A lot better than 10 Mflops but no where near the peak of 4 Gflops The notion of reusing data in this way is called temporal locality (many operations on the same data occur close in time) Fetch 1 word Fetch 1 word Fetch 1 word 4FL 4FL 4FL 4FL 100

Hypothetical Machine 3 (Increase the Memory/Cache Bandwidth) increase the block size b, from 1 word to 4 words As stated, this implies that the data path from memory to cache is 128 bits wide (432 bits/word) For the dot product algorithm, A request for a(1) brings in a(1:4) The a(1) takes 100 ns but a(2), a(3), and a(4) arrive at the same time Similarly, a request for b(1) brings in b(1:4) The request for b(1) is issued one cycle after that for a(1) but the bus is busy bringing a(1:4) into the cache Thus, after 201 ns, the dot product computation starts and proceeds 1 cycle at a time, completing at a(4) and b(4) Next the request for a(5) brings in a(5:8) and so on Thus, the CPU is performing at approximately 8 flops for roughly 204 ns or 1 operation per 25 ns or 40 Mflops

Hypothetical Machine 3 -- Analyzed In Terms of Cache-Hit Ratio Cache hit ratio = the number of memory accesses which are in cache/total number of memory accesses In this case, the first access in every 4 accesses is a miss and the remaining 3 are hits or successes Thus, the cache hit ratio is 75% Assume the dominant overhead is the misses Then 25% of the memory cycle time is an average overhead per access or 25 ns (25% of 100 ns memory latency) Because the dot-product has one operation per word accessed, this also works out to 40 Mflops A more accurate estimate is: (75%  1 + 25%  100) ns/word Or 25.74 ns or 38.8 Mflops

Actual Implementations 128 bit wide buses are expensive The usual implementation is to pipeline a 32-bit bus so that the words in the line (block) arrive at each clock cycle after the first item is received That is, instead of 4 words received after a 100 ns latency, the 4 items arrive after 100 + 3 ns However, multiply-add operations can start after each item arrives so that the result is the same -- that is, 40 or so Mflops Fetch 128 bits 4FL 4FL 4FL 4FL Fetch 32 bits 4FL Fetch 32 bits 4FL Fetch 32 bits 4FL Fetch 32 bits 4FL 100

Spatial Locality Issue It is clear that the dot-product is taking advantage of the consecutiveness of elements of the vector This is called spatial locality -- the elements are close together in memory Consider a different problem: The multiplication matrix-vector y=Ax The elements of the column are not consecutive in C That is, they are separated by a number of columns equal to the column length In this case, the accesses are not spatially local and essentially all accesses to every element of every column are cache misses

In Fortran The matrix A is stored by columns in the memory: #800A, A(3,3) #800B, A(4,3)

Sum All Elements of a Matrix Consider the problem of computing the sum of all elements a 10001000 matrix B S=0.D0 do i=1, 1000 do j=1, 1000 S=S+B(i,j) end do This code performs very poorly s fits in cache Since the inner loop is in the columns, consecutive elements are far apart in the memory unlikely to be in the same cache line every access experiences the maximum latency delay (100 ns)

Sum All Elements of a Matrix Changing the order of the loops S=0.D0 do j=1, 1000 do i=1, 1000 S=S+B(i,j) end do the inner loop accesses B by columns The elements in the columns are consecutive and thus a memory access brings multiple elements at a time (4 for our model machine) and thus the performance is reasonable for this machine

Peak Floating Point Performance x Peak Memory Bandwidth The performance issue: the peak floating point performance is bounded by the peak memory bandwidth For fast microprocessors, it is 100 MFLOPS/MByte of bandwidth Solve the problem by modifying the algorithms to hide the large memory latencies Some compilers can transform simple codes to obtain better performance For large scale vector processors, 1 MFLOP/MByte of bandwidth These modifications are typically unnecessary but they don't hurt the computation and sometimes help

Hiding Memory Latency Consider the example of getting information from Internet using a browser What can you do to reduce the wait time? While reading one page, we anticipate the next pages we will read and therefore begin fetches for them in advance This corresponds to pre-fetching pages in anticipation of them being read We open multiple browser windows and begin accesses in each window in parallel This corresponds to multiple threads running in parallel We request many pages in order This corresponds to pipelining with spatial locality

Multi-threading To Hide Memory Latency Consider the matrix-vector multiplication c=A*b Each row by vector inner product is an independent computation -- thus, create a different thread for each computation as follows: do k=1, n c(k)=create_thread( dot_product, A(k,: ),b) end do As separate threads: On the first cycle, the first thread accesses the first pair of data elements for the first row and waits for the data to arrive On the second cycle, the second thread accesses the first pair of elements for the second row and waits for the data to arrive And so on until l units of time (the latency) Then the first thread performs a computation and requests more data next Then the second thread performs a computation and requests more data And so on so that after the first latency of l cycles, every cycle is performing a computation

Multithread, block size=1 Fetch A(1,1) FL Fetch A(1,2) FL Fetch A(2,1) FL Fetch A(2,2) FL Fetch A(3,1) FL Fetch A(3,2) FL Fetch A(4,1) FL Fetch A(4,2) FL Fetch A(5,1) FL Fetch A(5,2) FL Fetch A(6,1) FL Fetch A(6,2) FL Fetch A(7,1) FL Fetch A(7,2) FL Fetch A(8,1) FL Fetch A(8,2) FL Fetch A(9,1) FL Fetch A(10,1) FL 8

Pre-fetching To Hide Memory Latency Advance the loads ahead of when the data is needed The problem is that the cache space may be needed for computation between the pre-fetch and use of the pre-fetched data This is no worse that not performing the pre-fetch because the pre-fetch memory unit is typically an independent functional unit Dot product again (or vector sum) provides an example a(1) and b(1) are requested in a loop The processor sees that a(2) and b(2) are needed for the next iteration and in the next cycle requests them and so on Assume the first item takes 100ns to obtain the data and the requests for the others are every consecutive cycle The processor waits 101 cycles for the first pair, performs the computation, and the next pair are there on the next cycle ready for computation and so on

Impact On Memory Bandwidth Pre-fetching or multithreading increase the bandwidth requirements to memory. Compare: a 32 thread computation experiencing a cache hit ratio of 25% (because all threads share the same cache and memory access) The memory bandwidth requirement is estimated to be 3 GB/sec A 1 single thread computation experiencing a 90% cache hit ratio The memory bandwidth requirement is estimated to be 400 MB/sec