Presentation is loading. Please wait.

Presentation is loading. Please wait.

Memory System Performance Chapter 3

Similar presentations


Presentation on theme: "Memory System Performance Chapter 3"— Presentation transcript:

1 Memory System Performance Chapter 3
Dagoberto A.R.Justo PPGMAp UFRGS 4/29/2019 intro

2 Introduction The clock rate of your CPU does not alone determine its performance Nowadays, memory speeds are becoming the limitation Hardware designers are creating architectures that try to overcome memory speed limitations However, the hardware is designed to be efficient under only some models of how the program running on them is designed Thus, careful program design is essential to obtain high performance We introduce these issues by looking at simple matrix operations and modeling their performance, given certain characteristics of the CPU architecture

3 Memory Issues -- Definitions
Consider an architecture with several caches How long does it take to receive a word in a particular cache, once requested? How many words can be retrieved in a unit of time, once the first one is received? Assume we request a word of memory and receive a block of data of size b words containing the desired word Latency: The time l to receive the first word after the request (usually in nanoseconds) Bandwidth The number of words per time unit received with one request (usually in million of words per second)

4 Hypothetical Machine 1 (no Cache)
Clock rate: 1 GHz (1 nanosecond, 1 ns) Two multiply-add units can perform 2 multiply and 2 add operations per cycle Fast CPU -- 4 FLOP per cycle  peak performance: 4 GFlops Latency: 100 ns takes 100 cycles x 1 ns to obtain a word once requested Block size: 1 word (=8 bytes=64 bits) Thus, the bandwidth is 10 megawords per second (Mwords/s) However, very slow in practice Consider a dot product operation Each step is a multiply and add accessing 2 elements of 2 vectors and accumulating the result in a register, say s 2 elements are sent to cache every 200 ns and 2 ops are performed in 1 cycle Thus, the machine runs at 100.5ns for each flop  10 Mflops -- a factor 400 times slower than the peak Fetch 1 word Fetch 1 word 2FL X 100

5 Hypothetical Machine 2 (Cache BUT The Problem is Different)
Consider matrix multiplication problem Cache size: 32 Kbytes Block size: 1 word (=8 bytes) Two situations are different here The problem is different Dot-product performs 1 operation for each data in average 2n operands, 2n operations Matrix multiplication has data reuse O(n3) operations for O(n2) data This machine has a cache with line (block) size 4 A cache is an intermediate storage area for which the CPU accesses its memory in 1 cycle (low latency) but stores sufficient data to take advantage of data reuse -- the important aspect of matrix multiplication

6  Same processor, clock rate: 1 GHz (1 ns)
Latency: 100 ns (from memory to cache) Block size: 1 word Let n=32, A,B,C be 3232 matrices. Consider multiplying C=A*B Each matrix needs 1024 words, 3 matrices, times 8 bytes per matrix = 24KB, which fits entirely in a 32 KB cache Time to fetching 2 matrices into the cache 2048 words x 100 ns = takes s Perform 2n3 operations for the matrix multiply (2*323 = 64K op) Time: 2*323/4 ns or 16.3 s (4 flop per cycle) Thus, the flop rate is 64K/( )  296 Mflops A lot better than 10 Mflops but no where near the peak of 4 Gflops The notion of reusing data in this way is called temporal locality (many operations on the same data occur close in time) Fetch 1 word Fetch 1 word Fetch 1 word 4FL 4FL 4FL 4FL 100

7 Hypothetical Machine 3 (Increase the Memory/Cache Bandwidth)
increase the block size b, from 1 word to 4 words As stated, this implies that the data path from memory to cache is 128 bits wide (432 bits/word) For the dot product algorithm, A request for a(1) brings in a(1:4) The a(1) takes 100 ns but a(2), a(3), and a(4) arrive at the same time Similarly, a request for b(1) brings in b(1:4) The request for b(1) is issued one cycle after that for a(1) but the bus is busy bringing a(1:4) into the cache Thus, after 201 ns, the dot product computation starts and proceeds 1 cycle at a time, completing at a(4) and b(4) Next the request for a(5) brings in a(5:8) and so on Thus, the CPU is performing at approximately 8 flops for roughly 204 ns or 1 operation per 25 ns or 40 Mflops

8 Hypothetical Machine 3 -- Analyzed In Terms of Cache-Hit Ratio
Cache hit ratio = the number of memory accesses which are in cache/total number of memory accesses In this case, the first access in every 4 accesses is a miss and the remaining 3 are hits or successes Thus, the cache hit ratio is 75% Assume the dominant overhead is the misses Then 25% of the memory cycle time is an average overhead per access or 25 ns (25% of 100 ns memory latency) Because the dot-product has one operation per word accessed, this also works out to 40 Mflops A more accurate estimate is: (75%  %  100) ns/word Or ns or 38.8 Mflops

9 Actual Implementations
128 bit wide buses are expensive The usual implementation is to pipeline a 32-bit bus so that the words in the line (block) arrive at each clock cycle after the first item is received That is, instead of 4 words received after a 100 ns latency, the 4 items arrive after ns However, multiply-add operations can start after each item arrives so that the result is the same -- that is, 40 or so Mflops Fetch 128 bits 4FL 4FL 4FL 4FL Fetch 32 bits 4FL Fetch 32 bits 4FL Fetch 32 bits 4FL Fetch 32 bits 4FL 100

10 Spatial Locality Issue
It is clear that the dot-product is taking advantage of the consecutiveness of elements of the vector This is called spatial locality -- the elements are close together in memory Consider a different problem: The multiplication matrix-vector y=Ax The elements of the column are not consecutive in C That is, they are separated by a number of columns equal to the column length In this case, the accesses are not spatially local and essentially all accesses to every element of every column are cache misses

11 In Fortran The matrix A is stored by columns in the memory:
#800A, A(3,3) #800B, A(4,3)

12 Sum All Elements of a Matrix
Consider the problem of computing the sum of all elements a 10001000 matrix B S=0.D0 do i=1, 1000 do j=1, 1000 S=S+B(i,j) end do This code performs very poorly s fits in cache Since the inner loop is in the columns, consecutive elements are far apart in the memory unlikely to be in the same cache line every access experiences the maximum latency delay (100 ns)

13 Sum All Elements of a Matrix
Changing the order of the loops S=0.D0 do j=1, 1000 do i=1, 1000 S=S+B(i,j) end do the inner loop accesses B by columns The elements in the columns are consecutive and thus a memory access brings multiple elements at a time (4 for our model machine) and thus the performance is reasonable for this machine

14 Peak Floating Point Performance x Peak Memory Bandwidth
The performance issue: the peak floating point performance is bounded by the peak memory bandwidth For fast microprocessors, it is 100 MFLOPS/MByte of bandwidth Solve the problem by modifying the algorithms to hide the large memory latencies Some compilers can transform simple codes to obtain better performance For large scale vector processors, 1 MFLOP/MByte of bandwidth These modifications are typically unnecessary but they don't hurt the computation and sometimes help

15 Hiding Memory Latency Consider the example of getting information from Internet using a browser What can you do to reduce the wait time? While reading one page, we anticipate the next pages we will read and therefore begin fetches for them in advance This corresponds to pre-fetching pages in anticipation of them being read We open multiple browser windows and begin accesses in each window in parallel This corresponds to multiple threads running in parallel We request many pages in order This corresponds to pipelining with spatial locality

16 Multi-threading To Hide Memory Latency
Consider the matrix-vector multiplication c=A*b Each row by vector inner product is an independent computation -- thus, create a different thread for each computation as follows: do k=1, n c(k)=create_thread( dot_product, A(k,: ),b) end do As separate threads: On the first cycle, the first thread accesses the first pair of data elements for the first row and waits for the data to arrive On the second cycle, the second thread accesses the first pair of elements for the second row and waits for the data to arrive And so on until l units of time (the latency) Then the first thread performs a computation and requests more data next Then the second thread performs a computation and requests more data And so on so that after the first latency of l cycles, every cycle is performing a computation

17 Multithread, block size=1
Fetch A(1,1) FL Fetch A(1,2) FL Fetch A(2,1) FL Fetch A(2,2) FL Fetch A(3,1) FL Fetch A(3,2) FL Fetch A(4,1) FL Fetch A(4,2) FL Fetch A(5,1) FL Fetch A(5,2) FL Fetch A(6,1) FL Fetch A(6,2) FL Fetch A(7,1) FL Fetch A(7,2) FL Fetch A(8,1) FL Fetch A(8,2) FL Fetch A(9,1) FL Fetch A(10,1) FL 8

18 Pre-fetching To Hide Memory Latency
Advance the loads ahead of when the data is needed The problem is that the cache space may be needed for computation between the pre-fetch and use of the pre-fetched data This is no worse that not performing the pre-fetch because the pre-fetch memory unit is typically an independent functional unit Dot product again (or vector sum) provides an example a(1) and b(1) are requested in a loop The processor sees that a(2) and b(2) are needed for the next iteration and in the next cycle requests them and so on Assume the first item takes 100ns to obtain the data and the requests for the others are every consecutive cycle The processor waits 101 cycles for the first pair, performs the computation, and the next pair are there on the next cycle ready for computation and so on

19 Impact On Memory Bandwidth
Pre-fetching or multithreading increase the bandwidth requirements to memory. Compare: a 32 thread computation experiencing a cache hit ratio of 25% (because all threads share the same cache and memory access) The memory bandwidth requirement is estimated to be 3 GB/sec A 1 single thread computation experiencing a 90% cache hit ratio The memory bandwidth requirement is estimated to be 400 MB/sec


Download ppt "Memory System Performance Chapter 3"

Similar presentations


Ads by Google