NVIDIA Memory Hierarchy Global Memory: large/high latency. Shared Memory: shared cache for each set of processors. Constant/texture memory: read only in global memory + on chip cache. – Constant memory faster, but only one port. – Texture Memory doesn’t suffer greatly from irregular access. Also, beneficial given 2D spatial locality.
Tuning SMVM for GPU (GT 280) Use multiple threads / row, use syncthreads and combine partial results. Access memory at stride. – Half warps access sequential addresses. – Allows for fewer memory reads from global memory. Align rows. – Also helps decrease memory reads from global memory. Use texture memory for input vector. – Input vector is reused. – Texture reads are cached, and benefit from spacial locality.
Improvements in Fermi (GTX 580) General L1/L2 cache structure. – L1 cache and Shared Memory cache are configurable to be 48 KB or 16 KB (64 KB shared between them). – L2 is 768 KB. Improved support for double precision floating point numbers. Added support for 32 bit integer multiplication. 32 SPs per SM.