Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Similar presentations


Presentation on theme: "Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,"— Presentation transcript:

1 Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University, USA 22nd ACM International Conference on Supercomputing (ICS), June 7-12, 2008. 1

2 Findings Prefetching hardware in modern single processors successfully hides memory latency when accessing memory in a stride that can be predicted by the hardware. The effectiveness of prefetching diminishes when multiple cores access memory simultaneously, straining the shared memory bandwidth. 2

3 Findings (continued) Memory latency cannot be hidden by creating numerous concurrent threads – the increased threads aggravate the bandwidth problem. for a given parallel program, when the number of executing cores exceeds a certain threshold, performance will degrade due to the bandwidth problem. 3

4 Motivations Programmers – Performance bottleneck – Algorithms comparison Compiler designers – Optimizations design: latency vs. bandwidth 4

5 Intel Quadcore Processor Q6600 block diagram 5

6 Intel Quadcore Processor Q6600 Four 2.4GHz cores (.417 ns cycle time) UJ : 2 X Four 2.5 GHz cores (Intel Xeon 5400 processors) Two L2 caches (each of 4MB size) UJ: Two L2 caches (each of 6 MB size) 1066MHz FSB UJ: 1333 MHz FSB 64-bit wide bus – FSB peak bandwidth = 8.5GB per second – UJ: FSB peak bandwidth = 10.6GB per second – bandwidth between MCH and main memory is 12.8GB per second. 6

7 Intel Quadcore processors Each core performs – out-of-order execution – completes up to four full instructions per cycle. Each 4MB (JU: 6MB) L2 cache is shared by two cores. In order to reduce the cache miss rate each core has – One hardware instruction prefetcher – Many data prefetchers all prefetching independently 7

8 Intel Quadcore processors (cont.) Two of the prefetchers can bring data from the memory to the L2 cache. Depending on memory reference patterns Pre-fetchers dynamically adjust parameters (Strides, Look ahead distance) according to: – Bus bandwidth – Number of pending requests – Pre-fetch history 8

9 Matrix Multiply Code 9

10 Matrix Multiplication Execution on a Single Core Speedup= non-blocked execution time/ Blocked execution time Blocking is not so helpful on a single core <10% improvement Perfetch hardware is perfect Is this also true on multicore? 10

11 BOTTLENECK OF THE MEMORY BUS On multicore machines, effectiveness of prefetching diminishes when it cannot proceed at the full speed due to the bandwidth constraint in spite of predictable strides. 11

12 Multicore Execution Results Four cores: 70% performance gap Observations – Efficient concurrency – No inter-thread communication – Bandwidth problem? 12

13 Matrix Multiplication Execution on a Single Core: revisited Speedup= non-blocked execution time/ Blocked execution time Blocking is not so helpful on a single core <10% improvement Perfetch hardware is perfect 13 For a certain application, is there a metric to quantitatively identify whether the memory bandwidth is adequate? AND AND available memory bandwidth is adequate

14 Metric Definition: Memory Access Intensity Zero-latency instruction issue rate – IR Z : instructions/ cycle which can be issued supposing the operands are always available on-chip α : average number of bytes accessed off-chip per instruction The Memory Access Intensity for an application is given by: β A = α X IR Z bytes/cycle bytes accessed per cycle = (bytes/Instruction) X (Instructions/cycle) 14

15 When there will be a memory bottleneck? Peak Memory Bandwidth, PMB : bytes/sec β M (bytes per cycle)= PMB /Frequency For an application there is a memory bottle neck if β A > β M 15

16 Is there a memory bottle neck for an application? β M for Intel Core 2 Duo Q6600 = 3.54B/cycle There is a memory bottleneck for an application if β A > β M Take home midterm question: Compute β M for the UJ 2 Quad Core Intel Xeon 5400. 16

17 Revisit Matrix Multiply Results 17

18 Explaination β M : 3.54B/cycle Methods to compute β A – Program analysis – Use Intel Vtune: Measure hardware counters 4.97>3.54, thus blocking is necessary when 4 cores are executing Core MEM (Byte)INSTβAβA Non- Block 16.89E+106.01E102.29 23.83E+106.02E102.55 43.73E+106.00E104.97 Block (256*25 6) 19.05E+087.75E100.02 29.45E+087.74E100.05 41.19E+097.74E100.12 18

19 Other Results Benchmarks: diagonally dominant banded linear systems – ScaLAPACK {version implemented in the Intel Math Kernel Library (MKL)} – Spike (will not discuss here, see paper) – A revised Spike (will not discuss here, see paper) Configurations – β M : 3.54 bytes / CPU cycle – Band: narrow(11), medium(99) and wide(399) – Large matrix 19

20 ScaLAPACK: performance for narrow banded system 20

21 ScaLAPACK: performance for medium banded system 21

22 ScaLAPACK: performance for wide banded system 22

23 Results of the factorization step For all three matrices band widths, little speedup is achieved by using multiple cores. – Reason (Vtune): parallelized code significantly increases the number of instructions Solution: change the algorithm to reduce the number of extra instructions introduced by parallelization. 23

24 Results of the solve step Flat speedup – Reason: Vtune shows β A > β M Solution: remove the memory bottleneck Vtune shows: Factorization step dominates the total execution time when the band gets wider. – Improvement on factorization more critical than the “solve” step to the total performance 24


Download ppt "Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,"

Similar presentations


Ads by Google