1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang Chen 3 and David Koppelman 1 1 Louisiana State University 2 University of Florida 3 Intel Corporation

04/11/2007 IPCCC’07 Peng, Louisiana State University 2 Motivation Dual-Core processors are popular. Understanding the impact of memory hierarchy to overall performance. What are important factors for memory hierarchy performance? How about speedups for dual threads?

04/11/2007 IPCCC’07 Peng, Louisiana State University 3 Selected Three Dual-Core Processors Intel Core 2 Duo Intel Pentium D AMD Athlon 64X2 Shared Cache vs. Private Cache On-chip vs. Off-chip memory controller On-chip vs. Off-chip Inter-core communication Off-Chip On-Chip Shared

04/11/2007 IPCCC’07 Peng, Louisiana State University 4 Selected Three Dual-Core Processors Intel Core 2 Duo Intel Pentium D AMD Athlon 64X2 Core 2 Duo : Shared Shared L2 cache, no L2 coherence, beneficial with one active core, higher latency, fairness issue When L1 miss, search L2 and the other L1 simultaneously, fast cache-cache transfer and L1 coherence (like a bus) Memory controller off-chip, aggressive memory dependence predict

04/11/2007 IPCCC’07 Peng, Louisiana State University 5 Selected Three Dual-Core Processors Intel Core 2 Duo Intel Pentium D AMD Athlon 64X2 Pentium D: Two Pentium 4 on a chip, use technology remap approach (SMP) Private L2 cache, MESI coherence, require memory update for M  S, off-chip FSB for memory update, L1 coherence also go through FSB Memory controller off-chip, longer delay but adaptive to new DRAM

04/11/2007 IPCCC’07 Peng, Louisiana State University 6 Selected Three Dual-Core Processors Intel Core 2 Duo Intel Pentium D AMD Athlon 64X2 Athlon 64x2: Private L2 cache, connected through HyperTransport Use system request queue for internal commun. Between two cores MOESI coherence protocol allows shared-modified block in O-state no need for memory updated when read a remote Modified block

04/11/2007 IPCCC’07 Peng, Louisiana State University 7 Specifications of the selected processors

04/11/2007 IPCCC’07 Peng, Louisiana State University 8 Methodology Same platform: SUSE Linux 10.1 with kernel 2.6.16- smp Micro-benchmarks Memory bandwidth and latency measured by Lmbench A lockless program [19] measuring cache-to-cache latency Real workloads Single threaded: SPEC CPU2000 and CPU2006 Multi-threaded: blastp, hmmpfam, SPECjbb2005 and SPLASH2

04/11/2007 IPCCC’07 Peng, Louisiana State University 9 Memory operations from Lmbench Memory read - measuring the time to read every 4 byte word from memory. Memory write - measuring the time to write every 4 byte word to memory. Other operations such as Memory bzero etc. Refer the paper for details.

04/11/2007 IPCCC’07 Peng, Louisiana State University 10 Lockless Program measuring cache-to-cache latency Doesn’t employ expensive read-modify-write atomic primitives. Maintains a lockless counter for each thread. *pPong is in a different cache line with *pPing. C2C latency for Core 2 Duo, Pentium D and Athlon 64X2: 33ns, 133ns and 68ns respectively.

04/11/2007 IPCCC’07 Peng, Louisiana State University 11 Memory bandwidth collected from the lmbench suite Doubled!! Private cache is faster! 1. In general, Core 2 Duo and Athlon 64 X2 have better bandwidth than that of Pentium D. 2. Pentium D shows the best memory read bandwidth when the array size is less than its L2 size. 3. Athlon 64X2 provides doubled memory read bandwidth for two copies lmbench, benefiting from its on-chip memory controller.

04/11/2007 IPCCC’07 Peng, Louisiana State University 12 SPEC CPU2000 and CPU2006 benchmarks’ execution time 1. Core 2 Duo processor runs fastest for almost all workloads, especially for art, mcf. 2. Athlon shows the best performance for ammp which has a large working set, resulting a high L2 miss rate. 3. When mixed with another program, memory intensive program’s execution time increasing is large. 4. When mixed with another program, CPU bounded program’s execution time increasing is small.

04/11/2007 IPCCC’07 Peng, Louisiana State University 13 Multi-programmed speedup of mixed SPEC CPU 2000/2006 benchmarks 1. Athlon 64X2 achieves the best speedup for all workloads. 2. CPU bounded program shows the best speedup. 3. Memory bounded program shows the worst speedup.

04/11/2007 IPCCC’07 Peng, Louisiana State University 14 Multithreaded Program Behaviors 1. Core 2 Duo’s single thread performance boosts because of larger L2 cache. 2. Core 2 Duo shows the best speedup for ocean due to high cache-to-cache transfer ratio. Verified by Intel VTune Analyzer. 3. Pentium D shows the best speedup for barnes because of the low cache miss rate

04/11/2007 IPCCC’07 Peng, Louisiana State University 15 Conclusions Analyzed the memory hierarchy of selected Intel and AMD dual-core processors. For the best performance and scalability, the following are important factors: fast cache-to-cache communication; large L2 or shared capacity; fast front side bus; on-chip memory controller. fair resource (cache) sharing.

04/11/2007 IPCCC’07 Peng, Louisiana State University 16 Thank you! Questions?

04/11/2007 IPCCC’07 Peng, Louisiana State University 17 Backup Slides (Memory load latency collected from the lmbench suite)

04/11/2007 IPCCC’07 Peng, Louisiana State University 18 Memory latency collected from the lmbench suite (continued) Latencies for all configurations jump after the array size is larger than L2 sizes. When the stride size is equal to 128 bytes, Pentium D still benefits partially but the L2 prefetchers of Core 2 Duo and Athlon 64X2 is not triggered. When the stride size is large than 128 bytes, Athlon 64X2’s on-die memory controller and separate I/O HyperTransport show the advantage. Two copies of lmbench suites bring more pressures on Pentium D.

04/11/2007 IPCCC’07 Peng, Louisiana State University 19 Backup Slides (Bandwidth for STREAM / STREAM2) The add operation is a loop of c[i] = a[i] + b[i], which can easily take advantage of the SSE2 packet operations. It shows higher bandwidth. Intel Core 2 Duo shows the best bandwidth for all operations because of L1 data prefetchers and the faster Front Side Bus. Athlon 64X2 has better bandwidth than that of Pentium D due to its faster on-chip memory controller.

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

Similar presentations

Presentation on theme: "1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

Similar presentations

Presentation on theme: "1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang."— Presentation transcript:

Similar presentations

About project

Feedback