1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

Slides:



Advertisements
Similar presentations
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Advertisements

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
To Include or Not to Include? Natalie Enright Dana Vantrease.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Lecture 6: Multicore Systems
System Level Benchmarking Analysis of the Cortex™-A9 MPCore™ John Goodacre Director, Program Management ARM Processor Division October 2009 Anirban Lahiri.
High Performing Cache Hierarchies for Server Workloads
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.
Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Thoughts on Shared Caches Jeff Odom University of Maryland.
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.
1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.
Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen.
Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
Multi-Core Architectures
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
History of Microprocessor MPIntroductionData BusAddress Bus
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Virtualisation Front Side Buses SMP systems COMP Jamie Curtis.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
Evaluating the Performance of Four Snooping Cache Coherency Protocols Susan J. Eggers, Randy H. Katz.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.
Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*
By Islam Atta Supervised by Dr. Ihab Talkhan
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Sunpyo Hong, Hyesoon Kim
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Lecture 2: Performance Evaluation
Lynn Choi School of Electrical Engineering
Lecture: Large Caches, Virtual Memory
ASR: Adaptive Selective Replication for CMP Caches
Multi-core processors
Multi-core processors
Memory hierarchy.
Taeweon Suh § Hsien-Hsin S. Lee § Shih-Lien Lu † John Shen †
CMSC 611: Advanced Computer Architecture
Lecture: Cache Innovations, Virtual Memory
Improving Multiple-CMP Systems with Token Coherence
Lecture: Cache Hierarchies
Chapter 4 Multiprocessors
Cache - Optimization.
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
DSPatch: Dual Spatial pattern prefetcher
Presentation transcript:

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang Chen 3 and David Koppelman 1 1 Louisiana State University 2 University of Florida 3 Intel Corporation

04/11/2007 IPCCC’07 Peng, Louisiana State University 2 Motivation Dual-Core processors are popular. Understanding the impact of memory hierarchy to overall performance. What are important factors for memory hierarchy performance? How about speedups for dual threads?

04/11/2007 IPCCC’07 Peng, Louisiana State University 3 Selected Three Dual-Core Processors Intel Core 2 Duo Intel Pentium D AMD Athlon 64X2 Shared Cache vs. Private Cache On-chip vs. Off-chip memory controller On-chip vs. Off-chip Inter-core communication Off-Chip On-Chip Shared

04/11/2007 IPCCC’07 Peng, Louisiana State University 4 Selected Three Dual-Core Processors Intel Core 2 Duo Intel Pentium D AMD Athlon 64X2 Core 2 Duo : Shared Shared L2 cache, no L2 coherence, beneficial with one active core, higher latency, fairness issue When L1 miss, search L2 and the other L1 simultaneously, fast cache-cache transfer and L1 coherence (like a bus) Memory controller off-chip, aggressive memory dependence predict

04/11/2007 IPCCC’07 Peng, Louisiana State University 5 Selected Three Dual-Core Processors Intel Core 2 Duo Intel Pentium D AMD Athlon 64X2 Pentium D: Two Pentium 4 on a chip, use technology remap approach (SMP) Private L2 cache, MESI coherence, require memory update for M  S, off-chip FSB for memory update, L1 coherence also go through FSB Memory controller off-chip, longer delay but adaptive to new DRAM

04/11/2007 IPCCC’07 Peng, Louisiana State University 6 Selected Three Dual-Core Processors Intel Core 2 Duo Intel Pentium D AMD Athlon 64X2 Athlon 64x2: Private L2 cache, connected through HyperTransport Use system request queue for internal commun. Between two cores MOESI coherence protocol allows shared-modified block in O-state no need for memory updated when read a remote Modified block

04/11/2007 IPCCC’07 Peng, Louisiana State University 7 Specifications of the selected processors

04/11/2007 IPCCC’07 Peng, Louisiana State University 8 Methodology Same platform: SUSE Linux 10.1 with kernel smp Micro-benchmarks Memory bandwidth and latency measured by Lmbench A lockless program [19] measuring cache-to-cache latency Real workloads Single threaded: SPEC CPU2000 and CPU2006 Multi-threaded: blastp, hmmpfam, SPECjbb2005 and SPLASH2

04/11/2007 IPCCC’07 Peng, Louisiana State University 9 Memory operations from Lmbench Memory read - measuring the time to read every 4 byte word from memory. Memory write - measuring the time to write every 4 byte word to memory. Other operations such as Memory bzero etc. Refer the paper for details.

04/11/2007 IPCCC’07 Peng, Louisiana State University 10 Lockless Program measuring cache-to-cache latency Doesn’t employ expensive read-modify-write atomic primitives. Maintains a lockless counter for each thread. *pPong is in a different cache line with *pPing. C2C latency for Core 2 Duo, Pentium D and Athlon 64X2: 33ns, 133ns and 68ns respectively.

04/11/2007 IPCCC’07 Peng, Louisiana State University 11 Memory bandwidth collected from the lmbench suite Doubled!! Private cache is faster! 1. In general, Core 2 Duo and Athlon 64 X2 have better bandwidth than that of Pentium D. 2. Pentium D shows the best memory read bandwidth when the array size is less than its L2 size. 3. Athlon 64X2 provides doubled memory read bandwidth for two copies lmbench, benefiting from its on-chip memory controller.

04/11/2007 IPCCC’07 Peng, Louisiana State University 12 SPEC CPU2000 and CPU2006 benchmarks’ execution time 1. Core 2 Duo processor runs fastest for almost all workloads, especially for art, mcf. 2. Athlon shows the best performance for ammp which has a large working set, resulting a high L2 miss rate. 3. When mixed with another program, memory intensive program’s execution time increasing is large. 4. When mixed with another program, CPU bounded program’s execution time increasing is small.

04/11/2007 IPCCC’07 Peng, Louisiana State University 13 Multi-programmed speedup of mixed SPEC CPU 2000/2006 benchmarks 1. Athlon 64X2 achieves the best speedup for all workloads. 2. CPU bounded program shows the best speedup. 3. Memory bounded program shows the worst speedup.

04/11/2007 IPCCC’07 Peng, Louisiana State University 14 Multithreaded Program Behaviors 1. Core 2 Duo’s single thread performance boosts because of larger L2 cache. 2. Core 2 Duo shows the best speedup for ocean due to high cache-to-cache transfer ratio. Verified by Intel VTune Analyzer. 3. Pentium D shows the best speedup for barnes because of the low cache miss rate

04/11/2007 IPCCC’07 Peng, Louisiana State University 15 Conclusions Analyzed the memory hierarchy of selected Intel and AMD dual-core processors. For the best performance and scalability, the following are important factors: fast cache-to-cache communication; large L2 or shared capacity; fast front side bus; on-chip memory controller. fair resource (cache) sharing.

04/11/2007 IPCCC’07 Peng, Louisiana State University 16 Thank you! Questions?

04/11/2007 IPCCC’07 Peng, Louisiana State University 17 Backup Slides (Memory load latency collected from the lmbench suite)

04/11/2007 IPCCC’07 Peng, Louisiana State University 18 Memory latency collected from the lmbench suite (continued) Latencies for all configurations jump after the array size is larger than L2 sizes. When the stride size is equal to 128 bytes, Pentium D still benefits partially but the L2 prefetchers of Core 2 Duo and Athlon 64X2 is not triggered. When the stride size is large than 128 bytes, Athlon 64X2’s on-die memory controller and separate I/O HyperTransport show the advantage. Two copies of lmbench suites bring more pressures on Pentium D.

04/11/2007 IPCCC’07 Peng, Louisiana State University 19 Backup Slides (Bandwidth for STREAM / STREAM2) The add operation is a loop of c[i] = a[i] + b[i], which can easily take advantage of the SSE2 packet operations. It shows higher bandwidth. Intel Core 2 Duo shows the best bandwidth for all operations because of L1 data prefetchers and the faster Front Side Bus. Athlon 64X2 has better bandwidth than that of Pentium D due to its faster on-chip memory controller.