Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Slides:

Advertisements

Similar presentations

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.

Advertisements

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

Performance of Cache Memory

Intel Core2 GHz Q6700 L2 Cache 8 Mbytes (4MB per pair) L1 Cache: (128 KB Instruction +128KB Data at the core level???) L3 Cache: None? CPU.

Instructor: Sazid Zaman Khan Lecturer, Department of Computer Science and Engineering, IIUC.

Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.

Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.

Chapter Hardwired vs Microprogrammed Control Multithreading

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Multi-core Processing The Past and The Future Amir Moghimi, ASIC Course, UT ECE.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.

Modeling GPU non-Coalesced Memory Access Michael Fruchtman.

Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with a similar.

Multi-core architectures. Single-core computer Single-core CPU chip.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

History of Microprocessor MPIntroductionData BusAddress Bus

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009.

Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Pipelining and Parallelism Mark Staveley

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

Full and Para Virtualization

Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja

Lx: A Technology Platform for Customizable VLIW Embedded Processing.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.

Sunpyo Hong, Hyesoon Kim

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)

An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,

PipeliningPipelining Computer Architecture (Fall 2006)

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

The Problem Finding a needle in haystack An expert (CPU)

STUDY AND IMPLEMENTATION

Many-Core Graph Workload Analysis

Memory System Performance Chapter 3

Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang

CS 295: Modern Systems Cache And Memory System

Presentation transcript:

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University, USA 22nd ACM International Conference on Supercomputing (ICS), June 7-12,

Findings Prefetching hardware in modern single processors successfully hides memory latency when accessing memory in a stride that can be predicted by the hardware. The effectiveness of prefetching diminishes when multiple cores access memory simultaneously, straining the shared memory bandwidth. 2

Findings (continued) Memory latency cannot be hidden by creating numerous concurrent threads – the increased threads aggravate the bandwidth problem. for a given parallel program, when the number of executing cores exceeds a certain threshold, performance will degrade due to the bandwidth problem. 3

Motivations Programmers – Performance bottleneck – Algorithms comparison Compiler designers – Optimizations design: latency vs. bandwidth 4

Intel Quadcore Processor Q6600 block diagram 5

Intel Quadcore Processor Q6600 Four 2.4GHz cores (.417 ns cycle time) UJ : 2 X Four 2.5 GHz cores (Intel Xeon 5400 processors) Two L2 caches (each of 4MB size) UJ: Two L2 caches (each of 6 MB size) 1066MHz FSB UJ: 1333 MHz FSB 64-bit wide bus – FSB peak bandwidth = 8.5GB per second – UJ: FSB peak bandwidth = 10.6GB per second – bandwidth between MCH and main memory is 12.8GB per second. 6

Intel Quadcore processors Each core performs – out-of-order execution – completes up to four full instructions per cycle. Each 4MB (JU: 6MB) L2 cache is shared by two cores. In order to reduce the cache miss rate each core has – One hardware instruction prefetcher – Many data prefetchers all prefetching independently 7

Intel Quadcore processors (cont.) Two of the prefetchers can bring data from the memory to the L2 cache. Depending on memory reference patterns Pre-fetchers dynamically adjust parameters (Strides, Look ahead distance) according to: – Bus bandwidth – Number of pending requests – Pre-fetch history 8

Matrix Multiply Code 9

Matrix Multiplication Execution on a Single Core Speedup= non-blocked execution time/ Blocked execution time Blocking is not so helpful on a single core <10% improvement Perfetch hardware is perfect Is this also true on multicore? 10

BOTTLENECK OF THE MEMORY BUS On multicore machines, effectiveness of prefetching diminishes when it cannot proceed at the full speed due to the bandwidth constraint in spite of predictable strides. 11

Multicore Execution Results Four cores: 70% performance gap Observations – Efficient concurrency – No inter-thread communication – Bandwidth problem? 12

Matrix Multiplication Execution on a Single Core: revisited Speedup= non-blocked execution time/ Blocked execution time Blocking is not so helpful on a single core <10% improvement Perfetch hardware is perfect 13 For a certain application, is there a metric to quantitatively identify whether the memory bandwidth is adequate? AND AND available memory bandwidth is adequate

Metric Definition: Memory Access Intensity Zero-latency instruction issue rate – IR Z : instructions/ cycle which can be issued supposing the operands are always available on-chip α : average number of bytes accessed off-chip per instruction The Memory Access Intensity for an application is given by: β A = α X IR Z bytes/cycle bytes accessed per cycle = (bytes/Instruction) X (Instructions/cycle) 14

When there will be a memory bottleneck? Peak Memory Bandwidth, PMB : bytes/sec β M (bytes per cycle)= PMB /Frequency For an application there is a memory bottle neck if β A > β M 15

Is there a memory bottle neck for an application? β M for Intel Core 2 Duo Q6600 = 3.54B/cycle There is a memory bottleneck for an application if β A > β M Take home midterm question: Compute β M for the UJ 2 Quad Core Intel Xeon

Revisit Matrix Multiply Results 17

Explaination β M : 3.54B/cycle Methods to compute β A – Program analysis – Use Intel Vtune: Measure hardware counters 4.97>3.54, thus blocking is necessary when 4 cores are executing Core MEM (Byte)INSTβAβA Non- Block 16.89E E E E E E Block (256*25 6) 19.05E E E E E E

Other Results Benchmarks: diagonally dominant banded linear systems – ScaLAPACK {version implemented in the Intel Math Kernel Library (MKL)} – Spike (will not discuss here, see paper) – A revised Spike (will not discuss here, see paper) Configurations – β M : 3.54 bytes / CPU cycle – Band: narrow(11), medium(99) and wide(399) – Large matrix 19

ScaLAPACK: performance for narrow banded system 20

ScaLAPACK: performance for medium banded system 21

ScaLAPACK: performance for wide banded system 22

Results of the factorization step For all three matrices band widths, little speedup is achieved by using multiple cores. – Reason (Vtune): parallelized code significantly increases the number of instructions Solution: change the algorithm to reduce the number of extra instructions introduced by parallelization. 23

Results of the solve step Flat speedup – Reason: Vtune shows β A > β M Solution: remove the memory bottleneck Vtune shows: Factorization step dominates the total execution time when the band gets wider. – Improvement on factorization more critical than the “solve” step to the total performance 24