Memory Opportunity in Multicore Era

Memory Opportunity in Multicore Era

Achievement Highlight
In the past semester 2007 fall: Written the paper - Automatic selection of compiler options for performance optimization on the control software of embedded system About to finish writing the paper - Performance profiling, modeling and simulating for IO intensive server-side applications Performed experiments of memory opportunity project. preliminary result is presented later.

Outline Project Goal Problem Description Observation Approach
Current Progress

Project Goal To develop a tool, which assists users (developers) deal with memory utilization issue while migrating programs onto multicore platform For multiple copies of programs run on Multicore platform For parallelized code runs on Multicore platform.

Problem Description Why is memory utilization issue important?
Problem Description (Obstacles) As cores (&HT) per chip increase, memory communication/traffic among cores (within a chip), and sockets grows exponentially. Limited per-core on-chip memories Off-chip memory bandwidth Memory latency hiding Smart arrange data inside per-chip boost performance. p m s

Observation I – Multiple Copies of Serial Program
N-headed straw to a milkshake (multi cores in a chip to access the memory via a single bus (fixed number of pins to access memory inside a multicore chip.) Michael and a colleague did this for a spec benchmark, and got around speedup on a 2-core CPU (vs. essentially no speedup when the code was optimized for instruction counts). “Data-Streaming Compilers for Multi-Core CPUs” by Michael Super Computing 2007 CPU # #2 SWIM SWIM scheduling Memory access

Observation II – Parallelized Program NPB
In paper, Characterization of Scientific Workloads on Systems with Multi-Core Processors_IISWC_2006: Authors demonstrate that performance improves with only task and memory placement techniques, while performance degrades even the MPI-version NPB benchmarks* running on 8 dual-core system without the processor and memory affinity policy. ★This shows that memory is key factor when we migrate program onto multicore platform. (Memory contention in MPI-based programs.) Further, the maximum performance improvement is about 3~4x on the 8 dual-core system (with 16 physical execution contexts.). This is our opportunity. *The NAS Parallel Benchmark (NPB) Suite consists of several small programs derived from computational fluid dynamics applications *Input dataset = class B

Observation III - NPB In paper, Understanding the Impact of Multi-Core Architecture in Cluster Computing A Case Study with Intel Dual-Core System_CCGrid_2007: Authors show the performance degrades while processes run on the cores in the same chip. For example, they run 2 processes on 2 processors from the same chip, 2 processors across chips and 2 processors across nodes respectively. The first configuration, 2 processes on 2 processors from the same chip, does not perform as well as the other two configuration. (MPI-version NPB benchmarks*.) The experiment suggests that to fully take advantage of multicore platform, cache and memory contention are the major concerns. Machine Types: 1. 2 sets of dual-core (4 processors per code) 2. Dual Intel Xeon per node pxq: p is the number of nodes, and q is the number of processors per node. Woodcrest 1x4 2x2 4x1 Inside dual-core: L2 cache Inside per-node: memory *input dataset = Class B

Observation IV - NPB Poor scalability while NPB runs on multicore platform due to memory bandwidth limitation and memory contention. In technical report from NSU, Multicore Parallel Computing with OpenMP_TR-NUS_2007: Memory contention phenomenon is reported; that is, NPB does not scale well in multicore as it does on SMP platform. Even parallel programs from HPC domain perform poorly in multicore platform. (OpenMP-version NPB Benchmarks*.) As the size of workload increases, speedup decreases. *input dataset = Class A

Observation V – Two copies of NPB
In technical report by National Energy Research Scientific Computing Center (NERSCC), Understanding and Mitigating Multicore Performance Issues on the AMD Opteron Architecture_TR-NERSC_2007: They see 10 % to 45% performance degradation on dual-core platform relative to single-core platform. (yielding performance data for execution on core 0 and core 1 separately (single core runs) and core 0 and 1 simultaneously (dual core runs).) (Serial Mode NPB Benchmarks) Dual core Benchmarks: Speedup (Single Core/ Dual Core runs) BT: SP: LU: CG: MG:

Check Point Problems: Current Solution: Memory bandwidth
This issue is going to be addressed in Intel's next generation CPU codename: Nehalem. However, there is always a need for memory bandwidth, i.e. the number of cores are far from number of pins to access memory from a chip. (HPC program is potential customer.) Memory & Cache contention Current Solution: Identify Memory Objects to guide program writing for better cache and memory utilization. Runtime profiling is required because memory utilization varies with job size, software threads number and number of hardware execution contexts.

Approach Procedure: Step 1: (We're here)
Find target application to expose memory contention effect on Multicore machine. Expected results: Poor scalability while NPB runs on multicore platform due to memory bandwidth limitation and memory contention. Step 2: Analyze the critical data structure, analyze hardware configuration, and propose a tool exploiting analyzed information to guide code re-writing. (It is also good for embedded system, i.e. to put data structure onto scratchpad memory.) Step 3: (Scenario) To schedule the memory access order while memory contention happens at runtime.

Current Progress Compiler Flags has little effects on NPB benchmarks.
Machine Type: The program is run on: Intel CPU Q6600 and 2GB Main Memory with ASUS Motherboard (P5B V-M) on debian linux and GCC uname -a = Linux debian #1 SMP Mon Dec 24 16:41:07 UTC 2007 i686 GNU/Linux)

Summary Now, programs with poor scalability have been identified.
Maybe it’s time for read memory access pattern work in literature. Also, it’s time to use PIN to identify memory objects used in programs. Performance Comparison: Identify those programs with scalability issue (on both multiple-copies and threads version) Threading Lib OpenMP-Version MPI-Version # of Threads MAX: 4Threads MAX:8 Threads MAX: 4 Threads Machines Mine 1 2 Problematic Benchmarks CG SP FT MG IS Not runing IS Performance Comparison: 1 Multicore Parallel Computing with OpenMP_TR-NUS_2007 2 Understanding the Impact of Multi-Core Architecture in Cluster Computing A Case Study with Intel Dual-Core System

Backup slides – Experimental Setup
Tool PIN Used to identify memory contention and identify critical data structure. Platform: A quad-core machine and a multiprocessor system with quad-cores. Configurations: Compiler Flags (O2, O3) Thread Numbers (1 ~ 4) Serial Mode, OpenMP-version, MPI-version, and multiple-copies version. Problem Size (CLASS A ~ E). Target application (community)? NAS Parallel Bench (HPC), PhysicBench (GAME) Why NPB? It is an important benchmarks with representative kernels and real applications. Why PhysicsBench? It seems that the only reason is that this benchmark requires quite amount of memory access. On average, PhysicsBench is composed of 34% floating point calculations, 25% integer calculations, 6% branches, 5% stores, and 30% loads.

Benchmarks Description
BT is a simulated CFD application that uses an implicit algorithm to solve 3-dimensional (3- D) compressible Navier-Stokes equations. The finite differences solution to the problem is based on an Alternating Direction Implicit (ADI) approximate factorization that decouples the x, y and z dimensions. The resulting systems are Block-Tridiagonal of 5´5 blocks and are solved sequentially along each dimension. SP is a simulated CFD application that has a similar structure to BT. The finite differences solution to the problem is based on a Beam-Warming approximate factorization that decouples the x, y and z dimensions. The resulting system has Scalar Pentadiagonal bands of linear equations that are solved sequentially along each dimension. LU is a simulated CFD application that uses symmetric successive over-relaxation (SSOR) method to solve a seven-block-diagonal system resulting from finite-difference discretization of the Navier-Stokes equations in 3-D by splitting it into block Lower and Upper triangular systems. FT contains the computational kernel of a 3-D fast Fourier Transform (FFT)-based spectral method. FT performs three one-dimensional (1-D) FFT’s, one for each dimension. MG uses a V-cycle MultiGrid method to compute the solution of the 3-D scalar Poisson equation. The algorithm works continuously on a set of grids that are made between coarse and fine. It tests both short and long distance data movement. · CG uses a Conjugate Gradient method to compute an approximation to the smallest eigenvalue of a large, sparse, unstructured matrix. This kernel tests unstructured grid computations and communications by using a matrix with randomly generated locations of entries. · EP is an Embarrassingly Parallel benchmark. It generates pairs of Gaussian random deviates according to a specific scheme. The goal is to establish the reference point for peak performance of a given platform.

Memory Usage – NPB Class=A
Memory size (MB): BT: 45 SP: 47 LU: 43 FT: 293 MG: 433 CG: 48 EP: 3544

Memory Opportunity in Multicore Era

Similar presentations

Presentation on theme: "Memory Opportunity in Multicore Era"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Memory Opportunity in Multicore Era

Similar presentations

Presentation on theme: "Memory Opportunity in Multicore Era"— Presentation transcript:

Similar presentations

About project

Feedback