Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.

Slides:

Advertisements

Similar presentations

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.

Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

ARCHER Tips and Tricks A few notes from the CSE team.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Profiling your application with Intel VTune at NERSC

Intel® performance analyze tools Nikita Panov Idrisov Renat.

The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

Judit Giménez, Juan González, Pedro González, Jesús Labarta, Germán Llort, Eloy Martínez, Xavier Pegenaute, Harald Servat Brief introduction.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.

Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.

Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

Petascale workshop 2013 Judit Gimenez Detailed evolution of performance metrics Folding.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER

JPCM - JDC121 JPCM. Agenda JPCM - JDC122 3 Software performance is Better Performance tuning requires accurate Measurements. JPCM - JDC124 Software.

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Hybrid MPI and OpenMP Parallel Programming

SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Zhengji Zhao, Nicholas Wright, and Katie Antypas NERSC Effects of Hyper- Threading on the NERSC workload on Edison NUG monthly meeting, June 6, 2013.

Belgrade, 25 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Performance analysis Tools: a case study of NMMB on Marenostrum.

NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture FIT5174 Distributed & Parallel Systems Lecture 5 Message Passing and MPI.

Single Node Optimization Computational Astrophysics.

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Sunpyo Hong, Hyesoon Kim

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

Tuning Threaded Code with Intel® Parallel Amplifier.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

NREL is a national laboratory of the U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy, operated by the Alliance for Sustainable.

Report on Vector Prototype J.Apostolakis, R.Brun, F.Carminati, A. Gheata 10 September 2012.

Native Computing & Optimization on Xeon Phi John D. McCalpin, Ph.D. Texas Advanced Computing Center.

1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

Distributed Processors

ParaTools ThreadSpotter ParaTools, Inc.

SHARED MEMORY PROGRAMMING WITH OpenMP

5.2 Eleven Advanced Optimizations of Cache Performance

NVIDIA Profiler’s Guide

Presented by: Isaac Martin

Multi-core CPU Computing Straightforward with OpenMP

Siddhartha Chatterjee

Projections Overview Ronak Buch & Laxmikant (Sanjay) Kale

ECE 498AL Lecture 10: Control Flow

Memory System Performance Chapter 3

Cache Performance Improvements

Presentation transcript:

Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC

M3DC1 performance and MPI communication profiling method Serial performance – Using Intel VTune – Hot functions – Vectorization and compilation options – Bandwidth – HW performance details: L1, L2, TLB misses, memory latency, … Parallel performance with MPI – MPI trace using Intel ITAC – Breakdown of performance Vs communication – Performance of all MPI calls – Messaging profile and statistics – Identify areas for improving communication performance

Overall performance summary with general-exploration metric: high CPI, low L1 cache hit ratio and average memory latency impact per cache misses, good VPU, not small L1 TLB miss ratio

Load imbalance of 16 MPI ranks – could use binding all MPI ranks to physical cores to prevent process migration from core to core

Core performance profile with general-exploration metric and VPU element active high lighted – small L1 and L2 data cache reuse

Small number of VPU element actives on each MPI rank with run time zoomed in for sec – good VPU but number of VPU active phases is small.

Hot functions with most run time – MPI ~ 44 % followed by MKL BLAS 21 % and functions int4, eval_ops, int3

Hot function “int4” shown with a reduction on “ksum” with multiply ops on 5 rhs vector – L1 cache miss is in the range but large L1 cache miss latency ~ 220 us

Hot function “eval_ops” shown with DAXPY operation – good vectorization but very poor L1 hit rate due to non stride 1 access of 3D matrix nu79(:,op,i)

Hot function “int3” shown with reduction sum with multiply ops on 4 rhs vector – noticeably large L1 cache miss ~ 9%

Timing statistics, CPU loads of all 16 MPI ranks and time distribution of hot functions using “advanced-hotspots” metric (thread profiling) - overall CPU loads are very imbalance

Performance profile summary over total run time – no thread spinning (only 1 MPI rank per core), hot functions with most run time

HW event sampling for all 16 MPI ranks and hot functions and libraries with most run time

Overall BW (r+w/r/w) utilization ~ 9.5 GB/s for 16 MPI ranks 0r 0.6 GB/s per rank or thread. A rough estimate for total BW usage of ~ 140 GB/s or more with optimized and hybrid M3DC1 running with 4T per core and 59 cores (236T)

Communication profile and statistics: MPI tracing using Intel ITAC tool

Application and communication profile shown with top MPI calls

M3DC1: application and communication profile versus run time with 16 MPI ranks (blue color application, red MPI) – Goal is to reduce MPI cost (eliminate red color regions)

Large cost of MPI_Barrier() detected – noted the zoom-in run time to show MPI communication in this time duration

Large cost of MPI_Barrier() found – noted further zoom-in on run time to show application and MPI communication pattern and MPI calls in this time duration

Application and communication profile over different time window - highly load imbalance due to stall on MPI communication side

Details of all MPI calls with MPI_Iprobe(), MPI_Test() before MPI_Recv() and MPI_Issend(), ---  large communication cost

Communication pattern and total times (seconds) taken by all MPI calls by each MPI rank over the entire run

Communication pattern and volume of all messages (bytes) taken by all MPI calls by each MPI rank over the entire run

Communication pattern and transfer rate (bytes/sec) taken by all MPI calls by each MPI rank over the entire run

Total time (sec) taken by collective communication per rank

MPI calls, number of calls, time (sec) required for MPI calls

Performance summary and code tuning suggestions Overall serial performance on KNC: – Vectorizing is good (6 out of 8 as maximum VPU lanes) – Improve L1 cache hit ratio is a must Suggestions: interchange op and i index loop to improve access of 3D matrix nu79(:,op,i) at the smaller cost of vector dofs(i) – The reduction sum found in the hot functions int4 and int3 using multiply ops on either 4 or 5 rhs vector can be further vectorized using ksumm(i) instead of just ksum and add extra reduction ksum loop over all ksumm(i) – An estimate of overall BW usage shows M3DC1 will require more than 140 GB/s if we tune up the code, add hybrid MPI+OMP and running with 3 or 4T per core and all 59 cores Overall parallel performance with MPI: – Large MPI communication cost (32% of overall timing) due to probe, wait and collective calls that blocking and/or synchronizing all MPI ranks – Need to schedule and post non-blocking (asynchronous) receiving/sending ahead and overlap with computing avoiding blocking on the wait – Reduce number of collective calls if possible – On Xeon Phi like KNC/KNL, use hybrid programming with small number of MPI ranks (1 to 4) but large number of OMP threads will make use of all HW threads per core and certainly improve parallel performance

Using VTune CLI (command line instruction) amplxe-cl and ITAC on babbage KNC #!/bin/bash -l #PBS -N 3D-NV=1 #PBS -l nodes=1 #PBS -l walltime=12:00:00 #PBS -q regular #PBS -j oe cd $PBS_O_WORKDIR ###module swap intel intel/ module swap impi impi/4.1.3 get_micfile firstmic=$(head -1 micfile.$PBS_JOBID) export ProjDir=/chos/global/scratch2/sd/tnphung/M3DC1_3D/m3dc1_3d export SrcDir=$ProjDir/unstructured export BinDir=$SrcDir/_bint-3d-opt-60 export MPIRanks=16 date ### time mpirun.mic -n MPIRanks -host $firstmic $BinDir/m3dc1_3d -ipetsc -options_file options_bjacobi date

Using VTune CLI (command line instruction) amplxe-cl and ITAC on babbage KNC (continued) module load vtune/2015.update2 export MPSSDir=/opt/mpss/3.4.1/sysroots/k1om-mpss-linux/lib64 ### three choices for VTune preset metrics: general-exploration, bandwidth and advanced-hotspots export VTuneMetric=general-exploration ###three choices for VTune preset metrics: GeneralExploration, AdvancedHotSpots and BandWidth export VTuneResultDir=$ProjDir/3D-NV=1/VTuneResultDir/GeneralExploration date ### same CLI for both bandwidth and advanced-hotspots without any knobs option ###echo 'Running amplxe-cl with ' $VTuneMetric ###amplxe-cl -collect $VTuneMetric -r $VTuneResultDir -target-system=mic-host-launch \ ### -source-search-dir $SrcDir \ ### -- mpirun.mic -n $MPIRanks -host $firstmic $BinDir/m3dc1_3d -ipetsc -options_file options_bjacobi date

Using VTune CLI (command line instruction) amplxe-cl and ITAC on babbage KNC (continued) date ###knobs are only for knc-general-exploration metric. ###amplxe-cl -collect $VTuneMetric -r $VTuneResultDir \ ### -knob enable-vpu-metrics=true -knob enable-tlb-metrics=true \ ### -knob enable-l2-events=true -knob enable-cache-metrics=true \ ### -target-system=mic-host-launch -source-search-dir $SrcDir \ ### -- mpirun.mic -n $MPIRanks -host $firstmic $BinDir/m3dc1_3d -ipetsc -options_file options_bjacobi date

Using VTune CLI (command line instruction) amplxe-cl and ITAC on babbage KNC (continued) ############################################################################################# ##### ITAC FOR MPI TRACE SETTING ##### ############################################################################################# module load itac/8.1.update4 export VT_LOGFILE_FORMAT=stfsingle export VT_LOGFILE_PREFIX=$ProjDir/3D-NV=1/MPITraceDir echo 'Running ITAC and Save single file to the path: ' $VT_LOGFILE_PREFIX export I_MPI_MIC=1 echo 'MIC_LD_LIBRARY_PATH ' $MIC_LD_LIBRARY_PATH date time mpirun.mic -trace -n $MPIRanks -host $firstmic $BinDir/m3dc1_3d -ipetsc -options_file options_bjacobi date

VTune CLI: final step to add the search dir path to VTune results before analyzing it on window or mac laptop or desktop amplxe-cl –finalize –r vtuneresultsdir –search-dir /opt/mpss/3.3.2/sysroots/k1om-mpss-linux/lib64 Help command amplxe-cl –help amplxe-cl –help collect …..