Jacobi solver status Lucian Anton, Saif Mulla, Stef Salvini CCP_ASEARCH meeting October 8, 2013 Daresbury 1.

Slides:



Advertisements
Similar presentations
Implementing Domain Decompositions Intel Software College Introduction to Parallel Programming – Part 3.
Advertisements

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 40.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 75.
Numeric Types & Ranges. ASCII Integral Type Numerical Inaccuracies Representational error – Round-off error – Caused by coding a real number as a finite.
Introduction to C Programming
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
A C++ Crash Course Part II UW Association for Computing Machinery Questions & Feedback.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:
Computer System Laboratory
CS 11 C track: lecture 7 Last week: structs, typedef, linked lists This week: hash tables more on the C preprocessor extern const.
Compilation and Debugging 101. Compilation in C/C++ hello.c Preprocessor Compiler stdio.h tmpXQ.i (C code) hello.o (object file)
Overview of programming in C C is a fast, efficient, flexible programming language Paradigm: C is procedural (like Fortran, Pascal), not object oriented.
C Language.
GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
*University of Utah † Lawrence Berkeley National Laboratory
Introductions to Parallel Programming Using OpenMP
Profiling your application with Intel VTune at NERSC
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Software Group © 2005 IBM Corporation Compilation Technology Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers Priya.
Computer Programming and Basic Software Engineering 4. Basic Software Engineering 1 Writing a Good Program 4. Basic Software Engineering 3 October 2007.
CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
SAGE: Self-Tuning Approximation for Graphics Engines
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
OMPi: A portable C compiler for OpenMP V2.0 Elias Leontiadis George Tzoumas Vassilios V. Dimakopoulos University of Ioannina.
Lecture 8: Caffe - CPU Optimization
Uses some of the slides for chapters 7 and 9 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Copyright © 2012 Pearson Education, Inc. Chapter 8 Two Dimensional Arrays.
CS 179: GPU Computing Lecture 3 / Homework 1. Recap Adding two arrays… a close look – Memory: Separate memory space, cudaMalloc(), cudaMemcpy(), … – Processing:
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.
Profiling and Tuning OpenACC Code. Profiling Tools (PGI) Use time option to learn where time is being spent -ta=nvidia,time NVIDIA Visual Profiler 3 rd.
HackLatt MILC Code Basics Carleton DeTar First presented at Edinburgh EPCC HackLatt 2008 Updated 2013.
Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.
GPU Architecture and Programming
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
09/24/2010CS4961 CS4961 Parallel Programming Lecture 10: Thread Building Blocks Mary Hall September 24,
10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Simulating the Nonlinear Schrodinger Equation using MATLAB with CUDA
QCAdesigner – CUDA HPPS project
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
1 One Dimensional Arrays Chapter 11 2 "All students to receive arrays!" reports Dr. Austin. Declaring arrays scores :
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Threaded Programming Lecture 2: Introduction to OpenMP.
CS 732: Advance Machine Learning
Chapter 1 Java Programming Review. Introduction Java is platform-independent, meaning that you can write a program once and run it anywhere. Java programs.
How to use HybriLIT Matveev M. A., Zuev M.I. Heterogeneous Computations team HybriLIT Laboratory of Information Technologies (LIT), Joint Institute for.
C++11 and three compilers: performance considerations CERN openlab Summer Students Lightning Talks Sessions Supervised by Pawel Szostek Stephen Wang ›
OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
SHARED MEMORY PROGRAMMING WITH OpenMP
SHARED MEMORY PROGRAMMING WITH OpenMP
INC 161 , CPE 100 Computer Programming
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
Advanced TAU Commander
Paraguin Compiler Communication.
Paraguin Compiler Version 2.1.
CS 179: Lecture 3.
Using OpenMP offloading in Charm++
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Makefiles, GDB, Valgrind
3.8 static vs dynamic thread management
Presentation transcript:

Jacobi solver status Lucian Anton, Saif Mulla, Stef Salvini CCP_ASEARCH meeting October 8, 2013 Daresbury 1

Outline Code structure –Front end –Numerical kernels –Data collection Performance data –Intel SB –Xeon Phi –BlueGeneQ –GPU 8/10/13 Jacobi test program 2

Code structure 8/10/13 Jacobi test program 3 Read input from command line –Grid sizes, length of iteration block, # of iteration blocks,.. –Algorithm to use –Output format (header, test iterations, …) Initialize grid with an eigenvalue of Jacobi smoother Run several iteration blocks Collect min, max, average times.

Build model 8/10/13 Jacobi test program 4 Uses a generic Makefile + plaform/*.inc files F90 := source /opt/intel/composerxe/bin/compilervars.sh intel64 && \ source /opt/intel/impi/4.1.0/intel64/bin/mpivars.sh && mpiifort CC := source /opt/intel/composerxe/bin/compilervars.sh intel64 && \ source /opt/intel/impi/4.1.0/intel64/bin/mpivars.sh && icc LANG = C ifdef USE_MIC FMIC = -mmic endif ifdef USE_MPI FMPI=-DUSE_MPI endif ifdef USE_DOUBLE_PRECISION DOUBLE=-DUSE_DOUBLE_PRECISION endif ifdef USE_VEC1D VEC1D = -DUSE_VEC1D endif #FC = module add intel/comp intel/mpi && mpiifort

Command line parameters 8/10/13 Jacobi test program 5 arcmport01:~/Projects/HOMB>./homb_c_gcc_debug_gpu.exe -help Usage: [-ng ] [ -nb ] [-np ] [-niter ] [-biter ] [-malign ] [- v] [-t] [-pc] [-model [num-waves] [threads-per-column]] [-nh] [-help] arcmport01:~/Projects/HOMB>./homb_c_gcc_debug_gpu.exe -model help possible values for model parameter: baseline baseline-opt blocked wave num-waves threads-per-column basegpu optgpu Note for wave model: if threads-per-column == 0 diagonal wave kernel is used.

README file 8/10/13 Jacobi test program 6 Full explanation on command line options are provided in README The following flags can be used to set the grid sized and other run parameters: -ng set the global gris sizes -nb set the computational block size, relevant only for blocked model. Notes: 1) no sanity checks tests are done, you are on your own. 2) for blocked model the OpeNMP parallelism is done over computational blocks. One must ensure that there enough work for all threads by setting suitable block sizes.

Correctness check 8/10/13 Jacobi test program 7 -t flag checks if norm ratio are close to Jacobi smoother eigenvalue arcmport01:~/Projects/HOMB>./homb_c_gcc_debug_gpu.exe -t -niter 7 Correctness check iteration, norm ratio, deviation from eigenvalue e e e e e e e e e e e e e e-08 # Last norm e+01 #========================================================================================================= =# #NThsNxNyNzNITERminTime meanTime maxTime #========================================================================================================= =# e e e-04

Algorithms 8/10/13 Jacobi test program 8 Basic 3 loops iteration over the grid –OpenMP parallelism applied to external loop –If condition from inner loop eliminated Blocked iterations Wave iterations

Algorithms: wave details 8/10/13 Jacobi test program 9 Z Y NewOld New

Algorithms: helping vectorisation 8/10/13 Jacobi test program 10 The inner loop can be replace with an easier to vectorize function: // 1D loop that helps the compiler to vectorize static void vec_oneD_loop(const int n, const Real uNorth[], const Real uSouth[], const Real uWest[], const Real uEast[], const Real uBottom[], const Real uTop[], Real w[] ){ int i; #ifdef __INTEL_COMPILER #pragma ivdep #endif #ifdef __IBMC__ #pragma ibm independent_loop #endif for (i=0; i < n; ++i) w[i] = sixth * (uNorth[i] + uSouth[i] + uWest[i] + uEast[i] + uBottom[i] + uTop[i]); }

Algorithms: CUDA 8/10/13 Jacobi test program 11 Base laplace3D (from Mike’s lecture notes) Shared memory in XY plane … more to come

Data collection 8/10/13 Jacobi test program 12 With such a large parameter space we have a big-ish data problem. Bash script + gnuplot i ndex=0 for exe in $exe_list do for model in $model_list do for nth in $threads_list do export OMP_NUM_THREADS=$nth for ((linsize=10; linsize <= max_linsize; linsize += step)) do biter=$(((10*max_linsize)/linsize)) niter=5 if [ "$model" = wave ] then nwave="$biter $((nth<biter?nth:biter))" echo "model $model $nwave" else nwave="" fi if [ "$blk_x" -eq 0 ] ; then blk_xt=$linsize ; else blk_xt=$blk_x ; fi if [ "$blk_y" -eq 0 ] ; then blk_yt=$linsize ; else blk_yt=$blk_y ; fi if [ "$blk_z" -eq 0 ] ; then blk_zt=$linsize ; else blk_zt=$blk_z ; fi echo "./"$exe" -ng $linsize $linsize $linsize -nb $blk_xt $blk_yt $blk_zt -model $model $nwave

SandyBrige baseline 8/10/13 Jacobi test program 13

SB: blocked and wave 8/10/13 Jacobi test program 14

BGQ 8/10/13 Jacobi test program 15

Xeon Phi vs SandyBridge 8/10/13 Jacobi test program 16

Fermi data 8/10/13 Jacobi test program 17

Conclusions & To do 8/10/13 Jacobi test program 18 We have an integrate set of Jacobi smoother algorithms –OpenMP, CUDA, MPI(almost) –Flexible build system –Run parameters can be selected from command line and preprocessor flags –Correctness check –Scripted data collection –README file Tested on several system (Idataplex, BGQ, Emerald,…, MacOs laptop) GPU needs further improvements ….