High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A)

High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics

Today’s Lecture Part 1: Motivations and benefits, serial libraries Part 1: Motivations and benefits, serial libraries Part 2: Parallel libraries, ACTS collection Part 2: Parallel libraries, ACTS collection Part 3: Netlib, HPL Part 3: Netlib, HPL HPC Libraries

Part 1: HPC Libraries Motivations, benefits Motivations, benefits Serial HPC libraries Serial HPC libraries

Is our current programming model viable? "We need to move away from a coding style suited for serial machines, where every macrostep of an algorithm needs to be thought about and explicitly coded, to a higher-level style, where the compiler and library tools take care of the details. And the remarkable thing is, if we adopt this higher-level approach right now, even on today's machines, we will see immediate benefits in our productivity." W. H. Press and S. A. Teukolsky, 1997 Numerical Recipes: Does This Paradigm Have a future?

Motivations, concerns In developing large applications, three significant issues must be addressed: In developing large applications, three significant issues must be addressed: Productivity Productivity Time to the first solution (prototype) and time to solution (production) Time to the first solution (prototype) and time to solution (production) Complexity Complexity Increasingly sophisticated models, may need to link to other solvers Increasingly sophisticated models, may need to link to other solvers Performance Performance Increasingly complex algorithms, architectures Increasingly complex algorithms, architectures What strategies should be applied? What strategies should be applied? Some appear mutually exclusive: best performance would reduce productivity if you tailor every single part of the code Some appear mutually exclusive: best performance would reduce productivity if you tailor every single part of the code

Unavoidable tension Scientists frequently need highest performance Algorithms have long lifetimes (longer than hardware) Low level programming High level programming

Library approach Why not use libraries? (provided they suit your problem) Why not use libraries? (provided they suit your problem) Optimization – many library functions are often assembly optimized Optimization – many library functions are often assembly optimized Well tested – libraries are used by far more people than your local research group Well tested – libraries are used by far more people than your local research group Support – frequently commercial packages come with online forums or email support Support – frequently commercial packages come with online forums or email support Main drawback – loss of understanding of code inner workings Main drawback – loss of understanding of code inner workings Is this really an issue? 99.9% of the software you use you didn’t write Is this really an issue? 99.9% of the software you use you didn’t write Also you are forced into using the library interface, but usual this is not a significant concern Also you are forced into using the library interface, but usual this is not a significant concern Secondary drawback may be cost Secondary drawback may be cost

Library ownership Three main possibilities Three main possibilities Public Domain Public Domain Most common for numerical software Most common for numerical software Commercial Commercial Becoming more common as Universities attempt to gain from Intellectual property Becoming more common as Universities attempt to gain from Intellectual property Vendor Specific Vendor Specific Many of the big vendors release platform specific optimized versions of the larger public domain packages Many of the big vendors release platform specific optimized versions of the larger public domain packages

Potential benefits of libraries Allows easier collaboration (provided library is freely available to everyone!) Allows easier collaboration (provided library is freely available to everyone!) Software using GPL’d libraries can be released publicly as source-code Software using GPL’d libraries can be released publicly as source-code You can contribute back improvements to the user community You can contribute back improvements to the user community Source based libraries can be adapted to your needs Source based libraries can be adapted to your needs Bottomline is that your time to solution is reduced! Bottomline is that your time to solution is reduced!

Bugs are a serious issue… On June 4, 1996, an Ariane 5 rocket launched by the European Space Agency exploded just forty seconds after its lift-off from Kourou, French Guiana. The rocket was on its first voyage, after a decade of development costing $7 billion. The problem was a software error in the inertial reference system. Specifically a 64 bit floating point number relating to the horizontal velocity of the rocket with respect to the platform was converted to a 16 bit signed integer. On August 23,1991, the first concrete base structure for the Sleipner A platform sprang a leak and sank under a controlled ballasting operation during preparation for deck mating in Gandsfjorden outside Stavanger, Norway. The post accident investigation traced the error to inaccurate finite element approximation of the linear elastic model of the tricell (using the popular finite element program NASTRAN). The shear stresses were underestimated by 47% leading to insufficient design. In particular, certain concrete walls were not thick enough.

Something to think about… ~ 20 years ago  1x10 6 Floating Point Ops/sec (Mflop/s) ~ 20 years ago  1x10 6 Floating Point Ops/sec (Mflop/s) Scalar based Scalar based ~ 10 years ago  1x10 9 Floating Point Ops/sec (Gflop/s) ~ 10 years ago  1x10 9 Floating Point Ops/sec (Gflop/s) Vector & Shared memory computing, bandwidth aware Vector & Shared memory computing, bandwidth aware Block partitioned, latency tolerant Block partitioned, latency tolerant ~ Today  1x10 12 Floating Point Ops/sec (Tflop/s) ~ Today  1x10 12 Floating Point Ops/sec (Tflop/s) Highly parallel, distributed processing, message passing, network based Highly parallel, distributed processing, message passing, network based data decomposition, communication/computation data decomposition, communication/computation Coming soon  1x10 15 Floating Point Ops/sec (Pflop/s) Coming soon  1x10 15 Floating Point Ops/sec (Pflop/s) Many more levels of memory hierarchy, combination of grids&HPC Many more levels of memory hierarchy, combination of grids&HPC More adaptive, latency and bandwidth aware, fault tolerant, extended precision, attention to SMP nodes More adaptive, latency and bandwidth aware, fault tolerant, extended precision, attention to SMP nodes Application codes will need to address these issues Application codes will need to address these issues

The Evolving Performance Gap Peak performance is skyrocketing l In 1990s, peak performance increased 100x; in 2000s, it will increase 1000x But l Efficiency for many science applications declined from 40-50% on the vector supercomputers of 1990s to as little as 5-10% on parallel supercomputers of today Need research on l Mathematical methods and algorithms that achieve high performance on a single processor and scale to thousands of processors l More efficient programming models for massively parallel supercomputers 0.1 1 10 100 1,000 20002004 Teraflops 1996 Performance Gap Peak Performance Real Performance We don’t want everyone working on the same problem though!

Notable Public Domain Numerical Libraries LAPACK LAPACK Linear equations, eigenproblems Linear equations, eigenproblems BLAS BLAS Fast linear algebra kernels Fast linear algebra kernels LINPACK LINPACK Linear equation solving (now incorporated in LAPACK) Linear equation solving (now incorporated in LAPACK) ODEPACK ODEPACK Ordinary d.e. solving (see also the DASSL toolkit) Ordinary d.e. solving (see also the DASSL toolkit) QUADPACK QUADPACK Numerical Quadrature Numerical Quadrature ITPACK ITPACK Sparse problems Sparse problems PIM PIM Linear systems Linear systems Check out mathtools.net for a vast list of libraries Check out mathtools.net for a vast list of libraries

Basic Linear Algebra Subprograms (BLAS) FORTRAN library of simple subroutine which can be used to build more sophisticated LA programs (dates back to 1970’s) FORTRAN library of simple subroutine which can be used to build more sophisticated LA programs (dates back to 1970’s) BLAS is divided into four types and three levels BLAS is divided into four types and three levels Single, double, complex and double complex Single, double, complex and double complex Level 1 (vector-vector operations) Level 1 (vector-vector operations) Level 2 (matrix-vector operations) Level 2 (matrix-vector operations) Level 3 (matrix-matrix operations) Level 3 (matrix-matrix operations) Functions are prefixed with the type of the variables: Functions are prefixed with the type of the variables: s,d,c, or z for single, double, complex, or double complex (z). s,d,c, or z for single, double, complex, or double complex (z).

BLAS routines Some of the BLAS 1 subprograms are: Some of the BLAS 1 subprograms are: xCOPY - copy one vector to another xCOPY - copy one vector to another xSWAP - swap two vectors xSWAP - swap two vectors xSCAL - scale a vector by a constant xSCAL - scale a vector by a constant xAXPY - add a multiple of one vector to another xAXPY - add a multiple of one vector to another xDOT - inner product xDOT - inner product xASUM - 1-norm of a vector xASUM - 1-norm of a vector xNRM2 - 2-norm of a vector xNRM2 - 2-norm of a vector IxAMAX - find maximal entry in a vector IxAMAX - find maximal entry in a vector

Levels 2 & 3 Some of the BLAS 2 subprograms are: Some of the BLAS 2 subprograms are: xGEMV - general matrix-vector multiplication xGEMV - general matrix-vector multiplication xGER - general rank-1 update xGER - general rank-1 update xSYR2 - symmetric rank-2 update xSYR2 - symmetric rank-2 update xTRSV - solve a triangular system of equations xTRSV - solve a triangular system of equations Some of the BLAS 3 subprograms are: Some of the BLAS 3 subprograms are: xGEMM - general matrix-matrix multiplication xGEMM - general matrix-matrix multiplication xSYMM - symmetric matrix-matrix multiplication xSYMM - symmetric matrix-matrix multiplication xSYRK - symmetric rank-k update xSYRK - symmetric rank-k update xSYR2K - symmetric rank-2k update xSYR2K - symmetric rank-2k update

Tuning Advantages CAB= * PHiPAC: Linear algebra is always faster using an optimized library!

BLAS and C CBLAS is a C version of the libraries CBLAS is a C version of the libraries Available from Netlib Available from Netlib However, you can still call FORTRAN versions from C However, you can still call FORTRAN versions from C you will need to declare the involved BLAS routine as “extern” you will need to declare the involved BLAS routine as “extern” extern void dgemv_(char *trans, int *m, int *n, double *alpha, double *a, int *lda, double *x, int *incx, double *beta, double *y, int *incy ); extern void dgemv_(char *trans, int *m, int *n, double *alpha, double *a, int *lda, double *x, int *incx, double *beta, double *y, int *incy );

VSIPL www.vsipl.org www.vsipl.org Vector Signal and Image Processing Library Vector Signal and Image Processing Library Origins in defence contracts to produce an API for embedded programming Origins in defence contracts to produce an API for embedded programming Developed in C, bindings for C++ under development Developed in C, bindings for C++ under development Main functionality Main functionality Vector based frequency domain analysis routines Vector based frequency domain analysis routines

LAPACK BLAS is used as the building block for the Linear Algegra Package, LAPACK BLAS is used as the building block for the Linear Algegra Package, LAPACK Website describing and distributing a portable version of the library: http://www.netlib.org/lapack/ Website describing and distributing a portable version of the library: http://www.netlib.org/lapack/http://www.netlib.org/lapack/ Includes online manual Includes online manual http://www.netlib.org/lapack/lug/index.html http://www.netlib.org/lapack/lug/index.html http://www.netlib.org/lapack/lug/index.html Vendors frequently distribute their own assembly level optimized versions of the library (e.g. Intel MKL, and AMD ACML) Vendors frequently distribute their own assembly level optimized versions of the library (e.g. Intel MKL, and AMD ACML) This library consists of a set of higher level linear algebra functions with interface described at: This library consists of a set of higher level linear algebra functions with interface described at: http://www.netlib.org/lapack/individualroutines.html http://www.netlib.org/lapack/individualroutines.html http://www.netlib.org/lapack/individualroutines.html

LAPACK There are a very large number of linear algebra subroutines available in LAPACK There are a very large number of linear algebra subroutines available in LAPACK All follow a XYYZZZ format, where X denotes the datatype, YY the type of matrix and ZZZ describes the computation performed. For example: All follow a XYYZZZ format, where X denotes the datatype, YY the type of matrix and ZZZ describes the computation performed. For example: dgetrf is used to compute LU factorizations of a matrix (d=double, ge=general, trf=triangular factorization) dgetrf is used to compute LU factorizations of a matrix (d=double, ge=general, trf=triangular factorization) dgetrs uses an LU factorization from dgetrf to solve a system dgetrs uses an LU factorization from dgetrf to solve a system dgetri uses the LU above to compute the inverse of a matrix dgetri uses the LU above to compute the inverse of a matrix dgesv essentially a combined call to dgetrf and dgetrs dgesv essentially a combined call to dgetrf and dgetrs dgeev computes the eigenvalues of a matrix. dgeev computes the eigenvalues of a matrix.

Gnu Scientific Library http://www.gnu.org/software/gsl/ http://www.gnu.org/software/gsl/ http://www.gnu.org/software/gsl/ GSL is a numerical library for C and C++ programmers GSL is a numerical library for C and C++ programmers Free software, available under GNU GPL Free software, available under GNU GPL The library provides a wide range of mathematical routines The library provides a wide range of mathematical routines e.g. random number generators e.g. random number generators special functions special functions least-squares fitting least-squares fitting There are over 1000 functions in total. There are over 1000 functions in total. The project was conceived in 1996 by Dr M. Galassi and Dr J. Theiler of Los Alamos National Laboratory. The project was conceived in 1996 by Dr M. Galassi and Dr J. Theiler of Los Alamos National Laboratory.

GSL Features The library uses an object-oriented design The library uses an object-oriented design Different algorithms can be plugged-in easily or changed at run-time without recompiling the program Different algorithms can be plugged-in easily or changed at run-time without recompiling the program It is intended for ordinary scientific users It is intended for ordinary scientific users Users with a knowledge of C programming will be able to use the library quickly Users with a knowledge of C programming will be able to use the library quickly Interface is designed to be simple to link into very high-level languages, such as GNU Guile or Python Interface is designed to be simple to link into very high-level languages, such as GNU Guile or Python Library is thread-safe Library is thread-safe Many of the routines are C “re”implementations of FORTRAN routines (e.g. FFTPACK) Many of the routines are C “re”implementations of FORTRAN routines (e.g. FFTPACK) Modern coding conventions and optimizations have been applied Modern coding conventions and optimizations have been applied

Full list of functions Complex NumbersRoots of PolynomialsSpecial Functions Vectors and MatricesPermutationsSorting BLAS SupportLinear AlgebraEigensystems Fast Fourier TransformsQuadratureRandom Numbers Quasi-Random SequencesRandom DistributionsStatistics HistogramsN-TuplesMonte Carlo Integration Simulated AnnealingDifferential EquationsInterpolation Numerical DifferentiationChebyshev ApproximationSeries Acceleration Discrete Hankel TransformsRoot-FindingMinimization Least-Squares FittingPhysical ConstantsIEEE Floating-Point

Compiling and Linking The library header files are installed in their own `gsl' directory The library header files are installed in their own `gsl' directory Include statements need `gsl/' directory prefix: Include statements need `gsl/' directory prefix: #include #include Compile objects first: gcc -c myprog.c Compile objects first: gcc -c myprog.c Then link: gcc example.o -lgsl -lgslcblas -lm Then link: gcc example.o -lgsl -lgslcblas -lm

FFTW http://www.fftw.org http://www.fftw.org http://www.fftw.org “Fastest Fourier Transform in the West” “Fastest Fourier Transform in the West” Authored by Frigo and Johnson at MIT Authored by Frigo and Johnson at MIT C subroutine library for discrete Fourier transforms C subroutine library for discrete Fourier transforms Portable Portable Multiple dimensions Multiple dimensions Arbitrary input sizes, real and complex transforms Arbitrary input sizes, real and complex transforms Small prime factors are best though Small prime factors are best though Discrete cosine and sine transforms Discrete cosine and sine transforms Parallel versions available (both shared (pthreads) and distributed memory (MPI)) Parallel versions available (both shared (pthreads) and distributed memory (MPI)) C and FORTRAN API C and FORTRAN API Supports SIMD extensions (e.g. SSE) Supports SIMD extensions (e.g. SSE) Self-tuning Self-tuning Contains many different FFT algorithms and optimal one is chosen at runtime Contains many different FFT algorithms and optimal one is chosen at runtime Has undergone a number of evolutions, and is now at version 3.0 Has undergone a number of evolutions, and is now at version 3.0 Won 1999 J. H. Wilkinson Prize for Numerical Software Won 1999 J. H. Wilkinson Prize for Numerical Software

Using FFTW Need to include header files Need to include header files #include or include “fftw3.f” #include or include “fftw3.f” Must also link to libraries Must also link to libraries -lfftw3 -lm but may also need to specify path – will be installation dependent -lfftw3 -lm but may also need to specify path – will be installation dependent Having created arrays(“in” and “out”), must create a “plan” Having created arrays(“in” and “out”), must create a “plan” plan=fftw_plan_dft_1d(N,in,out,FFTW_FORWARD,FFTW_ESTIMATE) plan=fftw_plan_dft_1d(N,in,out,FFTW_FORWARD,FFTW_ESTIMATE) call dfftw_plan_dft_1d(plan,N,in,out,FFTW_FORWARD, FFTW_ESTIMATE) call dfftw_plan_dft_1d(plan,N,in,out,FFTW_FORWARD, FFTW_ESTIMATE) Precise plan routine will depend upon the FFT operation you wish to perform Precise plan routine will depend upon the FFT operation you wish to perform Call to plan allows system to evaluate architecture and transform and then optimize the algorithm to be used in the FFT Call to plan allows system to evaluate architecture and transform and then optimize the algorithm to be used in the FFT Having created the plan the transform is executed by specifying fftw_execute(plan) Having created the plan the transform is executed by specifying fftw_execute(plan) See the fftw website for precise details See the fftw website for precise details

2 GHz Opteron speeds

Parallel FFTW: Shared Memory FFTW includes both a pthreads based SMP library and can be compiled with OpenMP support on platforms where it is available FFTW includes both a pthreads based SMP library and can be compiled with OpenMP support on platforms where it is available On HPCVL it is compiled with OpenMP support On HPCVL it is compiled with OpenMP support Threaded version requires additional memory Threaded version requires additional memory Call fftw_init_threads() before using the threaded version Call fftw_init_threads() before using the threaded version SMP parallel plans require knowledge of how many threads are going to be used SMP parallel plans require knowledge of how many threads are going to be used Call fftw_plan_with_nthreads(nthreads) Call fftw_plan_with_nthreads(nthreads) Note that since plans are specific to the number of threads, if you change the number of threads you must create a new plan Note that since plans are specific to the number of threads, if you change the number of threads you must create a new plan When work is completed you must call fftw_cleanup_threads() deallocate memory for threads When work is completed you must call fftw_cleanup_threads() deallocate memory for threads At linking stage must also include parallel library At linking stage must also include parallel library -lfftw3_threads -lfftw3_threads Note only fftw_execute is using a parallel region Note only fftw_execute is using a parallel region

Parallel FFTW: MPI Only available for older 2.x libraries which have a different API Only available for older 2.x libraries which have a different API MPI data decomposition is “slab” based. MPI data decomposition is “slab” based. For 3d arrays this is potentially limiting – can only use L processors if you have an L 3 array For 3d arrays this is potentially limiting – can only use L processors if you have an L 3 array However, communication costs are high so this is not often a significant barrier However, communication costs are high so this is not often a significant barrier Uses MPI_Alltoall primitive which can occasionally lead to poor performance (depends on MPI implementation) Uses MPI_Alltoall primitive which can occasionally lead to poor performance (depends on MPI implementation) Must enable support when FFTW is compiled and must also link to -lfftw_mpi Must enable support when FFTW is compiled and must also link to -lfftw_mpi

Example code #include int main(int argc, char **argv) { const int NX =..., NY =...; fftwnd_mpi_plan plan; fftw_complex *data; MPI_Init(&argc,&argv); plan = fftw2d_mpi_create_plan(MPI_COMM_WORLD, NX, NY, FFTW_FORWARD, FFTW_ESTIMATE);...allocate and initialize data... fftwnd_mpi(p, 1, data, NULL, FFTW_NORMAL_ORDER);... fftwnd_mpi_destroy_plan(plan); MPI_Finalize(); }

(Old) Performance results on T3D for MPI transform (3d, complex)

AMD Core Math Library (ACML) Developed in collaboration with Numerical Algorithms Group (NAG) Developed in collaboration with Numerical Algorithms Group (NAG) Latest version = 3.1 Latest version = 3.1 Distribution via registration (but they have never sent me spam!) Distribution via registration (but they have never sent me spam!) 32 bit (Athlon) and 64 bit (Opteron) versions 32 bit (Athlon) and 64 bit (Opteron) versions Cannot be linked with Intel 8.1 compiler though – turf war! Cannot be linked with Intel 8.1 compiler though – turf war! Forces you to use Intel MKL Forces you to use Intel MKL Exploits knowledge of cache architecture to improve execution speed Exploits knowledge of cache architecture to improve execution speed

ACML Components: Linear Algebra, FFTs Basic Linear Algebra Subroutines (BLAS) Level 1 (vector-vector operations) Level 2 (matrix-vector operations) Level 3 (matrix-matrix operations) Plus routines for sparse vectors Linear Algebra PACKage (LAPACK)  28 (threaded) routines  Use BLAS to perform complex operations Scalable LAPACK (ScaLAPACK, MPI parallel LAPACK) also included  Must provide your own MPI implementation (see part 2) FFTs  1D,2D,single and double precision plus all combinations of real-to- complex etc C and FORTRAN APIs

Intel Math Kernel Library Version 9.0 recently released Version 9.0 recently released Free for non-commercial use Free for non-commercial use Students come under this banner, but faculty do not! Students come under this banner, but faculty do not! Graduate students are becoming a grey area… Graduate students are becoming a grey area… Online support forum Online support forum Library functions: Library functions: Linear Algebra - BLAS and LAPACK Linear Algebra - BLAS and LAPACK Linear Algebra - PARDISO Sparse Solver Linear Algebra - PARDISO Sparse Solver Discrete Fourier Transforms Discrete Fourier Transforms Vector Math Library Vector Math Library Vector Statistical Library Vector Statistical Library random number generators random number generators

Cluster Math Kernel Library Adds ScaLAPACK and parallel BLAS routines to MKL Adds ScaLAPACK and parallel BLAS routines to MKL Roughly 20% performance improvement over Netlib distribution of ScaLAPACK.

PESSL, SCSL & CXML SGI provide their SCSL library free SGI provide their SCSL library free “Scientific Computing Software Library” “Scientific Computing Software Library” Provides same basic features as ACML (linear algebra) Provides same basic features as ACML (linear algebra) Ported to Altix systems, but need to compare speed to Intel MKL before using Ported to Altix systems, but need to compare speed to Intel MKL before using PESSL is IBM’s parallel library PESSL is IBM’s parallel library “Parallel Engineering and Scientific Subroutine Library” “Parallel Engineering and Scientific Subroutine Library” Again, same basic features as ACML, and also includes random number generator Again, same basic features as ACML, and also includes random number generator CXML is Compaq’s library for the Alphaserver CXML is Compaq’s library for the Alphaserver

Random Number Generators Numerical recipes RAN2 and RAN3 are both reasonable RNGs Numerical recipes RAN2 and RAN3 are both reasonable RNGs Note RAN3 does fail some of the more esoteric tests Note RAN3 does fail some of the more esoteric tests GSL library provides over 40 different generators GSL library provides over 40 different generators Includes Knuth’s algorithms Includes Knuth’s algorithms Mersenne Twister as well Mersenne Twister as well

Mersenne Twister http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html Developed by Matsumoto and Nishimura Developed by Matsumoto and Nishimura Period is 2^19937-1 (10 6000 ) Period is 2^19937-1 (10 6000 ) 623-dimensional equidistribution property is assured 623-dimensional equidistribution property is assured Fast generation Fast generation C rand() has been substituted, and now there are no much difference in speed C rand() has been substituted, and now there are no much difference in speed Efficient use of the memory Efficient use of the memory The implemented C-code mt19937.c consumes only 624 words of working area The implemented C-code mt19937.c consumes only 624 words of working area Currently the generator of choice for most problems (except crypto) Currently the generator of choice for most problems (except crypto)

Summary Part 1 Libraries offer a number of benefits Libraries offer a number of benefits Optimization Optimization Robustness Robustness Portability Portability Time to solution improvements Time to solution improvements

Part 2: Parallel Libraries BLACS ACTS collection ScaLAPACK

BLACS Basic Linear Algebra Communication Subprograms Basic Linear Algebra Communication Subprograms Conceptual aid in design and coding (design tool) Conceptual aid in design and coding (design tool) Associate widely known mnemonic names with communication Associate widely known mnemonic names with communication Improved readability and provides standard interface Improved readability and provides standard interface “Self documentation” “Self documentation”

BLACS data decomposition 1 2 0 4 5 3 7 8 6 012 0 1 2 2d processor grid Types of BLACS routines: point-to-point communication, broadcast, combine operations and support routines. Communication Modes: All processes in row All processes in column All grid processes

Communication Routines Send/Receive Send/Receive Send (sub)matrix from one process to another: Send (sub)matrix from one process to another: _xxSD2D(ICTXT, [UPLO,DIAG], M, N, A, LDA, RDEST,CDEST) _xxSD2D(ICTXT, [UPLO,DIAG], M, N, A, LDA, RDEST,CDEST) _xxRV2D(ICTXT, [UPLO,DIAG], M, N, A, LDA, RSRC, CSRC) _xxRV2D(ICTXT, [UPLO,DIAG], M, N, A, LDA, RSRC, CSRC) _ denotes datatype: _ denotes datatype: I (integer), S (single), D (double), C (complex), Z (double complex) I (integer), S (single), D (double), C (complex), Z (double complex) xx denotes matrix type xx denotes matrix type GE = general, TR=trapezoidal GE = general, TR=trapezoidal

Point-to-Point example CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, & MYROW, MYCOL ) IF( MYROW.EQ.0.AND. MYCOL.EQ.0 ) THEN CALL DGESD2D( ICTXT, 5, 1, X, 5, 1, 0 ) ELSE IF( MYROW.EQ.1.AND. MYCOL.EQ.0 ) THEN CALL DGERV2D( ICTXT, 5, 1, Y, 5, 0, 0 ) END IF

Contexts The concept of a communicator is imbedded within BLACS as a “context” The concept of a communicator is imbedded within BLACS as a “context” Contexts are thus the mechanism by which you: Contexts are thus the mechanism by which you: Create arbitrary groups of processes upon which to execute Create arbitrary groups of processes upon which to execute Create an indeterminate number of overlapping or disjoint grids Create an indeterminate number of overlapping or disjoint grids Isolate each grid so that grids do not interfere with each other Isolate each grid so that grids do not interfere with each other Initialization routines return a context (integer) which is then passed to the communication routines Initialization routines return a context (integer) which is then passed to the communication routines Equivalent to specifying COMM in MPI calls Equivalent to specifying COMM in MPI calls

ID less communication Messages with BLACS are tagless Messages with BLACS are tagless Generated internally within the library Generated internally within the library Why is this an issue? Why is this an issue? If tags are not unique it is possible to create not deterministic behaviour (have race conditions on message arrival) If tags are not unique it is possible to create not deterministic behaviour (have race conditions on message arrival) BLACS allows the user to specify what range of IDs the BLACS can use BLACS allows the user to specify what range of IDs the BLACS can use This ensures it can be used with other packages This ensures it can be used with other packages

ACTS Collection “Advanced CompuTational Software” “Advanced CompuTational Software” Set of software tools Set of software tools US Department of Energy program, run in conjunction with NSF and DARPA US Department of Energy program, run in conjunction with NSF and DARPA Extended support for experimental software Extended support for experimental software Provide technical support (acts-support@nersc.gov) Provide technical support (acts-support@nersc.gov) Maintain ACTS information center (http://acts.nersc.gov) Maintain ACTS information center (http://acts.nersc.gov) Coordinate efforts with US supercomputing centers Coordinate efforts with US supercomputing centers Enable large scale scientific applications Enable large scale scientific applications Educate and train Educate and train Unclear how much support issue extends beyond US borders, although there are registered users across the globe Unclear how much support issue extends beyond US borders, although there are registered users across the globe

ACTS is a guided project

Tuned and machine Dependent modules Application Data Layout Control I/O Algorithmic Implementations New Architecture or S/W Extensive tuning May require new programming paradigms Difficult to maintain! New Architecture: Extensive re-rewriting New or extended Physics: Extensive re-rewriting or increase overhead New Architecture: May or may not need re- rewriting New Developments: Difficult to compare New Architecture: Minimal to Extensive rewriting ACTS Motivation Large Scientific Codes: A Common Programming Practice

USER's APPLICATION CODE (Main Control) Tuned and machine Dependent modules Application Data Layout I/O Algorithmic Implementations AVAILABLE LIBRARIES & PACKAGES AVAILABLE LIBRARIES & PACKAGES AVAILABLE LIBRARIES The ACTS (“ideal”) Approach

ACTS Tools and functions CategoryToolFunctionalities Numerical Aztec Algorithms for the iterative solution of large sparse linear systems. Hypre Algorithms for the iterative solution of large sparse linear systems, intuitive grid-centric interfaces, and dynamic configuration of parameters. PETSc Tools for the solution of PDEs that require solving large-scale, sparse linear and nonlinear systems of equations. OPT++ Object-oriented nonlinear optimization package. SUNDIALS Solvers for the solution of systems of ordinary differential equations, nonlinear algebraic equations, and differential-algebraic equations. ScaLAPACK Library of high performance dense linear algebra routines for distributed-memory message-passing. SuperLU General-purpose library for the direct solution of large, sparse, nonsymmetric systems of linear equations. TAO Large-scale optimization software, including nonlinear least squares, unconstrained minimization, bound constrained optimization, and general nonlinear optimization. Code Development Global Arrays Library for writing parallel programs that use large arrays distributed across processing nodes and that offers a shared- memory view of distributed arrays. Overture Object-Oriented tools for solving computational fluid dynamics and combustion problems in complex geometries. Code Execution CUMULVS Framework that enables programmers to incorporate fault-tolerance, interactive visualization and computational steering into existing parallel programs Globus Services for the creation of computational Grids and tools with which applications can be developed to access the Grid. PAWS Framework for coupling parallel applications within a component-like model. SILOON Tools and run-time support for building easy-to-use external interfaces to existing numerical codes. TAU Set of tools for analyzing the performance of C, C++, Fortran and Java programs. Library Development ATLAS and PHiPAC Tools for the automatic generation of optimized numerical software for modern computer architectures and compilers. PETE Extensible implementation of the expression template technique (C++ technique for passing expressions as function arguments).

ATLAS Automatically Tuned Linear Algebra Software Automatically Tuned Linear Algebra Software Another University of Tennessee project! Another University of Tennessee project! Largely an unsupported project though Largely an unsupported project though http://math-atlas.sourceforge.net/ http://math-atlas.sourceforge.net/ http://math-atlas.sourceforge.net/ Provides a subset of both BLAS and LAPACK functionality Provides a subset of both BLAS and LAPACK functionality Provided foundation for work on BLAS and LAPACK in AMDs ACML Provided foundation for work on BLAS and LAPACK in AMDs ACML Takes optimization step further by giving the computer itself possibilities for optimization at compile time Takes optimization step further by giving the computer itself possibilities for optimization at compile time “AEOS”: Automated Empirical Optimization of Software “AEOS”: Automated Empirical Optimization of Software Similar motivation as FFTW Similar motivation as FFTW

ATLAS Benchmarks ARCHATLASCOMP% PeakPEAK (Gflop) 900Mhz Itanium23.6.0icc90%3.6 1.6Ghz Opteron3.6.0gcc88%3.2 1062Mhz UltraSPARC III3.7.8gcc 3.382%2.124 600Mhz Athlon3.5.7gcc 2.95.380%1.2 2.8Ghz Pentium4E3.7.3gcc 3.3.277%5.6 2.6Ghz Pentium43.6.0gcc77%5.2 1Ghz PentiumIII3.7.7gcc 2.95.376%1 1Ghz Efficieon3.7.7gcc 3.260%2 DGEMM performance:

PETSc Portable, Extensible Toolkit for Scientific Computation Portable, Extensible Toolkit for Scientific Computation http://www-unix.mcs.anl.gov/petsc/petsc-as/ http://www-unix.mcs.anl.gov/petsc/petsc-as/ http://www-unix.mcs.anl.gov/petsc/petsc-as/ Argonne lab development Argonne lab development Suite of data structures and routines for the scalable (parallel) solution of PDEs Suite of data structures and routines for the scalable (parallel) solution of PDEs Intended for use in large-scale application projects Intended for use in large-scale application projects Not a black box solution though Not a black box solution though Easily interfaces with solvers written in C, FORTRAN and C++ Easily interfaces with solvers written in C, FORTRAN and C++ All components are designed to be interoperable All components are designed to be interoperable Works in distributed memory environment using MPI Works in distributed memory environment using MPI

Levels of Abstraction in Mathematical Software Application-specific interface Application-specific interface Programmer manipulates objects associated with the application Programmer manipulates objects associated with the application High-level mathematics interface High-level mathematics interface Programmer manipulates mathematical objects Programmer manipulates mathematical objects Weak forms, boundary conditions, meshes Weak forms, boundary conditions, meshes Algorithmic and discrete mathematics interface Algorithmic and discrete mathematics interface Programmer manipulates mathematical objects Programmer manipulates mathematical objects Sparse matrices, nonlinear equations Sparse matrices, nonlinear equations Programmer manipulates algorithmic objects Programmer manipulates algorithmic objects Solvers Solvers Low-level computational kernels Low-level computational kernels BLAS-type operations BLAS-type operations FFT FFT PETSc emphasis

Features Parallel vectors scatters gathers Parallel matrices several sparse storage formats easy, efficient assembly. Scalable parallel preconditioners Krylov subspace methods Parallel Newton-based nonlinear solvers Parallel timestepping (ODE) solvers Complete documentation Automatic profiling of floating point and memory usage Consistent interface Intensive error checking Portable to UNIX and Windows Over one hundred examples PETSc is supported and will be actively enhanced for the next several years.

Computation and Communication Kernels MPI, MPI-IO, BLAS, LAPACK Profiling Interface PETSc PDE Application Codes Object-Oriented Matrices, Vectors, Indices Grid Management Linear Solvers Preconditioners + Krylov Methods Nonlinear Solvers ODE Integrators Visualization Interface Structure of PETSc – Layered Approach

Functionality example: selected vector operations

A Complete PETSc Program #include petscvec.h int main(int argc,char **argv) { Vec x; int n = 20,ierr; PetscTruth flg; PetscScalar one = 1.0, dot; petscvec.hVecPetscTruthPetscScalarpetscvec.hVecPetscTruthPetscScalar PetscInitialize(&argc,&argv,0,0); PetscOptionsGetInt(PETSC_NULL,"-n",&n,PETSC_NULL); VecCreate(PETSC_COMM_WORLD,&x); VecSetSizes(x,PETSC_DECIDE,n); VecSetFromOptions(x); VecSet(&one,x); VecDot(x,x,&dot); PetscPrintf(PETSC_COMM_WORLD,"Vector length %dn",(int)dot); VecDestroy(x); PetscFinalize(); return 0; } PetscInitialize(&argc,&argv,0,0); PetscOptionsGetInt(PETSC_NULL,"-n",&n,PETSC_NULL); VecCreate(PETSC_COMM_WORLD,&x); VecSetSizes(x,PETSC_DECIDE,n); VecSetFromOptions(x); VecSet(&one,x); VecDot(x,x,&dot); PetscPrintf(PETSC_COMM_WORLD,"Vector length %dn",(int)dot); VecDestroy(x); PetscFinalize(); return 0; }PetscInitializePetscOptionsGetIntVecCreateVecSetSizesVecSetFromOptionsVecSetVecDotPetscPrintfVecDestroyPetscFinalizePetscInitializePetscOptionsGetIntVecCreateVecSetSizesVecSetFromOptionsVecSetVecDotPetscPrintfVecDestroyPetscFinalize

TAO Toolkit for Advanced Optimization Toolkit for Advanced Optimization http://www-unix.mcs.anl.gov/tao/ http://www-unix.mcs.anl.gov/tao/ http://www-unix.mcs.anl.gov/tao/ Another Argonne project Another Argonne project Aimed at the solution of large-scale optimization problems on high-performance architectures Aimed at the solution of large-scale optimization problems on high-performance architectures Suitable for both single-processor and massively-parallel architecture Suitable for both single-processor and massively-parallel architecture Object oriented approach Object oriented approach Interoperable with other toolkits (PETSc for example) Interoperable with other toolkits (PETSc for example)

Functionality Systems of nonlinear equations Systems of nonlinear equations Nonlinear least squares Nonlinear least squares Bound-constrained optimization Bound-constrained optimization Linear and quadratic programming Linear and quadratic programming Nonlinearly constrained optimization Nonlinearly constrained optimization Combinatorial optimization Combinatorial optimization Stochastic optimization Stochastic optimization Global optimization Global optimization

Example program TAO tao; /* optimization solver */ mat H; /* Hessian matrix */ vec x, g; /* solution and gradient vectors */ double f; /* function to minimize */ int n; /* number of variables */ ApplicationCtx usercontext; /* user-defined context */ MatCreate(MPI_COMM_WORLD,n,n,&H); VecCreate(MPI_COMM_WORLD,n,&x); VecDuplicate(x,&g); TaoCreate(MPI_COMM_WORLD,&tao); TaoSetFunction(tao,x,EvaluateFunction,usercontext); TaoSetGradient(tao,g,EvaluateGradient,usercontext); TaoSetHessian(tao,H,EvaluateHessian,usercontext); TaoSolve(tao); TaoDestroy(tao);

ScaLAPACK Scalable LAPACK Scalable LAPACK Development team Development team University of Tennessee University of Tennessee University of California at Berkeley University of California at Berkeley ORNL, Rice U.,UCLA, UIUC etc. ORNL, Rice U.,UCLA, UIUC etc. Support in Commercial Packages Support in Commercial Packages NAG Parallel Library (including Intel MKL and AMD ACML) NAG Parallel Library (including Intel MKL and AMD ACML) IBM PESSL IBM PESSL CRAY Scientific Library and SGI SCSL CRAY Scientific Library and SGI SCSL VNI IMSL VNI IMSL Fujitsu, HP/Convex, Hitachi, NEC Fujitsu, HP/Convex, Hitachi, NEC

Important details Web page http://www.netlib.org/s calapack Web page http://www.netlib.org/s calapack http://www.netlib.org/s calapack http://www.netlib.org/s calapack Includes ScaLAPACK User’s Guide Includes ScaLAPACK User’s Guide Language : Fortran Language : Fortran Dense Matrix Problem Solvers Dense Matrix Problem Solvers Linear Equations Linear Equations Least Squares Least Squares Eigenvalue Eigenvalue Package dependencies

Components of the API Drivers Drivers Solves a Complete Problem Solves a Complete Problem Computational Components Computational Components Performs Tasks: LU factorization, etc. Performs Tasks: LU factorization, etc. Auxiliary Routines Auxiliary Routines Scaling, Matrix Norm, etc. Scaling, Matrix Norm, etc. Matrix Redistribution/Copy Routine Matrix Redistribution/Copy Routine Matrix on PE grid1 -> Matrix on PE grid2 Matrix on PE grid1 -> Matrix on PE grid2

API (cont..) LAPACK names with P prefix LAPACK names with P prefix PXYYZZZ Computation Performed Matrix Type Data Types Data Type real double cmplx dble cmplx X S D C Z

TAU Tuning and Analysis Utilities Tuning and Analysis Utilities University of Oregon development University of Oregon development http://www.cs.uoregon.edu/research/paracomp/ta u/tautools/ http://www.cs.uoregon.edu/research/paracomp/ta u/tautools/ Program and performance analysis tool framework for high-performance parallel and distributed computing Program and performance analysis tool framework for high-performance parallel and distributed computing TAU provides a suite of tools analysis of C, C++, FORTRAN 77/90, Python, High Performance FORTRAN, and Java programs TAU provides a suite of tools analysis of C, C++, FORTRAN 77/90, Python, High Performance FORTRAN, and Java programs

Useage Instrument the program by inserting TAU macros into the program (this can be done automatically). Instrument the program by inserting TAU macros into the program (this can be done automatically). Run the program. Files containing information about the program performance are automatically generated. Run the program. Files containing information about the program performance are automatically generated. View the results with TAU's pprof, the TAU visualizer racy (or paraprof), or a third-party visualizer (such as VAMPIR) View the results with TAU's pprof, the TAU visualizer racy (or paraprof), or a third-party visualizer (such as VAMPIR)

Additional facilities TAU collects much more information than what is available through prof or gprof, the standard Unix utilities. Also available through TAU are: TAU collects much more information than what is available through prof or gprof, the standard Unix utilities. Also available through TAU are: Per-process, per-thread and per-host information (supports pthreads) Per-process, per-thread and per-host information (supports pthreads) Inclusive and exclusive function times Inclusive and exclusive function times Profiling groups that allow you to organize data collection Profiling groups that allow you to organize data collection Access to hardware counters on some systems Access to hardware counters on some systems Per-class and per-instance information Per-class and per-instance information Separate data for each template instantiation Separate data for each template instantiation Start/stop timers for profiling arbitrary sections of code Start/stop timers for profiling arbitrary sections of code Support for collection of statistics on user-defined events Support for collection of statistics on user-defined events TAU is designed so that when you turn off profiling (by disabling TAU macros) there is no overhead TAU is designed so that when you turn off profiling (by disabling TAU macros) there is no overhead

CACTUS http://www.cactuscode.org/ http://www.cactuscode.org/ http://www.cactuscode.org/ Developed as response to needs of large scale projects (initially developed for General Relativity calculations which have a large computation to communication ratio) Developed as response to needs of large scale projects (initially developed for General Relativity calculations which have a large computation to communication ratio) Numerical/computational infrastructure to solve PDE’s Numerical/computational infrastructure to solve PDE’s Freely available, Open Source community framework Freely available, Open Source community framework Cactus Divided in “Flesh” (core) and “Thorns” (modules or collections of subroutines) Cactus Divided in “Flesh” (core) and “Thorns” (modules or collections of subroutines) Multilingual: User apps Fortran, C, C ++ ; automated interface between them Multilingual: User apps Fortran, C, C ++ ; automated interface between them Abstraction: Cactus Flesh provides API for virtually all CS type operations Abstraction: Cactus Flesh provides API for virtually all CS type operations Storage, parallelization, communication between processors, etc Storage, parallelization, communication between processors, etc Interpolation, Reduction Interpolation, Reduction IO (traditional, socket based, remote viz and steering…) IO (traditional, socket based, remote viz and steering…) Checkpointing, coordinates Checkpointing, coordinates “Grid Computing”: Cactus team and many collaborators worldwide, especially NCSA, Argonne/Chicago, LBL “Grid Computing”: Cactus team and many collaborators worldwide, especially NCSA, Argonne/Chicago, LBL

Modularity of Cactus... Application 1 Cactus Flesh Application 2... Sub-app AMR (GrACE, etc) MPI layer 3I/O layer 2 Unstructured... Globus Metacomputing Services User selects desired functionality… Code created... Abstractions... Remote Steer 2 MDS/Remote Spawn Legacy App 2 Symbolic Manip App

Cactus & the Grid Cactus Application Thorns Distribution information hidden from programmer Initial data, Evolution, Analysis, etc Grid Aware Application Thorns Drivers for parallelism, IO, communication, data mapping PUGH: parallelism via MPI (MPICH-G2, grid enabled message passing library) Grid Enabled Communication Library MPICH-G2 implementation of MPI, can run MPI programs across heterogenous computing resources Standard MPI Single Proc

The Flesh Abstract API Abstract API evolve the same PDE with unigrid, AMR (MPI or shared memory, etc) without having to change any of the application code. evolve the same PDE with unigrid, AMR (MPI or shared memory, etc) without having to change any of the application code. Interfaces Interfaces set of data structures that a thorn exports to the world (global), to its friends (protected) and to nobody (private) and how these are inherited. set of data structures that a thorn exports to the world (global), to its friends (protected) and to nobody (private) and how these are inherited. Implementations Implementations Different thorns may implement e.g. the evolution of the same PDE and we select the one we want at runtime. Different thorns may implement e.g. the evolution of the same PDE and we select the one we want at runtime. Scheduling Scheduling call in a certain order the routines of every thorn and how to handle their interdependencies. call in a certain order the routines of every thorn and how to handle their interdependencies. Parameters Parameters many types of parameters and all of their essential consistency checked before running many types of parameters and all of their essential consistency checked before running

Summary Part 2 ACTS is a collection of software for HPC that includes a number of useful tools Numerical libraries Code development software Profiling software ScaLAPACK extends LAPACK to distributed memory architectures Built on top of PBLAS which uses BLACS

Part 3: Odds and ends Netlib and other useful websites Netlib and other useful websites HPL library HPL library VTK VTK

Netlib The Netlib repository contains The Netlib repository contains freely available software, documents, databases of interest to the numerical & scientific computing communities freely available software, documents, databases of interest to the numerical & scientific computing communities The repository is maintained by The repository is maintained by AT&T Bell Laboratories AT&T Bell Laboratories University of Tennessee University of Tennessee Oak Ridge National Laboratory Oak Ridge National Laboratory The collection is mirrored at several sites around the world The collection is mirrored at several sites around the world Kept synchronized Kept synchronized Effective search engine to help locate software of potential use Effective search engine to help locate software of potential use

High Performance LINPACK Portable and freely available implementation of the LINPACK Benchmark – used for Top500 ranking Portable and freely available implementation of the LINPACK Benchmark – used for Top500 ranking Developed at UTK Innovative Computing Laboratory Developed at UTK Innovative Computing Laboratory A. Petitet, R. C. Whaley, J. Dongarra, A. Cleary A. Petitet, R. C. Whaley, J. Dongarra, A. Cleary A. PetitetR. C. WhaleyJ. DongarraA. Cleary A. PetitetR. C. WhaleyJ. DongarraA. Cleary HPL solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers HPL solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers Requires MPI 1.1 be installed Requires MPI 1.1 be installed Also requires an implementation of either the BLAS or the Vector Signal Image Processing Library VSIPL Also requires an implementation of either the BLAS or the Vector Signal Image Processing Library VSIPL Provides a testing and timing program Provides a testing and timing program Quantifies the accuracy of the obtained solution as well as the time it took to compute it Quantifies the accuracy of the obtained solution as well as the time it took to compute it

Rice University HPC software Center for High Performance Software Research (HiPerSoft) Center for High Performance Software Research (HiPerSoft) established in October 1998 established in October 1998 http://www.hipersoft.rice.edu/ http://www.hipersoft.rice.edu/ http://www.hipersoft.rice.edu/ Rice has a strong history of innovative HPC tools Rice has a strong history of innovative HPC tools HPCToolkit is an open-source suite of multi- platform tools for profile-based performance analysis of applications HPCToolkit is an open-source suite of multi- platform tools for profile-based performance analysis of applications

HPCtoolkit The toolkit components include: The toolkit components include: hpcrun: a tool for profiling executions of unmodified application binaries using statistical sampling of hardware performance counters. hpcrun: a tool for profiling executions of unmodified application binaries using statistical sampling of hardware performance counters. hpcprof & xprof: tools for interpeting sample-based execution profiles and relating them back to program source lines. hpcprof & xprof: tools for interpeting sample-based execution profiles and relating them back to program source lines. bloop: a tool for analyzing application binaries to recover program structure; namely, to identify where loops are present and what program source lines they contain. bloop: a tool for analyzing application binaries to recover program structure; namely, to identify where loops are present and what program source lines they contain. hpcview: a tool for correlating program structure information, multiple sample- based performance profiles, and program source code to produce a performance database. hpcview: a tool for correlating program structure information, multiple sample- based performance profiles, and program source code to produce a performance database. hpcviewer: a java-based GUI for exploring databases consisting of performance information correlated with program source. hpcviewer: a java-based GUI for exploring databases consisting of performance information correlated with program source. Supported platforms: Pentium+Linux, Opteron+Linux, Athlon+Linux, Itanium+Linux, Alpha+Tru64 and MIPS+Irix. Supported platforms: Pentium+Linux, Opteron+Linux, Athlon+Linux, Itanium+Linux, Alpha+Tru64 and MIPS+Irix. HPCToolkit is open-source software released with a BSD-like license. HPCToolkit is open-source software released with a BSD-like license.license

CALGO Collected algorithms of the ACM Collected algorithms of the ACM http://www.acm.org/pubs/calgo/ http://www.acm.org/pubs/calgo/ http://www.acm.org/pubs/calgo/ All software is refereed for originality, accuracy, robustness, completeness, portability, and lasting value All software is refereed for originality, accuracy, robustness, completeness, portability, and lasting value Use of ACM Algorithms is subject to the ACM Software Copyright and License Agreement Use of ACM Algorithms is subject to the ACM Software Copyright and License AgreementACM Software Copyright and License AgreementACM Software Copyright and License Agreement Available on CD Available on CD

MGnet www.mgnet.org www.mgnet.org www.mgnet.org Site devoted to Multi-grid and adaptive mesh refinement algorithms Site devoted to Multi-grid and adaptive mesh refinement algorithms Run by Craig Douglas Run by Craig Douglas Has links to a number of packages for multigrid Has links to a number of packages for multigrid Some are public domain Some are public domain Others are copyrighted Others are copyrighted Very useful resource for MG methods Very useful resource for MG methods

NCSA National Center for Supercomputing Applications National Center for Supercomputing Applications www.ncsa.uiuc.edu www.ncsa.uiuc.edu www.ncsa.uiuc.edu Their application repository is a very useful guide to what software is available in a given field Their application repository is a very useful guide to what software is available in a given field

NHSE National HPC Software Exchange National HPC Software Exchange www.nhse.org www.nhse.org www.nhse.org Numerous reports, libraries Numerous reports, libraries Unfortunately has been suspended in light of a lack of funding (2004) Unfortunately has been suspended in light of a lack of funding (2004) Access to meta-repository is still available (and links there in) Access to meta-repository is still available (and links there in)

VTK The Visualization Toolkit The Visualization Toolkit http://public.kitware.com/VTK/what-is-vtk.php http://public.kitware.com/VTK/what-is-vtk.php http://public.kitware.com/VTK/what-is-vtk.php Portable open-source software system for 3D computer graphics, image processing, and visualization Portable open-source software system for 3D computer graphics, image processing, and visualization Object-oriented approach Object-oriented approach VTK is at a higher level of abstraction than rendering libraries like OpenGL VTK is at a higher level of abstraction than rendering libraries like OpenGL VTK applications can be written directly in C++, Tcl, Java, or Python VTK applications can be written directly in C++, Tcl, Java, or Python Large user community Large user community Many source code contributions Many source code contributions

Summary Part 3 When looking for a library first place to stop is netlib! When looking for a library first place to stop is netlib!

Next lecture (Last lecture!) Productivity crisis, future of HPC

High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A)

Similar presentations

Presentation on theme: "High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A)

Similar presentations

Presentation on theme: "High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A)"— Presentation transcript:

Similar presentations

About project

Feedback