Presentation is loading. Please wait.

Presentation is loading. Please wait.

JLab SciDAC Activities QCD-API design and other activities at JLab include: –Messaging passing design and code (Level 1) [Watson, Edwards] First implementation.

Similar presentations


Presentation on theme: "JLab SciDAC Activities QCD-API design and other activities at JLab include: –Messaging passing design and code (Level 1) [Watson, Edwards] First implementation."— Presentation transcript:

1 JLab SciDAC Activities QCD-API design and other activities at JLab include: –Messaging passing design and code (Level 1) [Watson, Edwards] First implementation over MPI is available in C [Edwards] GM implementation underway [Chen] –Basic linear algebra routines (Level 1) Prototype design specification available [Edwards] Prototype code available [M4 version - Pochinsky] New (not accepted) implementation [Pochinsky] Subset optimized for Pentium 4 [Edwards, McClendon] –Data-parallel (Level 2) Prototype design specification available [Edwards] Example implementation – SZIN [Edwards] C interface first version, considering C++ version –Optimized Dirac operators, CG (Level 3) Parallel-threaded Pentium 4 version available [Edwards, McClendon]

2 Lattice QCD – extremely uniform Periodic or very simple boundary conditions Problem uniform Discretize on lattice – finite difference Dirac operator

3 Data-Parallel Programming Model Basic uniform operations across lattice: C(x) = A(x)*B(x) Distribute problem grid across a machine grid Want API to hide subgrid layout and communications (bundling subgrid faces) between nodes. Basic idea something like Fortran 90 – data- parallel support However, given uniform structure, can implement API without directly writing a compiler. Providing a meta-language (??) Data layout over processors

4 QCD-API Dirac Operators, CG Routines etc. C, C++, etc. (Organized by MILC or SZIN or CPS etc.) Level 3 Data Parallel QCD Lattice Operations (overlapping Algebra and Messaging) A = SHIFT(B, mu) * C; Global sums, etc Level 2 Lattice Wide Lin Alg (No Communication) e.g. A = B * C Lattice Wide Data Movement (Pure Communiation, non-blocking) e.g Atemp = SHIFT(A, mu) Level 1 Single Site Linear Algebra API SU(3), gamma algebra etc. Message Passing API (Know about mapping of Lattice onto Network Geomentry)

5 API Design Criteria Design must allow for overlapping of computation and communications. Hide data layout. Internally, a query can be made to return layout info to support a very limited number of layouts. Allow different underlying layouts to optimize implementations on a wide range of machines. Notably, (super)-scalar microprocessors with primary and secondary caches, and also classic vector machines. Support for multi-process or multi-threaded computations hidden from user control. Operations do not (by default) make conversions of objects from one layout into another layout so as to perform the operation. An error is generated if objects are in incompatible layouts.

6 Motivation Consider the equation dest(x) = src1(x)*src2(x+dir) (for all x) This can be decomposed astmp1(x) = src2(x+dir) dest(x) = src1(x)*tmp1(x) Implementation 1: This can be implemented as two functions as follows: Shift(tmp1, src2, FORWARD, dir); Multiply(dest, src1, tmp1, REPLACE); where FORWARD and dir give the sign of the shift and the direction, and REPLACE is an option indicating that dest is overwritten (allowing other options, like += later). Implementation 2: If the Shift is changed to also return its result, this can be rewritten as: Multiply(dest, src1, Shift(src2, FORWARD, dir), REPLACE);

7 Performance Considerations Predefined channels are opened between neighboring nodes of a d dimensional grid. Overlapping communications and computations, e.g. C(x)=A(x)*shift(B,d): –On clusters of scalar nodes, the face of a subnode is sent non-blocking to a neighboring node say in the forward direction. –The neighboring node in the corresponding backward direction sends its face into a preallocated buffer. –While this is going on, the specified operation is performed on the interior sites. –A wait is issued and the operation is performed on the face. Data layout over processors

8 /* Nd = number of dimensions Nc = number of colors Volume = Nd-volume Forward is an enum. There is also a Backward. The type specifiers below are suggestive, but not fleshed out yet */ Lattice_Gauge u[Nd], tmp_0, tmp_1; Lattice_Real wplaq_tmp; double tmp, w_plaq; int mu,nu; w_plaq = 0; for(mu=1; mu < Nd; ++mu) for(nu=0; nu < mu; ++nu) { /* tmp_0 = u(x+mu,nu)*u_dag(x+nu,mu) */ Multiply(tmp_0, Shift(u[nu],FORWARD,mu), Shift(Conjugate(u[mu]),FORWARD,nu), REPLACE); /* tmp_1 = tmp_0*u_dag(x,nu)=u(x+mu,nu)*u_dag(x+nu,mu)*u_dag(x,nu) */ Multiply(tmp_1,tmp_0,Conjugate(u[nu]),REPLACE); /* tmp_0 = u(x,mu)*tmp_1 = u(x,mu)*u(x+mu,nu)*u_dag(x+nu,mu)*u_dag(x,nu) */ Multiply(tmp_0,u[mu],tmp_1,REPLACE); /* wplaq_tmp = tr(tmp_0) = tr(u(x,mu)*u(x+mu,nu)*u_dag(x+nu,mu)*u_dag(x,nu)) */ Trace(wplaq_tmp,tmp_0,REAL_PART); /* w_plaq += sum(wplaq_tmp); */ Sum(w_plaq,wplaq_tmp,ADD); } w_plaq = 2.0*w_plaq/(float)(volume*Nd*(Nd-1)*Nc); Level 2 Example

9 Data Types Fields have various types (indices): Lattice Fields or scalar (same on all sites) Index type (at a site) Gauge fields:Product(Matrix(Nc),Scalar) Fermions:Product(Vector(Nc),Vector(Ns)) Scalars:Scalar Propagators:Product(Matrix(Nc),Matrix(Ns)) Support compatible operations on types: Matrix(color)*Matrix(spin)*Vector(color,spin)

10 C++ Interface for Level 2 Unary operations: operate on one source and return (or into) a target Lattice_Field Shift(Lattice_field source, enum sign, int direction); void Copy(Lattice_Field dest, Lattice_Field source, enum option); void Trace(double dest, Lattice_Field source, enum option); Binary operations: operate on two sources and return into a target void Multiply(Lattice_Field dest, Lattice_Field src1, Lattice_Field src2, enum option); void Compare(Lattice_Bool dest, Lattice_Field src1, Lattice_Field src2, enum compare_func); Broadcasts: broadcast throughout lattice void Fill(Lattice_Field dest, float val); Reductions: reduce through the lattice void Sum(double dest, Lattice_Field source);

11 C Interface for Level 2 Binary: Multiplication is a fundamental operation and quite important for speed. void QCDF_mult_T1T2T3_op3(Type3 *r, const Type1 *a, const Type2 *b) Multiply operations where T1, T2, T3 are shortened type names for the type Type1, Type2 and Type3 LatticeGaugeF, LatticeDiracFermionF, LatticeHalfFermionF, LatticePropagatorF and op3 are options like nnrr = a*bnnnr = -a*b ncrr = a*conj(b)ncnr = -a*conj(b) cnrr = conj(a)*bcnnr = -conj(a)*b ccrr = conj(a)*conj(b)ccnr = -conj(a)*conj(b) nnar = r + a*bnnsr = r – a*b ncar = r + a*conj(b)ncsr = r – a*conj(b) cnar = r + conj(a)*bcnar = r – conj(a)*b ccar = r + conj(a)*conj(b)ccsr = r - conj(a)*conj(b)

12 C Example of Level 2 calling Level 1 void QCDF_mult_T1T2T3_op3(Type3 *r, const Type1 *a, const Type2 *b) { for(I=0; I < volume; ++I) QCDF_mult_T1T2T3_op3_S(r[I], a[I], b[I]) } Scalar processor void QCDF_mult_T1T2T3_op3(Type3 *r, const Type1 *a, const Type2 *b) { thread_dispatch(&foo, subgrid_volume, r, a, b) } void foo(int start, int end, va_list args) { for(I=start; I < end; ++I) QCDF_mult_T1T2T3_op3_S(r[I], a[I], b[I]) } Parallel Threaded

13 Communications-Computations Overlap Type *QCDF_shift_T(Type *a, int d) { for(I=0; I < size_of_face; ++I) QCDF_copy_T1T2_repl(msgmem_f[d][I], a[offset[I]]) QMP_start(msgh_b[d]) // start receive QMP_start(msgh_f[d]) // start send } void QCDF_mult_T1T2T3_op3(Type3 *d, const Type1 *b, const Type2 *t) { thread_dispatch(&interior, interior_volume, d, b, t) QMP_wait(msgh_b) // Wait on receive thread_dispatch(&face, face_size, d, b, t) QMP_wait(msgh_f) // Wait on send } void interior(int start, int end, va_list args) // similar for face { for(I=start; I < end; ++I) QCDF_mult_T1T2T3_op3_S(d[off3[I]], b[off1[I]], t[off2[I]]) } T = shift(A, d) sending face in d & receive in -d D = B * T calls wait on send/receive

14 SZIN: Data-Parallel LHPC (Lattice Hadron Physics Collaboration) uses SZIN –M4 preprocessed object-oriented data-parallel programming interface –Base code is C –Supports scalar nodes, multi-process SMP, multi-threaded SMP, vector, CM-2, CM-5, clusters of SMP’s –The Level 2 routines are generated at M4 time into a repository –Supports overlapping communications/computations –No exposed Level 1 layer –Can/does use (transparent to user) use any Level 1 or 2 routines –Optimized Level 3 routines (Dirac operator) available –Code publicly available by tar and CVS Plan to move to conventional OO language

15 Linear Algebra (Level 1)Intel Pentium with Streaming SIMD Extensions (SSE) Streaming SIMD (Single Instruction Multiple Data) Extensions which are length 4 vector instructions Main benefit is lower instruction bandwidth and easy scheduling First introduced in Pentium 2 as MMX instructions (32 bit integer) In Pentium 3, SSE has eight 128 bit registers used in single prec., so each vector register holds and operates on four 32 bit registers In Pentium 4, SSE2 introduced which allows use of same registers in 64 bit (double prec.) mode Pentium 4 capable of four 32 bit operations per clock (e.g., two mult- adds) or two 64 bit operations (e.g., one mult-add) per clock

16 Level 1 Multiply Performance (32bit)

17 Level 1 Multiply Performance (64bit)

18 Single Node

19 Parallel Version on Single Node

20 Typical steps in LQCD calculation Some code generates gauge configurations –Largish binary image –Need header describing context Read configuration and compute quark propagators (solve Dirac eq.) –Even larger binary image –Carry gauge header and new propagator header – nested headers Read propagators and configuration to compute matrix elements (hadron decays, etc.) –Generates important but much smaller (often ascii) data sets. –Usually large number of quantum numbers (channels)

21 I/O and Data Formats Currently using Namelist (from Fortran) data format. Supports nested groups and resources (tags) within groups. Switching now to new nested header format for configurations, propagators, input files, etc. Will propose to use XML for ascii formated headers –These headers can be used for a user extensible replica(ted) catalog of locations of configurations and propagators. –Search user data using SQL to find locations. Use XML input and output data files –Parameter input for program execution –Output for cooked data can be sent over net to receiver –Want SQL-like data base for storing data and retrieving Problems: –Long computation time for typical SQL commits or rollbacks –SQL data types too primitive – want multi-D arrays, complex numbers.


Download ppt "JLab SciDAC Activities QCD-API design and other activities at JLab include: –Messaging passing design and code (Level 1) [Watson, Edwards] First implementation."

Similar presentations


Ads by Google