Presentation is loading. Please wait.

Presentation is loading. Please wait.

Realized by: Massimo Di Pierro Presented at: ACAT 2000, Fermilab “Matrix Distributed Processing” for Lattice/Grid Parallel Computations.

Similar presentations


Presentation on theme: "Realized by: Massimo Di Pierro Presented at: ACAT 2000, Fermilab “Matrix Distributed Processing” for Lattice/Grid Parallel Computations."— Presentation transcript:

1 Realized by: Massimo Di Pierro Presented at: ACAT 2000, Fermilab “Matrix Distributed Processing” for Lattice/Grid Parallel Computations

2 Motivation Every fundamental interaction in Nature is local (as far as we know) Its effects can be described in terms of a system of local differential equations of some kind Every local differential equation can be solved numerically by discretizing the space on which the equation itself is defined Discretized differential equations can be solved, efficiently, in parallel! Examples: QCD, Electromagnetism, General Relativity, fluidodynamics, termodynamics, etc. Examples outside physics: Black-Sholes equation

3 History mdpqcd (old!) @ Southampton Univ. FermiQCD @ Fermilab (Lattice QCD application) MDP 1.0 & hep-lat/9811036 hep-lat/0004007 hep-lat/0009001 Matrix Random JackBoot mpi (on top of MPI) generic_lattice site generic_field Scalar_field Gauge_field Fermi_field Fermi_propagator Staggered_field Staggered_propagator + Algorithms …. Thanks Theory Group

4 Standard ANSI C++ Fully Object Oriented (no global vars) Communications based on MPI (but no knowledge of MPI required) MPI wrapped in class mpi Matrix + Random + JackBoot User can define multiple lattices One random generator per site Arbitrary lattice size and dimension Arbitrary lattice partitioning Arbitrary lattice topology User can define multiple fields on each lattice (arbitrary structure per site) Matrix Distributed Processing 1.2 0101010 0111100 1010100 0101101 0101001 Algorithms are platform independent Results are n. of proc. independent Automatic optimization of communication without explicit calls to MPI Standardized I/O and analysis tools.

5 Problem: Solve iteratively (in 100 iterations) the non-linear Laplace equation: where U(x) and V(x) are 3  3 matrices. V(x) is a given external field initialized with random SU(3) matrices. Solve the equation on a 4 dimensional “space” approxima- ted with an 8 4 lattice distributed over N parallel processes. Example: 4D Lapace Equation  8 0 1 2 3

6 Discretization x x + 0x - 0 x + 1 x - 1 U(x) = 0.125*(cos(U(x)+V(x))+ U(x+0)+U(x-0)+ U(x+1)+U(x-1)+ U(x+2)+U(x-2)+ U(x+3)+U(x-3)) 0 2 1 3 Continuum: Local + Derivatives Discrete: Quasi - Local 4 th dimension

7 #define PARALLEL #include “MDP_Lib2.h” #include “MDP_MPI.h” int main(int argc, char **argv) { mpi.open_wormholes(argc,argv); int box[4]={8,8,8,8}; generic_lattice space(4,box); Matrix_field U(space,3,3); Matrix_field V(space,3,3); site x(space); forallsites(x) { V(x) = space.random(x).SU(3); U(x) = 0; }; V.update(); U.update(); for(int i=0; i<100; i++) { forallsites(x) U(x) = 0.125*(cos(U(x)+V(x))+ U(x+0)+U(x-0)+ U(x+1)+U(x-1)+ U(x+2)+U(x-2)+ U(x+3)+U(x-3)); U.update(); }; V.save(“V_field.dat”); U.save(“U_field.dat”); mpi.close_wormholes(); return 0; }; Compile with mpiCC filename, run with mpiriun –np N a.out Example Program using MDP 1.2 Otherwise single process Open communications Close communications

8 Program Output ====================================================== ===== Starting [ Matrix Distributed Processing ]... This program is using the packages: MDP_Lib2 and MDP_MPI Created by Massimo Di Pierro (mdp@FNAL.GOV) version 1.2 ====================================================== ===== Going parallel... YES Initializing a generic_lattice... Lattice dimension: 8 x 8 x 8 x 8 Communicating... Initializing random per site... Done. Let's begin to work! Saving file V_field.dat from process 0 (buffer = 1024 sites)... Saving time: 0.076129 (sec) Saving file U_field.dat from process 0 (buffer = 1024 sites)... Saving time: 0.019184 (sec) ====================================================== ======== Fractional time spent in communications by processor 0 is 0.27 Fractional time spent in communications by processor 1 is 0.23 Fractional time spent in communications by processor 2 is 0.21 Fractional time spent in communications by processor 3 is 0.19 ====================================================== ======== if(ME==on_which_process(0,0,0,0)) { x.set(0,0,0,0) printf(“%f\n”, real(U(x)(0,0))); }; checking convergence

9 Visualization Tool

10 C++ (Linux, Sun) Hierarchy of Classes MatrixJackBoot MPI 1.2 mpi Random generic_lattice generic_fieldsite User applications: … QCD, physics, geology, biology, etc… save(), load(), update() random(x) Object Oriented Matrix A, B(10,10); … B(2,3) = 3 + 5*I; …. A = exp(inv(B))+B; Average Jackknife Bootstrap

11 Basic Syntax and Notation int box[] = {100,100,100}; generic_lattice space(3, box, partitioning, topology, seed, boundary); site x(space); struct Tensor { float component[5][5][5]; }; generic_field T(space); forallsites(x) …T(x).component[i][j][k]… T.update(); T.save(); T.load(); Define box that contains lattice Define a 3D lattice in the box, “space” Specify a user defined partitioning (opt.) Specify a user defined topoly (opt.) Specify the tickness of boundary (opt.) Define a site on the lattice “space” Define an arbitrary Tensor structure Define a Tensor field “T” Example of parallel loop Example of how to access the field Parallel communications Parallel I/O

12 Arbitrary Topology nowhere

13 Arbitrary Partitioning boundary size HIDDENHIDDEN int p(int *x, int ndim, int *nc) { if (x[0]<3) return 0; if (x[1]<4) return 1; return 2; }; 0 1 2

14 Efficient Communication Patterns update(): … storage save() / load(): efficiency total time optimum Different communication patters are using to optimize communications If proc. A has information to send to proc. B, it does it in a single transfer Each process is always engaged in a single send and a single receive All processes communicate at the same time without cross communication Each couple of processes only communicate if they known they have sites in common (this is automatically determined when each lattice is declared) The process that sends rearranges sites. The process that receives does not, therefore it does not need to allocate a buffer. …

15 Standard I/O MDP file Standard Header Data … Metadata … Sites stored in a fixed order Ordering is architecture independent Checks on Endianess No need for parallel I/O No need to NFS mount disks

16 boundary size Local Parameterization (even and odd sites) HIDDENHIDDEN 01 2 34 5 67 826 18 21 19 20 27 HIDDENHIDDEN 9 1011 12 1314 15 1617 22 23 2425 2829

17 Optimizations in update() HIDDENHIDDEN 0 1 2

18 All properties from MDP 1.2: Standard I/O Arbitrary lattice dimensions Arbitrary lattice topology Arbitrary lattice partitioning Arbitrary SU(N) gauge fields Wilson, Clover and D234 Fermions An-isotropic Clover Action Lepage O(a^2) Improved KS Fermions Minimum Residue Inversion Stabilized Bi Conjugate Gradient Inversion Read CANOPY / MILC / UKQCD data files Speed comparable with MILC code One Application: FermiQCD A =(Gamma[mu]+Gamma5- 1)*exp(3*I*Gamma5); forallsites(x) psi(x)=(Gamma[mu] - m)*psi(x); forallsites(x) for(alpha=0; alpha<psi.nspin; alpha++) psi(x,alpha)=U(x,mu)*psi(x+mu,alpha); Notation MILC 1 CG FermiQCD 1 BiCGStab Time

19 http://home.dencity.com/massimo_dipierro/ Free download Registration Licence Download and FermiQCD

20 Site Parameterizations B = (0,3)A = (0,0) C = (5,5) Grid parametrization: A = (0,0)  A(0) = 0, A(1) = 0; B = (0,3)  B(0) = 0, B(1) = 3; C = (5,5)  C(0) = 5, C(1) = 5; Global (unique) parametrizaion: A = 0, B = 3, C = 35 Local (process 0) parametrization: A = 0, B = 22, C = 27 Local (process 1) parametrization: A = 0, B = 16, C = 25 Local (process 2) parametrization: A = ???, B = 7, C = 11

21 Simple Grid Topology


Download ppt "Realized by: Massimo Di Pierro Presented at: ACAT 2000, Fermilab “Matrix Distributed Processing” for Lattice/Grid Parallel Computations."

Similar presentations


Ads by Google