Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parco 2005 1 Auto-optimization of linear algebra parallel routines: the Cholesky factorization Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica.

Similar presentations


Presentation on theme: "Parco 2005 1 Auto-optimization of linear algebra parallel routines: the Cholesky factorization Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica."— Presentation transcript:

1 Parco 2005 1 Auto-optimization of linear algebra parallel routines: the Cholesky factorization Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica Universidad Politécnica de Cartagena, Spain luis.garcia@sait.upct.es Javier Cuenca Departamento de Ingeniería y Tecnología de Computadores Universidad de Murcia, Spain javiercm@ditec.um.es Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain domingo@dif.um.es

2 2Parco 2005 Outline Introduction Introduction Parallel routine for the Cholesky factorization Parallel routine for the Cholesky factorization Experimental Results Experimental Results Conclusions Conclusions

3 3Parco 2005 Introduction Our Goal: to obtain linear algebra parallel routines with auto-optimization capacity Our Goal: to obtain linear algebra parallel routines with auto-optimization capacity The approach: model the behavior of the algorithm The approach: model the behavior of the algorithm This work: improve the model for the communication costs when: This work: improve the model for the communication costs when: The routine uses different types of MPI communication mechanisms The routine uses different types of MPI communication mechanisms The system has more than one interconnection network The system has more than one interconnection network The communication parameters vary with the volume of the communication The communication parameters vary with the volume of the communication

4 4Parco 2005 Introduction Theoretical and experimental study of the algorithm. AP selection. Theoretical and experimental study of the algorithm. AP selection. In linear algebra parallel routines, typical AP and SP are: In linear algebra parallel routines, typical AP and SP are: b, p = r x c and the basic library b, p = r x c and the basic library k 1, k 2, k 3, t s and t w k 1, k 2, k 3, t s and t w An analytical model of the execution time An analytical model of the execution time T(n) = f(n,AP,SP) T(n) = f(n,AP,SP)

5 5Parco 2005 Parallel Cholesky factorization The n x n matrix is mapped through a block cyclic 2-D distribution onto a two-dimensional mesh of p = r x c processes (in ScaLAPACK style) The n x n matrix is mapped through a block cyclic 2-D distribution onto a two-dimensional mesh of p = r x c processes (in ScaLAPACK style) (a) First step (b) Second step (c) Third step Figure 1. Work distribution in the first three steps, with n/b = 6 and p = 2 x 3

6 6Parco 2005 Parallel Cholesky factorization The general model: t(n) = f(n,AP,SP) The general model: t(n) = f(n,AP,SP) Problem size: Problem size: n matrix size n matrix size Algorithmic parameters (AP): Algorithmic parameters (AP): b block size b block size p = r x c processes p = r x c processes System parameters (SP) SP = g(n,AP): System parameters (SP) SP = g(n,AP): k(n,b,p): k2potf2, k3trsm, k3gemm and k3syrk cost of basic arithmetic operations k(n,b,p): k2potf2, k3trsm, k3gemm and k3syrk cost of basic arithmetic operations t s (p) start-up time t s (p) start-up time t ws (n,p), t wd (n,p) word-sending time for different types of communications t ws (n,p), t wd (n,p) word-sending time for different types of communications t com (n,p) = t s (p)+nt w (n,p)

7 7Parco 2005 Parallel Cholesky factorization Theoretical model: Theoretical model: Arithmetic cost: Arithmetic cost: Communication cost: Communication cost: T = t arit + t com

8 8Parco 2005 Experimental Results Systems: Systems: A network of four nodes Intel Pentium 4 (P4net) with a FastEthernet switch, enabling parallel communications between them. The MPI library used is MPICH A network of four nodes Intel Pentium 4 (P4net) with a FastEthernet switch, enabling parallel communications between them. The MPI library used is MPICH A network of four nodes HP AlphaServer quad processors (HPC160) using Shared Memory (HPC160smp), MemoryChannel (HPC160mc) and both (HPC160smp-mc) for the communications between processes. A MPI library optimized for Shared Memory and for MemoryChannel has been used. A network of four nodes HP AlphaServer quad processors (HPC160) using Shared Memory (HPC160smp), MemoryChannel (HPC160mc) and both (HPC160smp-mc) for the communications between processes. A MPI library optimized for Shared Memory and for MemoryChannel has been used.

9 9Parco 2005 Experimental Results How to estimate the arithmetic SPs How to estimate the arithmetic SPs With routines performing some basic operation (dgemm, dsyrk, dtrsm) with the same data access scheme used in the algorithm With routines performing some basic operation (dgemm, dsyrk, dtrsm) with the same data access scheme used in the algorithm How to estimate the communication SPs How to estimate the communication SPs With routines that communicate rows or columns in the logical mesh of processes: With routines that communicate rows or columns in the logical mesh of processes: With a broadcast for MPI derived data type between processes in the same column With a broadcast for MPI derived data type between processes in the same column With a broadcast for MPI predefined data type between processes in the same row With a broadcast for MPI predefined data type between processes in the same row In both cases the experiments are repeated several times, to obtain an average value In both cases the experiments are repeated several times, to obtain an average value

10 10Parco 2005 Experimental Results Lowest execution time with the optimized version of BLAS and LAPACK for Pentium 4 and for Alpha Lowest execution time with the optimized version of BLAS and LAPACK for Pentium 4 and for Alpha Block size 3264128256 k3,dgemm0,0018620,0009370,0005720,000467 k3,dsyrk0,0034920,0014840,0012280,000762 k3,dtrsm0,0117190,0065270,0037850,002325 3264128256k3,dgemm0,0008240,0006580,0006100,000580 k3,dsyrk0,0016280,0011640,0008070,000688 k3,dtrsm0,0016170,0011100,0008410,000706 Table 1. Values of arithmetic system parameters (in µsec) in Pentium 4 with BLASopt Table 2. Values of arithmetic system parameters (in µsec) in Alpha with CXML

11 11Parco 2005 Experimental Results But other SPs can depend on n and b, for example: k2,potf2 But other SPs can depend on n and b, for example: k2,potf2 b/nb/nb/nb/n51210242048 320,00450,00540,0067 640,00340,0460,0049 1280,00630,00770,0076 2560,00860,01030,0100 b/nb/nb/nb/n102420484096320,00280,01470,0101 640,00240,00820,0034 1280,00330,00520,0025 2560,00270,00400,0023 Table 3. Values of k2,potf2 (in µsec) in Pentium 4 with BLASopt Table 4. Values of k2,potf2 (in µsec) in Alpha with CXML

12 12Parco 2005 Experimental Results Communication system parameters Communication system parameters Broadcast cost for MPI predefined data type, t ws Broadcast cost for MPI predefined data type, t ws Message Size p15002048 > 4000 20,610,770,84 41,221,451,68 p Shared Memory MemoryChannel20,0110,072 40,0250,14 Table 5. Values of t ws (in µsecs) in P4net Table 6. Values of t ws (in µsecs) in HPC160

13 13Parco 2005 Experimental Results Communication system parameters Communication system parameters Word sending time of a broadcast for MPI derived data type t wd Word sending time of a broadcast for MPI derived data type t wd Block size P4netHPC160smpHPC160mc p326412825632641282563264128256 20,970,841,001,100,0190,0240,0200,0190,0950,0910,0890,090 41,601,901,601,640,0470,0480,0450,0410,1900,1760,1790,183 twd Table 7. Values of twd (in µsecs) obtained experimentally for different b and p

14 14Parco 2005 Experimental Results Communication system parameters Communication system parameters Startup time of MPI broadcast t s Startup time of MPI broadcast t s Can be considered t s (n,p) ≈ t s (p) Can be considered t s (n,p) ≈ t s (p) pP4netHPC160smpHPC160mc 2554,884,88 41219,779,77 t s Table 8. Values of t s (in µsecs) obtained experimentally for different number of processes

15 15Parco 2005 Experimental Results P4net P4net

16 16Parco 2005 Experimental Results HPC160smp HPC160smp

17 17Parco 2005 Experimental Results Parameters selection in P4net Parameters selection in P4net Table 9. Parameters selection for the Cholesky factorization in P4net

18 18Parco 2005 Experimental Results Parameters Selection in HPC160 Parameters Selection in HPC160 Table 10. Parameters selection for the Cholesky factorization in HPC160 with shared memory (HPC160smp), MemoryChannel (HPC160mc) and both (HPC160smp-mc)

19 19Parco 2005 Conclusions The method has been applied successfully to the Cholesky factorization and can be applied to other linear algebra routines The method has been applied successfully to the Cholesky factorization and can be applied to other linear algebra routines It is neccesary to use different costs for different types of MPI communication mechanisms. It is neccesary to use different costs for different types of MPI communication mechanisms. and to use different cost for the communication parameters in systems with more than one interconnection network. and to use different cost for the communication parameters in systems with more than one interconnection network. It is necessary to decide the optimal allocation of processes by node, according to the speed of the interconnection networks. Hybrid systems. It is necessary to decide the optimal allocation of processes by node, according to the speed of the interconnection networks. Hybrid systems.


Download ppt "Parco 2005 1 Auto-optimization of linear algebra parallel routines: the Cholesky factorization Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica."

Similar presentations


Ads by Google