Presentation on theme: "Massively parallel implementation of Total-FETI DDM with application to medical image registration Michal Merta Alena Vašatová Václav Hapla David Horák."— Presentation transcript:
Massively parallel implementation of Total-FETI DDM with application to medical image registration Michal Merta Alena Vašatová Václav Hapla David Horák DD21, Rennes, France
solution of large-scale scientific and engineering problems possibly hundreds of millions DOFs linear problems non-linear problems non-overlapping, FETI methods with up to tens of thousands of subdomains usage of PRACE Tier-1 and Tier-0 HPC systems Motivation
developed by Argonne National Laboratory data structures and routines for the scalable parallel solution of scientific applications modeled by PDE coded primarily in C language, but good FORTRAN support, can also be called from C++ and Python codes current version is 3.2www.mcs.anl.gov/petscwww.mcs.anl.gov/petsc petsc-dev (development branch) is intensively evolving code and mailing lists open to anybody PETSc (Portable, Extensible Toolkit for Scientific computation)
PETSc components seq. / par.
developed by Sandia National Laboratories collection of relatively independent packages toolkit for basic linear algebra operations, direct and iterative solvers for linear systems, PDE discretization utilities, mesh generation tools etc. object oriented design, high modularity, use of modern C++ features (templating) mainly in C++ (Fortran and Python bindings) current version trilinos.sandia.govtrilinos.sandia.gov Trilinos
are parallelized on the data level (vectors & matrices) using MPI use BLAS and LAPACK – de facto standard for dense LA have their own implementation of sparse BLAS include robust preconditioners, linear solvers (direct and iterative) and nonlinear solvers can cooperate with many other external solvers and libraries (e.g. MATLAB, MUMPS, UMFPACK, …) support CUDA and hybrid parallelization are licensed as open-source Both PETSc and Trilinos…
Problem of elastostatics f
The FEM discretization with a suitable numbering of nodes results in the QP problem: Primal discretized formulation
Dual discretized formulation (homogenized) QP problem again, but with lower dimension and simpler constraints
Primal data distribution, F action … straightforward matrix distribution, given by a decomposition * very sparse block diagonal embarrassingly parallel
Coarse projector action * … can easily take 85 % of computation time if not properly parallelized! ? ? ?
G preprocessing and action preprocessing action ?
Coarse problem preprocessing and action preprocessing action ? Currently used variant: B2 (PPAM 2011)
the UK's largest, fastest and most powerful supercomputer supplied by Cray Inc., operated by EPCC uses the latest AMD "Bulldozer" multicore processor architecture 704 compute blades each blade with 4 compute nodes giving a total of 2816 compute nodes each node with two 16-core AMD Opteron 2.3GHz Interlagos processors → 32 cores per node total of cores each 16-core processor shares 16Gb of memory, in total 60 Tb theoretical peak performance over 800 Tflops HECToR phase 3 (XE6)
K + implemented as direct solve (LU) of regularized K built-in CG routine used (PETSc.KSP, Trilinos.Belos) E = 1e6, = 0.3, g = 9.81 ms -2 HECToR Benchmark
Results # subds = # cores Prim. dim Dual dim Solution timeTrilinos PETSc # iterationsTrilinos PETSc iter. timeTrilinos4.48e-24.76e-25.00e-25.95e-29.81e-22.75e-1 PETSc3.46e-23.92e-24.42e-24.52e-24.69e-25.73e-2 stopping criterion:||r k || / || r 0 || < 1e-5without preconditioning
Process of integrating information from two (or more) different images Images from different sensors, different angles or/and times Application to image registration
In medicine: Monitoring of growth of a tumour Therapy valuation Comparison of patient data with anathomical atlas Data from magnetic resonance (MR), computer tomography (CT), positron emission tomography (PET)
The task is to minimize the distance between two images Elastic registration
Parallelization using TFETI method Elastic registration
# of subdomains1416 Primal variables Dual variables Solution time [s] # of iterations Time/iteration [s] Results stopping criterion: ||r k || / || r 0 || < 1e-5
To consolidate PETSc & Trilinos TFETI implementation to the form of extensions or packages To further optimize the codes using core-hours on Tier- 1/Tier-0 systems (PRACE DECI Initiative, HPC-Europa2) To extend image registration to 3D data Conclusion and future work
KOZUBEK T. et al. Total FETI domain decomposition method and its massively parallel implementation. Accepted for publishing in Advances in Engineering Software. HORAK, D.; HAPLA, V. TFETI coarse space projectors parallelization strategies. Accepted for publishing in the proceedings of PPAM 2011, Springer LNCS, Zitova B., Flusser J., Image registration methods: a survey, Image and Vision Computing, Vol.21, No.11, 2003, pp References