Presentation is loading. Please wait.

Presentation is loading. Please wait.

Massively parallel implementation of Total-FETI DDM with application to medical image registration Michal Merta Alena Vašatová Václav Hapla David Horák.

Similar presentations


Presentation on theme: "Massively parallel implementation of Total-FETI DDM with application to medical image registration Michal Merta Alena Vašatová Václav Hapla David Horák."— Presentation transcript:

1 Massively parallel implementation of Total-FETI DDM with application to medical image registration Michal Merta Alena Vašatová Václav Hapla David Horák DD21, Rennes, France

2  solution of large-scale scientific and engineering problems  possibly hundreds of millions DOFs  linear problems  non-linear problems  non-overlapping, FETI methods with up to tens of thousands of subdomains  usage of PRACE Tier-1 and Tier-0 HPC systems Motivation

3  developed by Argonne National Laboratory  data structures and routines for the scalable parallel solution of scientific applications modeled by PDE  coded primarily in C language, but good FORTRAN support, can also be called from C++ and Python codes  current version is 3.2www.mcs.anl.gov/petscwww.mcs.anl.gov/petsc  petsc-dev (development branch) is intensively evolving  code and mailing lists open to anybody PETSc (Portable, Extensible Toolkit for Scientific computation)

4 PETSc components seq. / par.

5  developed by Sandia National Laboratories  collection of relatively independent packages  toolkit for basic linear algebra operations, direct and iterative solvers for linear systems, PDE discretization utilities, mesh generation tools etc.  object oriented design, high modularity, use of modern C++ features (templating)  mainly in C++ (Fortran and Python bindings)  current version trilinos.sandia.govtrilinos.sandia.gov Trilinos

6 Trilinos components

7  are parallelized on the data level (vectors & matrices) using MPI  use BLAS and LAPACK – de facto standard for dense LA  have their own implementation of sparse BLAS  include robust preconditioners, linear solvers (direct and iterative) and nonlinear solvers  can cooperate with many other external solvers and libraries (e.g. MATLAB, MUMPS, UMFPACK, …)  support CUDA and hybrid parallelization  are licensed as open-source Both PETSc and Trilinos…

8 Problem of elastostatics f

9 TFETI decomposition

10 The FEM discretization with a suitable numbering of nodes results in the QP problem: Primal discretized formulation

11 Dual discretized formulation (homogenized) QP problem again, but with lower dimension and simpler constraints

12 Primal data distribution, F action … straightforward matrix distribution, given by a decomposition * very sparse block diagonal  embarrassingly parallel

13 Coarse projector action * … can easily take 85 % of computation time if not properly parallelized! ? ? ?

14 G preprocessing and action preprocessing action ?

15 Coarse problem preprocessing and action preprocessing action ? Currently used variant: B2 (PPAM 2011)

16 Coarse problem

17  the UK's largest, fastest and most powerful supercomputer supplied by Cray Inc., operated by EPCC  uses the latest AMD "Bulldozer" multicore processor architecture  704 compute blades  each blade with 4 compute nodes giving a total of 2816 compute nodes  each node with two 16-core AMD Opteron 2.3GHz Interlagos processors → 32 cores per node  total of cores  each 16-core processor shares 16Gb of memory, in total 60 Tb  theoretical peak performance over 800 Tflops HECToR phase 3 (XE6)

18  K + implemented as direct solve (LU) of regularized K  built-in CG routine used (PETSc.KSP, Trilinos.Belos)  E = 1e6, = 0.3, g = 9.81 ms -2  HECToR Benchmark

19 Results # subds = # cores Prim. dim Dual dim Solution timeTrilinos PETSc # iterationsTrilinos PETSc iter. timeTrilinos4.48e-24.76e-25.00e-25.95e-29.81e-22.75e-1 PETSc3.46e-23.92e-24.42e-24.52e-24.69e-25.73e-2 stopping criterion:||r k || / || r 0 || < 1e-5without preconditioning

20  Process of integrating information from two (or more) different images  Images from different sensors, different angles or/and times Application to image registration

21  In medicine:  Monitoring of growth of a tumour  Therapy valuation  Comparison of patient data with anathomical atlas  Data from magnetic resonance (MR), computer tomography (CT), positron emission tomography (PET)

22  The task is to minimize the distance between two images Elastic registration

23  Parallelization using TFETI method Elastic registration

24 # of subdomains1416 Primal variables Dual variables Solution time [s] # of iterations Time/iteration [s] Results stopping criterion: ||r k || / || r 0 || < 1e-5

25 Solution

26  To consolidate PETSc & Trilinos TFETI implementation to the form of extensions or packages  To further optimize the codes using core-hours on Tier- 1/Tier-0 systems (PRACE DECI Initiative, HPC-Europa2)  To extend image registration to 3D data Conclusion and future work

27  KOZUBEK T. et al. Total FETI domain decomposition method and its massively parallel implementation. Accepted for publishing in Advances in Engineering Software.  HORAK, D.; HAPLA, V. TFETI coarse space projectors parallelization strategies. Accepted for publishing in the proceedings of PPAM 2011, Springer LNCS,  Zitova B., Flusser J., Image registration methods: a survey, Image and Vision Computing, Vol.21, No.11, 2003, pp References

28 Thank you for your attention!


Download ppt "Massively parallel implementation of Total-FETI DDM with application to medical image registration Michal Merta Alena Vašatová Václav Hapla David Horák."

Similar presentations


Ads by Google