Tomographic mammography parallelization Juemin Zhang (NU) Tao Wu (MGH) Waleed Meleis (NU) David Kaeli (NU)

Parallelization of SSI Applications We have developed profile-guided parallelization techniques to rapidly characterize program control flow and data flow, and use this information to guide parallelization We have already sped up a number of CenSSIS applications, including: – finite-difference time domain – steepest descent fast multi-pole method – photo simulation – ellipsoid algorithm We target Beowulf clusters running Linux We utilize MPICH as our middleware

Tomographic mammography 3D image reconstruction from x-ray projections – Used to detect and diagnose breast cancer – Based on well-developed mammography techniques – Exposes tissue structure using multiple projections from different angles Advantages Accuracy: provides at least as much useful information than x-ray film Flexibility: digital image manipulation, digital storage Provides structural information: using layered images Safe: low-dose x-ray Lower cost: compared to MRI

Image acquisition and reconstruction process Acquisition: 11 uniform angular samples along Y-axis X-ray projection: breast tissue density absorption radiograph Algorithm: constrained non-linear convergence and iterative process detector X-ray source Y Set 3D volume Compute projections Correct 3D volume 3D volume No Yes Exit Initialization Forward Backward Satisfied? X Y Z x-ray projections

Reconstruction and Parallelization Reconstruction algorithm: Maximum likelihood expectation maximization (ML-EM) High resolution image Computationally intensive: 3 hours serial execution on 2.2GHz Pentium 4 workstation, using 2GB memory The need for speed: – Large number of medical cases – Execution time increases as a function of breast size – Real-time application: computer-guided needle biopsy breast surgery Research motivation – Computation vs. communication – Platforms vs. parallelization methods

Parallelization approaches Reduce communication data – Segmentation along Y-axis – Using redundant computation to replace communication – Segmenting along x-ray beam First approach: No inter-node communication (more computation, no communication) Second approach: Overlap with inter-node communication Third approach: Non-overlapped with inter-node communication (no redundant computation, more communication) exchange dataOverlap area

Implementation and tests Serial code provided by T. Wu at MGH Programming model – C++ and message passing interface (MPI) – Globus tool kits: MPICH-G2 over NPACI Grid, in progress Test input data set – Phantom data set: 1600x2034x45 – A large patient data set: 1040x2034x70 Test platforms ProcessorInterconnection MGH cluster2.5GHz Pentium 4100Mb interconnect switch UIUC NCSA Titan cluster800MHz Itanium 1 dual-processor 1Gb Myrinet, Shared L3 cache UIUC NCSA IBM p690 server 1.3GHz Power41Gb Ethernet Shared memory system SGI Altix 3300 system1.3GHz Itanium 2 dual-processor NUMAlink interconnect, Shared memory system

Partitioning methods comparison Input data set – phantom 1600x2034x45 Platform: – UIUC NCSA Titan cluster Non-overlap method outperforms other two methods The best parallel runtime is under 3 minutes using 64 processors Three methods show very similar speedup trends Given additional processors, non-overlap method yields higher performance increase than other methods

Platform performance comparison using non-overlap method Input data set: phantom 1600x2034x45 Platforms: – SGI Altix system – UIUC NCSA Titan cluster – UIUC NCSA IBM p690 – Pentium 4 cluster at MGH Number of processors: 32 Algorithm: Non-overlap with inter-node communication partition method Computation: SGI Altix with Itanium 2 processor outperforms the other CPUs Communication: shared memory platforms have very low communication overhead Over 2 times performance difference between SGI Altix and Pentium IV cluster

Platform performance comparison using no inter-node communication Input data set: phantom 1600x2034x45 Platform: – SGI Altix system – UIUC NCSA Titan cluster – UIUC NCSA IBM p690 – Pentium 4 cluster at MGH Number of processors: 32 Algorithm: overlap without inter- node communications Computation: significant differences between Titan, IBM p690 and P4 clusters Synchronization: more waiting time accumulated at the end iterations SGI Altix performance remains similar to non-overlap method

Platform and parallel partitioning method performance comparison Input data set: –phantom 1600x2034x45 Platform: – Pentium 4 cluster at MGH – UIUC NCSA IBM p690 – UIUC NCSA Titan cluster – SGI Altix Number of processors: 32 Computation power dominant performances Inter-node communication and non-overlap methods lead to higher performance on some platforms

Summary and future work Over 180X speedup vs. serial implementation 1. Phantom data set: 1600x2034x45 –1 minute using 64 processors on SGI Altix 2. A large patient data set: 1040x2034x70 –1.5 minutes using 64 processors on SGI Altix Joint SPIE paper with T. Wu at MGH: “A parallel reconstruction method for digital tomosynthesis mammography,” 2004 SPIE Workshop on Medical Imaging Future work: –Real-time application: computer-guided needle biopsy Goal: 5~10 seconds delay or less Evaluation of computation reduction effects on image quality –Move code to a Grid environment (underway)

Tomographic mammography parallelization Juemin Zhang (NU) Tao Wu (MGH) Waleed Meleis (NU) David Kaeli (NU)

Similar presentations

Presentation on theme: "Tomographic mammography parallelization Juemin Zhang (NU) Tao Wu (MGH) Waleed Meleis (NU) David Kaeli (NU)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tomographic mammography parallelization Juemin Zhang (NU) Tao Wu (MGH) Waleed Meleis (NU) David Kaeli (NU)

Similar presentations

Presentation on theme: "Tomographic mammography parallelization Juemin Zhang (NU) Tao Wu (MGH) Waleed Meleis (NU) David Kaeli (NU)"— Presentation transcript:

Similar presentations

About project

Feedback