Presentation is loading. Please wait.

Presentation is loading. Please wait.

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

Similar presentations


Presentation on theme: "Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia."— Presentation transcript:

1 Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia

2 2 Outline MATLAB Optimizations to MATLAB GPU Acceleration with CUDA Applications (Heart Wall Tracking and Myocyte Simulation) –Problem –Algorithm –Optimization and performance –Lessons Conclusions Future Research

3 MATLAB Convenient but inefficient programming language of choice for scientists - Interpreted language - Most of the existing code and libraries are single-threaded MATLAB Parallel Toolbox - understanding of parallel programming Jacket and GPUmat - large parallelism to justify overhead 3

4 MATLAB contd. Interpreted language optimized by JIT compiler – 2x slower than C MATLAB Embedded Compiler has limited support – 1.2-1.4x slower than C MEX Interface to link C code - translating to C - many functions written from scratch - no support for convenient OpenMP standard, need to use thread libraries 4

5 5 Acceleration 1.Translation: - convert MATLAB to C 2.Parallelization: –C for multi-core CPU –CUDA for GPU Experimental Setup –CPU: 3.2 GHz quad-core Intel Core 2 Extreme –GPU: NVIDIA GeForce GTX 280 (PCIe 2.0) –MS Windows, MS C Compiler

6 6 Allocate GPU memory Transfer inputs Launch kernel Return to CPU Transfer results Free GPU memory C Program CUDA Kernel CPUGPU Acceleration with GPU (CUDA)

7 Entire Heart Wall Application read first frame from input file, display image [0.533 s] [0.28 %] crop image, display image [0.089 s] [0.05 %] SRAD, display image [3.224 s] [1.72 %] detect edges, display image [0.448 s] [0.24 %] morphological transformation, display image [0.275 s] [0.15 %] dilate image, display image [0.285 s] [0.15 %] inner and outer ellipse parameter setup [0.001 s] [0.00 %] create ellipse sample points [0.726 s] [0.39 %] track movement of sample points in all frames [129.276 s] [68.99 %] display movement of sample points through frames [36.63 s] [19.55 %] save outputs into file [0.035 s] [0.02 %] Hough Search, display images [15.872 s] [8.47 %]

8 8 Heart Wall Tracking Description Speed and shape of contractions provides important information about body’s response to stimulus Measured by tracking inner and outer heart walls through multiple frames InputOutput Tracking

9 9 Heart Wall Tracking Algorithm Processing 20 inner and 30 outer heart wall points, total 50 points (TLP) Processing of each point - sequence of operations on the surrounding area and template (DLP) Update templates Read next frame Track inner point Track outer point Save point locations 20 30 10 # of frames / 10 … time task-level parallelism (TLP) 1 23450 data-level parallelism (DLP)

10 10 Heart Wall Tracking Performance Times reported for processing of 300 frames (10s of ultrasound recording) 1.21x 2.71x 10.86x 2.37x 3.39x 5.94x 6.29x 33.20x36.89x41.23x

11 Porting Tradeoffs ConveniencePerformance Developing modular code replacing each MATLAB function with equivalent parallelized GPU function common routines can be possibly reused in other applications Hiding specific aspects of GPU programming inside each module each module sets up its own GPU execution parameters each module performs its own I/O with GPU transparently Restructuring and combining code writing code based on algorithm tasks rather than MATLAB statements overlap parallel (often unrelated) tasks by executing them in the same kernel call Exposing specific aspects of GPU programming doing more work inside each GPU kernel call to fully exploit parallelism and avoid overhead performing I/O with GPU manually to eliminate redundant data transfers

12 multiple add,sub, mul, div ---------- SYNC ---------- Reduction ---------- SYNC ---------- multiple add,sub, mul, div ---------- SYNC ---------- convolution multiple add,sub, mul, div ---------- SYNC ---------- Reduction ---------- SYNC ---------- multiple add,sub, mul, div ---------- SYNC ---------- convolution multiple add,sub, mul, div ---------- SYNC ---------- Reduction ---------- SYNC ---------- multiple add,sub, mul, div ---------- SYNC ---------- convolution multiple add,sub, mul, div ---------- SYNC ---------- Reduction ---------- SYNC ---------- multiple add,sub, mul, div ---------- SYNC ---------- convolution multiple add,sub, mul, div ---------- SYNC ---------- Reduction ---------- SYNC ---------- multiple add,sub, mul, div ---------- SYNC ---------- convolution multiple add,sub, mul, div ---------- SYNC ---------- Reduction ---------- SYNC ---------- multiple add,sub, mul, div ---------- SYNC ---------- convolution multiple add,sub, mul, div ---------- SYNC ---------- Reduction ---------- SYNC ---------- multiple add,sub, mul, div ---------- SYNC ---------- convolution time SMs outer heart points (30) inner heart points (20) … … 12330… global SYNC Frame 1 Frame 2 …

13 13 Heart Wall Tracking Lessons Typical MATLAB code written by a scientist has room for optimization – 1.3x Conversion to C requires significant coding effort Selective offloading results in multiple CPU-GPU data transfer overheads Iterative codes require merging kernels and reusing variables to avoid overhead CUDA libraries cannot be used as a part of GPU code Good performance - significant changes to the structure of code, difficult for a scientist to understand

14 14 Conclusions Limited availability of C libraries necessitates time consuming coding Many systems biology applications (even those with limited parallelism) benefit from GPU GPU overheads are significant (should be eliminated in new CPU-GPU architectures) Real-time processing feasible in near future Ultimately, acceleration of applications should be automated!

15 15 Future Research Automatic acceleration with the use of compiler - via use of architecture-specific libraries - via compiling for target architecture Merging of workloads - based on resource needs - based on dependency Acceleration with alternative architectures - well suited for fine-grained parallelism - esp. FPGA

16 16 Acknowledgements Funding provided by: –NSF grant IIS-0612049 –SRC grant 1607.001 Equipment donated by NVIDIA

17 17 Software Source code will be soon available at: http://lava.cs.virginia.edu/wiki/rodinia

18 Questions 18

19 Backup 19


Download ppt "Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia."

Similar presentations


Ads by Google