Download presentation
Presentation is loading. Please wait.
Published bySimon Butler Modified over 8 years ago
1
2014 Heterogeneous many cores for medical control: Performance, Scalability, and Accuracy Madhurima Pore, Arizona State University October 10,2014 #GHC14 2014
2
Acknowledgements Inputs and guidance: Dr. Sandeep Gupta (Advisor), Dr. Ayan Banerjee, Hari K Tadepalli This work has been partly funded by CNS grant #1218505 and Intel Corp.
3
2014 System Model Predictive Analytics and control Patient Data Management Notification
4
2014 Motivation Model predictive controllers (MPC) have to compute human physiology model within a small time constraint Human physiological models vary in complexity, computational time and accuracy. For multiple patients many core devices can be used to meet real time constraints. Evaluate the different MPC on many core platforms for performance and energy
5
2014 MODEL PREDICTIVE CONTROLLERS in medical devices Control Algorithm + + - - Control error Programmed Infusion Rate Model of Human Physiology Model of Human Physiology Target Open Loop System Control Action Feedback from Human Body Reference Drug Concentration Infusion Rate Perturbation Bolus Rate Infusion Pump Wireless channel MPC use human physiology model showing: Different computation complexity Content of serial and parallel parts Computation time Accuracy MPC use human physiology model showing: Different computation complexity Content of serial and parallel parts Computation time Accuracy
6
2014 MPC applications Assume a certain infusion rate Estimate the drug content using physiological models Increment infusion rate Estimate the drug content using physiological models Based on the two estimation decide final infusion rate Pharmacokinetic model Spatiotemporal model Involves: 1.Complex math operations 2.Serial computations, output of one function used as input of other Involves: 1.Discretizing using Finite Difference time domain method. 2.Grid computation that estimate drug over tissue. MPC application exhibit: 1.Data parallelism within the application. 2.Parallelism due to multiple patients Need for Many core devices MPC application exhibit: 1.Data parallelism within the application. 2.Parallelism due to multiple patients Need for Many core devices
7
2014 Individual applications: FLOPS, FLOPs per Joule Multiple patients: Number of patients monitored simultaneously, without exceeding the time constraint. Characteristics Pharmacokinetic Model Computation divided in max 8 threads Memory required low Communication overhead is low Spatiotemporal Model Entire grid can be processed in parallel Memory requirement is high Data transfer overhead Different application parameters, such as maximum multi threads, data transfer overhead, memory accesses and size varies the performance w.r.t architecture PERFORMANCE
8
2014 Architecture Core 4 Core 3 Core 2 Processor Graphics Core 1 Shared L3 cache System and Memory Controller I7: 8 threads, L1 32kB data and instruction L2 256kB per core,L3 8MB Graphic Processor Graphic Processor Cluster CUDA cores L2 Cache Memory Controller PCIE Express GPU (GTX 680): 1536 CUDA cores 1536 FMA and 256special function units, 512KB L2 cache Core L2 Core L2 GDDR MC TAG DIRECTORY Bidirectional Ring Interconnect MIC (Intel® Xeon Phi™ Coprocessor 3120P) : 57 cores with 4 threads each. L2 cache per core 28.5 MB and Max memory size 6GB. Architectures vary in :Compute power i.e. #threads Shared and individual memory for every core Data transfer overhead Application should exploit these resources to maximize throughput. Architectures vary in :Compute power i.e. #threads Shared and individual memory for every core Data transfer overhead Application should exploit these resources to maximize throughput.
9
2014 Compute Eigen values of A (4 X 4) Obtain modal matrix M of A Obtain Jordanian J of M Compute exp(-J(t-t 0 )) A series of nine matrix multiplication to solve for output y(t) I7 implementation,MIC Pharmacokinetic Algorithm GPU implementation Intel Math Kernel Library (MKL)MAGMA library LAPACKE_dgeev cblas_sgemm Serial for loops cblas_dgemm magma_zheevr_gpu cublasSgemm Serial for loops cublasSgemm Atmost 8 parallel threads on 4 cores in i7, Atmost 16 parallel threads as opposed to 224 cores on MIC Serial execution Only16 threads run in parallel as opposed to 786 K Serial execution Maximum of 16 computation can be done in parallel, leaving the rest of cores unutilized.
10
2014 Parallelism in i7 Gridsize (NxN) e.g. N=256 Maintain 3 Matrices of NxN tdc1, tdc2,tdc3 Compute each point using eq. (12) tdc1=tdc2,tdc2=tdc3 Runs on the GPU core using kernel function, Each block has block_dim 2 threads and runs on on SM Runs on the host machine 8 threads Parallelism in MIC 224 threads Parallelism in GPU 8 threads 257k threads All the code runs on MIC card All the code runs on Host i7 Spatiotemporal Algorithm Computation of entire grid is spread onto available cores improving the performance, however, the serial parts of application limit the performance.
11
2014 Performance Results #patientsi7MICGPU GPU computation only (s) 110.212.7833 280.868.12946 482.228.2929312 882.413554124 1683.83136.299347 32176.05142.01195395 647621933823190 Execution time for Spatiotemporal Model for multiple patients
12
2014 Performance Results Execution time for Pharmacokinetic Model for multiple patients #patientsGPUMICi7 (s) 121384.855.58 299289.026.26 4208194.927.48 82584118.3613.11 163552240.7324.78 322066.9953.86 64336.8
13
2014 Performance Energy Model Pharmacokinetic ModelSpatiotemporal Model
14
2014 Methodology To extract maximum performance and power efficiency out of the multicore. METHODOLOGY Design Space: Defining the resource constraints. e.g. design parameters i.e. number of threads, memory Exploration: map the MPC application on the multicores to maximize the throughput Evaluation : Different performance and energy savings are obtained for each application Hypothesis: In the mix of MPC applications monitoring patients with different medical needs, such heterogeneous platform e.g. MIC,i7 may be more efficient.
15
2014 Sample Results:
16
2014 Conclusion The pharmacokinetic model with mostly serial codes make use of power i7 cores for fast performance. The performance of spatiotemporal model application, with highly parallel sections are limited by the data transfer overhead in the serial sections. In such case, large memory such as in MIC and high compute capability are better. No single platform is suitable for both applications, but heterogeneous platforms such MIC-i7 works well for the combination of MPC applications. Madhurima Pore, Ayan Banerjee, Sandeep K.S. Gupta, and Hari K Tadepalli, Performance trends of multicore system for throughput computing in medical application, International Conference on High Performance Computing Conference(HiPC13), Hyderabad, India, December 2013
17
2014 Got Feedback? Rate and Review the session using the GHC Mobile App To download visit www.gracehopper.org
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.