2014 Heterogeneous many cores for medical control: Performance, Scalability, and Accuracy Madhurima Pore, Arizona State University October 10,2014 #GHC14.

Slides:

Advertisements

Similar presentations

Speed, Accurate and Efficient way to identify the DNA.

Advertisements

Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

ENERGY AND POWER CHARACTERIZATION OF PARALLEL PROGRAMS RUNNING ON THE INTEL XEON PHI JOAL WOOD, ZILIANG ZONG, QIJUN GU, RONG GE {JW1772, ZILIANG,

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Supporting GPU Sharing in Cloud Environments with a Transparent

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

GPU Architecture and Programming

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Sunpyo Hong, Hyesoon Kim

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

“SMT Capable CPU-GPU Systems for Big Data”

Computer Engg, IIT(BHU)

These slides are based on the book:

CS203 – Advanced Computer Architecture

M. Bellato INFN Padova and U. Marconi INFN Bologna

Generalized and Hybrid Fast-ICA Implementation using GPU

Gwangsun Kim, Jiyun Jeong, John Kim

Two-Dimensional Phase Unwrapping On FPGAs And GPUs

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Parallel Plasma Equilibrium Reconstruction Using GPU

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

High-performance tracing of many-core systems with LTTng

Embedded Systems Design

Lecture 2: Intro to the simd lifestyle and GPU internals

Presented by: Isaac Martin

Alan Jovic1, Kresimir Jozic2, Davor Kukolja1,

Graphics Processing Unit

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Presentation transcript:

2014 Heterogeneous many cores for medical control: Performance, Scalability, and Accuracy Madhurima Pore, Arizona State University October 10,2014 #GHC

Acknowledgements  Inputs and guidance: Dr. Sandeep Gupta (Advisor), Dr. Ayan Banerjee, Hari K Tadepalli  This work has been partly funded by CNS grant # and Intel Corp.

2014 System Model Predictive Analytics and control Patient Data Management Notification

2014 Motivation Model predictive controllers (MPC) have to compute human physiology model within a small time constraint Human physiological models vary in complexity, computational time and accuracy. For multiple patients many core devices can be used to meet real time constraints. Evaluate the different MPC on many core platforms for performance and energy

2014 MODEL PREDICTIVE CONTROLLERS in medical devices Control Algorithm Control error Programmed Infusion Rate Model of Human Physiology Model of Human Physiology Target Open Loop System Control Action Feedback from Human Body Reference Drug Concentration Infusion Rate Perturbation Bolus Rate Infusion Pump Wireless channel MPC use human physiology model showing: Different computation complexity Content of serial and parallel parts Computation time Accuracy MPC use human physiology model showing: Different computation complexity Content of serial and parallel parts Computation time Accuracy

2014 MPC applications Assume a certain infusion rate Estimate the drug content using physiological models Increment infusion rate Estimate the drug content using physiological models Based on the two estimation decide final infusion rate Pharmacokinetic model Spatiotemporal model Involves: 1.Complex math operations 2.Serial computations, output of one function used as input of other Involves: 1.Discretizing using Finite Difference time domain method. 2.Grid computation that estimate drug over tissue. MPC application exhibit: 1.Data parallelism within the application. 2.Parallelism due to multiple patients  Need for Many core devices MPC application exhibit: 1.Data parallelism within the application. 2.Parallelism due to multiple patients  Need for Many core devices

2014 Individual applications: FLOPS, FLOPs per Joule Multiple patients: Number of patients monitored simultaneously, without exceeding the time constraint. Characteristics Pharmacokinetic Model Computation divided in max 8 threads Memory required low Communication overhead is low Spatiotemporal Model Entire grid can be processed in parallel Memory requirement is high Data transfer overhead Different application parameters, such as maximum multi threads, data transfer overhead, memory accesses and size varies the performance w.r.t architecture PERFORMANCE

2014 Architecture Core 4 Core 3 Core 2 Processor Graphics Core 1 Shared L3 cache System and Memory Controller I7: 8 threads, L1 32kB data and instruction L2 256kB per core,L3 8MB Graphic Processor Graphic Processor Cluster CUDA cores L2 Cache Memory Controller PCIE Express GPU (GTX 680): 1536 CUDA cores 1536 FMA and 256special function units, 512KB L2 cache Core L2 Core L2 GDDR MC TAG DIRECTORY Bidirectional Ring Interconnect MIC (Intel® Xeon Phi™ Coprocessor 3120P) : 57 cores with 4 threads each. L2 cache per core 28.5 MB and Max memory size 6GB. Architectures vary in :Compute power i.e. #threads Shared and individual memory for every core Data transfer overhead Application should exploit these resources to maximize throughput. Architectures vary in :Compute power i.e. #threads Shared and individual memory for every core Data transfer overhead Application should exploit these resources to maximize throughput.

2014 Compute Eigen values of A (4 X 4) Obtain modal matrix M of A Obtain Jordanian J of M Compute exp(-J(t-t 0 )) A series of nine matrix multiplication to solve for output y(t) I7 implementation,MIC Pharmacokinetic Algorithm GPU implementation Intel Math Kernel Library (MKL)MAGMA library LAPACKE_dgeev cblas_sgemm Serial for loops cblas_dgemm magma_zheevr_gpu cublasSgemm Serial for loops cublasSgemm Atmost 8 parallel threads on 4 cores in i7, Atmost 16 parallel threads as opposed to 224 cores on MIC Serial execution Only16 threads run in parallel as opposed to 786 K Serial execution Maximum of 16 computation can be done in parallel, leaving the rest of cores unutilized.

2014 Parallelism in i7 Gridsize (NxN) e.g. N=256 Maintain 3 Matrices of NxN tdc1, tdc2,tdc3 Compute each point using eq. (12) tdc1=tdc2,tdc2=tdc3 Runs on the GPU core using kernel function, Each block has block_dim 2 threads and runs on on SM Runs on the host machine 8 threads Parallelism in MIC 224 threads Parallelism in GPU 8 threads 257k threads All the code runs on MIC card All the code runs on Host i7 Spatiotemporal Algorithm Computation of entire grid is spread onto available cores improving the performance, however, the serial parts of application limit the performance.

2014 Performance Results #patientsi7MICGPU GPU computation only (s) Execution time for Spatiotemporal Model for multiple patients

2014 Performance Results Execution time for Pharmacokinetic Model for multiple patients #patientsGPUMICi7 (s)

2014 Performance Energy Model Pharmacokinetic ModelSpatiotemporal Model

2014 Methodology To extract maximum performance and power efficiency out of the multicore. METHODOLOGY  Design Space: Defining the resource constraints. e.g. design parameters i.e. number of threads, memory  Exploration: map the MPC application on the multicores to maximize the throughput  Evaluation : Different performance and energy savings are obtained for each application Hypothesis: In the mix of MPC applications monitoring patients with different medical needs, such heterogeneous platform e.g. MIC,i7 may be more efficient.

2014  Sample Results:

2014 Conclusion The pharmacokinetic model with mostly serial codes make use of power i7 cores for fast performance. The performance of spatiotemporal model application, with highly parallel sections are limited by the data transfer overhead in the serial sections. In such case, large memory such as in MIC and high compute capability are better. No single platform is suitable for both applications, but heterogeneous platforms such MIC-i7 works well for the combination of MPC applications. Madhurima Pore, Ayan Banerjee, Sandeep K.S. Gupta, and Hari K Tadepalli, Performance trends of multicore system for throughput computing in medical application, International Conference on High Performance Computing Conference(HiPC13), Hyderabad, India, December 2013

2014 Got Feedback? Rate and Review the session using the GHC Mobile App To download visit