Modeling Ion Channel Kinetics with High- Performance Computation Allison Gehrke Dept. of Computer Science and Engineering University of Colorado Denver.

Slides:



Advertisements
Similar presentations
Chapter 7 Constructors and Other Tools. Copyright © 2006 Pearson Addison-Wesley. All rights reserved. 7-2 Learning Objectives Constructors Definitions.
Advertisements

Prasanna Pandit R. Govindarajan
CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
Parallelism Lecture notes from MKP and S. Yalamanchili.
INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks Session:
GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.
GPU Programming using BU Shared Computing Cluster
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
© 2009 IBM Corporation July, 2009 | PADTAD Chicago, Illinois A Proposal of Operation History Management System for Source-to-Source Optimization.
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Thoughts on Shared Caches Jeff Odom University of Maryland.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
OpenFOAM on a GPU-based Heterogeneous Cluster
Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.
1.  Project goals  Project description ◦ What is Musepack? ◦ Using multithreading approach ◦ Applying SIMD ◦ Analyzing Micro-architecture problems 
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
SAGE: Self-Tuning Approximation for Graphics Engines
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
1 SAM /JUNE/2000 SDL Based Auto Code Generation: A Multi Beneficial Approach Bhaskar Rao.G Software Engineering Group, Motorola India Electronics.
InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.
1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
GPU Architecture and Programming
Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
"Distributed Computing and Grid-technologies in Science and Education " PROSPECTS OF USING GPU IN DESKTOP-GRID SYSTEMS Klimov Georgy Dubna, 2012.
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
QCAdesigner – CUDA HPPS project
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Sunpyo Hong, Hyesoon Kim
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Parallel Computing Presented by Justin Reschke
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
These slides are based on the book:
Employing compression solutions under openacc
Xing Cai University of Oslo
Ioannis E. Venetis Department of Computer Engineering and Informatics
Many-core Software Development Platforms
CMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

Modeling Ion Channel Kinetics with High- Performance Computation Allison Gehrke Dept. of Computer Science and Engineering University of Colorado Denver

Agenda Introduction Application Characterization, Profile, and Optimization Computing Framework Experimental Results and Analysis Conclusions Future Research

Introduction Target application – Kingen Simulates ion channel activity (kinetics) Optimizes kinetic model rate constants to biological data Ion Channel Kinetics Transition states Reaction rates

Computational Complexity

AMPA Receptors

Kinetic Scheme

Introduction: Why study ion channel kinetics? Protein function Implement accurate mathematical models Neurodevelopment Sensory processing Learning/memory Pathological states

Modeling Ion Channel Kinetics with High- Performance Computation Introduction Application Characterization, Profile, and Optimization Computing Framework Experimental Results and Analysis Conclusions Future Research

System-Level Application-Level Optimization Intel Vtune Intel Pin Profiling CPU GPU NVIDIA CUDA Multicore Intel TBB Intel Compiler & SSE2 Parallel Architectures Adapting Scientific Applications to Parallel Architectures

System Level – Thread Profile Fully utilized 93% Under utilized 4.8% Serial: 1.65%

Hardware Performance Monitors Processor utilization drops Constant available memory Context switches/sec increases Privileged time increases

System-Level Application-Level Optimization Intel Vtune Intel Pin Profiling CPU GPU NVIDIA CUDA Multicore Intel TBB Intel Compiler & SSE2 Parallel Architectures Adapting Scientific Applications to Parallel Architectures

Application Level Analysis Hotspots CPI FP Operations

Hotspots calc_funcs_ampa 59.51%30.45% runAmpaLoop 40.04%40.99% calc_glut_conc 0.45%2.16% operator[] 0%25.92% get_delta 0%0.48%

CPI FP Assist FP Instructions Ratio v v FP Impacting Metrics CPI.75 good 4 poor - indicates instructions require more cycles to execute than they should Upgrade ~9.4x speedup FP assist 0.2 low 1 high

Post compiler Upgrade Improved CPI and FP operations Hotspot analysis Same three functions still hot FP operations in AMPA function optimized with SIMD STL vector operator get function from a class object Redundant calculations in hotspot region

Manual Tuning Reduced function overhead Used arrays instead of STL vectors Reduced redundancies Eliminated get function Eliminated STL vector operator[ ] ~2x speedup

Application Analysis Conclusions runAmpaLoop % calc_glut_conc 4.4 % ge 0.02 % libm_sse2_exp 0.02 % All others 3.73 %

System-Level Application-Level Optimization Intel Vtune Intel Pin Profiling CPU GPU NVIDIA CUDA Multicore Intel TBB Intel Compiler & SSE2 Parallel Architectures Observations

Computer Architecture Analysis DTLB Miss Ratios L1 cache miss rate L1 Data cache miss performance impact L2 cache miss rate L2 modified lines eviction rate Instruction Mix

Computer Architecture Analysis Results FP instructions dominate Small instruction footprint fits in L1 cache L2 handling typical workloads Strong GPU potential

Modeling Ion Channel Kinetics with High- Performance Computation Introduction Application Characterization, Profile, and Optimization Computing Framework Experimental Results and Analysis Conclusions Future Research

Computing Framework Multicore coarse-grain TBB implementation GPU acceleration in progress Distributed multicore in progress (192 core cluster)

TBB Implementation Template library that extends C++ Includes algorithms for common parallel patterns and parallel interfaces Abstracts CPU resources

tbb:parallel_for Template function Loop iterations must be independent Iteration space broken into chunks TBB runs each chunk on a separate thread

tbb:parallel_for parallel_for( blocked_range (0,GeneticAlgo::NUM_CHROMOS), ParallelChromosomeLoop(tauError, ec50PeakError, ec50SteadyError, desensError, DRecoverError, ar, thetaArray), auto_partitioner() ); for (int i = 0; i < GeneticAlgo::NUM_CHROMOS; i++){ call ampa macro 11 times calculate error on the chromosome (rate constant set) }

tbb::parallel_for: The Body Object Need member fields for all local variables defined outside the original loop but used inside it Usually constructor for the body object initializes member fields Copy constructor invoked to create a separate copy for each worker thread Body operator() should not modify the body so it must be declared as const Recommend local copies in operator()

Ampa Macro calc_bg_ampa – defines differential equations that describe ampa kinetics based on rate constant set GA to solve the system of equations runAmpaLoop Runge-Kutta method

Ampa Macro calc_bg_ampa – defines differential equations that describe ampa kinetics based on rate constant set GA to solve the system of equations runAmpaLoop Runge-Kutta method

Initialize Chromosomes Coarse-grained parallelism Gen0Gen0 Serial Execution Gen 1 Genetic Algo population has better fit on average Convergence Gen N Chromo 0 … … Calc Error Ampa Macro Chromo 1 + r Chromo N Chromo 0 … … Calc Error Ampa Macro Chromo 1 + r Chromo N

Genetic Algorithm Convergence

Runge-Kutta 4 th Order Method (RK4) runAmpaLoop: numerical integration of differential equations describing our kinetic scheme RK4 Formulas: x(t + h) = x(t) + 1/6(F 1 + 2F 2 +2F 3 + F 4 ) where F 1 = hf(t, x) F 2 = hf(t + ½ h, x + ½ F 1 ) F 3 = hf(t + ½ h, x + ½ F 2 ) F 4 = hf(t + h, x + F 3 )

RK4 Hotspot is the function that computes RK4 Need finer-grained parallelism to alleviate hotspot bottleneck How to parallelize RK4?

Modeling Ion Channel Kinetics with High- Performance Computation Introduction Application Characterization, Profile, and Optimization Computing Framework Experimental Results and Analysis Conclusions Future Research

Experimental Results and Analysis Hardware and software set-up Domain specific metrics? Parallel speed-up Verification

CPU Intel® Xeon CPU 2.66 GHz Intel ® Core 2 Quad CPU 2.40 GHz Cores 844 Memory 3 GB 8 GB OS Windows XP Pro Fedora Compiler Intel C++ Compiler (11.1, 10.1) Intel C++ Compiler (11.1) Intel TBB Version 2.1 Configuration

Computational Complexity

Parallel Speedup Baseline: 2 generations, after compiler upgrade, prior to manual tuning Generation number magnifies any performance improvement

Verification MKL and custom Gaussian elimination routine get different results (sometimes) Small variation in a given parameter changed error significantly Non-deterministic

Conclusions Process that uncovers key characteristics is important Kingen needs cores/threads – lots of them Need ability automatically (semi-?) identify opportunities for parallelism in code Better validation methods

Future Research 192-core cluster GPU acceleration Programmer-led optimization Verification Model validation Techniques to simplify porting to massively parallel architectures