Mathew Alvino, Travis McBee, Heather Nelson, Todd Sullivan GPGPU: GPU Processing of Protein Structure Comparisons.

Slides:

Advertisements

Similar presentations

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Advertisements

Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.

GI 2006, Québec, June 9th 2006 Implementing the Render Cache and the Edge-and-Point Image on Graphics Hardware Edgar Velázquez-Armendáriz Eugene Lee Bruce.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

IMGD 4000: Computer Graphics in Games Emmanuel Agu.

GPU Computing with CUDA as a focus Christie Donovan.

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.

Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.

Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.

Final Gathering on GPU Toshiya Hachisuka University of Tokyo Introduction Producing global illumination image without any noise.

IN4151 Introduction 3D graphics 1 Introduction to 3D computer graphics part 2 Viewing pipeline Multi-processor implementation GPU architecture GPU algorithms.

Network coding on the GPU Péter Vingelmann Supervisor: Frank H.P. Fitzek.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.

GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.

Enhancing GPU for Scientific Computing Some thoughts.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Computer Graphics Graphics Hardware

GPU Computation Strategies & Tricks Ian Buck Stanford University.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

GPU Shading and Rendering Shading Technology 8:30 Introduction (:30–Olano) 9:00 Direct3D 10 (:45–Blythe) Languages, Systems and Demos 10:30 RapidMind.

Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.

Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.

Cg Programming Mapping Computational Concepts to GPUs.

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

Diane Marinkas CDA 6938 April 30, Outline Motivation Algorithm CPU Implementation GPU Implementation Performance Lessons Learned Future Work.

Accelerating image recognition on mobile devices using GPGPU

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

Computational Biology 2008 Advisor: Dr. Alon Korngreen Eitan Hasid Assaf Ben-Zaken.

CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.

Shadow Mapping Chun-Fa Chang National Taiwan Normal University.

Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

GRAPHICS PIPELINE & SHADERS SET09115 Intro to Graphics Programming.

CS662 Computer Graphics Game Technologies Jim X. Chen, Ph.D. Computer Science Department George Mason University.

GPU Computation Strategies & Tricks Ian Buck NVIDIA.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.

Havok FX Physics on NVIDIA GPUs. Copyright © NVIDIA Corporation 2004 What is Effects Physics? Physics-based effects on a massive scale 10,000s of objects.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.

From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.

An Epsilon Range Join in a graphics processing unit Project work of Timo Proescholdt.

Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Ray Tracing by GPU Ming Ouhyoung. Outline Introduction Graphics Hardware Streaming Ray Tracing Discussion.

December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

An Introduction to the Cg Shading Language Marco Leon Brandeis University Computer Science Department.

COMP 175 | COMPUTER GRAPHICS Remco Chang1/XX13 – GLSL Lecture 13: OpenGL Shading Language (GLSL) COMP 175: Computer Graphics April 12, 2016.

Image Fusion In Real-time, on a PC. Goals Interactive display of volume data in 3D –Allow more than one data set –Allow fusion of different modalities.

Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.

Computer Graphics Graphics Hardware

GPU Architecture and Its Application

Graphics Processing Unit

Chapter 6 GPU, Shaders, and Shading Languages

From Turing Machine to Global Illumination

Computer Graphics Graphics Hardware

Ray Tracing on Programmable Graphics Hardware

RADEON™ 9700 Architecture and 3D Performance

Graphics Processing Unit

Presentation transcript:

Mathew Alvino, Travis McBee, Heather Nelson, Todd Sullivan GPGPU: GPU Processing of Protein Structure Comparisons

Proteins are the essential building blocks of life Fold into complicated 3D structures  Structure often determines function Goal of researchers is to determine 3D structure from amino acid sequence  Prediction and retrieval algorithms very time consuming The Protein Folding Problem

Index-based Protein Substructure Alignments (IPSA) Large index of database proteins  Map query into the index using several data structures Pharmaceuticals affected by protein interactions  Substructure alignments useful to researchers Time consuming, but still faster than competitors  Around 20 minutes per query  Over 80,000 protein chains in Protein Data Bank (PDB)  Growing dataset Provides real-time search engine

Response Time of IPSA versus Competitors Speedup DALI37.66 CE2.78

Accuracy of IPSA versus Competitors

Market Analysis Bioinformatics Research Pharmaceutical Industry  Nearly $1 Billion per year (Tufts Center for the Study of Drug Development) General-Purpose computing on the GPU (GPGPU) a fast-growing field 1 of 5 disruptive technologies for 2007 (InformationWeek)

Goals and Objectives Gain experience with GPGPU Evaluate feasibility of a GPU-based IPSA algorithm The team's ultimate goal is to port portions of IPSA to run on the GPU  Faster (Better average response time)  More scalable as the dataset size increases

Costs Computer 1 Hardware  NVIDIA 8800 GTX costs: $575  Other machine costs: $1,200 Computer 2 Hardware  ATI x800 XT PE: $250  Other machine costs: $950 Time  Average of eight hours a week per team member.  32 hours a week total.  Ten weeks total.  320 hours total.  320 $50 per hour = $16,000

Operating Environment Requirements Computer 1  NVIDIA 8800 GTX video card 128 processing cores 768 MB of memory  Intel Pentium GHz  4 Gigabytes of Ram  Linux Operating System Computer 2  ATI x800 XT PE video card 256 MB of memory  AMD  3 Gigabytes of Ram  Windows XP/Cygwin

Environmental Constraints Stand-Alone System  User sends data and system handles the rest. Quality  Needs to produce responses faster than they can be produced on the CPU. Reliability  System needs to be able to handle multiple requests at once. Coding  IPSA is in Java  GPGPU code needs to be in C

Project Schedule

GPGPU Technologies Base Technologies:  OpenGL Shading Language  DirectX  Cg Commercial Products:  RapidMind  Peakstream Other Languages/Extensions:  Sh  Shallows  Accelerator  Brook  CUDA

BrookGPU Performance Buck, I.; et al. “Brook for GPUs: stream computing on graphics hardware,” ACM SIGGRAPH 2004 Papers, pp , Aug. 2004

GPU Pipeline

Mapping CPU Algorithms to the GPU Arrays = Textures Memory Read = Sample Texture Loop = Fragment Program (Kernel) Array Write = Render to Texture

Basic GPGPU Operations Map  Applying a function to a given set of data elements Reduce  Reducing the size of a data stream, usually until only one element remains. Example: Given an array A of data in range [0.0, 1.0)  Map the data to the range [0, 255] by the function f(x) = Floor( x * 256 )  Reduce the array to one element by the summation ∑ f(X i ) for all X i in A

Issues and Limitations Generally impossible to directly translate CPU algorithms to GPU  2D textures (arrays) most efficient Translate 1D/3D into 2D  Branching very costly  No random access memory – avoid lookups  Often must divide code into multiple shaders for even the most simple computations Data transfer from CPU to GPU very costly

Issues and Limitations cont. Highly computational and parallelizable code has most potential  Even parallel algorithms can be inefficient if they overuse branching and memory lookups Limit number of passes over textures  Do as much as possible at one time GPUs use single-precision floating point numbers.  IPSA uses double-precision floating point numbers.

Implementing GPGPU using Cg Initialize data and libraries Create frame buffer object for off-screen rendering Create textures  Generate, setup, and transfer data from CPU Initialize Cg  Create fragment profile, bind fragment program to shader, load program Perform computation  Enable profile, bind program, draw buffer and enable textures as necessary Transfer texture from GPU

Cg Implementations Mathematical computation over each element y = alpha*y+ (alpha+y)/(alpha*y)*alpha 175 times faster than CPU on 4096x4096 dataset Average Random Walk Distance Useful in protein folding problems 40 times faster on GPU for 4096x4096 dataset Multiple shaders necessary  Summation of each element diminishes performance

IPSA Profile Layer 1

IPSA Profile Layer 2

IPSA: High Level Process Diagram

GPU Modified IPSA: High Level Process Diagram Compute D1

Array Translation Each pixel on a texture contains four floats – Red, Green, Blue, and Alpha Need to convert all arrays of floats to array of float4’s IPSA calculates most values in groups of three – Leave the alpha float empty (unused)  Array of 3x3 Matrices CPU Memory: 1D Array of float4’s. Each square contains the three values of the same color from the array of 3x3 matrices.

Chain Translation Need to compute many chain comparisons at once. – Solution: Pack chains into a giant texture. Chains are either 30 or 45 floats long Each chain fits into a 4x4 of float4’s Each block contains three floats in the pixel’s RGB values Alpha values are set to zero White blocks contain zeroes on RGBA

Code Translation: Floats to Float4’s Calculation of tR in Compute D1: for( i = 0; i < AA_size; i++ ){ a1 = i * 3; a2 = a1 + 1; a3 = a2 + 1; tR[0][0] += chain1[a1] * chain2[a1]; tR[0][1] += chain1[a1] * chain2[a2]; tR[0][2] += chain1[a1] * chain2[a3]; tR[1][0] += chain1[a2] * chain2[a1]; tR[1][1] += chain1[a2] * chain2[a2]; tR[1][2] += chain1[a2] * chain2[a3]; tR[2][0] += chain1[a3] * chain2[a1]; tR[2][1] += chain1[a3] * chain2[a2]; tR[2][2] += chain1[a3] * chain2[a3]; } Translation to array of float4’s: for( i = 0; i < AA_size; i++ ){ tR[0].r += chain1[i].r * chain2[i].r; tR[0].g += chain1[i].r * chain2[i].g; tR[0].b += chain1[i].r * chain2[i].b; tR[1].r += chain1[i].g * chain2[i].r; tR[1].g += chain1[i].g * chain2[i].g; tR[1].b += chain1[i].g * chain2[i].b; tR[2].r += chain1[i].b * chain2[i].r; tR[2].g += chain1[i].b * chain2[i].g; tR[2].b += chain1[i].b * chain2[i].b; }

Code Translation: For Loop to Fragment Program For loop in float4 format: for (i = 0; i < AA_size; i++) { chain1[i].r = chain8[i].r – mean.r; chain1[i].g = chain8[i].g – mean.g; chain1[i].b = chain8[i].b – mean.b; } Pseudocode fragment program: kernel subtract( float4 c8pixel, float4 mean, out float4 c1pixel ){ c1pixel.r = c8pixel.r – mean.r; c1pixel.g = c8pixel.g – mean.g; c1pixel.b = c8pixel.b – mean.b; c1pixel.a = 0.0; } Operation: chain1 = chain8 – mean;

Results GPU version of Compute D1 calculates 102,400 chain comparisons simultaneously. GPU-based Compute D times faster than Java-based Compute D1. GPU IPSA is times faster than IPSA Results in average response time of seconds. Cut 84 seconds off the total processing time.

Improvements/Future Work GPU performance gain is limited by:  Small percentage of total processing time from the functions with GPU potential Compute D2 (2%) and Matrix Multiply (0.3%)  Using only three of four floats in each pixel  Unused float4’s from texture packing strategy Additional work:  Compute D1 calculates eigenvalues and eigenvectors Extremely complicated task that was removed from the prototype and performance testing  Modify IPSA to use GPU-calculated values GPU’s single-precision floats may affect IPSA accuracy

Questions?