Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Speed, Accurate and Efficient way to identify the DNA.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Optimization on Kepler Zehuan Wang
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Jared Barnes Chris Jackson.  Originally created to calculate pixel values  Each core executes the same set of instructions Mario projected onto several.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Computer Graphics Graphics Hardware
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
Cuda application-Case study 2015/10/24 1. Introduction (1) 2015/10/24 GPU Workshop 2 The fast increasing power of the GPU (Graphics Processing Unit) and.
GPU Architecture and Programming
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Computer Engg, IIT(BHU)
EECE571R -- Harnessing Massively Parallel Processors ece
Basic CUDA Programming
General Purpose Graphics Processing Units (GPGPUs)
6- General Purpose GPU Programming
Presentation transcript:

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher: Journal of Computational Physics, 2010 Speaker : De Yu Chen Data: 2010/11/24

Outline Introduce to CUDA Basic algorithm Parallel scan Smith-Waterman algorithm Multiple GPUs Results 2

Introduce to CUDA CUDA(Compute Unified Device Architecture) allows the programmer to program for NVIDIA graphics cards with an extension of the C programming language (called CUDA C). In CUDA, parallelized programs are launched from a kernel of code. The code on the kernel is run on several thousands of threads. Individual threads run all of the code on the kernel, but with different data. The kernel can be very small, or can contain an entire program. 3

Introduce to CUDA (count.) 4 CUDA architecture In the CUDA, CPU also called host, GPU also called device. When you programming, you have to choice: 1.Number of grids? 2.Number of blocks per grid? 3.Number of threads per block on same grid?

Introduce to CUDA (count.) 5 CUDA architecture with hierarchy of memory Constant Cache Texture Cache Device Memory On the GPU, a hierarchy of memory architecture is available for the programmer to utilize. 1. Registers: Read-Write per-thread 2. Local Memory: Read-write per –thread 3. Shared Memory: Read-write per-block 4. Global Memory: Read-write per-grid 5. Constant Memory: Read-only per-grid 6. Texture Memory: Read-only per-grid Grid 1

Introduce to CUDA (count.) 6

Bank conflicts problem Shared memory has 16 banks that are organized such that successive 32-bit words are assigned to successive banks, i.e. interleaved. Each bank has a bandwidth of 32 bits per two clock Cycles. Assume the program have 1 block and 16 threads per block __global__ void kernel( ) { __shared__ int shared [16] ; int id = threadIdx.x ; char data = shared [id] ; } 7

Introduce to CUDA (count.) 8-bit and 16-bit accesses typically generate bank conflicts. For example, there are bank conflicts if an array of char is accessed the following way: __global__ void kernel( ) { __shared__ char shared [32] ; int id = threadIdx.x ; char data = shared[id] ; } because shared[0], shared[1], shared[2], and shared[3], for example, belong to the same bank. There are no bank conflicts however, if the same array is accessed the following way: char data = shared[ 4 * id ]; 8

Basic algorithm 9 The Smith-Waterman uses affine-gap penalty as following : Where Gs is gap start-penalty, Ge is gap extension-penalty.

Parallel scan SW algorithm Basic idea 10 (1) (2) (1)(2) Let:

Parallel scan SW algorithm (count.) The new formula of (1),(2) as following : 11 (3)(4) (3) into (4) : (3)(5)

Parallel scan SW algorithm (count.) 12 For every cell in same row, it’s should be pass three step: Step 1: Step 2: Step 3:

Parallel scan SW algorithm (count.)  My implement 13 for ( 1 to query sequence length ) { Kernel<<<1, database sequence length >>>( arg1, arg2,….) } For Kernel, it’s execute Step1~Step3, and insert “__syncthreads( )” instruction between Step1 and Step2 to make sure that every threads should be complete Step1 before enter Step2.

Multiple GPUs It is possible to trivially use the previous algorithm on up to four GPUs which are installed on a single motherboard using SLI (Nvidia) technology. This involves an additional physical connection between the GPUs. However, for large collections of GPUs operating on different motherboards and communicating via the network, a more complex parallelization and implementation strategy is necessary. 14

Multiple GPUs (count.) To use N GPUs, the Smith–Waterman table is decomposed into N large, and slightly overlapping blocks. The idea is shown as following 15

Results Hardware environment. 1.Nvidia 9800 GT has 112 cores, memory bandwidth is 57.6 GB/s. 2.Nvidia 9800 GTX has 128 cores, bandwidth is 70.4 GB/s. 3.Nvidia 295 GTX comes as two GPU cards that share a PCIe slot. When we refer to a single 295 GTX we refer to one of these two cards, which has 240 cores and a memory bandwidth of GB/s. For the parallel timings a PC that contains 2 Nvidia 295 GTX cards containing a total of 4 GPUS was used. 4.CPU is an AMD quad-core Phenom II X4, operating at 3.2 GHz, it has KB of L2 cache and 6 MB of L3 cache. 16

Results (count.) Software environment All code was written in C++ with Nvidia’s CUDA language Extensions for the GPU, and compiled using Microsoft Visual Studio 2005 (VS 8) under Windows XP Professional x64. The bulk of Nvidia SDK examples use this configuration. The SSCA#1 benchmark was developed as a part of the high productivity computer systems (HPCS) benchmark suite that was designed to evaluate the true performance capabilities of Super- computers. It consists of 5 kernels that are various permutations of the Smith–Waterman algorithm. 17

Results (count.) In the first kernel the Smith–Waterman table is computed and the largest table values (200 of them) and their locations in the table are saved. These are the end points of well aligned sequences. 18

Results (count.) Timings for kernel 3 of the SSCA #1 benchmark are shown in Fig.7(b). Kernel 3 is similar to kernel 1 in that it is another Smith– Waterman calculation. However in this case the sequences are required to be computed at the same time (in a single pass) as the table evalu- ation. 19

Results (count.) The parallel performance (for kernels 1 and 3) of using up to four GPUs is shown below: 20

Results (count.) 21