Yafeng Yin, Lei Zhou, Hong Man 07/21/2010

Slides:



Advertisements
Similar presentations
Acceleration of software package "R" using GPU's Sachinthaka Abeywardana.
Advertisements

Accelerated Linear Algebra Libraries James Wynne III NCCS User Assistance.
Introduction to the CUDA Platform
GPU Programming using BU Shared Computing Cluster
Computational Physics Linear Algebra Dr. Guy Tel-Zur Sunset in Caruaru by Jaime JaimeJunior. publicdomainpictures.netVersion , 14:00.
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
© NVIDIA Corporation 2013 CUDA Libraries. © NVIDIA Corporation 2013 Why Use Library No need to reprogram Save time Less bug Better Performance = FUN.
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
Kaldi’s matrix library
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
Solving Linear Systems (Numerical Recipes, Chap 2)
Nequalities Takagi Factorization on a GPU using CUDA Gagandeep S. Sachdev, Vishay Vanjani & Mary W. Hall School of Computing, University of Utah What is.
Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
1cs542g-term Notes  Assignment 1 is out (questions?)
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Reconfigurable Application Specific Computers RASCs Advanced Architectures with Multiple Processors and Field Programmable Gate Arrays FPGAs Computational.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Data Shackling Locality enhancement of dense numerical linear algebra codes Traversals along co-ordinate axes Data-centric reference for each statement.
Scientific Computing on MSR Summer School 2009 – Jurgen Van Gael.
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
Chapter 12 Fast Fourier Transform. 1.Metropolis algorithm for Monte Carlo 2.Simplex method for linear programming 3.Krylov subspace iteration (CG) 4.Decomposition.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Performance and Energy Efficiency of GPUs and FPGAs
CUDA Linear Algebra Library and Next Generation
1 Intel Mathematics Kernel Library (MKL) Quickstart COLA Lab, Department of Mathematics, Nat’l Taiwan University 2010/05/11.
Enhancing GPU for Scientific Computing Some thoughts.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.
Computación algebraica dispersa con GPUs y su aplicación en tomografía electrónica Non-linear iterative optimization method for locating particles using.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Presented by The Lapack for Clusters (LFC) Project Piotr Luszczek The MathWorks, Inc.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
CUDA-based Volume Rendering in IGT Nobuhiko Hata Benjamin Grauer.
GPU-Accelerated Computing and Case-Based Reasoning Yanzhi Ren, Jiadi Yu, Yingying Chen Department of Electrical and Computer Engineering, Stevens Institute.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
GPU VSIPL: Core and Beyond Andrew Kerr 1, Dan Campbell 2, and Mark Richards 1 1 Georgia Institute of Technology 2 Georgia Tech Research Institute.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
The Library Approach to GPU Computations of Initial Value Problems Dave Yuen University of Minnesota, U.S.A. with Larry Hanyk and Radek Matyska Charles.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
Martin Kruliš by Martin Kruliš (v1.0)1.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
1/24 UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU.
Generalized and Hybrid Fast-ICA Implementation using GPU
Review of Linear Algebra
A survey of Exascale Linear Algebra Libraries for Data Assimilation
Brad Baker, Wayne Haney, Dr. Charles Choi
CUDA Interoperability with Graphical Environments
GPU Computing CIS-543 Lecture 10: CUDA Libraries
Chapter 12 Fast Fourier Transform
GPU VSIPL: High Performance VSIPL Implementation for GPUs
GPU with CPU OZAN ÇETİNASLAN.
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
GPU Implementations for Finite Element Methods
CS 179 Project Ideas.
Parallelization of Sparse Coding & Dictionary Learning
Presentation transcript:

Yafeng Yin, Lei Zhou, Hong Man 07/21/2010 CUDA Library and Demo Yafeng Yin, Lei Zhou, Hong Man 07/21/2010

Outline Basic CUDA computation library GPULib, CUBLAS, CUFFT Advanced CUDA computation library CULA /MAGMA , VSIPL CUDA FIR Demo(UMD) Discuss and future work

Basic lib - GPULib GPULib provides a library of mathematical functions addition, subtraction, multiplication, and division, as well as unary functions, including sin(), cos(), gamma(), and exp(), interpolation, array reshaping, array slicing, and reduction operations

Basic lib - CUBLAS BLAS-- Basic Linear Algebra Subprograms CUBLAS Provide a set of functions for basic vector and matrix operations, such as matrix‐vector copy, sort, dot product, Euclidean norm etc Real data Level 1 (vector-vector O(N) ) Level 2 (matrix-vector O(N2) ) Level 3 (matrix-matrix O(N3) ) Complex data Level 1

CUBLAS-Level 2 function cublasSgbmv() y = alpha * op(A) * x + beta * y cublasSgemv() cublasSger() A = alpha * x * yT + A cublasSsbmv() y = alpha * A * x + beta * y , cublasSspmv() y = alpha * A * x + beta * y cublasSspr() A = alpha * x * xT + A cublasSspr2() A = alpha * x * yT + alpha * y * xT + A , cublasSsymv() cublasSsyr() cublasSsyr2() cublasStbmv() x = op(A) * x cublasStbsv() op(A) * x = b , output x

Basic lib - CUFFT CUFFT is the CUDA FFT library Provides a simple interface for computing parallel FFT on an NVIDIA GPU Allows users to leverage the floating-point power and parallelism of the GPU without having to develop a GPU-based FFT implementation cufftPlan1d() ,cufftPlan2d() ,cufftPlan3d() Creates a 1D,2D or 3D FFT plan configuration for a specified signal size

Advanced lib – CULA and MAGMA CULA: GPU Accelerated Linear Algebra provide LAPACK (Linear Algebra PACKage) function on CUDA GPUs MAGMA: Matrix Algebra on GPU and Multicore Architectures develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures and "Multicore+GPU" systems

Advanced lib -CULA function Linear Equation Routines Solves a general system of linear equations AX=B. Orthogonal Factorizations LQ ,RQ factorization Least Squares Routines Symmetric and non- Symmetric Eigenvalue Routines Singular Value Decomposition (SVD) Routines

Advanced lib - MAGMA LAPACK on CUDA GPUs LU, QR, and Cholesky factorizations in both real and complex arithmetic (single and double) Linear solvers based on LU, QR, and Cholesky in real arithmetic (single and double) Mixed-precision iterative refinement solvers based on LU, QR, and Cholesky in real arithmetic Reduction to upper Hessenberg form in real arithmetic (single and double) MAGMA BLAS in real arithmetic (single and double),

Advanced lib -VSIPL VSIPL: Vector Image Signal Processing Library Generalized matrix product Fast FIR filtering Correlation Fast Fourier Transform QR decomposition Random number generation    Elementwise arithmetic, logical, and comparison operators, linear algebra procedures

CUDA library Summary Basic vector or matrix computation GPULib, CUBLAS, CUFFT vector or matrix: addition, subtraction, multiplication, and division sin(), cos(), sort, dot product, Libraries can be used for Signal Processing CULA /MAGMA , VSIPL LU, QR, and Cholesky factorizations SVD decompostion

CUDA Demo (FIR) GPU: NVIDIA GeForce 8600 GT CPU: Intel Duo CPU 2.33G Software: Visual Studio 2005

CUDA Demo (FIR) Output NO GPU Run Memory Time(msec) Total Time CPU +GPU CPU Only 1000 0.312121 0.166641 0.482184 0.391251 10000 0.667264 0.284254 0.955568 4.646471 100000 4.210870 1.489784 5.704915 43.831200 1000000 39.460812 5.597150 45.062572 421.615448 10000000 391.816345 48.080204 439.901794 4310.153320

CUDA Demo (FIR)

Discuss and future work how to connect CUDA to the SSP re-hosting demo how to change the sequential executed codes in signal processing system to CUDA codes how to transfer the XML codes to CUDA codes to generate the CUDA input.

Reference CUDA Zone http://www.nvidia.com/object/cuda_home_new.html http://en.wikipedia.org/wiki/CUDA