GPU VSIPL: Core and Beyond Andrew Kerr 1, Dan Campbell 2, and Mark Richards 1 1 Georgia Institute of Technology 2 Georgia Tech Research Institute.

Slides:

Advertisements

Similar presentations

Yafeng Yin, Lei Zhou, Hong Man 07/21/2010

Advertisements

Introduction to the CUDA Platform

SCIP Optimization Suite

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

© NVIDIA Corporation 2013 CUDA Libraries. © NVIDIA Corporation 2013 Why Use Library No need to reprogram Save time Less bug Better Performance = FUN.

VSI/Pro®-GPU: Commercial VSIPL Support for GP-GPU-Based Accelerated Signal and Image Processing Anthony Skjellum, Ph.D. Jennifer H. Skjellum

Numerical Method Inc. Ltd. URL: Presented by Ken Yiu.

Problem Uncertainty quantification (UQ) is an important scientific driver for pushing to the exascale, potentially enabling rigorous and accurate predictive.

OpenFOAM on a GPU-based Heterogeneous Cluster

Evaluating Performance and Power of Object-oriented vs. Procedural Programming in Embedded Processors A. Chatzigeorgiou, G. Stephanides Department of Applied.

Lecture 4 Sept 8 Complete Chapter 3 exercises Chapter 4.

Semantic Signal Processing Group Semantic Radio Fangming He, Hongbing Cheng, Jiadi Yu, Hong Man, Yu-dong Yao, Jennifer Department of Electrical and Computer.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Scientific Computing on MSR Summer School 2009 – Jurgen Van Gael.

Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.

Chapter 12 Fast Fourier Transform. 1.Metropolis algorithm for Monte Carlo 2.Simplex method for linear programming 3.Krylov subspace iteration (CG) 4.Decomposition.

© MPI Software Technology, Inc. Public Release. VSIPL Short Tutorial Anthony Skjellum MPI Software Technology, Inc. SC2002 November 17, 2002.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

MATLAB Lecture One Monday 4 July Matlab Melvyn Sim Department of Decision Sciences NUS Business School

Matlab tutorial course Lesson 2: Arrays and data types

1 Design of an SIMD Multimicroprocessor for RCA GaAs Systolic Array Based on 4096 Node Processor Elements Adaptive signal processing is of crucial importance.

MATLAB and the GPU Who is AccelerEyes? What’s a GPU?

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

1 Chapter 5 Divide and Conquer Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.

CS654: Digital Image Analysis Lecture 15: Image Transforms with Real Basis Functions.

CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.

Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.

Initial experience on openCL pragamming and develop GPU solver for OpenFoam Presented by: Qingfeng Xia School of MACE University of Manchester Date:

GTRI_B-1 ECRB - HPC - 1 Using GPU VSIPL & CUDA to Accelerate RF Clutter Simulation 2010 High Performance Embedded Computing Workshop 23 September 2010.

Lecture 1 - Introduction June 3, 2002 CVEN 302. Lecture’s Goals General Introduction to CVEN Computer Applications in Engineering and Construction.

Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western.

HPEC04 Panel Session 1 HPEC 2004 Panel Session: Amending Moore’s Law for Embedded Applications The Second Path: The Role of Algorithms in Maintaining Progress.

CUDA-based Volume Rendering in IGT Nobuhiko Hata Benjamin Grauer.

LIST OF EXPERIMENTS USING TMS320C5X Study of various addressing modes of DSP using simple programming examples Sampling of input signal and display Implementation.

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,

Linear Algebra Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Bhupendra Singh Bhupendra Singh Scientist ‘B’ Scientist ‘B’ Centre for Artificial.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

CORDIC-Based Processor

1 Parallel Vector & Signal Processing Mark Mitchell CodeSourcery, LLC April 3, 2003.

GTRI_B-1 1 GPU Performance Assessment with HPEC Challenge High Performance Embedded Computing (HPEC) Workshop September 25, 2008 Andrew Kerr, Dan Campbell,

Compilers and Applications Kathy Yelick Dave Judd, Ronny Krashinsky, Randi Thomas, Samson Kwok, Simon Yau, Kar Ming Tang, Adam Janin, Thinh Nguyen Computer.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Normal Equations The Orthogonality Principle Solution of the Normal Equations.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

1 Lecture 3 Post-Graduate Students Advanced Programming (Introduction to MATLAB) Code: ENG 505 Dr. Basheer M. Nasef Computers & Systems Dept.

Linear Algebra Operators for GPU Implementation of Numerical Algorithms J. Krüger R. Westermann computer graphics & visualization Technical University.

CCA Common Component Architecture Distributed Array Component based on Global Arrays Manoj Krishnan, Jarek Nieplocha High Performance Computing Group Pacific.

1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.

Modeling K The Common Core State Standards in Mathematics Geometry Measurement and Data The Number System Number and Operations.

Portability Operating System and Architecture Independence Solaris - Ultra Sparc Linux - PPC/Pentium VxWorks - PPC-Altivec MAC OS X - PPC-Altivec Windows.

Hongjie Zhu,Chao Zhang,Jianhua Lu Designing of Fountain Codes with Short Code-Length International Workshop on Signal Design and Its Applications in Communications,

Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Reducing a Set Covering Matrix. S I T E S Cost Areas

TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.

Martin Kruliš by Martin Kruliš (v1.0)1.

Brad Baker, Wayne Haney, Dr. Charles Choi

M. Richards1 ,D. Campbell1 (presenter), R. Judd2, J. Lebak3, and R

MPI Software Technology, Inc. VSIPL for Diverse Architectures

GPU VSIPL: High Performance VSIPL Implementation for GPUs

Presented by: Tim Olson, Architect

Parallel Vector & Signal Processing

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

HPC Modeling of the Power Grid

CHAPTER OBJECTIVES The primary objective of this chapter is to show how to compute the matrix inverse and to illustrate how it can be.

VSIPL Short Tutorial Anthony Skjellum MPI Software Technology, Inc.

Linear Algebra Lecture 32.

VSIPL++: Parallel Performance HPEC 2004

All we need in Game Programming Course Reference

Introduction to Matlab

Presentation transcript:

GPU VSIPL: Core and Beyond Andrew Kerr 1, Dan Campbell 2, and Mark Richards 1 1 Georgia Institute of Technology 2 Georgia Tech Research Institute

Goal An application development environment for embedded high performance computing that achieves – Portability : same code usable with different processors, processor generations, and vendors – Productivity : disciplined programming model, leverage highly optimized libraries for signal processing+linear algebra – Performance : employ highly advanced processors, ~1 TFLOPS Approach Adopt the VSIPL API for open standard portability and productivity Develop a state-of-the-art GPU-VSIPL library to leverage CUDA-enabled GPU performance

GPU-VSIPL Functional Coverage What’s covered from VSIPL Core –Data Types real, complex, integer, boolean, index –View Types Matrix, vector –Element-wise Operators arithmetic, trigonometric, transcendental, scatter/gather, logical, and comparison –Signal Processing FFT (in-place, out-of-place, batched) Fast FIR filter, window creation, 1D correlation Random number generation, histogram –Linear Algebra generalized matrix product QR decomposition, least-squares solver What’s Not (yet) –Linear Algebra LU, Toeplitz, least-squares solvers What’s Added Beyond VSIPL Core –Scalar and matrix versions of element-wise vector operators –Matrix utility functions VSIPL API Core Profile GPU VSIPL Core Lite Profile

Performance Examples: Signal Processing 1D FFT, In-Place1D Correlation

Performance Examples: Linear Algebra Matrix-Vector ProductQR Decomposition