Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Lecture 6: Multicore Systems

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.

Large Scale Machine Learning based on MapReduce & GPU Lanbo Zhang.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

Support Vector Machines

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Panda: MapReduce Framework on GPU’s and CPU’s

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh,

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Computer Graphics Graphics Hardware

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Extracted directly from:

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

GPU Architecture and Programming

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Sunpyo Hong, Hyesoon Kim

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Computer Graphics Graphics Hardware

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

Presented by: Isaac Martin

Computer Graphics Graphics Hardware

Chapter 4 Multiprocessors

Graphics Processing Unit

6- General Purpose GPU Programming

Presentation transcript:

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram

Big Picture Frameworks CBIR Application Framework Patterns Application Framework Developer Map Reduce Programming Framework Map Reduce Programming Pattern Map Reduce Programming Framework Developer CUDA Computation & Communication Framework Barrier/Reduction Computation & Communication Patterns CUDA Framework Developer Face Search Developer Consumer Search Searcher Feature Extraction & Classifier Application Patterns Nvidia G80 Hardware Architect Pattern Language SW Infrastructure Platform Application

GPUs as proxy for manycore GPUs are interesting architectures to program Transitioning from highly specialized pipelines to general purpose The only way to get performance from GPUs is through parallelism (No caching, branch prediction, prefetching etc.) Can launch millions of threads in one call 35/14/08CS258 Parallel Computer Architecture

GPUs are not for everyone Memory coalescing is really important Irregular memory accesses to even local stores is discouraged - up to 30% performance hit on some apps for local memory bank conflicts Cannot forget that it is a SIMD machine Memory consistency is non-existent & inter-SM synchronization is absent Hardware scheduled threads 20 us overhead for kernel call (20,000 1GHz) 45/14/08CS258 Parallel Computer Architecture

NVIDIA G80 Architecture 5/14/08CS258 Parallel Computer Architecture5

NVIDIA GeForce 8800 GTX Specifications Number of Streaming Multiprocessors16 Multiprocessor Width8 Local Store Size16 KB Total number of Stream Processors128 Peak SP Floating Point Rate346 Gflops Clock1.35 GHz Device Memory768 MB Peak Memory Bandwidth86.4 GB/s Connection to Host CPUPCI Express CPU -> GPU bandwidth2.2 GB/s* GPU -> CPU bandwidth1.7 GB/s* 5/14/08CS258 Parallel Computer Architecture6 * measured values

GPU programming - CUDA Each block can have upto 512 threads that synchronize Millions of blocks can be issued No synchronization between blocks No control over scheduling 5/14/08CS258 Parallel Computer Architecture7

Support Vector Machines A hugely popular machine learning technique for classification Tries to find a hyperplane separating the different classes with “maximum margin” Non-linear surfaces can be generated through non-linear kernel functions Uses Quadratic Programming for training (specific set of constraints imply a wide variety of techniques for solving it) 85/14/08CS258 Parallel Computer Architecture

SVM Training Quadratic Program Some kernel functions: Variables : α : Weight for each training point (determines classifier) Data: l : number of training points C : trades off error on training set for generalization performance y : Label (+/- 1) for each training point x : training points

Choice of parallel algorithm (among chunking algorithms) 5/14/08CS258 Parallel Computer Architecture10 Sequential Minimal Optimization (SMO)

Fitting SMO on a GPU Shared memory constraints on the GPU fits the algorithm as only two vectors need to be shared among all the threads Performance strongly dependent on the choice of the working set Several heuristics proposed – two are popular (1 st and 2 nd order) 2 nd order heuristic is almost twice as costly, but saves on the number of iterations 115/14/08CS258 Parallel Computer Architecture

Adaptive heuristic Both heuristics can be expressed as a series of “Map Reduce” stages A Map Reduce code generator was used to generate the code Sample periodically and adapt depending on the most converging heuristic at any given time Tightly coupled map-reduces are essential for machine learning algorithms Cannot afford the overhead of general library call when called millions of times 125/14/08CS258 Parallel Computer Architecture

Results 135/14/08CS258 Parallel Computer Architecture Normalized to 1 st order heuristic

Overall speedup compared to LIBSVM

SVM Classification SVM classification task involves finding which side of the hyperplane a point lies on Specifically, where Insight : Instead of doing this serially for all points, note that

Restructuring the Classification problem SVTest Data Output Vs Output Test Data SV

Results 5/14/08CS258 Parallel Computer Architecture17

Results 5/14/08CS258 Parallel Computer Architecture18

Is this compute or memory bound? GPUs are better for memory bound jobs (Observed 7 GB/s vs 1 GB/s for other streaming-like apps) 5/14/08CS258 Parallel Computer Architecture19

Importance of memory coalescing In order to avoid non-coalesced memory accesses, carried both Data and Data T into GPU memory Letting 0.05% of memory accesses to be non- coalesced led to a 21% drop in performance for one case Well written code should scale with GPU size (parallelism should be limited by problem size, not machine size) 5/14/08CS258 Parallel Computer Architecture20

Is SIMD becoming ubiquitous? SIMD already important for performance on uniprocessor systems Task Vs Data parallelism Intel’s new GPU has wide SIMD CUDA lesson - Runtime SIMD binding easier for programmers Non-SIMD leads to performance penalty, not incorrect programs – prevents premature optimizations and keep code flexible 5/14/08CS258 Parallel Computer Architecture21

Conclusion GPUs and Manycore CPUs are on a collision course Data parallelism on GPUs or Task parallelism on CPUs Rethink serial control and data structures Sequential optimizations may harm parallelism Machine learning can use a lot of parallel hardware if software engineered properly 5/14/08CS258 Parallel Computer Architecture22