Presentation is loading. Please wait.

Presentation is loading. Please wait.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Similar presentations

Presentation on theme: "Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram."— Presentation transcript:

1 Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram

2 Big Picture Frameworks CBIR Application Framework Patterns Application Framework Developer Map Reduce Programming Framework Map Reduce Programming Pattern Map Reduce Programming Framework Developer CUDA Computation & Communication Framework Barrier/Reduction Computation & Communication Patterns CUDA Framework Developer Face Search Developer Consumer Search Searcher Feature Extraction & Classifier Application Patterns Nvidia G80 Hardware Architect Pattern Language SW Infrastructure Platform Application

3 GPUs as proxy for manycore GPUs are interesting architectures to program Transitioning from highly specialized pipelines to general purpose The only way to get performance from GPUs is through parallelism (No caching, branch prediction, prefetching etc.) Can launch millions of threads in one call 35/14/08CS258 Parallel Computer Architecture

4 GPUs are not for everyone Memory coalescing is really important Irregular memory accesses to even local stores is discouraged - up to 30% performance hit on some apps for local memory bank conflicts Cannot forget that it is a SIMD machine Memory consistency is non-existent & inter-SM synchronization is absent Hardware scheduled threads 20 us overhead for kernel call (20,000 instructions @ 1GHz) 45/14/08CS258 Parallel Computer Architecture

5 NVIDIA G80 Architecture 5/14/08CS258 Parallel Computer Architecture5

6 NVIDIA GeForce 8800 GTX Specifications Number of Streaming Multiprocessors16 Multiprocessor Width8 Local Store Size16 KB Total number of Stream Processors128 Peak SP Floating Point Rate346 Gflops Clock1.35 GHz Device Memory768 MB Peak Memory Bandwidth86.4 GB/s Connection to Host CPUPCI Express CPU -> GPU bandwidth2.2 GB/s* GPU -> CPU bandwidth1.7 GB/s* 5/14/08CS258 Parallel Computer Architecture6 * measured values

7 GPU programming - CUDA Each block can have upto 512 threads that synchronize Millions of blocks can be issued No synchronization between blocks No control over scheduling 5/14/08CS258 Parallel Computer Architecture7

8 Support Vector Machines A hugely popular machine learning technique for classification Tries to find a hyperplane separating the different classes with “maximum margin” Non-linear surfaces can be generated through non-linear kernel functions Uses Quadratic Programming for training (specific set of constraints imply a wide variety of techniques for solving it) 85/14/08CS258 Parallel Computer Architecture

9 SVM Training Quadratic Program Some kernel functions: Variables : α : Weight for each training point (determines classifier) Data: l : number of training points C : trades off error on training set for generalization performance y : Label (+/- 1) for each training point x : training points

10 Choice of parallel algorithm (among chunking algorithms) 5/14/08CS258 Parallel Computer Architecture10 Sequential Minimal Optimization (SMO)

11 Fitting SMO on a GPU Shared memory constraints on the GPU fits the algorithm as only two vectors need to be shared among all the threads Performance strongly dependent on the choice of the working set Several heuristics proposed – two are popular (1 st and 2 nd order) 2 nd order heuristic is almost twice as costly, but saves on the number of iterations 115/14/08CS258 Parallel Computer Architecture

12 Adaptive heuristic Both heuristics can be expressed as a series of “Map Reduce” stages A Map Reduce code generator was used to generate the code Sample periodically and adapt depending on the most converging heuristic at any given time Tightly coupled map-reduces are essential for machine learning algorithms Cannot afford the overhead of general library call when called millions of times 125/14/08CS258 Parallel Computer Architecture

13 Results 135/14/08CS258 Parallel Computer Architecture Normalized to 1 st order heuristic

14 Overall speedup compared to LIBSVM

15 SVM Classification SVM classification task involves finding which side of the hyperplane a point lies on Specifically, where Insight : Instead of doing this serially for all points, note that

16 Restructuring the Classification problem SVTest Data Output Vs Output Test Data SV

17 Results 5/14/08CS258 Parallel Computer Architecture17

18 Results 5/14/08CS258 Parallel Computer Architecture18

19 Is this compute or memory bound? GPUs are better for memory bound jobs (Observed 7 GB/s vs 1 GB/s for other streaming-like apps) 5/14/08CS258 Parallel Computer Architecture19

20 Importance of memory coalescing In order to avoid non-coalesced memory accesses, carried both Data and Data T into GPU memory Letting 0.05% of memory accesses to be non- coalesced led to a 21% drop in performance for one case Well written code should scale with GPU size (parallelism should be limited by problem size, not machine size) 5/14/08CS258 Parallel Computer Architecture20

21 Is SIMD becoming ubiquitous? SIMD already important for performance on uniprocessor systems Task Vs Data parallelism Intel’s new GPU has wide SIMD CUDA lesson - Runtime SIMD binding easier for programmers Non-SIMD leads to performance penalty, not incorrect programs – prevents premature optimizations and keep code flexible 5/14/08CS258 Parallel Computer Architecture21

22 Conclusion GPUs and Manycore CPUs are on a collision course Data parallelism on GPUs or Task parallelism on CPUs Rethink serial control and data structures Sequential optimizations may harm parallelism Machine learning can use a lot of parallel hardware if software engineered properly 5/14/08CS258 Parallel Computer Architecture22

Download ppt "Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram."

Similar presentations

Ads by Google