Presentation is loading. Please wait.

Presentation is loading. Please wait.

Low-Latency Accelerated Computing on GPUs

Similar presentations


Presentation on theme: "Low-Latency Accelerated Computing on GPUs"— Presentation transcript:

1 Low-Latency Accelerated Computing on GPUs
Han Vanholder, Peter Messmer Dr. Christoph Angerer DevTech, NVIDIA

2 Accelerated Computing
High Performance & High Energy Efficiency for Throughput Tasks CPU Serial Tasks GPU Accelerator Parallel Tasks

3 Accelerating Insights
GOOGLE DATACENTER 1,000 CPU Servers 2,000 CPUs • 16,000 cores 600 kWatts $5,000,000 STANFORD AI LAB 3 GPU-Accelerated Servers 12 GPUs • 18,432 cores 4 kWatts $33,000 Accelerating Insights The next obvious question is whether that success could scale out In 2012 Google and Andrew Ng at Stanford published a paper showing the ability to extract high level features from a dataset of 10 million images harvested from YouTube, using a large neural network model with 1B parameters The community was impressed yet sad – only Google has the resources to build a 16 thousand core cluster for these sorts of experiments, millions of $ The following year the same group at Stanford collaborating with NVIDIA showed that they could replicate the Google results from the prior year, but using only 12 GPUs.  Total cost - $33K, smaller footprint, lower power Details below… Same time to complete, same setup of 10M 200x200 images from YouTube, 1B parameters (we actually scaled to 11 B parameters). The 25 EFLOP model recognizes 1000 classes. Humans recognize somewhere around 10, ,000 classes. Average adult vocabulary is 20-35,000 words (kids are much smaller), and we each can recognize about 10,000 faces. We can recognize places as well. Add them all up, I'd guess most people are somewhere between 10K and 100K classes. These servers for Stanford & Google were for *training* the model.  In addition, there’s also the computational requirements for doing the actual *classification* for every new image Training one model today is 10s of EFLOPs (note the NYU model took ~25 EXAFLOPs to train) Processing an image costs only ~TFLOPs per image – but since over 1 billion photos are uploaded to the internet every day, the collective cost is huge = 46 PFLOP/s (ORNL Titan is only 27 PFLOP/s) 47 PFLOP/s = ~23,000 GPUs are required to keep up in real-time with images uploaded to the internet That’s just with today’s models For a larger “vocabulary” (more classes), the training set grows, and the computational requirements increase for training Also, classification isn’t the only think you want to know – you also want to know things like “where is the object in the image” (localization), “what are all of the objects in the image” (detection), or “who is that” (recognition).  All this drives up the cost of the real-time image processing above 46 PFLOP/s All of these emerging requirements will continue to drive up the demand for computational resources to keep up with the data deluge A few caveats: The 46 PFLOP/s (SP) number that I calculated for these 23,148 GPUs depends on how many FLOPs/s you assume are achieved on the GPU for this operation, so it might be off by ~2x.  CEO math is reasonable for 46 PFLOP/s if you assume 2 TFLOP/s sustained single precision Most people will think 17.6 PFLOP/s double precision for LINPACK for ORNL Titan.  The 27 PFLOP/s I quoted below as for peak (not LINPACK).  In single precision Titan might be ~2x that number – so ~35 PFLOP/s single precision Now You Can Build Google’s $1M Artificial Brain on the Cheap Deep learning with COTS HPC systems, A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro  ICML 2013

4 From HPC to Enterprise Data Center
Oil & Gas Higher Ed Government Supercomputing Finance Consumer Web Air Force Research Laboratory Tokyo Institute of Technology Naval Research Laboratory

5 Tesla: Platform for Accelerated Datacenters
Data Center Infrastructure Development Optimized Systems Communication Solutions Infrastructure Management Programming Languages Development Tools Software Applications Partner Ecosystem Tesla is more than just great GPU Accelerator Boards. It combines the world’s fastest GPU accelerators, the widely used CUDA parallel computing model, enterprise-class support and maintenance services, and a comprehensive ecosystem of software developers, software vendors, and datacenter system OEMs to deliver unrivalled performance. It combines Fast GPU Accelerators, advanced system management features, accelerated communication technology, and is supported by popular infrastructure management software. The Tesla platform provides computing professionals the tools to easily build, test and deploy accelerated applications in the datacenter. GPU Accelerators Interconnect System Management Compiler Solutions Profile and Debug Libraries Enterprise Services Tesla Accelerated Computing Platform

6 Common Programming Models Across Multiple CPUs
Libraries Programming Languages Compiler Directives AmgX cuBLAS / The other primary goal for us is to provide an open platform that supports all CPU archtectures, industry standards. NVIDIA is the only processor company in the world with the open platform to work with any CPU architecture, OpenPOWER, ARM, or x86. Intel can’t do this. They are tied to x86. In fact, Intel & x86 is becoming more “closed”, own network stack. Our vision is OEMs will build what is right for their customers, the right CPU, etc, and GPUs will be there to accelerate their work. We want to make sure when our customers write GPU code, …, it will run without change on all three platforms. X86

7 Normalized Performance
GPU Roadmap 20 Pascal Unified Memory 3D Memory NVLink 16 12 Maxwell DX12 Normalized Performance 8 Kepler Dynamic Parallelism 4 Fermi FP64 Tesla CUDA 2008 2010 2012 2014 2016

8 GPUDirect

9 Multi-GPU: Unified Virtual Addressing Single Partitioned Address Space
System Memory GPU0 Memory GPU1 Memory 0x0000 0xFFFF CPU GPU0 GPU1 PCI-e

10 GPUDirect Technologies
GPUDirect Shared GPU-Sysmem for Inter-node copy optimization How: Use GPUDirect-aware 3rd party network drivers GPUDirect P2P Transfers for on-node GPU-GPU memcpy How: Use CUDA APIs directly in application How: use P2P-aware MPI implementation GPUDirect P2P Access for on-node inter-GPU LD/ST access How: Access remote data by address directly in GPU device code GPUDirect RDMA for Inter-node copy optimization What: 3rd party PCIe devices can read and write GPU memory How: Use GPUDirect RDMA-aware 3rd party network drivers and MPI implementations or custom device drivers for other hardware

11 GPUDirect P2P: GPU-GPU Direct Access
PCIe PCIe GPU Memory GPU Memory CPU/IOH GPUDirect v2.0 enables GPUs to communicate directly, both during cudaMemcpyD2D and from within CUDA kernels CUDA 4.0 feature CPU Memory

12 P2P Goal: Improve intra-node programming model
Improve CUDA programming model How? Transfer data between two GPUs quickly/easily int main() { double *cpuTmp, *gpu0Data, gpu1Data; setup (gpu0Data, gpu1Data); cudaSetDevice (0); kernel <<< … >>> (gpu0Data); cudaMemcpy (cpuTmp, gpu0Data); cudaMemcpy (gpu1Data, cpuTmp); cudaSetDevice (1); }

13 GPUDirect P2P: Common use cases
MPI implementation optimization for Intra-Node communication HPC applications that fit on a single node You get much better efficiency with GPUDirect P2P (compared to MPI) Can use between ranks with cudaIpc() APIs

14 GPUDirect RDMA GPU CPU/IOH Third Party Hardware PCIe PCIe GPU CPU
Memory GPU GPUDirect v3.0 enables third party devices to directly access GPU memory without CPU or chipset involvement CUDA 5.0 feature Any vendor can implement with no hardware modification (only software device driver changes)! Video capture devices NICs (Infiniband, Ethernet and others) Storage devices (Future SSD + Database applications) Third Party Hardware GPU Memory

15 GPUDirect RDMA Goal Inter-node Latency and Bandwidth How?
Transfer data between GPU and third party device (e.g. NIC) with possibly zero host-side copies int main() { double *cpuData, *cpuTmp, *gpuData; setup (gpuData); kernel <<< … >>> (gpuData); cudaDeviceSynchronize (); cudaMemcpy (cpuTmp, gpuData); memcpy (cpuData, cpuTmp); MPI_Send (gpuData) }

16 GPUDirect RDMA: What does it get you?
Latency Reduction MPI_Send latency of 25µs with Shared GPU-Sysmem* No overlap possible Bidirectional transfer is difficult MPI_Send latency of 5µs with RDMA Does not affect running kernels Unlimited concurrency RDMA possible! MPI-3 One sided of 3µs

17 GPUDirect RDMA: Common use cases
Inter-Node MPI communication Transfer data between GPU and a remote Node Use CUDA-aware MPI Interface with third party hardware Requires adopting GPUDirect-Interop API in vendor driver stack Limitation GPUDirect RDMA does not work with CUDA Unified Memory today

18 NVLINK

19 Pascal GPU Features NVLINK and Stacked Memory
GPU high speed interconnect GB/s 3D Stacked Memory 4x Higher Bandwidth (~1 TB/s) 3x Larger Capacity 4x More Energy Efficient per bit

20 NVLink Unleashes Multi-GPU Performance
Over 2x Application Performance Speedup When Next-Gen GPUs Connect via NVLink Versus PCIe GPUs Interconnected with NVLink CPU PCIe Switch Speedup vs PCIe based Server Want to show that NVLink provides immediate application speed-up, simply by allowing GPUs-to-GPU to communicate faster Start at 0x0 Why faster? Most HPC applications are multi-GPU enabled For multi-GPU applications, GPU-to-GPU app TESLA GPU TESLA GPU 5x Faster than PCIe Gen3 x16 3D FFT, ANSYS: 2 GPU configuration, AMBER Cellulose (256x128x128), FFT problem size (256^3), 4 GPU configuration

21 NVLink High-Speed GPU Interconnect
Example: 8-GPU Server with NVLink Important point is the NVLink connection between the GPUs – so GPUDirect fuctions at 10x PCIe speed

22 Machine Learning

23 Machine Learning using Deep Neural Networks
Input Result Hinton et al., 2006; Bengio et al., 2007; Bengio & LeCun, 2007; Lee et al., 2008; 2009 Visual Object Recognition Using Deep Convolutional Neural Networks Rob Fergus (New York University / Facebook)

24 3 Drivers for Deep Learning
Powerful GPU Accelerators More Data Better Models

25 Broad use of GPUs in Deep learning
Early Adopters Use Cases GTC Image Detection Face Recognition Gesture Recognition Video Search & Analytics Speech Recognition & Translation Recommendation Engines Indexing & Search There’s already quite a body of research in the public domain describing the use of GPUs for machine learning tasks NVIDIA hosts our annual GPU Technology Conference in San Jose every spring This year we had a fantastic number of talks about using CUDA for machine learning and data sciences Encourage you to attend – lots of opportunities for cross-pollination across domains, especially with machine learning   Image Analytics for Creative Cloud Speech/Image Recognition Image Classification Hadoop Recommendation Search Rankings

26 What is Next? Analyzing Unstructured Data
Anomaly Detection Behavior Prediction Diagnostic Support “Any product that excites you over the next five years and makes you think: ‘That is magical, how did they do that?’, is probably based on this [deep learning].” Steve Jurvetson, Partner DFJ Venture Language Analysis ….

27 Database Acceleration
Beyond Deep Learning Graph Analytics Database Acceleration Real Time Analytics Visualization of social network analysis by Martin Grandjean is licensed under CC BY-SA 3.0

28 Deep Learning Sessions
Adobe Google Alibaba iFlytek, Ltd Baidu NUANCE Carnegie Mellon Stanford Univ Facebook UC Berkeley Flickr / Yahoo Univ of Toronto GTC 2015 had many Deep Learning Sessions Check GTC on-demand

29 NVIDIA in OCP

30 Unlocking Access to the Tesla Platform
Engaging with OEMs, end customers and technology partners to include NVIDIA Accelerators in the OCP Platform NVLINK based designs for maximum performance Standard PCIe designs for scale out Open Compute Project

31 Enabling NVLink GPU-CPU connections
KEPLER GPU PASCAL GPU NVLink High-Speed GPU Interconnect NVLink POWER CPU NVLink PCIe PCIe X86, ARM64, POWER CPU X86, ARM64, POWER CPU 2014 2016

32 Thanks


Download ppt "Low-Latency Accelerated Computing on GPUs"

Similar presentations


Ads by Google