Low-Latency Accelerated Computing on GPUs

Slides:

Advertisements

Similar presentations

Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.

Advertisements

Introduction to the CUDA Platform

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

Multi-GPU System Design with Memory Networks

Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

PVOCL: Power-Aware Dynamic Placement and Migration in Virtualized GPU Environments Palden Lama, Xiaobo Zhou, University of Colorado at Colorado Springs.

Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

OpenFOAM on a GPU-based Heterogeneous Cluster

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

Virtualization and the Cloud

1 AppliedMicro X-Gene ® ARM Processors Optimized Scale-Out Solutions for Supercomputing.

Universidad Politécnica de Baja California. Juan P. Navarro Sanchez 9th level English Teacher: Alejandra Acosta The Beowulf Project.

SaaS, PaaS & TaaS By: Raza Usmani

Panda: MapReduce Framework on GPU’s and CPU’s

NVIDIA Confidential. Product Description World’s most popular 3D content creation tool Used across Design, Games and VFX markets Over +300k 3ds Max licenses,

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

1 1 Hybrid Cloud Solutions (Private with Public Burst) Accelerate and Orchestrate Enterprise Applications.

© NVIDIA Corporation 2011 The ‘Super’ Computing Company From Super Phones to Super Computers CUDA 4.0.

GPU Computing with CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Department of Electrical Engineering Electronics Computers Communications Technion Israel Institute of Technology High Speed Digital Systems Lab. High.

Extracted directly from:

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Gilad Shainer, VP of Marketing Dec 2013 Interconnect Your Future.

GPU Computing April GPU Outpacing CPU in Raw Processing GPU NVIDIA GTX cores 1.04 TFLOPS CPU GPU CUDA Architecture Introduced DP HW Introduced.

Gregory Fotiades.  Global illumination techniques are highly desirable for realistic interaction due to their high level of accuracy and photorealism.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Scientific Computing Environments ( Distributed Computing in an Exascale era) August Geoffrey Fox

Accelerating image recognition on mobile devices using GPGPU

GPU Architecture and Programming

GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond

1 HPC as Service and Scientific Applications Bruno Schulze.

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY HPCDB Satisfying Data-Intensive Queries Using GPU Clusters November.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.

Martin Kruliš by Martin Kruliš (v1.0)1.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Philipp Gysel ECE Department University of California, Davis

1© 2015 IBM Corporation Unlocking the power of the API economy Client Briefing Nov.

IBM Power system – HPC solution

Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.

APE group Many-core platforms and HEP experiments computing XVII SuperB Workshop and Kick-off Meeting Elba, May 29-June 1,

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:

SAS® Viya™ Overview ANDRĖ DE WAAL, GLOBAL ACADEMIC PROGRAM

It’s Time for Cognitive Computing

Comparing TensorFlow Deep Learning Performance Using CPUs, GPUs, Local PCs and Cloud Pace University, Research Day, May 5, 2017 John Lawrence, Jonas Malmsten,

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)

GPU-based iterative CT reconstruction

Big Data A Quick Review on Analytical Tools

Siemens Enables Digitalization: Data Analytics & Artificial Intelligence Dr. Mike Roshchin, CT RDA BAM.

Enabling machine learning in embedded systems

Pipeline parallelism and Multi–GPU Programming

CS 179 Lecture 14.

AI Stick Easy to learn and use, accelerate the industrialization of artificial intelligence, and let the public become an expert in AI.

Technical Capabilities

Presentation transcript:

Low-Latency Accelerated Computing on GPUs Han Vanholder, Peter Messmer Dr. Christoph Angerer DevTech, NVIDIA

Accelerated Computing High Performance & High Energy Efficiency for Throughput Tasks CPU Serial Tasks GPU Accelerator Parallel Tasks

Accelerating Insights GOOGLE DATACENTER 1,000 CPU Servers 2,000 CPUs • 16,000 cores 600 kWatts $5,000,000 STANFORD AI LAB 3 GPU-Accelerated Servers 12 GPUs • 18,432 cores 4 kWatts $33,000 Accelerating Insights The next obvious question is whether that success could scale out In 2012 Google and Andrew Ng at Stanford published a paper showing the ability to extract high level features from a dataset of 10 million images harvested from YouTube, using a large neural network model with 1B parameters The community was impressed yet sad – only Google has the resources to build a 16 thousand core cluster for these sorts of experiments, millions of $ The following year the same group at Stanford collaborating with NVIDIA showed that they could replicate the Google results from the prior year, but using only 12 GPUs. Total cost - $33K, smaller footprint, lower power Details below… Same time to complete, same setup of 10M 200x200 images from YouTube, 1B parameters (we actually scaled to 11 B parameters). The 25 EFLOP model recognizes 1000 classes. Humans recognize somewhere around 10,000-100,000 classes. Average adult vocabulary is 20-35,000 words (kids are much smaller), and we each can recognize about 10,000 faces. We can recognize places as well. Add them all up, I'd guess most people are somewhere between 10K and 100K classes. These servers for Stanford & Google were for *training* the model. In addition, there’s also the computational requirements for doing the actual *classification* for every new image Training one model today is 10s of EFLOPs (note the NYU model took ~25 EXAFLOPs to train) Processing an image costs only ~TFLOPs per image – but since over 1 billion photos are uploaded to the internet every day, the collective cost is huge = 46 PFLOP/s (ORNL Titan is only 27 PFLOP/s) 47 PFLOP/s = ~23,000 GPUs are required to keep up in real-time with images uploaded to the internet That’s just with today’s models For a larger “vocabulary” (more classes), the training set grows, and the computational requirements increase for training Also, classification isn’t the only think you want to know – you also want to know things like “where is the object in the image” (localization), “what are all of the objects in the image” (detection), or “who is that” (recognition). All this drives up the cost of the real-time image processing above 46 PFLOP/s All of these emerging requirements will continue to drive up the demand for computational resources to keep up with the data deluge A few caveats: The 46 PFLOP/s (SP) number that I calculated for these 23,148 GPUs depends on how many FLOPs/s you assume are achieved on the GPU for this operation, so it might be off by ~2x. CEO math is reasonable for 46 PFLOP/s if you assume 2 TFLOP/s sustained single precision Most people will think 17.6 PFLOP/s double precision for LINPACK for ORNL Titan. The 27 PFLOP/s I quoted below as for peak (not LINPACK). In single precision Titan might be ~2x that number – so ~35 PFLOP/s single precision Now You Can Build Google’s $1M Artificial Brain on the Cheap “ Deep learning with COTS HPC systems, A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro ICML 2013

From HPC to Enterprise Data Center Oil & Gas Higher Ed Government Supercomputing Finance Consumer Web Air Force Research Laboratory Tokyo Institute of Technology Naval Research Laboratory

Tesla: Platform for Accelerated Datacenters Data Center Infrastructure Development Optimized Systems Communication Solutions Infrastructure Management Programming Languages Development Tools Software Applications Partner Ecosystem Tesla is more than just great GPU Accelerator Boards. It combines the world’s fastest GPU accelerators, the widely used CUDA parallel computing model, enterprise-class support and maintenance services, and a comprehensive ecosystem of software developers, software vendors, and datacenter system OEMs to deliver unrivalled performance. It combines Fast GPU Accelerators, advanced system management features, accelerated communication technology, and is supported by popular infrastructure management software. The Tesla platform provides computing professionals the tools to easily build, test and deploy accelerated applications in the datacenter. GPU Accelerators Interconnect System Management Compiler Solutions Profile and Debug Libraries Enterprise Services Tesla Accelerated Computing Platform

Common Programming Models Across Multiple CPUs Libraries Programming Languages Compiler Directives AmgX cuBLAS / The other primary goal for us is to provide an open platform that supports all CPU archtectures, industry standards. NVIDIA is the only processor company in the world with the open platform to work with any CPU architecture, OpenPOWER, ARM, or x86. Intel can’t do this. They are tied to x86. In fact, Intel & x86 is becoming more “closed”, own network stack. Our vision is OEMs will build what is right for their customers, the right CPU, etc, and GPUs will be there to accelerate their work. We want to make sure when our customers write GPU code, …, it will run without change on all three platforms. X86

Normalized Performance GPU Roadmap 20 Pascal Unified Memory 3D Memory NVLink 16 12 Maxwell DX12 Normalized Performance 8 Kepler Dynamic Parallelism 4 Fermi FP64 Tesla CUDA 2008 2010 2012 2014 2016

GPUDirect

Multi-GPU: Unified Virtual Addressing Single Partitioned Address Space System Memory GPU0 Memory GPU1 Memory 0x0000 0xFFFF CPU GPU0 GPU1 PCI-e

GPUDirect Technologies GPUDirect Shared GPU-Sysmem for Inter-node copy optimization How: Use GPUDirect-aware 3rd party network drivers GPUDirect P2P Transfers for on-node GPU-GPU memcpy How: Use CUDA APIs directly in application How: use P2P-aware MPI implementation GPUDirect P2P Access for on-node inter-GPU LD/ST access How: Access remote data by address directly in GPU device code GPUDirect RDMA for Inter-node copy optimization What: 3rd party PCIe devices can read and write GPU memory How: Use GPUDirect RDMA-aware 3rd party network drivers and MPI implementations or custom device drivers for other hardware

GPUDirect P2P: GPU-GPU Direct Access PCIe PCIe GPU Memory GPU Memory CPU/IOH GPUDirect v2.0 enables GPUs to communicate directly, both during cudaMemcpyD2D and from within CUDA kernels CUDA 4.0 feature CPU Memory

P2P Goal: Improve intra-node programming model Improve CUDA programming model How? Transfer data between two GPUs quickly/easily int main() { double *cpuTmp, *gpu0Data, gpu1Data; setup (gpu0Data, gpu1Data); cudaSetDevice (0); kernel <<< … >>> (gpu0Data); cudaMemcpy (cpuTmp, gpu0Data); cudaMemcpy (gpu1Data, cpuTmp); cudaSetDevice (1); }

GPUDirect P2P: Common use cases MPI implementation optimization for Intra-Node communication HPC applications that fit on a single node You get much better efficiency with GPUDirect P2P (compared to MPI) Can use between ranks with cudaIpc() APIs

GPUDirect RDMA GPU CPU/IOH Third Party Hardware PCIe PCIe GPU CPU Memory GPU GPUDirect v3.0 enables third party devices to directly access GPU memory without CPU or chipset involvement CUDA 5.0 feature Any vendor can implement with no hardware modification (only software device driver changes)! Video capture devices NICs (Infiniband, Ethernet and others) Storage devices (Future SSD + Database applications) Third Party Hardware GPU Memory

GPUDirect RDMA Goal Inter-node Latency and Bandwidth How? Transfer data between GPU and third party device (e.g. NIC) with possibly zero host-side copies int main() { double *cpuData, *cpuTmp, *gpuData; setup (gpuData); kernel <<< … >>> (gpuData); cudaDeviceSynchronize (); cudaMemcpy (cpuTmp, gpuData); memcpy (cpuData, cpuTmp); MPI_Send (gpuData) }

GPUDirect RDMA: What does it get you? Latency Reduction MPI_Send latency of 25µs with Shared GPU-Sysmem* No overlap possible Bidirectional transfer is difficult MPI_Send latency of 5µs with RDMA Does not affect running kernels Unlimited concurrency RDMA possible! MPI-3 One sided of 3µs

GPUDirect RDMA: Common use cases Inter-Node MPI communication Transfer data between GPU and a remote Node Use CUDA-aware MPI Interface with third party hardware Requires adopting GPUDirect-Interop API in vendor driver stack Limitation GPUDirect RDMA does not work with CUDA Unified Memory today

NVLINK

Pascal GPU Features NVLINK and Stacked Memory GPU high speed interconnect 80-200 GB/s 3D Stacked Memory 4x Higher Bandwidth (~1 TB/s) 3x Larger Capacity 4x More Energy Efficient per bit

NVLink Unleashes Multi-GPU Performance Over 2x Application Performance Speedup When Next-Gen GPUs Connect via NVLink Versus PCIe GPUs Interconnected with NVLink CPU PCIe Switch Speedup vs PCIe based Server Want to show that NVLink provides immediate application speed-up, simply by allowing GPUs-to-GPU to communicate faster Start at 0x0 Why faster? Most HPC applications are multi-GPU enabled For multi-GPU applications, GPU-to-GPU app TESLA GPU TESLA GPU 5x Faster than PCIe Gen3 x16 3D FFT, ANSYS: 2 GPU configuration, AMBER Cellulose (256x128x128), FFT problem size (256^3), 4 GPU configuration

NVLink High-Speed GPU Interconnect Example: 8-GPU Server with NVLink Important point is the NVLink connection between the GPUs – so GPUDirect fuctions at 10x PCIe speed

Machine Learning

Machine Learning using Deep Neural Networks Input Result Hinton et al., 2006; Bengio et al., 2007; Bengio & LeCun, 2007; Lee et al., 2008; 2009 Visual Object Recognition Using Deep Convolutional Neural Networks Rob Fergus (New York University / Facebook) http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php#2985

3 Drivers for Deep Learning Powerful GPU Accelerators More Data Better Models

Broad use of GPUs in Deep learning Early Adopters Use Cases Talks @ GTC Image Detection Face Recognition Gesture Recognition Video Search & Analytics Speech Recognition & Translation Recommendation Engines Indexing & Search There’s already quite a body of research in the public domain describing the use of GPUs for machine learning tasks NVIDIA hosts our annual GPU Technology Conference in San Jose every spring This year we had a fantastic number of talks about using CUDA for machine learning and data sciences Encourage you to attend – lots of opportunities for cross-pollination across domains, especially with machine learning Image Analytics for Creative Cloud Speech/Image Recognition Image Classification Hadoop Recommendation Search Rankings

What is Next? Analyzing Unstructured Data Anomaly Detection Behavior Prediction Diagnostic Support “Any product that excites you over the next five years and makes you think: ‘That is magical, how did they do that?’, is probably based on this [deep learning].” Steve Jurvetson, Partner DFJ Venture Language Analysis ….

Database Acceleration Beyond Deep Learning Graph Analytics Database Acceleration Real Time Analytics Visualization of social network analysis by Martin Grandjean is licensed under CC BY-SA 3.0

Deep Learning Sessions Adobe Google Alibaba iFlytek, Ltd Baidu NUANCE Carnegie Mellon Stanford Univ Facebook UC Berkeley Flickr / Yahoo Univ of Toronto GTC 2015 had many Deep Learning Sessions Check GTC on-demand http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php

NVIDIA in OCP

Unlocking Access to the Tesla Platform Engaging with OEMs, end customers and technology partners to include NVIDIA Accelerators in the OCP Platform NVLINK based designs for maximum performance Standard PCIe designs for scale out Open Compute Project

Enabling NVLink GPU-CPU connections KEPLER GPU PASCAL GPU NVLink High-Speed GPU Interconnect NVLink POWER CPU NVLink PCIe PCIe X86, ARM64, POWER CPU X86, ARM64, POWER CPU 2014 2016

Thanks cangerer@nvidia.com