Large-scale Deep Unsupervised Learning using Graphics Processors

Slides:

Advertisements

Similar presentations

Scalable Learning in Computer Vision

Advertisements

Deep Learning Bing-Chen Tsai 1/21.

Stochastic Neural Networks Deep Learning and Neural Nets Spring 2015.

Advanced topics.

Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University.

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

Presented by: Mingyuan Zhou Duke University, ECE September 18, 2009

BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.

OpenFOAM on a GPU-based Heterogeneous Cluster

Self Taught Learning : Transfer learning from unlabeled data Presented by: Shankar B S DMML Lab Rajat Raina et al, CS, Stanford ICML 2007.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.

GPU PROGRAMMING David Gilbert California State University, Los Angeles.

Deep Belief Networks for Spam Filtering

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

Submitted by:Supervised by: Ankit Bhutani Prof. Amitabha Mukerjee (Y )Prof. K S Venkatesh.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.

Comp 5013 Deep Learning Architectures Daniel L. Silver March,

SAGE: Self-Tuning Approximation for Graphics Engines

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Computer Graphics Graphics Hardware

Christopher Mitchell CDA 6938, Spring The Discrete Cosine Transform  In the same family as the Fourier Transform  Converts data to frequency domain.

Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.

Modeling GPU non-Coalesced Memory Access Michael Fruchtman.

A shallow introduction to Deep Learning

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

GPU Architecture and Programming

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

CS 732: Advance Machine Learning

Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov

CUDA Simulation Benjy Kessler.  Given a brittle substance with a crack in it.  The goal is to study how the crack propagates in the substance as a function.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.

Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.

CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Machine Learning Artificial Neural Networks MPλ ∀ Stergiou Theodoros 1.

Computer Graphics Graphics Hardware

Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton

Analysis of Sparse Convolutional Neural Networks

Learning Deep Generative Models by Ruslan Salakhutdinov

Energy models and Deep Belief Networks

Chilimbi, et al. (2014) Microsoft Research

Article Review Todd Hricik.

Restricted Boltzmann Machines for Classification

Multimodal Learning with Deep Boltzmann Machines

Programming Massively Parallel Graphics Processors

Programming Massively Parallel Graphics Processors

Artificial Neural Networks

6- General Purpose GPU Programming

CSC 578 Neural Networks and Deep Learning

Presentation transcript:

Large-scale Deep Unsupervised Learning using Graphics Processors Rajat Raina Anand Madhavan Andrew Y. Ng Stanford University

Learning from unlabeled data Classify Input space vs. car motorcycle Unlabeled examples Learn higher-level representation Higher-level representation Deep Belief Networks Sparse Coding Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

The promise of unsupervised learning Use large amounts of unlabeled data to learn complex/deep models, possibly with many parameters. Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

Some recent work on DBNs (Similar situation for sparse coding.) Published Source Domain Number of free parameters Hinton et al. Handwritten digits 1.6 million Hinton & Salakhutdinov Face images 3 million Salakhutdinov & Hinton Information retrieval 2.6 million Ranzato & Szummer Text documents 3.6 million Our DBN model over images 100 million Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

Large-scale learning [Banko & Brill, 2001] Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

Large-scale unsupervised learning Current models: 1000s of input dimensions, 1000s of hidden units. 106 parameters. Our desired model: 108 parameters Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

Graphics Processors CPU RAM Graphics Card (GPU) Motherboard Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

Why graphics processors? 1000 750 500 250 NVIDIA GPU Peak Gflops (billion ops / sec) Intel CPU 2003 2004 2005 2006 2007 2008 (Source: NVIDIA CUDA Programming Guide) Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

Why graphics processors? IBM ASCI White Supercomputer Cost: $110 million Space: 2 basketball courts 13 graphics cards Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

GPU Schematic … … RAM 30 MPs 1000GB/s 100 GB/s (coalesced) Shared Memory (16K) SP MP Shared Memory (16K) SP … MP Shared Memory (16K) SP Registers 1000GB/s … Registers Registers 100 GB/s (coalesced) Global Memory (~1GB) Slow transfer from RAM RAM (Note: Some additional features not displayed.)

GPU Programming Two-level parallelism Main bottleneck: NVIDIA CUDA Global Memory (~1GB) MP Shared Memory SP Two-level parallelism Split task into blocks, blocks into threads. Access to global memory (not RAM). Restrictions on memory access patterns. Main bottleneck: Getting data into GPU memory, and accessing it in efficient ways. NVIDIA CUDA High-level routines to allocate/copy GPU memory. Good GPU matrix libraries that suffice for many machine learning tasks. RAM Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

Unsupervised learning on GPUs Initialize parameters in global memory. while convergence criterion is not satisfied Periodically transfer a large number of unlabeled examples into global memory. Pick a few of the unlabeled examples at a time, and compute the updates in parallel using the GPU's two-level parallelism (blocks and threads) or GPU matrix libraries. end Transfer learnt parameters from global memory. Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

Deep Belief Networks Learning Large DBNs using Graphics Processors Rajat Raina, Andrew Y. Ng

Restricted Boltzmann Machine (RBM) . . . h h1 h2 . . . v v1 v2 v3 Contrastive divergence learning via conditional distributions: Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

Experimental setup Single graphics card: Nvidia GTX 280 CPU: 1GB on-board memory, 240 cores. Current price: US $250. CPU: Two cores, each @3.16GHz.

Learning Large RBMs ½ hour 2 weeks Dual-core CPU 35 hours 72x faster 1 day Learning time for 10 million examples (log scale) 8 hours 2 hours 5 hours GPU 1 hour ½ hour 1 18 36 45 Millions of parameters Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

Overlapping patches DBN Hidden UnitsB Hidden UnitsA Input image Patch A Patch B WA, bA, cA WB, bB, cB . . . . . . Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

Overlapping patches DBN example 110 million parameters. … . . 8192 units 15680 units 2048 units 32768 units (128 units per 24x24 patch) 20736 units (144x144) All layers can be learnt in about 1 day on a GPU. Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

Sparse Coding Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

Sparse coding Given unlabeled data x(i), obtain b by solving: Basis vectors b Given unlabeled data x(i), obtain b by solving: Alternating minimization Keep a fixed, find optimal b. Keep b fixed, find optimal a. = 0.8 * + 0.3 * + 0.5 * Input x = 0.8 * b87 + 0.3 * b376 + 0.5 * b411 Activations a Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

Parallel Sparse Coding Alternating minimization Keep a fixed, find optimal b. Easy on GPU (projected grad descent). Keep b fixed, find optimal a. Not as straightforward. Need to parallelize: Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

Parallel Sparse Coding Easy to optimize for one coordinate (keeping the others fixed). (Friedman et al., 2007) One iteration of our algorithm: Descent direction Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

Sparse coding with 106 parameters 19 days Dual-core CPU Learning time (days) with 10 million examples 15x faster GPU 1 day 6 hours 3% nonzero 10% nonzero Sparsity Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

Summary Large-scale unsupervised learning. Ten-times more data might transform an OK algorithm into a good algorithm. Working at smaller-scale risks confounding the effects of the model itself, with the effect of scale. GPUs are a powerful tool for machine learning. Easy to program (no low-level programming). Especially useful for stochastic learning methods. Learning algorithms for DBNs and sparse coding can be an order-of- magnitude faster. Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

THE END

Why graphics processors? 120 100 80 60 40 20 NVIDIA GPU Bandwidth from memory to processor (GB/s) In modern computing, sometimes the bottleneck is getting data into the processor from the memory. Intel CPU 2003 2004 2005 2006 2007 (Source: NVIDIA CUDA Programming Guide) Large-scale Deep Unsupervised Learning Rajat Raina, Anand Madhavan, Andrew Y. Ng

GPU Programming: A=A+B __global__ void vecAdd(float* A, float* B){ int my = threadIdx.x + blockIdx.x * 128; A[my]=A[my]+B[my]; } int main(int argc, char** argv){ float A[SIZE], B[SIZE]; float* d_A, * d_B; cudaMalloc((void**)&d_A,SIZE_BYTES); cudaMalloc((void**)&d_B,SIZE_BYTES); cudaMemcpy(d_A,A,SIZE_BYTES,cudaMemcpyHostToDevice); cudaMemcpy(d_B,B,SIZE_BYTES,cudaMemcpyHostToDevice); vecAdd<<<32,128>>>(d_A,d_B); cudaThreadSynchronize(); cudaMemcpy(A,d_A,SIZE_BYTES,cudaMemcpyDeviceToHost); GPU CPU http://www.cs.technion.ac.il/~marks/docs/LinuxClubGPGPU.pdf (Adapted from http://www.cs.technion.ac.il/~marks/docs/LinuxClubGPGPU.pdf)