Stanford University.

Slides:



Advertisements
Similar presentations
NEURAL NETWORKS Backpropagation Algorithm
Advertisements

EE 690 Design of Embodied Intelligence
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Introduction to Neural Networks Computing
Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Classification Neural Networks 1
Machine Learning Neural Networks
S. Mandayam/ ANN/ECE Dept./Rowan University Artificial Neural Networks ECE /ECE Fall 2008 Shreekanth Mandayam ECE Department Rowan University.
Radial Basis-Function Networks. Back-Propagation Stochastic Back-Propagation Algorithm Step by Step Example Radial Basis-Function Networks Gaussian response.
S. Mandayam/ ANN/ECE Dept./Rowan University Artificial Neural Networks / Fall 2004 Shreekanth Mandayam ECE Department Rowan University.
S. Mandayam/ ANN/ECE Dept./Rowan University Artificial Neural Networks ECE /ECE Fall 2010 Shreekanth Mandayam ECE Department Rowan University.
S. Mandayam/ ANN/ECE Dept./Rowan University Artificial Neural Networks / Spring 2002 Shreekanth Mandayam Robi Polikar ECE Department.
Artificial Neural Networks
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
S. Mandayam/ ANN/ECE Dept./Rowan University Artificial Neural Networks ECE /ECE Fall 2006 Shreekanth Mandayam ECE Department Rowan University.
Traffic Sign Recognition Using Artificial Neural Network Radi Bekker
Artificial Neural Networks
Backpropagation An efficient way to compute the gradient Hung-yi Lee.
Classification / Regression Neural Networks 2
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
Non-Bayes classifiers. Linear discriminants, neural networks.
11 1 Backpropagation Multilayer Perceptron R – S 1 – S 2 – S 3 Network.
Convolutional Neural Networks for Architectures with Hierarchical Memories Ardavan Pedram VLSI Research Group Stanford University.
ShiDianNao: Shifting Vision Processing Closer to the Sensor
M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)
PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.
Introduction to Neural Networks Introduction to Neural Networks Applied to OCR and Speech Recognition An actual neuron A crude model of a neuron Computational.
Chapter 18 Connectionist Models
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.
Hand-written character recognition
Neural Networks 2nd Edition Simon Haykin
Previous Lecture Perceptron W  t+1  W  t  t  d(t) - sign (w(t)  x)] x Adaline W  t+1  W  t  t  d(t) - f(w(t)  x)] f’ x Gradient.
Chapter 6 Neural Network.
Joe Bradish Parallel Neural Networks. Background  Deep Neural Networks (DNNs) have become one of the leading technologies in artificial intelligence.
Lecture 2b: Convolutional NN: Optimization Algorithms
6.5.4 Back-Propagation Computation in Fully-Connected MLP.
Machine Learning Supervised Learning Classification and Regression
Analysis of Sparse Convolutional Neural Networks
Deep Learning Amin Sobhani.
Multilayer Perceptrons
Large-scale Machine Learning
The Gradient Descent Algorithm
Chilimbi, et al. (2014) Microsoft Research
Mahdi Nazemi, Shahin Nazarian, and Massoud Pedram July 10, 2017
Classification with Perceptrons Reading:
NETWORK-ON-CHIP HARDWARE ACCELERATORS FOR BIOLOGICAL SEQUENCE ALIGNMENT Author: Souradip Sarkar; Gaurav Ramesh Kulkarni; Partha Pratim Pande; and Ananth.
Neural Networks and Backpropagation
Classification / Regression Neural Networks 2
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Grid Long Short-Term Memory
Classification Neural Networks 1
Logistic Regression & Parallel SGD
Convolutional networks
Final Project presentation
Neural Networks Geoff Hulten.
Backpropagation.
Convolutional Neural Networks
2. Matrix-Vector Formulation of Backpropagation Learning
Artificial Neural Networks
Neural networks (1) Traditional multi-layer perceptrons
Backpropagation.
The Updated experiment based on LSTM
CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.
CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.
Image recognition.
Artificial Neural Networks / Spring 2002
Parallel Systems to Compute
Mohsen Imani, Saransh Gupta, Yeseong Kim, Tajana Rosing
Presentation transcript:

Stanford University

Yuanfang Li and Ardavan Pedram* CATERPILLAR: CGRA for Accelerating the Training of Deep Neural Networks Yuanfang Li and Ardavan Pedram* Stanford University Cerebras Systems

CATERPILLAR © A. Pedram

Compute Infrastructure Required for Training Deep Learning Stack Example Stack Computation Smart Camera Mobile App End Application Photo Recognition API Access High Level Service Trained Photo CNN DNN Model and Data Compute Infrastructure Required for Training Compute CATERPILLAR © A. Pedram

Research Efforts as of July 2017 Source:ScaleDeep ISCA 2017 CATERPILLAR CATERPILLAR © A. Pedram

The Neural Networks Zoo http://www.asimovinstitute.org/neural-network-zoo/ by Asimov Institute.  CATERPILLAR © A. Pedram

RNN, CNN, DFF(MLP) CATERPILLAR © A. Pedram

Multilayer Perceptron Several Fully Connected Layers Basic Operation Matrix Vector Multiplication (GEMV) CATERPILLAR © A. Pedram

Backpropagation Training MLP Backpropagation CATERPILLAR © A. Pedram

Backpropagation Basic Operation GEMV Rank-1 update (outer product) Update Gradient Rank-1 update (outer product) Update Weights CATERPILLAR © A. Pedram

Gradient Descent A GEMV B × C ∑ m C+=A×B n x h1 h2 h3 ŷ h11 h21 h31 time x h1 h2 h3 ŷ GEMV h11 h21 h31 δ31 δ21 δ11 m B C n A × C+=A×B ∑ t CATERPILLAR © A. Pedram

Stochastic Gradient Descent GEMV Inherently Inefficient Requirements Broadcast (systolic /non-systolic) Reduction (systolic/ tree based) time x h1 h2 h3 ŷ h11 h21 h31 δ31 δ21 δ11 h12 h22 h32 δ32 δ22 δ12 h1i h2i h3i t δ3i δ2i a δ1i CATERPILLAR © A. Pedram

Batched Gradient Descent Data Parallelism GEMV➔ GEMM GEMM: Memory efficient kernel #weight updates/batch size time x h1 h2 h3 ŷ x h1 h2 h3 ŷ h11:4 h11 h21:4 h21 h31:4 h31 δ11 δ31 δ21 δ11:4 h12 h22 h32 δ31:4 δ21:4 δ32 δ22 h15:8 δ12 h25:8 h35:8 h1i h2i h3i δ35:8 t δ3i δ25:8 δ2i δ15:8 a δ1i b CATERPILLAR © A. Pedram

Direct Feedback Alignment Dependence Elimination Parallelism in backward pass Effective for smaller networks time x h1 h2 h3 ŷ x h1 h2 h3 ŷ x h1 h2 h3 ŷ h11:4 h11 h11:4 h21:4 h21:4 h21 h31:4 h31 h31:4 δ11 δ31 δ21 δ11:4 h12 h22 h32 δ31:4 δ21:4 h15:8 δ32 h25:8 δ22 h15:8 h35:8 δ12 h25:8 h35:8 h1i h2i h3i δ35:8 δ35:8 t δ3i δ25:8 δ2i δ15:8 a δ1i b c CATERPILLAR © A. Pedram

Pipelined Continuous Propagation* Layer Parallelization Pipelining Inputs Layer Locality More Efficient GEMVs Smaller Reduction Tree Weight Temporal Locality Update and Consume Immediately time x h1 h2 h3 ŷ x h1 h2 h3 ŷ h11 h11 h21 h21 h31 h31 δ11 δ31 δ21 h12 h22 h32 δ32 δ22 δ12 h1i h2i h3i t δ3i δ2i *Continuous Propagation, CP, Cerebras Systems Patent Pending a δ1i d CATERPILLAR © A. Pedram

What Do We Need to Support? GEMM GEMV Parallelization Between Cores Collective Communications Gather Reduce All Gather All Reduce Broadcast Efficient Transpose CATERPILLAR © A. Pedram

CATERPILLAR Architecture 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row CATERPILLAR © A. Pedram

CATERPILLAR Architecture PE Native Support for Inner Product 3 Levels of memory hierarchy Accumulator Mem B (2 ports) Mem A (1 port) Distributed Memory Programming Model State Machine Reprogrammable 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row The Linear Algebra Core PE CATERPILLAR © A. Pedram

CATERPILLAR Architecture Core GEMM Optimized for Rank-1 Updates Broadcast bus GEMV Systolic between neighboring PEs Accelerate reduction 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row CATERPILLAR © A. Pedram

CATERPILLAR Architecture Multicore Ring of Cores Reconfigurable Support for Collective Comms All gather Reduce All Reduce Systolic/Parallel 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row CATERPILLAR © A. Pedram

CATERPILLAR Architecture Multicore Ring of Cores Reconfigurable Support for Collective Comms All gather Reduce All Reduce Systolic/Parallel 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row CATERPILLAR © A. Pedram

GEMV CATERPILLAR © A. Pedram

Current Layer’s weights Forward Path Output Activation Input Activation Transpose & send in time Broadcast Core 1 Partition Broadcast Reduce1 Reduce2 Core 2 Partition Broadcast Reduce From previous layer Current Layer’s weights to next layer CATERPILLAR © A. Pedram

Delta Path Input delta Transpose Reduce Reduce Broadcast To Previous layer Core 1 Partition Broadcast Broadcast Core 2 Partition Broadcast Reduce Output delta Back From Next Layer CATERPILLAR © A. Pedram

Multicore GEMM × Off Core memory distribution All gather Go to Next Layer Batched Samples On Chip CATERPILLAR © A. Pedram

Multicore GEMM × All Reduce Off Core memory distribution Batched Samples On Chip Off Core memory distribution CATERPILLAR © A. Pedram

Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn

Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn

Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn

Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn

Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn

Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn

Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn

Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn

Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn

Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn

Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn

Methodology Networks, Dataset, and Algorithms: MNIST Batch Sizes 2,4,8,50,100 #Layers 4,5,6 Deep & Wide Network 2500-2000-1500-1000-500-10 Architecture Half Precision FPU 16KB of local memory 512 KB private SRAM/Core 45 nm @ 1 GHz 2×16 cores with 16×16 PEs 103.2 mm2 2×4 cores with 4×4 PEs 178.9mm2 CATERPILLAR © A. Pedram

Pure Convergence Analyses CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending

Pure Convergence Analyses z CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending

Pure Convergence Analyses z CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending

Hardware Analyses Combine Epoch to convergence with hardware Energy to Convergence Time to Convergence Network size Fit /don’t fit on the cores Bigger Network Converge Faster Need More compute Batched Algorithms Use GEMM Faster Converge Slower CATERPILLAR © A. Pedram

Energy to Convergence 32 4x4 Cores 500-500-500-10 Fits on Cores large networks, MBGD can perform better in terms of energy than SGD even when there is enough local memory to store the entire network. Further, CP consistently outperforms all other training methods. 2500-2000-1500-100-500-10 Does not fit on Cores CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending

CP:Cerebras Systems patent pending Time to Accuracy Going off-core is Expensive Minibatched Converge Faster than Non-minibatched if the network does not fit CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending

CP:Cerebras Systems patent pending Conclusion Training MLP DNNs and Their Effect on Convergence Exploration of the Design Space of Accelerators for Various BP algorithms CATERPILLAR Both GEMV and GEMM Kernels Collective Communications If Network Fits pipelined backpropagation consistently performs the best If Network Does not Fit Minibatched algorithms have comparable performance to pipelined backpropagation, CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending

CATERPILLAR © A. Pedram