Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stanford University.

Similar presentations


Presentation on theme: "Stanford University."— Presentation transcript:

1 Stanford University

2 Yuanfang Li and Ardavan Pedram*
CATERPILLAR: CGRA for Accelerating the Training of Deep Neural Networks Yuanfang Li and Ardavan Pedram* Stanford University Cerebras Systems

3 CATERPILLAR © A. Pedram

4 Compute Infrastructure Required for Training
Deep Learning Stack Example Stack Computation Smart Camera Mobile App End Application Photo Recognition API Access High Level Service Trained Photo CNN DNN Model and Data Compute Infrastructure Required for Training Compute CATERPILLAR © A. Pedram

5 Research Efforts as of July 2017
Source:ScaleDeep ISCA 2017 CATERPILLAR CATERPILLAR © A. Pedram

6 The Neural Networks Zoo
Asimov Institute.  CATERPILLAR © A. Pedram

7 RNN, CNN, DFF(MLP) CATERPILLAR © A. Pedram

8 Multilayer Perceptron
Several Fully Connected Layers Basic Operation Matrix Vector Multiplication (GEMV) CATERPILLAR © A. Pedram

9 Backpropagation Training MLP Backpropagation CATERPILLAR © A. Pedram

10 Backpropagation Basic Operation GEMV Rank-1 update (outer product)
Update Gradient Rank-1 update (outer product) Update Weights CATERPILLAR © A. Pedram

11 Gradient Descent A GEMV B × C ∑ m C+=A×B n x h1 h2 h3 ŷ h11 h21 h31
time x h1 h2 h3 ŷ GEMV h11 h21 h31 δ31 δ21 δ11 m B C n A × C+=A×B t CATERPILLAR © A. Pedram

12 Stochastic Gradient Descent
GEMV Inherently Inefficient Requirements Broadcast (systolic /non-systolic) Reduction (systolic/ tree based) time x h1 h2 h3 ŷ h11 h21 h31 δ31 δ21 δ11 h12 h22 h32 δ32 δ22 δ12 h1i h2i h3i t δ3i δ2i a δ1i CATERPILLAR © A. Pedram

13 Batched Gradient Descent
Data Parallelism GEMV➔ GEMM GEMM: Memory efficient kernel #weight updates/batch size time x h1 h2 h3 ŷ x h1 h2 h3 ŷ h11:4 h11 h21:4 h21 h31:4 h31 δ11 δ31 δ21 δ11:4 h12 h22 h32 δ31:4 δ21:4 δ32 δ22 h15:8 δ12 h25:8 h35:8 h1i h2i h3i δ35:8 t δ3i δ25:8 δ2i δ15:8 a δ1i b CATERPILLAR © A. Pedram

14 Direct Feedback Alignment
Dependence Elimination Parallelism in backward pass Effective for smaller networks time x h1 h2 h3 ŷ x h1 h2 h3 ŷ x h1 h2 h3 ŷ h11:4 h11 h11:4 h21:4 h21:4 h21 h31:4 h31 h31:4 δ11 δ31 δ21 δ11:4 h12 h22 h32 δ31:4 δ21:4 h15:8 δ32 h25:8 δ22 h15:8 h35:8 δ12 h25:8 h35:8 h1i h2i h3i δ35:8 δ35:8 t δ3i δ25:8 δ2i δ15:8 a δ1i b c CATERPILLAR © A. Pedram

15 Pipelined Continuous Propagation*
Layer Parallelization Pipelining Inputs Layer Locality More Efficient GEMVs Smaller Reduction Tree Weight Temporal Locality Update and Consume Immediately time x h1 h2 h3 ŷ x h1 h2 h3 ŷ h11 h11 h21 h21 h31 h31 δ11 δ31 δ21 h12 h22 h32 δ32 δ22 δ12 h1i h2i h3i t δ3i δ2i *Continuous Propagation, CP, Cerebras Systems Patent Pending a δ1i d CATERPILLAR © A. Pedram

16 What Do We Need to Support?
GEMM GEMV Parallelization Between Cores Collective Communications Gather Reduce All Gather All Reduce Broadcast Efficient Transpose CATERPILLAR © A. Pedram

17 CATERPILLAR Architecture
0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row CATERPILLAR © A. Pedram

18 CATERPILLAR Architecture
PE Native Support for Inner Product 3 Levels of memory hierarchy Accumulator Mem B (2 ports) Mem A (1 port) Distributed Memory Programming Model State Machine Reprogrammable 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row The Linear Algebra Core PE CATERPILLAR © A. Pedram

19 CATERPILLAR Architecture
Core GEMM Optimized for Rank-1 Updates Broadcast bus GEMV Systolic between neighboring PEs Accelerate reduction 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row CATERPILLAR © A. Pedram

20 CATERPILLAR Architecture
Multicore Ring of Cores Reconfigurable Support for Collective Comms All gather Reduce All Reduce Systolic/Parallel 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row CATERPILLAR © A. Pedram

21 CATERPILLAR Architecture
Multicore Ring of Cores Reconfigurable Support for Collective Comms All gather Reduce All Reduce Systolic/Parallel 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row CATERPILLAR © A. Pedram

22 GEMV CATERPILLAR © A. Pedram

23 Current Layer’s weights
Forward Path Output Activation Input Activation Transpose & send in time Broadcast Core 1 Partition Broadcast Reduce1 Reduce2 Core 2 Partition Broadcast Reduce From previous layer Current Layer’s weights to next layer CATERPILLAR © A. Pedram

24 Delta Path Input delta Transpose Reduce Reduce Broadcast
To Previous layer Core 1 Partition Broadcast Broadcast Core 2 Partition Broadcast Reduce Output delta Back From Next Layer CATERPILLAR © A. Pedram

25 Multicore GEMM × Off Core memory distribution All gather
Go to Next Layer Batched Samples On Chip CATERPILLAR © A. Pedram

26 Multicore GEMM × All Reduce Off Core memory distribution
Batched Samples On Chip Off Core memory distribution CATERPILLAR © A. Pedram

27 Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn

28 Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn

29 Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn

30 Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn

31 Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn

32 Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn

33 Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn

34 Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn

35 Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn

36 Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn

37 Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn

38 Methodology Networks, Dataset, and Algorithms: MNIST Batch Sizes
2,4,8,50,100 #Layers 4,5,6 Deep & Wide Network Architecture Half Precision FPU 16KB of local memory 512 KB private SRAM/Core 45 1 GHz 2×16 cores with 16×16 PEs 103.2 mm2 2×4 cores with 4×4 PEs 178.9mm2 CATERPILLAR © A. Pedram

39 Pure Convergence Analyses
CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending

40 Pure Convergence Analyses
z CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending

41 Pure Convergence Analyses
z CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending

42 Hardware Analyses Combine Epoch to convergence with hardware
Energy to Convergence Time to Convergence Network size Fit /don’t fit on the cores Bigger Network Converge Faster Need More compute Batched Algorithms Use GEMM Faster Converge Slower CATERPILLAR © A. Pedram

43 Energy to Convergence 32 4x4 Cores
Fits on Cores large networks, MBGD can perform better in terms of energy than SGD even when there is enough local memory to store the entire network. Further, CP consistently outperforms all other training methods. Does not fit on Cores CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending

44 CP:Cerebras Systems patent pending
Time to Accuracy Going off-core is Expensive Minibatched Converge Faster than Non-minibatched if the network does not fit CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending

45 CP:Cerebras Systems patent pending
Conclusion Training MLP DNNs and Their Effect on Convergence Exploration of the Design Space of Accelerators for Various BP algorithms CATERPILLAR Both GEMV and GEMM Kernels Collective Communications If Network Fits pipelined backpropagation consistently performs the best If Network Does not Fit Minibatched algorithms have comparable performance to pipelined backpropagation, CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending

46 CATERPILLAR © A. Pedram


Download ppt "Stanford University."

Similar presentations


Ads by Google