Mohsen Imani, Saransh Gupta, Yeseong Kim, Tajana Rosing

FloatPIM: In-Memory Acceleration of Deep Neural Network Training with High Precision
Mohsen Imani, Saransh Gupta, Yeseong Kim, Tajana Rosing University of California San Diego System Energy Efficiency Lab.

Deep Learning Deep learning is the state-of-the-art approach for video analysis Videos are 70% of today’s internet traffic Over 300 hours of video uploaded to YouTube every minute Over 500 million hours of video surveillance collected every day “Training a single AI model can emit as much carbon as five cars in their lifetimes“ MIT Technology Review Slide from: V. Sze presentation, MIT‘17

Data movement is very expensive!
Computing Challenges Data movement is very expensive! Slide from: V. Sze, et.al., “Hardware for Machine Learning: Challenges and Opportunies”, 2017

DNN Challenges in Training
TFLite Apple AI Hawaii NPU Nervana Don’t support full training due to energy inefficiency How about using existing PIM architectures? DNN/CNN Training 1 Highly Parallel Architecture 2 High Precision Computation 3 Large Data Movement

Digital-based Processing In-Memory
Operations Examples Bitwise NOR, AND, XOR, … Arithmetic Addition, Multiplication Search-based Exact/Nearest Search Advantages Memory Architecture Works on digital data No ADC/DAC In-place computation where big data is stored Eliminates data movements Simultaneous computation in all memory blocks High Parallelism Flexible operations Fixed or Floating Point operations

Digital PIM Operations
B NOR(A,B) C=A+B C=A×B Arithmetic Row-parallel Addition Row Driver Detector Row-parallel Multiplication Q Search-based Exact Search Row Driver Q Detector

Digital PIM Operations
B C=A+B C=A×B Arithmetic Row Driver Detector Row-parallel Addition Row-parallel Multiplication Q Exact Search Row Driver Q Detector

Activation Function (g)
Neural Networks Zi aj Weight Matrix Activation Function (g) Zj Weight Wij Derivative Activation (g’) g’(aj) Feed Forward Back Propagation

Vector-Matrix Multiplication
Doesn’t support row-level addition a2 a3 Addition a4 Input Weight Matrix Transposed Weight Transposed Input a1 a2 a3 a4 a1 a2 a3 a4 a1 a2 a3 a4 a1 a2 a3 a4 Row-Parallel Copy Multiplication Addition

Neural Network: Convolution Layer
Input Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 Z9 Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 Z9 Weight Matrix w1 w2 w3 w4 * How to move convolution windows in memory Expand weights Write in memory is too slow! Input Shifter w1 w2 w3 w4 Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 Z9 w1 w2 w3 w4 w1 w2 w3 w4 Multiplication Addition Addition

Neural Network: Back Propagation
Feed Forward Zi aj Weight Matrix Activation Function (g) Zj Weight Wij Derivative Activation (g’) g’(aj) Back Propagation η δk Weight Matrix δj Updated Weight Weight Wjk ηδjZi Weight Wij Error Backward Weight Update

Memory Layout: Back Propagation
Weight Matrix δj Updated Weight Weight Wjk ηδjZi Weight Wij Error Backward Weight Update Stored during Feed Forward Switch δk Copies WTjk g’(aj) ηZj PIM Reserved δj Update next layer weights δj Copies ηZi PIM Reserved WTij g’(ai) δi Stored during Feed Forward

Digital PIM Architecture
How does data move between the block? Digital PIM Architecture z g g Block 1 Block 2 Switch Block 3 g g Block 4 Switch Example Network Computing Mode Data Transfer Computing Mode

Serialized Computation
FloatPIM Parallelism Serialized Computation Parallel Computation

FloatPIM Architecture
32 Tiles 256 Blocks/Tile 1K*1K Block Size Controller per tile 11.5% of area 9.7% of power! Crossbar array: 1K*1k 99% of area 89% of power! 6-levels barrel shifter 0.5% of area ~10% of power! Switches 6.3% of area 0.9 % of power!

Deep Learning Acceleration
Four popular networks over large-scaled ImageNet dataset Accelerators Training Floating Point Training Stablity Analog PIM ISAAC [ISCA’16] N/A PipeLayer [HPCA’17] Unstable Digital PIM FloatPIM Stable Training /High accuracy Float-32 bFloat Fixed-32 Fixed-16 AlexNet 27.4% 29.6% 31.3% GoogleNet 15.6% 18.5% 21.4% VGGNet 17.5% 17.7% 23.1% SqueezeNet 25.9% 26.1% 32.1%

FloatPIM: Fixed vs. Floating Point
FloatPIM efficiency using bFloat as compared to Float-32: 2.9× speedup and 2.5× energy savings Fixed-32: 1.5× speedup and 1.42× energy savings

FloatPIM Efficiency FloatPIM vs. NVIDIA 1080 GTX GPU and PipeLayer [HPCA’17]: FloatPIM efficiency comes from: Higher density Lower data movement Faster computation in a lower bitwidth 303X 48X 4.3X faster than Analog PIM 16X more energy efficient than Analog PIM

Conclusion Several existing challenges in analog-based computing in today’s PIM technology Proposed the idea of digital-based PIM architecture Exploits analog characteristics of NVMs to support row-parallel NOR-operations Extends it to row-parallel arithmetic; addition/multiplication Maps the entire DNN training/inference to a crossbar memory with minimal changes in the memory Results as compared to: NVIDIA GTX 1080 GPU : 302X faster and 48X more energy efficient Analog PIM[HPCA’17]: 4.3X faster and 16X more energy efficient

Mohsen Imani, Saransh Gupta, Yeseong Kim, Tajana Rosing

Similar presentations

Presentation on theme: "Mohsen Imani, Saransh Gupta, Yeseong Kim, Tajana Rosing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mohsen Imani, Saransh Gupta, Yeseong Kim, Tajana Rosing

Similar presentations

Presentation on theme: "Mohsen Imani, Saransh Gupta, Yeseong Kim, Tajana Rosing"— Presentation transcript:

Similar presentations

About project

Feedback