Presentation is loading. Please wait.

Presentation is loading. Please wait.

Power-Efficient Machine Learning using FPGAs on POWER Systems

Similar presentations


Presentation on theme: "Power-Efficient Machine Learning using FPGAs on POWER Systems"— Presentation transcript:

1 Power-Efficient Machine Learning using FPGAs on POWER Systems
Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Join the Conversation #OpenPOWERSummit

2 Top-5 Accuracy Image Classification Image-Net Large-Scale Visual Recognition Challenge (ILSVRC*)
** Super Human Humans: ~95%*** * ** pg 10 *** Russakovsky, et al 2014,

3 Top-5 Accuracy Image Classification Image-Net Large-Scale Visual Recognition Challenge (ILSVRC*)
** Super Human Humans: ~95%*** CNNs far outperform non AI methods CNNs deliver super-human accuracy * ** pg 10 *** Russakovsky, et al 2014,

4 CNNs Explained

5 The Computation

6 The Computation Page 6

7 Convolution Calculating a single pixel on a single output feature plane requires a 3x3x384 input sub-volume and a 3x3x384 set of kernel weights

8 Convolution Calculating the next pixel on the same output feature plane requires an overlapping 3x3x384 input sub-volume and the same 3x3x384 set of weights

9 Convolution Continue along the row ...

10 Convolution Before moving down to the next row

11 Convolution The first output feature map is complete

12 Convolution Move onto the next output feature map by switching weights, and repeat

13 Convolution Pattern repeats as before: same input volumes, different weight

14 Convolution Complete the second output feature map plane

15 Convolution Finally, after 256 weight sets have been used, the output feature map is complete

16 Fully Connected Layers

17 Fully Connected Layers

18 CNN Properties Compute: dominated by convolution (CONV) layers
GOPs Per Layer Compute Memory BW: dominated by fully-connected (FC) layers Memory Access G Reads Per Layer Source: Yu Wang, Tsinghua University, Feb 2016

19 Humans are six orders of magnitude more efficient
Humans vs Machines * Humans are six orders of magnitude more efficient *IBM Watson, ca 2012 Source: Yu Wang, Tsinghua University, Feb 2016

20 Cost of Computation Source: William Dally, “High Performance Hardware for Machine Learning” Cadence ENN Summit, 2/9/2016.

21 Cost of Computation Stay in on-chip memory (1/100 x power)
Use Smaller Multipliers (8bits vs 32bits: 1/16 x power) Fixed-Point vs Float (don’t waste bits on dynamic range) Source: William Dally, “High Performance Hardware for Machine Learning” Cadence ENN Summit, 2/9/2016. Page 21

22 Improving Machine Efficiency
Model Pruning Right-Sizing Precision Custom CNN Processor Architecture

23 Pruning Elements Remove Low Contribution Weights (Synapses)
Retrain Remaining Weights Source: Han, et al, “Learning both Weights and Connections for Efficient Neural Networks”

24 Pruning Results: AlexNet
9x Reduction In #Weights Most Reduction In FC Layers Source: Han, et al, “DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING ”,

25 Pruning Results: AlexNet
< 0.1% Accuracy Loss Source: Han, et al, “DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING ”,

26 Inference with Integer Quantization

27 Right-Sizing Precision
Network VGG16 Data Bits Single-float 16 8 Weight Bits 8 or 4 Data Precision N/A 2-2 2-5/2-1 Dynamic Weight Precision 2-15 2-7 Top-1 Accuracy 68.1% 68.0% 53.0% 28.2% 67.0% Top-5 Accuracy 88.0% 87.9% 76.6% 49.7% 87.6% Dynamic: Variable Format Fixed-Point (Per Layer) < 1% Accuracy Loss Source: Yu Wang, Tsinghua University, Feb 2016

28 Right-Sizing Precision
Fixed-Point Sufficient For Deployment (INT16, INT8) No Significant Loss in Accuracy (< 1%) >10x Energy Efficiency OPs/J (INT8 vs FP32) 4x Memory Energy Efficiency Tx/J (INT8 vs FP32)

29 Improving Machine Efficiency
CNN Model Model pruning FPGA Based Neural Network Processor Pruned Floating-Point Model Data/weight quantization Pruned Fixed-Point Model Run Compilation Instructions Modified From: Yu Wang, Tsinghua University, Feb 2016

30 Xilinx Kintex® UltraScale™ KU115 (20nm)
5520 DSP Cores,up to 500Mhz 5.5 T OPs int16 (peak) 4 GB DDR & 38 GB/s 55W TDP & 100 G OPs/W Single Slot, Low Profile FF OpenPOWER CAPI AlphaData ADM-PCIE-8K5

31 FPGA Architecture 2D Array Architecture (scales with Moore’s Law)
RAM CLB DSP CLB RAM RAM CLB DSP CLB RAM RAM CLB DSP CLB RAM RAM CLB DSP CLB RAM 2D Array Architecture (scales with Moore’s Law) Memory Proximate Computing (Minimize Data Moves) Broadcast Capable Interconnect (Data Sharing/Reuse)

32 FPGA Arithmetic & Memory Resources
Custom Width Memory 16-bit Multiplier 48-bit Accumulator Custom Quantization INT4 INT8 INT16 INT32 FP16 FP32 Dj Q8.8 Q2.14 Qm.n Oi Wij Native 16-bit multiplier (or reduced power 8-bit) On-Chip RAMs store INT4, INT8, INT16, … Custom Quantization Formatting (Qm.n)

33 Convolver Unit + X Source: Yu Wang, Tsinghua University, Feb 2016
⋯⋯ MUX Data buffer Weight buffer Multipliers Adder Tree X + 9 Data Inputs 9 Weight Inputs n Delays 𝑚 Delays Input Data Weight Output Source: Yu Wang, Tsinghua University, Feb 2016

34 Memory Proximate Compute
Convolver Unit ⋯⋯ MUX Data buffer Weight buffer Multipliers Adder Tree X + 9 Data Inputs 9 Weight Inputs n Delays 𝑚 Delays Input Data Weight Output Serial to Parallel Data Reuse: 8/9 Memory Proximate Compute 2D Parallel Memory 2D Operator Array INT16 Serial to Parallel Ping/Pong Source: Yu Wang, Tsinghua University, Feb 2016

35 Processing Engine (PE)
Convolver Complex + NL Pool Output Buffer Input Buffer Data Bias Weights Intermediate Data Controller PE Adder Tree Bias Shift shift Source: Yu Wang, Tsinghua University, Feb 2016

36 Processing Engine (PE)
Convolver Complex + NL Pool Output Buffer Input Buffer Data Bias Weights Intermediate Data Controller PE Adder Tree Bias Shift shift Custom Quantization Memory Sharing Broadcast Weights Source: Yu Wang, Tsinghua University, Feb 2016

37 Top Level … Source: Yu Wang, Tsinghua University, Feb 2016
Power CPU External Memory Processing System DMA w/ compression Data & Inst. Bus Input Buffer PE Computing Complex Output Buffer FIFO Controller Programmable Logic Config. Bus Source: Yu Wang, Tsinghua University, Feb 2016

38 Top Level SW Scheduled Dataflow Decompress weights on the fly
Power CPU External Memory Processing System DMA w/ compression Data & Inst. Bus Input Buffer PE Computing Complex Output Buffer FIFO Controller Programmable Logic Config. Bus SW Scheduled Dataflow Decompress weights on the fly Ping Pong Buffers Transfers Overlap with Compute Multiple PE Block Level Parallelism Source: Yu Wang, Tsinghua University, Feb 2016

39 FPGA Neural Net Processor
Tiled Architecture (Parallelism & Scaling) Semi Static Dataflow (Pre-scheduled Data Transfers) Memory Reuse (Data Sharing across Convolvers)

40 OpenPOWER CAPI Peer Programming Model and Interaction Efficiency
CAP PSL POWER8 CAP UNIT Shared Virtual Memory System-Wide Memory Coherency Low Latency Control Messages Peer Programming Model and Interaction Efficiency

41 OpenPOWER CAPI Power Xilinx FPGA AuvizDNN Kernel
CAP PSL POWER8 CAP UNIT Power Caffe, TensorFlow, etc Load CNN Model Call AuvizDNN Library Xilinx FPGA AuvizDNN Kernel Scalable & Fully Parameterized Plug and Play Library

42 OpenPOWER CAPI 14 Images/s/W (AlexNet) Batch Size 1 Low Profile TDP
CAP PSL POWER8 CAP UNIT 14 Images/s/W (AlexNet) Batch Size 1 Low Profile TDP

43 Take Aways FPGA: Ideal Dataflow CNN Processor
POWER/CAPI: Elevates Accelerators As Peers to CPUs FPGA CNN Libraries

44 Thank You! 11/16/2018


Download ppt "Power-Efficient Machine Learning using FPGAs on POWER Systems"

Similar presentations


Ads by Google