Power-Efficient Machine Learning using FPGAs on POWER Systems

Slides:



Advertisements
Similar presentations
An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Enabling Coherent FPGA Acceleration Allan Cantle, President & Founder Nallatech Join the conversation at #OpenPOWERSummit1 #OpenPOWERSummit.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Spatial Pyramid Pooling in Deep Convolutional
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
J. Christiansen, CERN - EP/MIC
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
ShiDianNao: Shifting Vision Processing Closer to the Sensor
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
Approximate Computing on FPGA using Neural Acceleration Presented By: Mikkel Nielsen, Nirvedh Meshram, Shashank Gupta, Kenneth Siu.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Philipp Gysel ECE Department University of California, Davis
System on a Programmable Chip (System on a Reprogrammable Chip)
HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
Computer Organization Exam Review CS345 David Monismith.
Programmable Logic Devices
Scalpel: Customizing DNN Pruning to the
. ASAP 2017 Ramine Roane Sr Dir Product Planning July 12, 2017.
Stanford University.
Backprojection Project Update January 2002
Analysis of Sparse Convolutional Neural Networks
Improved Resource Sharing for FPGA DSP Blocks
Benchmarking Deep Learning Inference
The Relationship between Deep Learning and Brain Function
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
FPGA implementation of CNN Convolution layer logic
ESE532: System-on-a-Chip Architecture
FPGA Acceleration of Convolutional Neural Networks
FPGAs in AWS and First Use Cases, Kees Vissers
Overcoming Resource Underutilization in Spatial CNN Accelerators
Intelligent Information System Lab
Lecture 5 Smaller Network: CNN
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
The University of Adelaide, School of Computer Science
Machine Learning: The Connectionist
Bit-Pragmatic Deep Neural Network Computing
Stripes: Bit-Serial Deep Neural Network Computing
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Introduction to Neural Networks
EVA2: Exploiting Temporal Redundancy In Live Computer Vision
Wavelet “Block-Processing” for Reduced Memory Transfers
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Optimization for Fully Connected Neural Network for FPGA application
Final Project presentation
Deep Neural Networks for Onboard Intelligence
Hardware Architectures for Deep Learning
Sahand Salamat, Mohsen Imani, Behnam Khaleghi, Tajana Šimunić Rosing
Convolutional Neural Networks
Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA Jincheng Yu, Yiming Hu, Xuefei Ning, Jiantao Qiu, Kaiyuan Guo, Yu.
RADEON™ 9700 Architecture and 3D Performance
Convolution Layer Optimization
Heterogeneous convolutional neural networks for visual recognition
Progress Report 2019/5/5 PHHung.
EE 193: Parallel Computing
Model Compression Joseph E. Gonzalez
Samira Khan University of Virginia Feb 6, 2019
Authors: Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M
ADSP 21065L.
CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.
CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.
Husky Energy Chair in Oil and Gas Research
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs
Artificial Intelligence: Driving the Next Generation of Chips and Systems Manish Pandey May 13, 2019.
Presentation transcript:

Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Join the Conversation #OpenPOWERSummit

Top-5 Accuracy Image Classification Image-Net Large-Scale Visual Recognition Challenge (ILSVRC*) ** Super Human Humans: ~95%*** * http://image-net.org/challenges/LSVRC/ **http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference, pg 10 *** Russakovsky, et al 2014, http://arxiv.org/pdf/1409.0575.pdf

Top-5 Accuracy Image Classification Image-Net Large-Scale Visual Recognition Challenge (ILSVRC*) ** Super Human Humans: ~95%*** CNNs far outperform non AI methods CNNs deliver super-human accuracy * http://image-net.org/challenges/LSVRC/ **http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference, pg 10 *** Russakovsky, et al 2014, http://arxiv.org/pdf/1409.0575.pdf

CNNs Explained

The Computation

The Computation Page 6

Convolution Calculating a single pixel on a single output feature plane requires a 3x3x384 input sub-volume and a 3x3x384 set of kernel weights

Convolution Calculating the next pixel on the same output feature plane requires an overlapping 3x3x384 input sub-volume and the same 3x3x384 set of weights

Convolution Continue along the row ...

Convolution Before moving down to the next row

Convolution The first output feature map is complete

Convolution Move onto the next output feature map by switching weights, and repeat

Convolution Pattern repeats as before: same input volumes, different weight

Convolution Complete the second output feature map plane

Convolution Finally, after 256 weight sets have been used, the output feature map is complete

Fully Connected Layers

Fully Connected Layers

CNN Properties Compute: dominated by convolution (CONV) layers GOPs Per Layer Compute Memory BW: dominated by fully-connected (FC) layers Memory Access G Reads Per Layer Source: Yu Wang, Tsinghua University, Feb 2016

Humans are six orders of magnitude more efficient Humans vs Machines * Humans are six orders of magnitude more efficient *IBM Watson, ca 2012 Source: Yu Wang, Tsinghua University, Feb 2016

Cost of Computation Source: William Dally, “High Performance Hardware for Machine Learning” Cadence ENN Summit, 2/9/2016.

Cost of Computation Stay in on-chip memory (1/100 x power) Use Smaller Multipliers (8bits vs 32bits: 1/16 x power) Fixed-Point vs Float (don’t waste bits on dynamic range) Source: William Dally, “High Performance Hardware for Machine Learning” Cadence ENN Summit, 2/9/2016. Page 21

Improving Machine Efficiency Model Pruning Right-Sizing Precision Custom CNN Processor Architecture

Pruning Elements Remove Low Contribution Weights (Synapses) Retrain Remaining Weights Source: Han, et al, “Learning both Weights and Connections for Efficient Neural Networks” http://arxiv.org/pdf/1506.02626v3.pdf

Pruning Results: AlexNet 9x Reduction In #Weights Most Reduction In FC Layers Source: Han, et al, “DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING ”, http://arxiv.org/pdf/1510.00149.pdf

Pruning Results: AlexNet < 0.1% Accuracy Loss Source: Han, et al, “DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING ”, http://arxiv.org/pdf/1510.00149.pdf

Inference with Integer Quantization

Right-Sizing Precision Network VGG16 Data Bits Single-float 16 8 Weight Bits 8 or 4 Data Precision N/A 2-2 2-5/2-1 Dynamic Weight Precision 2-15 2-7 Top-1 Accuracy 68.1% 68.0% 53.0% 28.2% 67.0% Top-5 Accuracy 88.0% 87.9% 76.6% 49.7% 87.6% Dynamic: Variable Format Fixed-Point (Per Layer) < 1% Accuracy Loss Source: Yu Wang, Tsinghua University, Feb 2016

Right-Sizing Precision Fixed-Point Sufficient For Deployment (INT16, INT8) No Significant Loss in Accuracy (< 1%) >10x Energy Efficiency OPs/J (INT8 vs FP32) 4x Memory Energy Efficiency Tx/J (INT8 vs FP32)

Improving Machine Efficiency CNN Model Model pruning FPGA Based Neural Network Processor Pruned Floating-Point Model Data/weight quantization Pruned Fixed-Point Model Run Compilation Instructions Modified From: Yu Wang, Tsinghua University, Feb 2016

Xilinx Kintex® UltraScale™ KU115 (20nm) 5520 DSP Cores,up to 500Mhz 5.5 T OPs int16 (peak) 4 GB DDR4-2400 & 38 GB/s 55W TDP & 100 G OPs/W Single Slot, Low Profile FF OpenPOWER CAPI AlphaData ADM-PCIE-8K5

FPGA Architecture 2D Array Architecture (scales with Moore’s Law) RAM CLB DSP CLB RAM RAM CLB DSP CLB RAM . . . . RAM CLB DSP CLB RAM RAM CLB DSP CLB RAM . . . . 2D Array Architecture (scales with Moore’s Law) Memory Proximate Computing (Minimize Data Moves) Broadcast Capable Interconnect (Data Sharing/Reuse)

FPGA Arithmetic & Memory Resources Custom Width Memory 16-bit Multiplier 48-bit Accumulator Custom Quantization INT4 INT8 INT16 INT32 FP16 FP32 Dj Q8.8 Q2.14 Qm.n Oi Wij Native 16-bit multiplier (or reduced power 8-bit) On-Chip RAMs store INT4, INT8, INT16, … Custom Quantization Formatting (Qm.n)

Convolver Unit + X Source: Yu Wang, Tsinghua University, Feb 2016 ⋯⋯ ⋯ MUX Data buffer Weight buffer Multipliers Adder Tree X + 9 Data Inputs 9 Weight Inputs n Delays 𝑚 Delays ① ② ③ Input Data Weight Output Source: Yu Wang, Tsinghua University, Feb 2016

Memory Proximate Compute Convolver Unit ⋯⋯ ⋯ MUX Data buffer Weight buffer Multipliers Adder Tree X + 9 Data Inputs 9 Weight Inputs n Delays 𝑚 Delays ① ② ③ Input Data Weight Output Serial to Parallel Data Reuse: 8/9 Memory Proximate Compute 2D Parallel Memory 2D Operator Array INT16 Serial to Parallel Ping/Pong Source: Yu Wang, Tsinghua University, Feb 2016

Processing Engine (PE) Convolver Complex + NL Pool Output Buffer Input Buffer Data Bias Weights Intermediate Data Controller PE Adder Tree Bias Shift shift … Source: Yu Wang, Tsinghua University, Feb 2016

Processing Engine (PE) Convolver Complex + NL Pool Output Buffer Input Buffer Data Bias Weights Intermediate Data Controller PE Adder Tree Bias Shift shift … Custom Quantization Memory Sharing Broadcast Weights Source: Yu Wang, Tsinghua University, Feb 2016

Top Level … Source: Yu Wang, Tsinghua University, Feb 2016 Power CPU External Memory Processing System DMA w/ compression Data & Inst. Bus Input Buffer PE Computing Complex Output Buffer FIFO Controller Programmable Logic Config. Bus … Source: Yu Wang, Tsinghua University, Feb 2016

Top Level SW Scheduled Dataflow Decompress weights on the fly Power CPU External Memory Processing System DMA w/ compression Data & Inst. Bus Input Buffer PE Computing Complex Output Buffer FIFO Controller Programmable Logic Config. Bus … SW Scheduled Dataflow Decompress weights on the fly Ping Pong Buffers Transfers Overlap with Compute Multiple PE Block Level Parallelism Source: Yu Wang, Tsinghua University, Feb 2016

FPGA Neural Net Processor Tiled Architecture (Parallelism & Scaling) Semi Static Dataflow (Pre-scheduled Data Transfers) Memory Reuse (Data Sharing across Convolvers)

OpenPOWER CAPI Peer Programming Model and Interaction Efficiency CAP PSL POWER8 CAP UNIT Shared Virtual Memory System-Wide Memory Coherency Low Latency Control Messages Peer Programming Model and Interaction Efficiency

OpenPOWER CAPI Power Xilinx FPGA AuvizDNN Kernel CAP PSL POWER8 CAP UNIT Power Caffe, TensorFlow, etc Load CNN Model Call AuvizDNN Library Xilinx FPGA AuvizDNN Kernel Scalable & Fully Parameterized Plug and Play Library

OpenPOWER CAPI 14 Images/s/W (AlexNet) Batch Size 1 Low Profile TDP CAP PSL POWER8 CAP UNIT 14 Images/s/W (AlexNet) Batch Size 1 Low Profile TDP

Take Aways FPGA: Ideal Dataflow CNN Processor POWER/CAPI: Elevates Accelerators As Peers to CPUs FPGA CNN Libraries

Thank You! 11/16/2018