Download presentation
Presentation is loading. Please wait.
Published byXavier Barbero Carrizo Modified over 6 years ago
1
Designing Neural Network SoC Architectures Using Configurable Cache Coherent Interconnect IP
CHIPEX 2018 1 May 2018 Regis Gaillard Application Engineer, Arteris IP
2
Agenda Neural Networks & Deep Learning Chips for NN
Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP Agenda 1 Neural Networks & Deep Learning 2 Chips for NN 3 Technologies for NN Chips Technologies for AI Chips 4 5 The Future 1 May 2018 Copyright © 2018 Arteris Copyright © 2017 Arteris
3
Automotive is driving Machine Learning
Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP New data for training ? Updated model CPUs + HW Accelerators CPUs + HW Accelerators 1 May 2018 Copyright © 2018 Arteris Copyright © 2017 Arteris
4
axon from a different neuron
Mimicking a Brain Cell Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP Input Value Weight x0 w0 axon from a different neuron x0w0 cell body input dendrite f 𝑖 𝑥 𝑖 𝑤 𝑖 +𝑏 x1w1 𝑖 𝑥 𝑖 𝑤 𝑖 +𝑏 f output axon x2w2 f = Activation Function Sources: MIT, Stanford 1 May 2018 Copyright © 2018 Arteris Copyright © 2017 Arteris
5
Many cells become a neural network
Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP Input Layer Hidden Layer Output Layer Source: Stanford 1 May 2018 Copyright © 2018 Arteris Copyright © 2017 Arteris
6
A neural network in operation
Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP Source: Lee et al., Communications of the ACM 2011 1 May 2018 Copyright © 2018 Arteris Copyright © 2017 Arteris
7
Many types with specialized processing requirements
Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP Source: 1 May 2018 Copyright © 2018 Arteris Copyright © 2017 Arteris
8
Training vs. Inference Source: Nvidia 1 May 2018
Copyright © 2018 Arteris
9
Why is Deep Learning Happening Now?
Algorithms And research in GPU utilization Compute Power We helped make chips more powerful and easier to design The Internet Large organized “Big Data” datasets are easier to gather Money Financial backing “gold rush” for first-mover advantage Datacenters Software & infrastructure created to support search & cloud computing 1 May 2018 Copyright © 2018 Arteris
10
A Silicon Design Renaissance For AI
Today’s hype is driving a funding rush that is allowing many different approaches to AI processing to be explored, including new IP offerings: APPROACHES HARDWARE ACCELERATOR IP Systolic Memory oriented On chip storage focused Big cores Massively parallel tiny cores Analog Optical Quantum? Cadence Synopsys Ceva Nvidia AiMotive 1 May 2018 Copyright © 2018 Arteris
11
Deep Learning Startups Are Happening
Startup activity is unusually high, despite the well known historical VC bias against chip startups AIMotive Portable software for automated driving Hungary Axis Semi Massive array of compute cores USA Bitmain Coin miner builds training ASIC China BrainChip Spiking Neuron Adaptive Processor Cambricon Device and cloud processors for AI Cerebras Systems Specialized next-generation chip for deep-learning applications Deep Vision Low-power computer vision Deephi Compressed CNN networks and processors Esperanto Massive array of RISC-V cores Graphcore Graph-oriented processors for deep learning UK Groq Google spinout deep learning chip Horizon Robotics Smart Home, automotive and Public safety IntelliGo Hardware and software for image and speech processing Mythic-AI Ultra-low power NN inference IC design based on flash+analog+digital USA Novumind AI for IoT Preferred Networks Real time data analytics with deep learning and Chainer library Japan Reduced Energy Microsystems Lowest power silicon for deep learning and machine vision SenseTime Computer vision China Tenstorrent Deep learning processor: designed for faster training and adaptability to future algorithms Canada Syntient Customize analog neural networks ThinCI vision processing chips Thinkforce AI chips Unisound AI-based speech and text Vathys Deep learning supercomputers Wave Computing Deep Learning computers based on custom silicon 1 May 2018 Copyright © 2018 Arteris Source: Chris Rowen, Cognite Ventures
12
A Wide Range of Applications and Markets
High Performance Data center; High bandwidth Many compute engines Flexible computing Mid-Range Not battery powered Thermal constraints Functional safety Low Power Edge or consumer Battery powered Focused application 1 May 2018 Copyright © 2018 Arteris
13
Technologies for Advanced AI Chips
Power Efficiency Scalability Hardware Acceleration Technologies for Advanced AI Chips 1 May 2018 Copyright © 2018 Arteris
14
Interconnects enable SoC architectures
CPU Subsystem A72 L2 cache A53 Design-Specific Subsystems Application IP Subsystem GPU Subsystem 3D Graphics DSP Subsystem (A/V) AES IP IP IP IP IP IP 2D GR. FlexWay® Interconnect FlexWay Interconnect MPEG Etc. Interchip Links Ncore™ Cache Coherent Interconnect FlexNoC® Non-coherent Interconnect Interchip Links Memory Subsystem High Speed Wired Peripherals Wireless Subsystem Memory Scheduler Subsystem Interconnect WiFi CRI Crypto Firewall (PCF+) HDMI Memory Controller USB 3 USB 2 PCIe Ethernet GSM MIPI Wide IO LP DDR DDR3 LTE Display PHY 3.0, 2.0 PHY PHY RSA-PSS Cert. Engine LTE Adv. PMU PHY PHY JTAG Security Subsystem I/O Peripherals Arteris IP FlexNoC non-coherent interconnect IP Arteris IP Ncore cache coherent interconnect IP 1 May 2018 Copyright © 2018 Arteris
15
Custom Hardware Acceleration
Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP Deep Learning Needs: Hardware Acceleration Delivers: Specialized processing Custom matrix operations Multiply accumulate (MAC) Fixed point operations Flexible operand bit widths (8→4→2 bits) Specialized dataflow Specific to algorithm(s) Goal is optimization of Data reuse Local accumulation Process algorithm specific data formats The best performance and energy efficiency as hardware is customized for the application The more focused the application, the more likely the use of multiple types of tightly integrated custom processors Hardware accelerators enable: Low power Low latency Data reuse Data locality 1 May 2018 Copyright © 2018 Arteris Copyright © 2017 Arteris
16
Neural Network Architecture Revolution
Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP Hardware accelerators outpace CPU-ONLY Low precision analog Performance & Power efficiency: High parallelism Structured, specialized architectures Excellent results with low precision Appropriate memory bandwidth → 100x energy benefit over CPU Datacenter CPUs 1 May 2018 Copyright © 2018 Arteris Source: Chris Rowen, Cognite Ventures Copyright © 2017 Arteris
17
Anatomy of a Neural Network Processor Example: Tensilica Vision C5 neural network DSP
On-chip Mem NN DSP Multi-processor Neural Network DSP: Each DSP is complete processor Less data movement = lower energy General vision DSP instruction set: 8b & 16b Tensor unit sustains 1024 MACs/cycle On-chip Mem NN DSP Keys to success: Scale to many 1000s of multiply-add (MAC) units High MAC density High MAC utilization High programmability High data bandwidth Registers Local memory Off-chip Compression/decompression DDR On-chip Mem NN DSP Distributed On-chip mem Neural network DSP * + Instruction Cache 4-way instruction decode Scalar op execution Scalar registers - 32b Vector data registers 1024b Vector accumulators 3072b Tensor execution units Vector execution units On-the-fly decompression Memory access Unit Memory Port 512b: 64GB/s DMA Engine: 64 GB/s On-chip Memory (64KB – 256KB per core) Data: 500 GB/s Weights: 120 GB/s Accumulators: 750 GB/s 1 May 2018 Copyright © 2018 Arteris Source: Chris Rowen, Cognite Ventures
18
How Do You Feed It? Each new AI chip has a unique memory access profile High performance chips are optimized for max bandwidth Google TPU: Attached DDR3 Google TPU2: HBM Intel Nervana: HBM2 Graphcore: On-chip local memory Mid-range (Automotive, Vision aggregation) rely on standard I/O like PCIe or Ethernet Low end (Mobile, low power) may have only inputs from sensors and a low power DDR Once on chip, data flow in on-chip interconnect between custom hardware elements must also be optimized Bandwidth – wide paths when needed, narrower when not Coherency – available where needed to simplify software development and portability Flexibility – meet needs from high performance to minimal cost, and everything in between The key is to optimize the available bandwidth for the algorithm of choice, so: The next data is ready when it is needed and where it is needed Utilization of the compute resources is maximized 1 May 2018 Copyright © 2018 Arteris
19
Using Proxy Caches to Integrate HW Accelerators
allow existing non-coherent cores to participate full in coherent system Acc1 Acc2 Acc3 Acc4 DRAM Transport Interconnect CHI CPU Cache ($) CHI Coherent Agent Interface Directory Snoop Filters(s) System Memory Interface CMC ($) SMMU Non-coherent Agent Interface ACE-Lite ACE CPU ACE CCIX Peripheral Access Per. Mem Proxy cache ($) AXI Integrate non-coherent HW accelerators as fully coherent peers Proxy caches Association and configurability for machine learning use cases 1 May 2018 Copyright © 2018 Arteris
20
Coherent Read Example – Cache Hit
11/11/2018 Arteris Coherent Read Example – Cache Hit Consumer Producer ❶ CHI CPU Cache ($) ACE CPU Cache ($) Acc1 Acc2 Acc3 Acc4 ❸ CHI Coherent Agent Interface ACE Coherent Agent Interface CCIX Coherent Agent Interface Non-coherent Agent Interface ACE-Lite Proxy cache ($) SMMU ❷ Directory Transport Interconnect Snoop Filters(s) Snoop Filters(s) Non-coherent Agent Interface AXI Proxy cache ($) System Memory Interface System Memory Interface Peripheral Access Interface CMC ($) CMC ($) DRAM Per. Mem DRAM 1 May 2018 Copyright © 2018 Arteris Copyright © Arteris 2014
21
Coherent Read Example – Cache Misses (to CMC)
11/11/2018 Arteris Coherent Read Example – Cache Misses (to CMC) Consumer ❶ CHI CPU Cache ($) ACE CPU Cache ($) Acc1 Acc2 Acc3 Acc4 CHI Coherent Agent Interface ACE Coherent Agent Interface CCIX Coherent Agent Interface Non-coherent Agent Interface ACE-Lite Proxy cache ($) SMMU ❷ Directory Transport Interconnect Snoop Filters(s) Snoop Filters(s) Non-coherent Agent Interface AXI Proxy cache ($) ❸ ❹ System Memory Interface System Memory Interface Peripheral Access Interface CMC ($) CMC ($) DRAM Memory Per. Mem DRAM 1 May 2018 Copyright © 2018 Arteris Copyright © Arteris 2014
22
The (Really) Hard Stuff – Safety and Reliability
Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP How do you verify a deep learning system? How do you debug the Neural Network black box? What are the ethics and biases of these systems? What does it mean to make a Neural Network “safe”? 1 May 2018 Copyright © 2018 Arteris Copyright © 2017 Arteris
23
Data flow protection for functional safety
Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP CPU1 Cache ($) CPU2 Cache ($) Acc1 Acc2 Acc3 Acc4 Acc5 Data protection (at rest & transit) Parity for data path protection ECC memory protection Intelligent Ncore hardware unit duplication Don’t duplicate protected memories or links Do duplicate HW that affects packets Integrate checkers, ECC/parity generators & buffers Fault controller with BIST Coherent Agent Interface Coherent Agent Interface Fault Controller Non-coherent bridge Proxy cache ($) Directory Transport Interconnect Snoop Filters(s) Snoop Filters(s) Non-coherent bridge Proxy cache ($) Coherent Memory Interface Coherent Memory Interface CMC ($) CMC ($) DRAM 1 May 2018 Copyright © 2018 Arteris Copyright © 2017 Arteris
24
Neural Network SoCs: Takeaways
Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP Deep learning is being implemented with neural networks in custom SoCs Hardware acceleration and optimized dataflow, both to on and off-chip resources, are required for performance and power efficiency optimization The more specific the use case, the more opportunities for the use of many different types of hardware accelerators General-purpose systems will have few types, while more specialized systems can have five or more types of hardware accelerators to implement a custom pipeline As the number of hardware accelerator increases, managing the data flow in software become more difficult Drives the need to integrate accelerators in hardware-coherent systems Functional safety is a must for autonomous driving edge inference 1 May 2018 Copyright © 2018 Arteris Copyright © 2017 Arteris
25
Thank You regis.gaillard@arteris.com
Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP Thank You Copyright © 2017 Arteris
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.