Download presentation
Presentation is loading. Please wait.
Published byElfreda Mitchell Modified over 6 years ago
1
Samira Khan University of Virginia Feb 4, 2019
ADVANCED COMPUTER ARCHITECTURE ML Accelerators: Why Samira Khan University of Virginia Feb 4, 2019 The content and concept of this course are adapted from CMU ECE 740
2
AGENDA Review from last lecture
Single core->multi core->accelerator ML accelerators: why?
3
LOGISTICS Project list
Posted in Piazza Be prepared to spend time on the project Sample project proposals from many different years Project Proposal Due on Feb 11, 2019 Project Proposal Presentations: Feb 13, 2019 Can can present using your own laptop Groups: 1 or 2 students
4
Project Proposal Problem: Clearly define what is the problem you are trying to solve Novelty: Did any other work try to solve the problem? How did they solve it? What are the shortcomings? Key Idea: What is the initial idea? Why do you think it will work? How is your approach different from the prior work? Methodology: How will you test and evaluate your idea? What tools or simulators will you use? What are the experiments you need to do to prove/disprove your idea? Plan: Describe the steps to finish your project. What will you accomplice at each milestone? What are the things you must need to finish? Can you do more? If you finish it can you submit it to a conference? Which conference do you think is a better fit for the work?
5
LITERATURE SURVEY Goal: Critically analyze related work to your project Pick 2-3 papers related to your project Use the same format as the reviews What is the problem the paper is solving What is the key insight What are the advantages and disadvantages How can you do better Will become the related work in your proposal
6
FLYNN’S TAXONOMY OF COMPUTERS
Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements Array processor Vector processor MISD: Multiple instructions operate on single data element Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) Multiprocessor Multithreaded processor
7
WHY SYSTOLIC ARCHITECTURES?
Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory Similar to an assembly line of processing elements Different people work on the same car Many cars are assembled simultaneously Why? Special purpose accelerators/architectures need Simple, regular design (keep # unique parts small and regular) High concurrency high performance Balanced computation and I/O (memory) bandwidth
8
SYSTOLIC ARRAYS: PROS AND CONS
Advantage: Specialized (computation needs to fit PE organization/functions) improved efficiency, simple design, high concurrency/ performance good to do more with less memory bandwidth requirement Downside: Specialized not generally applicable because computation needs to fit the PE functions/organization
9
MULTI-CORE Idea: Put multiple processors on the same die.
Technology scaling (Moore’s Law) enables more transistors to be placed on the same die area What else could you do with the die area you dedicate to multiple processors? Have a bigger, more powerful core Have larger caches in the memory hierarchy Integrate platform components on chip (e.g., network interface, memory controllers)
10
WHY MULTI-CORE? Alternative: Bigger, more powerful single core
Larger superscalar issue width, larger instruction window, more execution units, large trace caches, large branch predictors, etc + Improves single-thread performance transparently to programmer, compiler - Very difficult to design (Scalable algorithms for improving single-thread performance elusive) - Power hungry – many out-of-order execution structures consume significant power/area when scaled. Why? - Diminishing returns on performance - Does not significantly help memory-bound application performance (Scalable algorithms for this elusive)
11
MULTI-CORE VS. LARGE SUPERSCALAR
Multi-core advantages + Simpler cores more power efficient, lower complexity, easier to design and replicate, higher frequency (shorter wires, smaller structures) + Higher system throughput on multiprogrammed workloads reduced context switches + Higher system throughput in parallel applications Multi-core disadvantages - Requires parallel tasks/threads to improve performance (parallel programming) - Resource sharing can reduce single-thread performance - Shared hardware resources need to be managed - Number of pins limits data supply for increased demand
12
WHY MULTI-CORE? Alternative: Bigger caches
+ Improves single-thread performance transparently to programmer, compiler + Simple to design - Diminishing single-thread performance returns from cache size. Why? - Multiple levels complicate memory hierarchy
13
CACHE VS. CORE
14
WHY MULTI-CORE? Alternative: Integrate platform components on chip instead + Speeds up many system functions (e.g., network interface cards, Ethernet controller, memory controller, I/O controller) - Not all applications benefit (e.g., CPU intensive code sections)
15
Multicore Decade? We have relied on multicore scaling for over five years. ? Pentium Extreme Dual-Core Core 2 Quad-Core i7 980x Hex-Core How much longer will it be our primary performance scaling technique?
16
Finding Optimal Multicore Designs
Comprehensive design space: Fixed area budget Fixed power budget Two sets of CMOS scaling projections Optimal core and diverse multicore organizations Parallel benchmarks For next 5 technology generations, find the best performing multicore from a comprehensive design space search for each of the PARSEC benchmarks
17
Symmetric Multicore Projections
18x 3.4x in 10 years Symmetric multicores alone will not sustain the multicore era.
18
Asymmetric Topologies
Multicore Solutions Asymmetric Topologies 3.5x
19
Multicore Solutions Dynamic Topologies 3.5x
[Chakraborty (2008), Suleman et al (2009)]
20
Composed/Fused Topologies
Multicore Solutions Composed/Fused Topologies 3.7x [Ipek et al (2007), Kim et al (2007)]
21
Multicore Solutions 2.7x
22
Multicore Era Projections
18x 3.7x The best designs speed up 14% per year rather than the recent trend of 34% per year
23
WITH MULTIPLE CORES ON CHIP
What we want: N times the performance with N times the cores when we parallelize an application on N cores What we get: Amdahl’s Law (serial bottleneck) Bottlenecks in the parallel portion
24
CAVEATS OF PARALLELISM
Amdahl’s Law f: Parallelizable fraction of a program N: Number of processors Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS 1967. Maximum speedup limited by serial portion: Serial bottleneck Parallel portion is usually not perfectly parallel Synchronization overhead (e.g., updates to shared data) Load imbalance overhead (imperfect parallelization) Resource sharing overhead (contention among N processors) 1 Speedup = + f 1 - f N
25
THE PROBLEM: SERIALIZED CODE SECTIONS
Many parallel programs cannot be parallelized completely Causes of serialized code sections Sequential portions (Amdahl’s “serial part”) Critical sections Barriers Serialized code sections Reduce performance Limit scalability Waste energy
26
Why Diminishing Returns?
Transistor area is still scaling Voltage and capacitance scaling have slowed Result: designs are power, not area, limited
27
Dark Silicon Sources of Dark Silicon: Power + Limited Parallelism
At 22 nm: At 8 nm: 71% 51% 26% Sources of Dark Silicon: Power + Limited Parallelism 17%
28
Multicore performance gains are limited
Conclusions Multicore performance gains are limited Need at least 18%-40% per generation from architecture alone without additional power ? Unicore Era Multicore Era
29
Specialization Innovation Efficiency
30
NN Accelerators
31
How Does the Brain Work? The basic computational unit of the brain is a neuron 86B neurons in the brain Neurons are connected with nearly 1014 – 1015 synapses Neurons receive input signal from dendrites and produce output signal along axon, which interact with the dendrites of other neurons via synaptic weights Synaptic weights – learnable & control influence strength Image Source: Stanford 10
32
Neural Networks: Weighted Sum
Image Source: Stanford
33
Many Weighted Sums Image Source: Stanford
34
What is Deep Learning? “Volvo XC90” Image
Image Source: [Lee et al., Comm. ACM 2011] 17
35
Why is Deep Learning Hot Now?
Big Data Availability GPU Acceleration New ML Techniques 350M images uploaded per day 2.5 Petabytes of customer data hourly 300 hours of video uploaded every minute
36
ImageNet Challenge Image Classification Task:
1.2M training images • 1000 object categories Object Detection Task: 456k training images • 200 object categories
37
ImageNet: Image Classification Task
Top 5 Classification Error (%) 30 large error rate reduction 25 due to Deep CNN 20 15 10 5 Hand-crafted feature- based designs Human Deep CNN-based designs [Russakovsky et al., IJCV 2015] 20
38
GPU Usage for ImageNet Challenge
39
Established Applications
Image Classification: image to object class Recognition: same as classification (except for faces) Detection: assigning bounding boxes to objects Segmentation: assigning object class to every pixel Speech & Language Speech Recognition: audio to text Translation Natural Language Processing: text to meaning Audio Generation: text to audio Games
40
Deep Learning on Games Google DeepMind AlphaGo
41
Emerging Applications
Medical (Cancer Detection, Pre-Natal) Finance (Trading, Energy Forecasting, Risk) Infrastructure (Structure Safety and Traffic) Weather Forecasting and Event Detection 24
42
Deep Learning for Self-driving Cars
43
DNN Terminology 101 Neurons DNN Terminology 101 Image Source: Stanford
44
DNN Terminology 101 DNN Terminology 101 Synapses
Image Source: Stanford
45
DNN Terminology 101 Each synapse has a weight for neuron activation ⎛ ⎞ Xi ⎟ ⎠ Yj activation⎜Wij 3 ⎝ i1 W11 Y1 X1 Y2 X2 Y3 X3 Y4 W34 Image Source: Stanford
46
DNN Terminology 101 Weight Sharing: multiple synapses use the same weight value ⎛ ⎞ Xi ⎟ ⎠ Yj activation⎜Wij 3 ⎝ i1 W11 Y1 X1 Y2 X2 Y3 X3 Y4 W34 Image Source: Stanford
47
DNN Terminology 101 Layer 1 L1 Neuron outputs a.k.a. Activations
L1 Neuron inputs e.g. image pixels Image Source: Stanford
48
DNN Terminology 101 Layer 2 L2 Input Activations L2 Output Activations
Image Source: Stanford
49
DNN Terminology 101 Fully-Connected: all i/p neurons connected to all o/p neurons Sparsely-Connected Image Source: Stanford
50
DNN Terminology 101 Feedback Feed Forward Image Source: Stanford
51
Deep Convolutional Neural Networks
Modern Deep CNN: 5 – 1000 Layers Low- Level Feature s High- Level Feature s CON V Layer CON V Layer FC Layer … Class es 1 – 3 Layers
52
Deep Convolutional Neural Networks
Low- Level Feature s High- Level Feature s CON V Layer CON V Layer FC Layer … Class es Convolution Activation ×
53
Deep Convolutional Neural Networks
Low- Level Feature s High- Level Feature s CON V Layer CON V Layer FC Layer … Class es Fully Connected Activation ×
54
Deep Convolutional Neural Networks
Optional layers in between CONV and/or FC layers High- Level Feature s CON V Layer NOR M Layer POO L Layer CON V Layer FC Layer Class es Normalization Pooling
55
Deep Convolutional Neural Networks
High-Level Features Classes CON V Layer NOR M Layer POO L Layer CON V Layer FC Layer Convolutions account for more than 90% of overall computation, dominating runtime and energy consumption
56
Convolution (CONV) Layer
a plane of input activations a.k.a. input feature map (fmap) filter (weights) H R S W
57
Convolution (CONV) Layer
input fmap filter (weights) H R S W Element-wise Multiplication
58
Convolution (CONV) Layer
output fmap an output input fmap filter (weights) activation H E R S W F Element-wise Multiplication Partial Sum (psum) Accumulation
59
Convolution (CONV) Layer
output fmap an output input fmap filter (weights) activation H E R S W F Sliding Window Processing
60
Convolution (CONV) Layer
input fmap filter C output fmap C H E R S W F Many Input Channels (C)
61
Convolution (CONV) Layer
input fmap many filters (M) output fmap C C H M E R 1 S W F … Many Output Channels (M) C R M S
62
Convolution (CONV) Layer
Many Input fmaps (N) Many Output fmaps (N) filters C M C H E R 1 1 S W F … … … C C R E H N S N F W
63
How should we build an accelerator for the convolution layer?
64
What are the ways to take advantage of parallelism?
SIMD Dataflow
65
Highly-Parallel Compute Paradigms
Temporal Architecture (SIMD/SIMT) Spatial Architecture (Dataflow Processing) Memory Hierarchy Memory Hierarchy Register File ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Control ALU ALU ALU ALU
66
Memory Access is the Bottleneck
Memory Read MAC* Memory Write filter weight fmap activation partial sum ALU updated partial sum * multiply-and-accumulate
67
Memory Access is the Bottleneck
Memory Read MAC* Memory Write DRAM ALU DRAM * multiply-and-accumulate Worst Case: all memory R/W are DRAM accesses Example: AlexNet [NIPS 2012] has 724M MACs 2896M DRAM accesses required
69
What are the ways to address memory bottleneck?
Memory Hierarchy Exploit locality
70
Memory Access is the Bottleneck
Memory Read MAC Memory Write DRAM Mem ALU Mem DRAM Extra levels of local memory hierarchy
71
Memory Access is the Bottleneck
Memory Read MAC Memory Write 1 DRAM Mem ALU Mem DRAM Extra levels of local memory hierarchy Opportunities: 1 data reuse
72
Types of Data Reuse in DNN
Convolutional Reuse CONV layers only (sliding window) Input Fmap Filter Reuse: Activations Filter weights
73
Types of Data Reuse in DNN
Convolutional Reuse CONV layers only (sliding window) Fmap Reuse CONV and FC layers Filters Input Fmap Input Fmap Filter 1 2 Reuse: Activations Reuse: Activations Filter weights
74
Types of Data Reuse in DNN
Convolutional Reuse CONV layers only (sliding window) Fmap Reuse CONV and FC layers Filter Reuse CONV and FC layers (batch size > 1) Input Fmaps Filters Input Fmap Input Fmap Filter Filter 1 1 2 2 Reuse: Activations Reuse: Activations Reuse: Filter weights Filter weights
75
Memory Access is the Bottleneck
Memory Read MAC Memory Write 1 DRAM Mem ALU Mem DRAM Extra levels of local memory hierarchy Opportunities: 1 data reuse 11) Can reduce DRAM reads of filter/fmap by up to 500×** ** AlexNet CONV layers
76
Memory Access is the Bottleneck
Memory Read MAC Memory Write 1 DRAM Mem ALU Mem DRAM 2 Extra levels of local memory hierarchy Opportunities: 1 data reuse 2 local accumulation 11) Can reduce DRAM reads of filter/fmap by up to 500× 22) Partial sum accumulation does NOT have to access DRAM
77
Memory Access is the Bottleneck
Memory Read MAC Memory Write 1 DRAM Mem ALU Mem DRAM 2 Extra levels of local memory hierarchy Opportunities: 1 data reuse 2 local accumulation 11) Can reduce DRAM reads of filter/fmap by up to 500× 22) Partial sum accumulation does NOT have to access DRAM Example: DRAM access in AlexNet can be reduced from 2896M to 61M (best case)
78
Spatial Architecture for DNN
DRAM Local Memory Hierarchy Global Buffer Direct inter-PE network PE-local memory (RF) Global Buffer (100 – 500 kB) ALU AL U AL U AL U ALU ALU ALU ALU Processing Element (PE) ALU ALU AL U ALU Reg File 0.5 – 1.0 kB ALU ALU AL U ALU Control
79
Low-Cost Local Data Access
DRAM PE PE Glob al Buff er PE fetch data to run a MAC here ALU Normalized Energy Cost* 1× (Reference) ALU 0.5 – 1.0 kB RF 1× ALU NoC: 200 – PEs PE ALU 2× 100 – 500 kB Buffer ALU 6× DRA M 200× ALU * measured from a commercial 65nm process
80
Low-Cost Local Data Access
How to exploit 1 data reuse and with limited low-cost local storage? 2 local accumulation Normalized Energy Cost* 1× (Reference) ALU 0.5 – 1.0 kB RF 1× ALU NoC: 200 – PEs PE ALU 2× 100 – 500 kB Buffer ALU 6× DRA M 200× ALU * measured from a commercial 65nm process
81
Low-Cost Local Data Access
How to exploit 1 data reuse and 2 local accumulation with limited low-cost local storage? specialized processing dataflow required! Normalized Energy Cost* 1× (Reference) ALU 0.5 – 1.0 kB RF 1× ALU NoC: 200 – PEs PE ALU 2× 100 – 500 kB Buffer ALU 6× DRA M 200× ALU * measured from a commercial 65nm process
82
Dataflow Taxonomy Weight Stationary (WS) Output Stationary (OS)
No Local Reuse (NLR) [Chen et al., ISCA 2016] 17
83
Weight Stationary (WS)
Global Buffer Weight Minimize weight read energy consumption − maximize convolutional and filter reuse of weights Broadcast activations and accumulate psums spatially across the PE array. Psum Activation W0 W1 W2 W3 W4 W5 W6 W7 PE 18
84
Samira Khan University of Virginia Feb 4, 2019
ADVANCED COMPUTER ARCHITECTURE ML Accelerators: Why Samira Khan University of Virginia Feb 4, 2019 The content and concept of this course are adapted from CMU ECE 740
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.