Presentation is loading. Please wait.

Presentation is loading. Please wait.

Samira Khan University of Virginia Feb 4, 2019

Similar presentations


Presentation on theme: "Samira Khan University of Virginia Feb 4, 2019"— Presentation transcript:

1 Samira Khan University of Virginia Feb 4, 2019
ADVANCED COMPUTER ARCHITECTURE ML Accelerators: Why Samira Khan University of Virginia Feb 4, 2019 The content and concept of this course are adapted from CMU ECE 740

2 AGENDA Review from last lecture
Single core->multi core->accelerator ML accelerators: why?

3 LOGISTICS Project list
Posted in Piazza Be prepared to spend time on the project Sample project proposals from many different years Project Proposal Due on Feb 11, 2019 Project Proposal Presentations: Feb 13, 2019 Can can present using your own laptop Groups: 1 or 2 students

4 Project Proposal Problem: Clearly define what is the problem you are trying to solve Novelty: Did any other work try to solve the problem? How did they solve it? What are the shortcomings? Key Idea: What is the initial idea? Why do you think it will work? How is your approach different from the prior work?  Methodology: How will you test and evaluate your idea? What tools or simulators will you use? What are the experiments you need to do to prove/disprove your idea?  Plan: Describe the steps to finish your project. What will you accomplice at each milestone? What are the things you must need to finish? Can you do more? If you finish it can you submit it to a conference? Which conference do you think is a better fit for the work?

5 LITERATURE SURVEY Goal: Critically analyze related work to your project Pick 2-3 papers related to your project Use the same format as the reviews What is the problem the paper is solving What is the key insight What are the advantages and disadvantages How can you do better Will become the related work in your proposal

6 FLYNN’S TAXONOMY OF COMPUTERS
Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements Array processor Vector processor MISD: Multiple instructions operate on single data element Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) Multiprocessor Multithreaded processor

7 WHY SYSTOLIC ARCHITECTURES?
Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory Similar to an assembly line of processing elements Different people work on the same car Many cars are assembled simultaneously Why? Special purpose accelerators/architectures need Simple, regular design (keep # unique parts small and regular) High concurrency  high performance Balanced computation and I/O (memory) bandwidth

8 SYSTOLIC ARRAYS: PROS AND CONS
Advantage: Specialized (computation needs to fit PE organization/functions)  improved efficiency, simple design, high concurrency/ performance  good to do more with less memory bandwidth requirement Downside: Specialized  not generally applicable because computation needs to fit the PE functions/organization

9 MULTI-CORE Idea: Put multiple processors on the same die.
Technology scaling (Moore’s Law) enables more transistors to be placed on the same die area What else could you do with the die area you dedicate to multiple processors? Have a bigger, more powerful core Have larger caches in the memory hierarchy Integrate platform components on chip (e.g., network interface, memory controllers)

10 WHY MULTI-CORE? Alternative: Bigger, more powerful single core
Larger superscalar issue width, larger instruction window, more execution units, large trace caches, large branch predictors, etc + Improves single-thread performance transparently to programmer, compiler - Very difficult to design (Scalable algorithms for improving single-thread performance elusive) - Power hungry – many out-of-order execution structures consume significant power/area when scaled. Why? - Diminishing returns on performance - Does not significantly help memory-bound application performance (Scalable algorithms for this elusive)

11 MULTI-CORE VS. LARGE SUPERSCALAR
Multi-core advantages + Simpler cores  more power efficient, lower complexity, easier to design and replicate, higher frequency (shorter wires, smaller structures) + Higher system throughput on multiprogrammed workloads  reduced context switches + Higher system throughput in parallel applications Multi-core disadvantages - Requires parallel tasks/threads to improve performance (parallel programming) - Resource sharing can reduce single-thread performance - Shared hardware resources need to be managed - Number of pins limits data supply for increased demand

12 WHY MULTI-CORE? Alternative: Bigger caches
+ Improves single-thread performance transparently to programmer, compiler + Simple to design - Diminishing single-thread performance returns from cache size. Why? - Multiple levels complicate memory hierarchy

13 CACHE VS. CORE

14 WHY MULTI-CORE? Alternative: Integrate platform components on chip instead + Speeds up many system functions (e.g., network interface cards, Ethernet controller, memory controller, I/O controller) - Not all applications benefit (e.g., CPU intensive code sections)

15 Multicore Decade? We have relied on multicore scaling for over five years. ? Pentium Extreme Dual-Core Core 2 Quad-Core i7 980x Hex-Core How much longer will it be our primary performance scaling technique?

16 Finding Optimal Multicore Designs
Comprehensive design space: Fixed area budget Fixed power budget Two sets of CMOS scaling projections Optimal core and diverse multicore organizations Parallel benchmarks For next 5 technology generations, find the best performing multicore from a comprehensive design space search for each of the PARSEC benchmarks

17 Symmetric Multicore Projections
18x 3.4x in 10 years Symmetric multicores alone will not sustain the multicore era.

18 Asymmetric Topologies
Multicore Solutions Asymmetric Topologies 3.5x

19 Multicore Solutions Dynamic Topologies 3.5x
[Chakraborty (2008), Suleman et al (2009)]

20 Composed/Fused Topologies
Multicore Solutions Composed/Fused Topologies 3.7x [Ipek et al (2007), Kim et al (2007)]

21 Multicore Solutions 2.7x

22 Multicore Era Projections
18x 3.7x The best designs speed up 14% per year rather than the recent trend of 34% per year

23 WITH MULTIPLE CORES ON CHIP
What we want: N times the performance with N times the cores when we parallelize an application on N cores What we get: Amdahl’s Law (serial bottleneck) Bottlenecks in the parallel portion

24 CAVEATS OF PARALLELISM
Amdahl’s Law f: Parallelizable fraction of a program N: Number of processors Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS 1967. Maximum speedup limited by serial portion: Serial bottleneck Parallel portion is usually not perfectly parallel Synchronization overhead (e.g., updates to shared data) Load imbalance overhead (imperfect parallelization) Resource sharing overhead (contention among N processors) 1 Speedup = + f 1 - f N

25 THE PROBLEM: SERIALIZED CODE SECTIONS
Many parallel programs cannot be parallelized completely Causes of serialized code sections Sequential portions (Amdahl’s “serial part”) Critical sections Barriers Serialized code sections Reduce performance Limit scalability Waste energy

26 Why Diminishing Returns?
Transistor area is still scaling Voltage and capacitance scaling have slowed Result: designs are power, not area, limited

27 Dark Silicon Sources of Dark Silicon: Power + Limited Parallelism
At 22 nm: At 8 nm: 71% 51% 26% Sources of Dark Silicon: Power + Limited Parallelism 17%

28 Multicore performance gains are limited
Conclusions Multicore performance gains are limited Need at least 18%-40% per generation from architecture alone without additional power ? Unicore Era Multicore Era

29 Specialization Innovation Efficiency

30 NN Accelerators

31 How Does the Brain Work? The basic computational unit of the brain is a neuron  86B neurons in the brain Neurons are connected with nearly 1014 – 1015 synapses Neurons receive input signal from dendrites and produce output signal along axon, which interact with the dendrites of other neurons via synaptic weights Synaptic weights – learnable & control influence strength Image Source: Stanford 10

32 Neural Networks: Weighted Sum
Image Source: Stanford

33 Many Weighted Sums Image Source: Stanford

34 What is Deep Learning? “Volvo XC90” Image
Image Source: [Lee et al., Comm. ACM 2011] 17

35 Why is Deep Learning Hot Now?
Big Data Availability GPU Acceleration New ML Techniques 350M images uploaded per day 2.5 Petabytes of customer data hourly 300 hours of video uploaded every minute

36 ImageNet Challenge Image Classification Task:
1.2M training images • 1000 object categories Object Detection Task: 456k training images • 200 object categories

37 ImageNet: Image Classification Task
Top 5 Classification Error (%) 30 large error rate reduction 25 due to Deep CNN 20 15 10 5 Hand-crafted feature- based designs Human Deep CNN-based designs [Russakovsky et al., IJCV 2015] 20

38 GPU Usage for ImageNet Challenge

39 Established Applications
Image Classification: image to object class Recognition: same as classification (except for faces) Detection: assigning bounding boxes to objects Segmentation: assigning object class to every pixel Speech & Language Speech Recognition: audio to text Translation Natural Language Processing: text to meaning Audio Generation: text to audio Games

40 Deep Learning on Games Google DeepMind AlphaGo

41 Emerging Applications
Medical (Cancer Detection, Pre-Natal) Finance (Trading, Energy Forecasting, Risk) Infrastructure (Structure Safety and Traffic) Weather Forecasting and Event Detection 24

42 Deep Learning for Self-driving Cars

43 DNN Terminology 101 Neurons DNN Terminology 101 Image Source: Stanford

44 DNN Terminology 101 DNN Terminology 101 Synapses
Image Source: Stanford

45 DNN Terminology 101 Each synapse has a weight for neuron activation Xi ⎟ Yj  activation⎜Wij 3 ⎝ i1 W11 Y1 X1 Y2 X2 Y3 X3 Y4 W34 Image Source: Stanford

46 DNN Terminology 101 Weight Sharing: multiple synapses use the same weight value Xi ⎟ Yj  activation⎜Wij 3 ⎝ i1 W11 Y1 X1 Y2 X2 Y3 X3 Y4 W34 Image Source: Stanford

47 DNN Terminology 101 Layer 1 L1 Neuron outputs a.k.a. Activations
L1 Neuron inputs e.g. image pixels Image Source: Stanford

48 DNN Terminology 101 Layer 2 L2 Input Activations L2 Output Activations
Image Source: Stanford

49 DNN Terminology 101 Fully-Connected: all i/p neurons connected to all o/p neurons Sparsely-Connected Image Source: Stanford

50 DNN Terminology 101 Feedback Feed Forward Image Source: Stanford

51 Deep Convolutional Neural Networks
Modern Deep CNN: 5 – 1000 Layers Low- Level Feature s High- Level Feature s CON V Layer CON V Layer FC Layer Class es 1 – 3 Layers

52 Deep Convolutional Neural Networks
Low- Level Feature s High- Level Feature s CON V Layer CON V Layer FC Layer Class es Convolution Activation ×

53 Deep Convolutional Neural Networks
Low- Level Feature s High- Level Feature s CON V Layer CON V Layer FC Layer Class es Fully Connected Activation ×

54 Deep Convolutional Neural Networks
Optional layers in between CONV and/or FC layers High- Level Feature s CON V Layer NOR M Layer POO L Layer CON V Layer FC Layer Class es Normalization Pooling

55 Deep Convolutional Neural Networks
High-Level Features Classes CON V Layer NOR M Layer POO L Layer CON V Layer FC Layer Convolutions account for more than 90% of overall computation, dominating runtime and energy consumption

56 Convolution (CONV) Layer
a plane of input activations a.k.a. input feature map (fmap) filter (weights) H R S W

57 Convolution (CONV) Layer
input fmap filter (weights) H R S W Element-wise Multiplication

58 Convolution (CONV) Layer
output fmap an output input fmap filter (weights) activation H E R S W F Element-wise Multiplication Partial Sum (psum) Accumulation

59 Convolution (CONV) Layer
output fmap an output input fmap filter (weights) activation H E R S W F Sliding Window Processing

60 Convolution (CONV) Layer
input fmap filter C output fmap C H E R S W F Many Input Channels (C)

61 Convolution (CONV) Layer
input fmap many filters (M) output fmap C C H M E R 1 S W F Many Output Channels (M) C R M S

62 Convolution (CONV) Layer
Many Input fmaps (N) Many Output fmaps (N) filters C M C H E R 1 1 S W F C C R E H N S N F W

63 How should we build an accelerator for the convolution layer?

64 What are the ways to take advantage of parallelism?
SIMD Dataflow

65 Highly-Parallel Compute Paradigms
Temporal Architecture (SIMD/SIMT) Spatial Architecture (Dataflow Processing) Memory Hierarchy Memory Hierarchy Register File ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Control ALU ALU ALU ALU

66 Memory Access is the Bottleneck
Memory Read MAC* Memory Write filter weight fmap activation partial sum ALU updated partial sum * multiply-and-accumulate

67 Memory Access is the Bottleneck
Memory Read MAC* Memory Write DRAM ALU DRAM * multiply-and-accumulate Worst Case: all memory R/W are DRAM accesses Example: AlexNet [NIPS 2012] has 724M MACs  2896M DRAM accesses required

68

69 What are the ways to address memory bottleneck?
Memory Hierarchy Exploit locality

70 Memory Access is the Bottleneck
Memory Read MAC Memory Write DRAM Mem ALU Mem DRAM Extra levels of local memory hierarchy

71 Memory Access is the Bottleneck
Memory Read MAC Memory Write 1 DRAM Mem ALU Mem DRAM Extra levels of local memory hierarchy Opportunities: 1 data reuse

72 Types of Data Reuse in DNN
Convolutional Reuse CONV layers only (sliding window) Input Fmap Filter Reuse: Activations Filter weights

73 Types of Data Reuse in DNN
Convolutional Reuse CONV layers only (sliding window) Fmap Reuse CONV and FC layers Filters Input Fmap Input Fmap Filter 1 2 Reuse: Activations Reuse: Activations Filter weights

74 Types of Data Reuse in DNN
Convolutional Reuse CONV layers only (sliding window) Fmap Reuse CONV and FC layers Filter Reuse CONV and FC layers (batch size > 1) Input Fmaps Filters Input Fmap Input Fmap Filter Filter 1 1 2 2 Reuse: Activations Reuse: Activations Reuse: Filter weights Filter weights

75 Memory Access is the Bottleneck
Memory Read MAC Memory Write 1 DRAM Mem ALU Mem DRAM Extra levels of local memory hierarchy Opportunities: 1 data reuse 11) Can reduce DRAM reads of filter/fmap by up to 500×** ** AlexNet CONV layers

76 Memory Access is the Bottleneck
Memory Read MAC Memory Write 1 DRAM Mem ALU Mem DRAM 2 Extra levels of local memory hierarchy Opportunities: 1 data reuse 2 local accumulation 11) Can reduce DRAM reads of filter/fmap by up to 500× 22) Partial sum accumulation does NOT have to access DRAM

77 Memory Access is the Bottleneck
Memory Read MAC Memory Write 1 DRAM Mem ALU Mem DRAM 2 Extra levels of local memory hierarchy Opportunities: 1 data reuse 2 local accumulation 11) Can reduce DRAM reads of filter/fmap by up to 500× 22) Partial sum accumulation does NOT have to access DRAM Example: DRAM access in AlexNet can be reduced from 2896M to 61M (best case)

78 Spatial Architecture for DNN
DRAM Local Memory Hierarchy Global Buffer Direct inter-PE network PE-local memory (RF) Global Buffer (100 – 500 kB) ALU AL U AL U AL U ALU ALU ALU ALU Processing Element (PE) ALU ALU AL U ALU Reg File 0.5 – 1.0 kB ALU ALU AL U ALU Control

79 Low-Cost Local Data Access
DRAM PE PE Glob al Buff er PE fetch data to run a MAC here ALU Normalized Energy Cost* 1× (Reference) ALU 0.5 – 1.0 kB RF ALU NoC: 200 – PEs PE ALU 100 – 500 kB Buffer ALU DRA M 200× ALU * measured from a commercial 65nm process

80 Low-Cost Local Data Access
How to exploit 1 data reuse and with limited low-cost local storage? 2 local accumulation Normalized Energy Cost* 1× (Reference) ALU 0.5 – 1.0 kB RF ALU NoC: 200 – PEs PE ALU 100 – 500 kB Buffer ALU DRA M 200× ALU * measured from a commercial 65nm process

81 Low-Cost Local Data Access
How to exploit 1 data reuse and 2 local accumulation with limited low-cost local storage? specialized processing dataflow required! Normalized Energy Cost* 1× (Reference) ALU 0.5 – 1.0 kB RF ALU NoC: 200 – PEs PE ALU 100 – 500 kB Buffer ALU DRA M 200× ALU * measured from a commercial 65nm process

82 Dataflow Taxonomy Weight Stationary (WS) Output Stationary (OS)
No Local Reuse (NLR) [Chen et al., ISCA 2016] 17

83 Weight Stationary (WS)
Global Buffer Weight Minimize weight read energy consumption − maximize convolutional and filter reuse of weights Broadcast activations and accumulate psums spatially across the PE array. Psum Activation W0 W1 W2 W3 W4 W5 W6 W7 PE 18

84 Samira Khan University of Virginia Feb 4, 2019
ADVANCED COMPUTER ARCHITECTURE ML Accelerators: Why Samira Khan University of Virginia Feb 4, 2019 The content and concept of this course are adapted from CMU ECE 740


Download ppt "Samira Khan University of Virginia Feb 4, 2019"

Similar presentations


Ads by Google