We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byBeatrice Swindall
Modified over 2 years ago
Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer
Outline Motivation and Vision Related Works and Background Linear Algebra Processor Power/Performance Analysis Conclusion and Future Work 9/30/20142©Ardavan Pedram 2012
Outline Motivation and Vision Related Works and Background Linear Algebra Processor Power/Performance Analysis Conclusion and Future Work 9/30/20143©Ardavan Pedram 2012
Trend of processors Technology scaling has reached physical limits – Limit of performance is power We may have Dark silicon on the chip – Only a percentage of chip might be active 9/30/20144©Ardavan Pedram 2012
Heterogeneous Solution – Increase power efficiency: GFLOPS/W – More of cores with lower frequency and power – Specialized cores Orders of magnitude better power efficiency (GFLOPS/W) Expensive Long time to market 9/30/20145 Nvidia Tegra System on Chip ©Ardavan Pedram 2012
Linear Algebra Processor Design Goals Efficiency of full custom hardware Orders of magnitude improvement Achieving upper limits of power/performance ratio Flexibility to execute a whole class of coarse- grain operations Co-optimized and co-designed across all layers Targeting linear algebra applications 9/30/20146 Source: Andreas Olofsson ©Ardavan Pedram 2012
Linear Algebra Routines Linear Algebra Package (LAPACK) level – Cholesky and QR factorization Basic Linear Algebra Subroutines (BLAS) – General matrix-matrix multiplication (GEMM) Inner kernels – Hand-optimized GEMM is often what delivers high- performance to many crucial applications 9/30/20147©Ardavan Pedram 2012
Outline Motivation and Vision Related Works and Background Linear Algebra Processor Power/Performance Analysis Conclusion and Future Work 9/30/20148©Ardavan Pedram 2012
GEMM Implementations CPUs: 95% peak – [Goto et al.2008][Intel MKL] GPUs: 70% peak – [Nath et al.2010] Nvidia Fermi – [Volkov et al.2008] Nvidia Tesla FPGAs: 99% peak – [Zikari et al. 2007] – [Zhuo et al. 2008] Specialized architectures – Clearspeed CSX: 78% peak – Systolic Arrays: [Lippert et al.2001] Intel Quad core – 40 GFLOPS @2.6 GHz Nvidia FERMI – 350 GFLOPS @1.15 GHz Altera Stratix IV – 100 GFLOPS @ 0.4 GHz CSX 700 – 75 GFLOPS @ 0.25 GHz 9/30/20149©Ardavan Pedram 2012
Common Sources of Inefficiencies in conventional architectures CPUs & GPUs – Instruction handling – Multi-ported register file – Cache overheads: tags and coherency – Thread scheduling FPGAs – Low area efficiency Specialized architectures – Data communication overheads 9/30/201410©Ardavan Pedram 2012
Outline Motivation and Vision Related Works and Background Linear Algebra Processor Power/Performance Modeling Generalization Conclusion and Future Work 9/30/201411©Ardavan Pedram 2012
Matrix Multiplication Hierarchy 9/30/2014 Fastest general-purpose implementation of GEMM.[GotoBLAS] CAB ©Ardavan Pedram 201212
Rank-1 Update Rank-1 Update: Updates a matrix by adding outer product of two vectors to it 9/30/201413 Matrix multiplication using series of rank-1 updates: Let C, A, and B be 4x4, 4xk c, and k c x4 matrices. C+=AB can be computed as: for i=0 to k c -1 end for A A C C B B ©Ardavan Pedram 2012
Linear Algebra Core (LAC) Desgin Customized for rank-1 update – 2D arrangement of PEs – Broadcast buses Integrates into memory hierarchy 9/30/201414©Ardavan Pedram 2012
C C 9/30/201415 On-Chip Memory C += A 0 B 0 + … + A K-1 B K-1 Main Memory Core Local stores Memory Hierarchy A A B B C C
9/30/201416 On-Chip Memory C i += A i,p B p Core Local stores Memory Hierarchy C C A A B B
9/30/201417 On-Chip Memory C i,j += A i,p B p,j Core Local stores Main Memory Memory Hierarchy C C A A B B
On-Chip Memory 9/30/201418 Core Local stores Main Memory Memory Hierarchy C C A A B B
Design of Linear Algebra Core (LAC) Distributed memory architecture Broadcast Buses 9/30/201419©Ardavan Pedram 2012
Data Mapping on LAC 9/30/201420 PE(0,0)PE(0,1)PE(0,2)PE(0,3) PE(1,0)PE(1,1)PE(1,2)PE(1,3) PE(2,0)PE(1,2)PE(2,2)PE(2,3) PE(3,0)PE(1,3)PE(3,2)PE(3,3) Mapping of A 16x16 on 4x4 2D arrangement of PEs 4x4 2D arrangement of PEs ©Ardavan Pedram 2012
Data Mapping on LAC 9/30/201421 PE(0,0)PE(0,1)PE(0,2)PE(0,3) PE(1,0)PE(1,1)PE(1,2)PE(1,3) PE(2,0)PE(1,2)PE(2,2)PE(2,3) PE(3,0)PE(1,3)PE(3,2)PE(3,3) Mapping of A 16x16 on 4x4 2D arrangement of PEs 4x4 2D arrangement of PEs ©Ardavan Pedram 2012
Rank-1 Update c 11 +=a 1i xb i1 c 21 +=a 2i xb i1 c 12 +=a 1i xb i2 c 22 +=a 2i xb i2 c 13 +=a 1i xb i3 c 23 +=a 2i xb i3 c 14 +=a 1i xb i4 c 24 +=a 2i xb i4 c 31 +=a 3i xb i1 c 41 +=a 4i xb i1 c 32 +=a 3i xb i2 c 42 +=a 4i xb i2 c 33 +=a 3i xb i3 c 43 +=a 4i xb i3 c 34 +=a 3i xb i4 c 44 +=a 4i xb i4 dd s s 9/30/2014 Orange : elements of A Green : elements of B Blue : elements of C 22©Ardavan Pedram 2012
GEMM on LAP 23 ©Ardavan Pedram 2012
Multi LAC on Chip Same panel of B for all cores On-chip memory stores a complete n×n block of C Each core computes different panel of C 9/30/201424 Lac 0 Memory Lac 1 Memory Lac 2 Memory ©Ardavan Pedram 2012
Outline Motivation and Vision Related Works and Background Linear Algebra Core Power/Performance Analysis Conclusion and Future Work 9/30/201425©Ardavan Pedram 2012
Performance and Power Analysis Analytical formulae – Utilization – Bandwidth – Size of local stores Cycle-accurate simulator – Matrix multiplication – Cholesky factorization Component selections – MAC units (45nm) [Galal et al.2010] – Storage model with [CACTI 6.0] Pure SRAM Model – Interconnect AMBA AHB [Lahiri.2004] [Wolkotte.2009] – Activity of components based on GEMM – Leakage as 25%~30% of dynamic power 9/30/201426©Ardavan Pedram 2012
Core Utilization Trade-off 9/30/201427 Bandwidth vs. local memory size trade-oﬀ 100% utilization Core dimension trade-off ©Ardavan Pedram 2012
Multi-LAC Solution Trade-off 9/30/201428 On-chip memory limits performance On-chip Bandwidth requirement grows exponentially to maintain peak performance ©Ardavan Pedram 2012
33 GB/s off-chip BW Over 600 DP-GFLOPS Over 90% utilization Performance vs. External Bandwidth 9/30/201429 256x256 /512x512 / 768x768 /1024x1024
PE Efficiency for Different Frequencies Area – Mostly occupied by SRAM Power – Mostly consumed by MAC units 120 GFLOPS/W – upper limit for SP-PE 60 GFLOPS/W – upper limit for DP-PE 1 GHz sweet spot of performance vs. efficiency Low voltages, – SRAM power consumption limits efficiency 9/30/201430©Ardavan Pedram 2012
LAP vs. Intel® Core2 Duo Penryn Power Break down – [V George et al.2007] Out of Order and Frontend – 40% of the core power (over 5 W) Execution logic – Register file 9/30/201431©Ardavan Pedram 2012
LAP vs. GTX280 Nvidia Tesla Single Precision GEMM 9/30/201432
LAP VS. GTX480 Nvidia Fermi 9/30/201433©Ardavan Pedram 2012
Summary of LAP – 600/1200 DP/SP-GFLOPS – One/two Orders of magnitude Improvements vs. GPUs/CPUs 9/30/201434©Ardavan Pedram 2012
GEMM Performance and efficiency on different platforms 9/30/201435 GFLOPSW/mm 2 GFLOPS/mm 2 GFLOPS/WUtilization Cell BE (SP)2000.31.5588% NVidia GTX480 SM (SP)7800.20.95.270% NVidia GTX480 SM (DP)3900.20.52.670% Intel Core-i7 960 (SP)9188.8.131.525% Intel Core-i7 960 (DP)4184.108.40.2065% Altera Stratix IV (DP)1000.020.053.590+% ClearSpeed CSX700(DP)750.020.212.578% LAP (SP)12000.26-115590+% LAP (DP)6000.23-52590+% ©Ardavan Pedram 2012
Outline Motivation and Vision Related Works and Background Linear Algebra Core Power/Performance Analysis Conclusion and Future Work 9/30/201436©Ardavan Pedram 2012
Conclusion Linear algebra Processor – Algorithm/Architecture co-design – Power and efficiency estimation – Generalized to more complex algorithms (Cholesky) – Results @ 1GHz DP: 32 GFLOPS, 47 GFLOPS/W 0.6 Watts 2.8 mm 2 in 45nm 4 GB/s external BW Orders of magnitude improvement 9/30/201437©Ardavan Pedram 2012
Conclusion 9/30/201438©Ardavan Pedram 2012 Studied Architectures and their power consumption sources
Future Work Implementation – Hardware synthesis Generalization – Level-3 BLAS – LU and QR factorization 9/30/201439 Integration within a general purpose framework Design space exploration – Picking the right algorithm variant ©Ardavan Pedram 2012
10/12/20141Chem-160. Covalent Bonds 10/12/20142Chem-160.
10/22/20141 GDP and Economic Growth Chapter /22/20142 Outline Gross Domestic Product Gross Domestic Product Economic Growth Economic Growth.
Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan
A Hardware Processing Unit For Point Sets S. Heinzle, G. Guennebaud, M. Botsch, M. Gross Graphics Hardware 2008.
Rev Monday, January 13, Foundations, Technology, Skills Tools.
Time for a BREAK! You have 45 Minutes.
2011年上半年 我院团学工作活动图片展播 2011年8月28日.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Session Agenda What is WebCRD? The four ways to place an order Placing an order from an application Uploading a document Placing a Catalog order.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
Analysis of Union Budget 2012 CENVAT Credit CA. Atul Kumar Gupta.
First Experimental Tests 08/04/20141/18. First Experimental Tests Temperature sensors 08/04/20142/18.
Π 0 production in Cu+Au at PHENIX Sarah Campbell for the PHENIX Collaboration Sept 23, 2014 Hot Quarks Sarah Campbell -- Hot Quarks
Before Between After.
Development of renewable energy sources in Germany in 2011
© 2017 SlidePlayer.com Inc. All rights reserved.