Compiler Options." Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements."> Compiler Options." Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.">

Presentation is loading. Please wait.

Presentation is loading. Please wait.

Energy-Efficiency Memory Hierarchy for Multi-core Architectures Yen-Kuang Chen, Ph.D., IEEE Fellow Principal Engineer, Intel Corporation Associate Director,

Similar presentations


Presentation on theme: "Energy-Efficiency Memory Hierarchy for Multi-core Architectures Yen-Kuang Chen, Ph.D., IEEE Fellow Principal Engineer, Intel Corporation Associate Director,"— Presentation transcript:

1 Energy-Efficiency Memory Hierarchy for Multi-core Architectures Yen-Kuang Chen, Ph.D., IEEE Fellow Principal Engineer, Intel Corporation Associate Director, Intel-NTU CCC Center With help from a long list of collaborators: Guangyu Sun, Jishen Zhao, Cong Xu, Yuan Xie (PSU), Christopher Hughes, Changkyu Kim (Intel)

2 Notice and Disclaimers  Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order.  INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications, product descriptions, and plans at any time, without notice.  All products, dates, and figures are preliminary for planning purposes and are subject to change without notice.  Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined.“ Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.  Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.  The Intel products discussed herein may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.  Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's website at http://www.intel.com.http://www.intel.com  Intel® Itanium®, Xeon™, Pentium®, Intel SpeedStep® and Intel NetBurst®, Intel®, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Copyright © 2011, Intel Corporation. All rights reserved.  *Other names and brands may be claimed as the property of others..

3 Optimization Notice – Please read Optimization Notice Intel® Compiler includes compiler options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are reserved for Intel microprocessors. For a detailed description of these compiler options, including the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides > Compiler Options." Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.

4 碳排放量大難解決 王振堂:資料中心不是好 生意 不要歡迎  2012-02-11 01:12 中國時報 【康文柔/台北報導】  Google 即將在彰濱工業區設立設資料中心,台北市電 腦公會理事長、宏碁董事長王振堂昨( 10 )日指出, 台灣發展雲端產業,最重要的是軟體、服務與應用, 絕對不是設資料中心( Data center )。  因資料中心耗電高,碳排放量大,他說「這不是一個 好生意!」政府不要再歡迎外商到台灣蓋資料中心。

5 Dark Silicon  Dark Silicon and the End of Multicore Scaling  In International Symposium on Computer Architecture (ISCA), 2011, by H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger.  Regardless of chip organization and topology, multicore scaling is power limited  At 22 nm, 21% of a fixed-size chip must be powered off  At 8 nm, more than 50% Energy Efficiency

6 Memory is the key in Multi-Core  We should spend 90%+ energy in memory  “Memory” = Memory/cache hierarchy  Cache coherence (or even non-coherent)  Data management placement/replacement  Compiler or hardware assist

7 Outline  Motivation  Moguls Memory Model  Energy-efficient Memory Hierarchy Design with Moguls  Future Prediction  Memory hierarchy  Processor designs  Conclusion

8 “Bandwidth Wall”  Performance depends on two resources  Compute does the work  Bandwidth feeds the compute  Processors (thru ILP, DLP, and TLP) are faster  Memory becomes relative slower Source: D. A. Patterson, "Latency Lags Bandwidth," Communication of the ACM, Vol. 47, no. 10, pp. 71-75, Oct. 2004. Current memory hierarchy may be not good enough

9 How to Alleviate Bandwidth Problems  Software techniques  Cache blocking, data compression, memory management, data structure re-arrangement, etc.  But, not always applicable  Hardware techniques  “New” memory technologies provide opportunities  3D stacking [Madan, HPCA 2009, Sun HPCA 2009, Sun ISLPED 2009]  eDRAM [Thoziyoor, ISCA 2008, Wu, ISCA 2009]  MRAM [Sun, HPCA 2009, Wu, ISCA 2009]  PCM [Lee, ISCA 2009, Qureshi, ISCA 2009, Wu, ISCA 2009]

10 Trade-off Between Bandwidth & Power  High bandwidth means high power  GPUs use GDDR  higher bandwidth  Some find GPUs perform better because of higher bandwidth (details in Lee, et al., “Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU,” ISCA, June 2010.)  GDDR burns more power  1GB GDDR at 128GB/s can burn roughly 4x more power of 4GB DDR at 16GB/s  Recall:  Why multi-core is everywhere? Isn’t it because of power?  Who burns more power? Will it be memory?  Challenge:  How to provide energy-efficient memory hierarchy

11 Our Research Statement  What should memory hierarchy look like?  Does a memory hierarchy provide enough bandwidth?  How many levels in the memory hierarchy?  What are the capacity and bandwidth of each level?  Which memory technologies are chosen for different levels?  Can we explore the design space quickly?  Simulation-based design evaluation slow  Not feasible for large design space  Exploration using an analytical model (Moguls)  Fast and accurate (with proper approximations)

12 Sneak preview A new level of cache every 5-7 years

13 Outline  Motivation  Moguls Memory Model  Energy-efficient Memory Hierarchy Design with Moguls  Future Prediction  Memory hierarchy  Processor designs  Conclusion

14 Cache: Smaller, Faster Buffer  Smaller, faster buffer  E.g., 10ns access time (vs. DRAM 100ns)  E.g., 32KB vs. 2GB  Stores most-likely to be used data  Based on temporal and spatial localities  Filters the memory access  E.g., cache satisfies 90% of requests  Only 10% of requests go to next level  Significantly improve the performance  E.g., 90%*10ns + 10%*(110ns)=21ns CPU Cache Memory

15 Bandwidth Requirement Core M1M1 M2M2 MnMn … …… Core Main memory N level BW demand (decided by T) BR 1 (T) BR 2 (T) BR n-1 (T) BR n (T) BR C (T) BW demand from cores Provided BW BP 2 BP 1 BP n BP M BW provided by main memory Question: How to abstract this relationship?

16 Capacity-bandwidth (CB) Coordinate Bandwidth (log scale) Cache Capacity (log scale) (C, B) C: cache capacity B: demanded/provided BW Both provided and demand BWs can be described in the same coordinate. (C O, B M ) Origin = (C O, B M ) C O : the minimum capacity B M : BW provided by main memory

17 Provided CB Curve Bandwidth (log scale) Cache Capacity (log scale) (C 1, BP 1 ) (C 2, BP 2 ) (C O, B M ) First level cache Second level cache

18 Demand CB Curve Bandwidth (log scale) Cache Capacity (log scale) (C O, B M ) BR C (T 1 ) Demand CB curve is continuous and decided by T Demand CB curve under T 2 Demand CB curve under T 1 BR C (T 2 ) (C x, B x )

19 Combine Demand & Provided CB curves Bandwidth (log scale) Cache Capacity (log scale) (C 1, BP 1 ) (C 2, BP 2 ) (C O, B M ) Demand CB curve under T 2 BR C (T 2 ) Demand CB curve is satisfied by provided CB curve.

20 Combine Demand & Provided CB curves (cont.) Bandwidth (log scale) Cache Capacity (log scale) (C 1, BP 1 ) (C 2, BP 2 ) (C O, B M ) Demand CB curve is NOT satisfied by provided CB curve. Bandwidth (log scale) BR C (T 1 ) Question: How to modify the provided CB curve?

21 Increase Capacity Bandwidth (log scale) Cache Capacity (log scale) (C 1, BP 1 ) (C 2, BP 2 ) (C O, B M ) Bandwidth (log scale) BR C (T 1 ) (C’ 1, BP 1 )

22 Increase Bandwidth Bandwidth (log scale) Cache Capacity (log scale) (C 1, BP 1 ) (C 2, BP 2 ) (C O, B M ) Bandwidth (log scale) BR C (T 1 ) (C 2, BP’ 2 )

23 Add an Extra Level Bandwidth (log scale) Cache Capacity (log scale) (C 1, BP 1 ) (C 2, BP 2 ) (C O, B M ) Bandwidth (log scale) BR C (T 1 ) (C 3, BP 3 )

24 Why named Moguls? Wikipedia: Moguls are a series of bumps on a trail formed when skiers push the snow into piles as they ski.

25 Why named Moguls?

26 Recall the Research Statements  Does a memory hierarchy provide enough bandwidth?  How many levels in the memory hierarchy?  What are the capacity and bandwidth of each level?  Which memory technologies are chosen for different levels?

27 Outline  Motivation  Moguls Memory Model  Energy-efficient Memory Hierarchy Design with Moguls  Future Prediction  Memory hierarchy  Processor designs  Conclusion

28 Approximations Used to Apply the Moguls  Approximation-1: The demand CB curve is represented as a straight line with slope -1/2 in log-log space. y=-x/2 Bandwidth (log scale) Cache Capacity (log scale) (C O, B M ) BR C (T) C O (B S /B M ) 2 y=-x/2

29 Approximations Used to Apply the Moguls  Approximation-2: The access power of a cache is approximately y=-x/2 Bandwidth (log scale) Cache Capacity (log scale) (C O, B M ) BR C (T) C O (B S /B M ) 2 Iso-power line Same power

30 After Applying Two Approximations  Iso-power line is parallel to the demand CB curve. Bandwidth (log scale) Cache Capacity (log scale) (C O, B M ) BR C (T) C O (B S /B M ) 2 Iso-power line

31 Start Point: Two-Level Cache Design Bandwidth (log scale) Cache Capacity (log scale) (C O, B M ) (C 1, BP 1 ) (C 2, BP 2 ) BR C (T) (C 1, BP 2 ) C O (B S /B M ) 2

32 Energy-efficient Tow-level Cache Design Bandwidth (log scale) Cache Capacity (log scale) (C O, B M ) (C 1, BP 1 ) (C 2, BP 2 ) BR C (T) (C 1, BP 2 ) BSBS Iso-power line C O (B S /B M ) 2

33 Extension to N-level Cache Design Bandwidth (log scale) Cache Capacity (log scale) (C O, B M ) C O (B S /B M ) 2 (C n, BP n ) BR C (T) (C 2, BP 1 ) (C n-1, BP n-1 ) (C 1, BP 1 ) (C n, BP n-1 )

34 Design under Power Constraint Bandwidth (log scale) Cache Capacity (log scale) (C O, B M ) C O (B S /B M ) 2 (C n, BP n ) (C 2, BP 1 ) (C n-1, BP n-1 ) (C 1, BP 1 ) (C n, BP n-1 ) BR C (T 1 ) (C n, BP n ) (C 2, BP 1 ) (C n-1, BP n-1 ) (C 1, BP 1 ) (C n, BP n-1 ) BR C (T 2 ) Throughput is degraded from T 1 to T 2

35 For More Details  Mixing different memory technologies  Simulation results to validate the model  Accurately predict the number of levels (> 90%)  Accurately predict size/BW of every level (>80%)  "Moguls: a Model to Explore the Memory Hierarchy for Throughput Computing," G. Sun, C. J. Hughes, C. Kim, J. Zhao, C. Xu, Y. Xie, Y.-K Chen, in International Symposium on Computer Architecture (ISCA), June 2011.

36 Recall the Research Statements  Does a memory hierarchy provide enough bandwidth?  How many levels in the memory hierarchy?  What are the capacity and bandwidth of each level?  Which memory technologies are chosen for different levels?

37 Different Memory Technologies MRAM eDRAM MemristorPCM 256KB1MB4MB16MB64MB256MB Cache Capacity (log scale) Cross points Write : Read = 1:9

38 Different Memory Technologies 256KB1MB4MB16MB64MB256MB Cache Capacity (log scale) SRAM is more energy-efficient eDRAM is more energy-efficient

39 Outline  Motivation  Moguls Memory Model  Energy-efficient Memory Hierarchy Design with Moguls  Future Prediction  Memory hierarchy  Processor designs  Conclusion

40 Our Research Statement  What should memory hierarchy look like?  How to use the latest technology components?  What is the proper number of levels?  What should capacity/bandwidth of each level be?  Can we explore the design space quickly?  Simulation-based design evaluation slow  Not feasible for large design space  Exploration using an analytical model (Moguls)  Fast and accurate (with proper approximations)

41 Historical Trend  Growing bandwidth gap  Processor speed increases at 50% per year  Memory bandwidth increases at 27% per year  Intel processors introduced on-die  L1 cache in 1990  L2 cache in 1998  L3 cache in 2005

42 Optimal Performance per Watt A new level of cache every 5-7 years

43 Takeaway Messages  It is time to add another level of memory into the hierarchy to alleviate the bandwidth bottleneck.  Need new memory level roughly every 5-7 years  Mathematically, with some assumptions about miss-rate curves and power consumption, we can solve for:  Optimal # of levels, capacity & BW of each level  Our study shows that L4 helps significantly

44 Outline  Motivation  Moguls Memory Model  Energy-efficient Memory Hierarchy Design with Moguls  Future Prediction  Memory hierarchy  Processor designs  Conclusion

45 Programming GPU Becomes Popular  Many show GPUs have significant performance gain  However, GPUs are NOT orders of magnitude faster than CPUs  Architecture-specific optimizations are important Source: Google scholar with “GPGPU” and “GPGPU GPU CPU performance speedup”

46 Compute Bound vs. Bandwidth Bound  Performance depends on two resources  Compute does the work  Bandwidth feeds the compute  Well optimized applications are compute or bandwidth bounded  For compute bound applications: Performance = Efficiency * Peak Compute Capability  For bandwidth bound applications: Performance = Efficiency * Peak Bandwidth Capability 46

47 With the Same Efficiency  Core i7 960  Four OoO Superscalar Cores, 3.2GHz  Peak SP Flop: 102GF/s  Peak BW: 30 GB/s  GTX 280  30 SMs (w/ 8 In-order SP each), 1.3GHz  Peak SP Flop: 933GF/s  Peak BW: 141 GB/s Max Speedup: GTX 280 over Core i7 960 Compute Bound Apps: (SP)933/102 = 9.1x Bandwidth Bound Apps:141/30 = 4.7x Assuming both Core i7 and GTX280 have the same efficiency: Many GPU performance claims are myths

48 Debunking the Myths  Bordawekar, et al. (IBM), “Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!” IBM Technical Report RC24982, April 2010.  Vudu, et al. (Georgia Tech), “On the Limits of GPU Acceleration,” HotPar, June 2010.  Lee, et al. (Intel), “Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU,” ISCA, June 2010. 48

49 49 * Source: Vudu, et al. (Georgia Tech), “On the Limits of GPU Acceleration,” HotPar, June 2010.

50 Software Optimization  CPU  Multi-threading  SIMDification  Cache blocking  Memory management  Data structure re-arrangement  GPU  Multi-threading  Branch divergence reduction  Coalescing memory accesses  Synchronization avoidance  Local shared buffer optimization 50 Need both for fair comparison

51 Case Study: Sparse Matrix-Vector Multiplication  Source: V. W. Lee, et al., “Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU,” ISCA 2010. 10x becomes 2x after optimizations

52 Case Study: Fast Motion Estimation  Source: N.-M. Cheung, et al., "Video Coding on Multicore Graphics Processors --- The challenges and advantages of the GPU implementation," IEEE Signal Processing Magazine, March 2010. QualityParallelismQualityParallelism More algorithm research needed to utilize parallelism from multi-core processors

53 GPU’s Are Against Future Trends  GPU memory are not energy efficient  The number of levels and the size of each level must be adjusted according to the throughput of the processors  CUDA-like GPU programming model are not scalable in the future  To get good performance on GPU, explicit memory management are often required  It will become a nightmare if the number of levels and the size of each level are changing Trust me: reduce your bets on GPU’s

54 Outline  Motivation  Moguls Memory Model  Energy-efficient Memory Hierarchy Design with Moguls  Future Prediction  Memory hierarchy  Processor designs  Conclusion

55 Conclusions  The winning multi-core architecture will have the most energy-efficient memory hierarchy  GPU (or GDDR) is good, but not that good  Expect to have more levels  3D stacking, eDRAM, and other emerging technologies  "Moguls: a Model to Explore the Memory Hierarchy for Throughput Computing," G. Sun, C. J. Hughes, C. Kim, J. Zhao, C. Xu, Y. Xie, Y.-K Chen, in International Symposium on Computer Architecture, June 2011.  On-die adaptive/reconfigurable cache can save energy  "Performance and Energy Implications of Caches For Throughput Computing," C. Hughes, C. Kim, Y.-K. Chen, IEEE Micro Magazine, vol.30, no.6, pp.25-35, Nov.-Dec. 2010.  Software can help too  One of our current work Discussion: how about cloud computing?

56 Energy-Efficiency Memory Hierarchy for Multi-core Architectures Yen-Kuang Chen, Ph.D., IEEE Fellow Principal Engineer, Intel Corporation Associate Director, Intel-NTU CCC Center With help from a long list of collaborators: Guangyu Sun, Jishen Zhao, Cong Xu, Yuan Xie (PSU), Christopher Hughes, Changkyu Kim (Intel)


Download ppt "Energy-Efficiency Memory Hierarchy for Multi-core Architectures Yen-Kuang Chen, Ph.D., IEEE Fellow Principal Engineer, Intel Corporation Associate Director,"

Similar presentations


Ads by Google