Compiler Options." Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements."> Compiler Options." Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.">
Download presentation
Presentation is loading. Please wait.
Published byGenevieve Wickham Modified over 9 years ago
1
Energy-Efficiency Memory Hierarchy for Multi-core Architectures Yen-Kuang Chen, Ph.D., IEEE Fellow Principal Engineer, Intel Corporation Associate Director, Intel-NTU CCC Center With help from a long list of collaborators: Guangyu Sun, Jishen Zhao, Cong Xu, Yuan Xie (PSU), Christopher Hughes, Changkyu Kim (Intel)
2
Notice and Disclaimers Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order. INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications, product descriptions, and plans at any time, without notice. All products, dates, and figures are preliminary for planning purposes and are subject to change without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined.“ Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The Intel products discussed herein may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's website at http://www.intel.com.http://www.intel.com Intel® Itanium®, Xeon™, Pentium®, Intel SpeedStep® and Intel NetBurst®, Intel®, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Copyright © 2011, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others..
3
Optimization Notice – Please read Optimization Notice Intel® Compiler includes compiler options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel® Compiler are reserved for Intel microprocessors. For a detailed description of these compiler options, including the instruction sets they implicate, please refer to "Intel® Compiler User and Reference Guides > Compiler Options." Many library routines that are part of Intel® Compiler are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® Compiler offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for Intel® Compiler, with respect to Intel's compilers and associated libraries as a whole, Intel® Compiler may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other compilers to determine which best meet your requirements.
4
碳排放量大難解決 王振堂:資料中心不是好 生意 不要歡迎 2012-02-11 01:12 中國時報 【康文柔/台北報導】 Google 即將在彰濱工業區設立設資料中心,台北市電 腦公會理事長、宏碁董事長王振堂昨( 10 )日指出, 台灣發展雲端產業,最重要的是軟體、服務與應用, 絕對不是設資料中心( Data center )。 因資料中心耗電高,碳排放量大,他說「這不是一個 好生意!」政府不要再歡迎外商到台灣蓋資料中心。
5
Dark Silicon Dark Silicon and the End of Multicore Scaling In International Symposium on Computer Architecture (ISCA), 2011, by H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger. Regardless of chip organization and topology, multicore scaling is power limited At 22 nm, 21% of a fixed-size chip must be powered off At 8 nm, more than 50% Energy Efficiency
6
Memory is the key in Multi-Core We should spend 90%+ energy in memory “Memory” = Memory/cache hierarchy Cache coherence (or even non-coherent) Data management placement/replacement Compiler or hardware assist
7
Outline Motivation Moguls Memory Model Energy-efficient Memory Hierarchy Design with Moguls Future Prediction Memory hierarchy Processor designs Conclusion
8
“Bandwidth Wall” Performance depends on two resources Compute does the work Bandwidth feeds the compute Processors (thru ILP, DLP, and TLP) are faster Memory becomes relative slower Source: D. A. Patterson, "Latency Lags Bandwidth," Communication of the ACM, Vol. 47, no. 10, pp. 71-75, Oct. 2004. Current memory hierarchy may be not good enough
9
How to Alleviate Bandwidth Problems Software techniques Cache blocking, data compression, memory management, data structure re-arrangement, etc. But, not always applicable Hardware techniques “New” memory technologies provide opportunities 3D stacking [Madan, HPCA 2009, Sun HPCA 2009, Sun ISLPED 2009] eDRAM [Thoziyoor, ISCA 2008, Wu, ISCA 2009] MRAM [Sun, HPCA 2009, Wu, ISCA 2009] PCM [Lee, ISCA 2009, Qureshi, ISCA 2009, Wu, ISCA 2009]
10
Trade-off Between Bandwidth & Power High bandwidth means high power GPUs use GDDR higher bandwidth Some find GPUs perform better because of higher bandwidth (details in Lee, et al., “Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU,” ISCA, June 2010.) GDDR burns more power 1GB GDDR at 128GB/s can burn roughly 4x more power of 4GB DDR at 16GB/s Recall: Why multi-core is everywhere? Isn’t it because of power? Who burns more power? Will it be memory? Challenge: How to provide energy-efficient memory hierarchy
11
Our Research Statement What should memory hierarchy look like? Does a memory hierarchy provide enough bandwidth? How many levels in the memory hierarchy? What are the capacity and bandwidth of each level? Which memory technologies are chosen for different levels? Can we explore the design space quickly? Simulation-based design evaluation slow Not feasible for large design space Exploration using an analytical model (Moguls) Fast and accurate (with proper approximations)
12
Sneak preview A new level of cache every 5-7 years
13
Outline Motivation Moguls Memory Model Energy-efficient Memory Hierarchy Design with Moguls Future Prediction Memory hierarchy Processor designs Conclusion
14
Cache: Smaller, Faster Buffer Smaller, faster buffer E.g., 10ns access time (vs. DRAM 100ns) E.g., 32KB vs. 2GB Stores most-likely to be used data Based on temporal and spatial localities Filters the memory access E.g., cache satisfies 90% of requests Only 10% of requests go to next level Significantly improve the performance E.g., 90%*10ns + 10%*(110ns)=21ns CPU Cache Memory
15
Bandwidth Requirement Core M1M1 M2M2 MnMn … …… Core Main memory N level BW demand (decided by T) BR 1 (T) BR 2 (T) BR n-1 (T) BR n (T) BR C (T) BW demand from cores Provided BW BP 2 BP 1 BP n BP M BW provided by main memory Question: How to abstract this relationship?
16
Capacity-bandwidth (CB) Coordinate Bandwidth (log scale) Cache Capacity (log scale) (C, B) C: cache capacity B: demanded/provided BW Both provided and demand BWs can be described in the same coordinate. (C O, B M ) Origin = (C O, B M ) C O : the minimum capacity B M : BW provided by main memory
17
Provided CB Curve Bandwidth (log scale) Cache Capacity (log scale) (C 1, BP 1 ) (C 2, BP 2 ) (C O, B M ) First level cache Second level cache
18
Demand CB Curve Bandwidth (log scale) Cache Capacity (log scale) (C O, B M ) BR C (T 1 ) Demand CB curve is continuous and decided by T Demand CB curve under T 2 Demand CB curve under T 1 BR C (T 2 ) (C x, B x )
19
Combine Demand & Provided CB curves Bandwidth (log scale) Cache Capacity (log scale) (C 1, BP 1 ) (C 2, BP 2 ) (C O, B M ) Demand CB curve under T 2 BR C (T 2 ) Demand CB curve is satisfied by provided CB curve.
20
Combine Demand & Provided CB curves (cont.) Bandwidth (log scale) Cache Capacity (log scale) (C 1, BP 1 ) (C 2, BP 2 ) (C O, B M ) Demand CB curve is NOT satisfied by provided CB curve. Bandwidth (log scale) BR C (T 1 ) Question: How to modify the provided CB curve?
21
Increase Capacity Bandwidth (log scale) Cache Capacity (log scale) (C 1, BP 1 ) (C 2, BP 2 ) (C O, B M ) Bandwidth (log scale) BR C (T 1 ) (C’ 1, BP 1 )
22
Increase Bandwidth Bandwidth (log scale) Cache Capacity (log scale) (C 1, BP 1 ) (C 2, BP 2 ) (C O, B M ) Bandwidth (log scale) BR C (T 1 ) (C 2, BP’ 2 )
23
Add an Extra Level Bandwidth (log scale) Cache Capacity (log scale) (C 1, BP 1 ) (C 2, BP 2 ) (C O, B M ) Bandwidth (log scale) BR C (T 1 ) (C 3, BP 3 )
24
Why named Moguls? Wikipedia: Moguls are a series of bumps on a trail formed when skiers push the snow into piles as they ski.
25
Why named Moguls?
26
Recall the Research Statements Does a memory hierarchy provide enough bandwidth? How many levels in the memory hierarchy? What are the capacity and bandwidth of each level? Which memory technologies are chosen for different levels?
27
Outline Motivation Moguls Memory Model Energy-efficient Memory Hierarchy Design with Moguls Future Prediction Memory hierarchy Processor designs Conclusion
28
Approximations Used to Apply the Moguls Approximation-1: The demand CB curve is represented as a straight line with slope -1/2 in log-log space. y=-x/2 Bandwidth (log scale) Cache Capacity (log scale) (C O, B M ) BR C (T) C O (B S /B M ) 2 y=-x/2
29
Approximations Used to Apply the Moguls Approximation-2: The access power of a cache is approximately y=-x/2 Bandwidth (log scale) Cache Capacity (log scale) (C O, B M ) BR C (T) C O (B S /B M ) 2 Iso-power line Same power
30
After Applying Two Approximations Iso-power line is parallel to the demand CB curve. Bandwidth (log scale) Cache Capacity (log scale) (C O, B M ) BR C (T) C O (B S /B M ) 2 Iso-power line
31
Start Point: Two-Level Cache Design Bandwidth (log scale) Cache Capacity (log scale) (C O, B M ) (C 1, BP 1 ) (C 2, BP 2 ) BR C (T) (C 1, BP 2 ) C O (B S /B M ) 2
32
Energy-efficient Tow-level Cache Design Bandwidth (log scale) Cache Capacity (log scale) (C O, B M ) (C 1, BP 1 ) (C 2, BP 2 ) BR C (T) (C 1, BP 2 ) BSBS Iso-power line C O (B S /B M ) 2
33
Extension to N-level Cache Design Bandwidth (log scale) Cache Capacity (log scale) (C O, B M ) C O (B S /B M ) 2 (C n, BP n ) BR C (T) (C 2, BP 1 ) (C n-1, BP n-1 ) (C 1, BP 1 ) (C n, BP n-1 )
34
Design under Power Constraint Bandwidth (log scale) Cache Capacity (log scale) (C O, B M ) C O (B S /B M ) 2 (C n, BP n ) (C 2, BP 1 ) (C n-1, BP n-1 ) (C 1, BP 1 ) (C n, BP n-1 ) BR C (T 1 ) (C n, BP n ) (C 2, BP 1 ) (C n-1, BP n-1 ) (C 1, BP 1 ) (C n, BP n-1 ) BR C (T 2 ) Throughput is degraded from T 1 to T 2
35
For More Details Mixing different memory technologies Simulation results to validate the model Accurately predict the number of levels (> 90%) Accurately predict size/BW of every level (>80%) "Moguls: a Model to Explore the Memory Hierarchy for Throughput Computing," G. Sun, C. J. Hughes, C. Kim, J. Zhao, C. Xu, Y. Xie, Y.-K Chen, in International Symposium on Computer Architecture (ISCA), June 2011.
36
Recall the Research Statements Does a memory hierarchy provide enough bandwidth? How many levels in the memory hierarchy? What are the capacity and bandwidth of each level? Which memory technologies are chosen for different levels?
37
Different Memory Technologies MRAM eDRAM MemristorPCM 256KB1MB4MB16MB64MB256MB Cache Capacity (log scale) Cross points Write : Read = 1:9
38
Different Memory Technologies 256KB1MB4MB16MB64MB256MB Cache Capacity (log scale) SRAM is more energy-efficient eDRAM is more energy-efficient
39
Outline Motivation Moguls Memory Model Energy-efficient Memory Hierarchy Design with Moguls Future Prediction Memory hierarchy Processor designs Conclusion
40
Our Research Statement What should memory hierarchy look like? How to use the latest technology components? What is the proper number of levels? What should capacity/bandwidth of each level be? Can we explore the design space quickly? Simulation-based design evaluation slow Not feasible for large design space Exploration using an analytical model (Moguls) Fast and accurate (with proper approximations)
41
Historical Trend Growing bandwidth gap Processor speed increases at 50% per year Memory bandwidth increases at 27% per year Intel processors introduced on-die L1 cache in 1990 L2 cache in 1998 L3 cache in 2005
42
Optimal Performance per Watt A new level of cache every 5-7 years
43
Takeaway Messages It is time to add another level of memory into the hierarchy to alleviate the bandwidth bottleneck. Need new memory level roughly every 5-7 years Mathematically, with some assumptions about miss-rate curves and power consumption, we can solve for: Optimal # of levels, capacity & BW of each level Our study shows that L4 helps significantly
44
Outline Motivation Moguls Memory Model Energy-efficient Memory Hierarchy Design with Moguls Future Prediction Memory hierarchy Processor designs Conclusion
45
Programming GPU Becomes Popular Many show GPUs have significant performance gain However, GPUs are NOT orders of magnitude faster than CPUs Architecture-specific optimizations are important Source: Google scholar with “GPGPU” and “GPGPU GPU CPU performance speedup”
46
Compute Bound vs. Bandwidth Bound Performance depends on two resources Compute does the work Bandwidth feeds the compute Well optimized applications are compute or bandwidth bounded For compute bound applications: Performance = Efficiency * Peak Compute Capability For bandwidth bound applications: Performance = Efficiency * Peak Bandwidth Capability 46
47
With the Same Efficiency Core i7 960 Four OoO Superscalar Cores, 3.2GHz Peak SP Flop: 102GF/s Peak BW: 30 GB/s GTX 280 30 SMs (w/ 8 In-order SP each), 1.3GHz Peak SP Flop: 933GF/s Peak BW: 141 GB/s Max Speedup: GTX 280 over Core i7 960 Compute Bound Apps: (SP)933/102 = 9.1x Bandwidth Bound Apps:141/30 = 4.7x Assuming both Core i7 and GTX280 have the same efficiency: Many GPU performance claims are myths
48
Debunking the Myths Bordawekar, et al. (IBM), “Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!” IBM Technical Report RC24982, April 2010. Vudu, et al. (Georgia Tech), “On the Limits of GPU Acceleration,” HotPar, June 2010. Lee, et al. (Intel), “Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU,” ISCA, June 2010. 48
49
49 * Source: Vudu, et al. (Georgia Tech), “On the Limits of GPU Acceleration,” HotPar, June 2010.
50
Software Optimization CPU Multi-threading SIMDification Cache blocking Memory management Data structure re-arrangement GPU Multi-threading Branch divergence reduction Coalescing memory accesses Synchronization avoidance Local shared buffer optimization 50 Need both for fair comparison
51
Case Study: Sparse Matrix-Vector Multiplication Source: V. W. Lee, et al., “Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU,” ISCA 2010. 10x becomes 2x after optimizations
52
Case Study: Fast Motion Estimation Source: N.-M. Cheung, et al., "Video Coding on Multicore Graphics Processors --- The challenges and advantages of the GPU implementation," IEEE Signal Processing Magazine, March 2010. QualityParallelismQualityParallelism More algorithm research needed to utilize parallelism from multi-core processors
53
GPU’s Are Against Future Trends GPU memory are not energy efficient The number of levels and the size of each level must be adjusted according to the throughput of the processors CUDA-like GPU programming model are not scalable in the future To get good performance on GPU, explicit memory management are often required It will become a nightmare if the number of levels and the size of each level are changing Trust me: reduce your bets on GPU’s
54
Outline Motivation Moguls Memory Model Energy-efficient Memory Hierarchy Design with Moguls Future Prediction Memory hierarchy Processor designs Conclusion
55
Conclusions The winning multi-core architecture will have the most energy-efficient memory hierarchy GPU (or GDDR) is good, but not that good Expect to have more levels 3D stacking, eDRAM, and other emerging technologies "Moguls: a Model to Explore the Memory Hierarchy for Throughput Computing," G. Sun, C. J. Hughes, C. Kim, J. Zhao, C. Xu, Y. Xie, Y.-K Chen, in International Symposium on Computer Architecture, June 2011. On-die adaptive/reconfigurable cache can save energy "Performance and Energy Implications of Caches For Throughput Computing," C. Hughes, C. Kim, Y.-K. Chen, IEEE Micro Magazine, vol.30, no.6, pp.25-35, Nov.-Dec. 2010. Software can help too One of our current work Discussion: how about cloud computing?
56
Energy-Efficiency Memory Hierarchy for Multi-core Architectures Yen-Kuang Chen, Ph.D., IEEE Fellow Principal Engineer, Intel Corporation Associate Director, Intel-NTU CCC Center With help from a long list of collaborators: Guangyu Sun, Jishen Zhao, Cong Xu, Yuan Xie (PSU), Christopher Hughes, Changkyu Kim (Intel)
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.