Presentation is loading. Please wait.

Presentation is loading. Please wait.

PERFORMANCE MODELING AND CHARACTERIZATION OF MULTICORE COMPUTING SYSTEMS ANIL KRISHNA Advisor: Dr. YAN SOLIHIN PhD Defense Examination, August 6 th 2013.

Similar presentations


Presentation on theme: "PERFORMANCE MODELING AND CHARACTERIZATION OF MULTICORE COMPUTING SYSTEMS ANIL KRISHNA Advisor: Dr. YAN SOLIHIN PhD Defense Examination, August 6 th 2013."— Presentation transcript:

1 PERFORMANCE MODELING AND CHARACTERIZATION OF MULTICORE COMPUTING SYSTEMS ANIL KRISHNA Advisor: Dr. YAN SOLIHIN PhD Defense Examination, August 6 th 2013 Image Source: http://en.kioskea.net/faq/372-choosing-the-right-cpu 1

2 2 Good Morning!

3 PERFORMANCE MODELING AND CHARACTERIZATION OF MULTICORE COMPUTING SYSTEMS ANIL KRISHNA Advisor: Dr. YAN SOLIHIN PhD Defense Examination, August 6 th 2013 Image Source: http://en.kioskea.net/faq/372-choosing-the-right-cpu 3

4 4 o RESEARCH OVERVIEW o Questions I have been researching all these years o SUMMARY – Motivation, Problem, Contribution o Quick overview of my latest research o DETAILS of ReSHAPE o A performance estimation tool o VALIDATION o Does this tool work? o USE CASES o Where can it be used? o CONCLUSIONS and FUTURE DIRECTION o Where are we? Where to next? AGENDA How this talk is organized

5 Motivation o Off-chip bandwidth is pin limited, pins are area limited, area not growing Problem Statement o To what extent does the bandwidth wall restrict future multi-core scaling? o To what extent can bandwidth conservation techniques help? Contributions and Findings o Developed simple but effective analytical performance model o Core to cache ratio changes from 50:50 to 10:90 in 4 generations o Core scaling is only 3x vs. 16x in 4 generations o Different bandwidth conservation techniques have different benefits o Combining techniques can delay this problem significantly o 3D-stacked DRAM caches + link and cache compression gives >16x scaling Motivation o Off-chip bandwidth is pin limited, pins are area limited, area not growing Problem Statement o To what extent does the bandwidth wall restrict future multi-core scaling? o To what extent can bandwidth conservation techniques help? Contributions and Findings o Developed simple but effective analytical performance model o Core to cache ratio changes from 50:50 to 10:90 in 4 generations o Core scaling is only 3x vs. 16x in 4 generations o Different bandwidth conservation techniques have different benefits o Combining techniques can delay this problem significantly o 3D-stacked DRAM caches + link and cache compression gives >16x scaling Motivation o Off-chip bandwidth is pin limited, pins are area limited, area not growing Problem Statement o To what extent does the bandwidth wall restrict future multi-core scaling? o To what extent can bandwidth conservation techniques help? Motivation o Off-chip bandwidth is pin limited, pins are area limited, area not growing Problem Statement o To what extent does the bandwidth wall restrict future multi-core scaling? o To what extent can bandwidth conservation techniques help? Scaling the bandwidth wall: challenges in and avenues for CMP scaling Brian Rogers, Anil Krishna, Gordon Bell, Ken Vu, Xiaowei Jiang, Yan Solihin International Symposium on Computer Architecture, ISCA 2009 Scaling the bandwidth wall: challenges in and avenues for CMP scaling Brian Rogers, Anil Krishna, Gordon Bell, Ken Vu, Xiaowei Jiang, Yan Solihin International Symposium on Computer Architecture, ISCA 2009 Motivation o Off-chip bandwidth is pin limited, pins are area limited, area not growing Motivation o Off-chip bandwidth is pin limited, pins are area limited, area not growing RESEARCH OVERVIEW In the context of processor chip design trends 5

6 Data sharing in multi-threaded applications and its impact on chip design Anil Krishna, Ahmad Samih, Yan Solihin Intl. Symp. on Performance Analysis of Systems and Software, ISPASS 2012 Data sharing in multi-threaded applications and its impact on chip design Anil Krishna, Ahmad Samih, Yan Solihin Intl. Symp. on Performance Analysis of Systems and Software, ISPASS 2012 Motivation o Parallel applications moving from SMP to a single chip o No analytical models exist that can capture the effect of data sharing Problem Statement o What is the right way to quantify the impact of data sharing on miss rates? o How can this be incorporated into an analytical performance model? o Does data sharing impact optimal on-chip core vs. cache ratios? Contributions and Findings o Developed novel approach to quantifying the true impact of data sharing o Developed analytical performance model that incorporates data sharing o Showed that core area increases 33% to 49%; throughput increases 58% o Presence of data sharing encourages larger cores over smaller ones Motivation o Parallel applications moving from SMP to a single chip o No analytical models exist that can capture the effect of data sharing Problem Statement o What is the right way to quantify the impact of data sharing on miss rates? o How can this be incorporated into an analytical performance model? o Does data sharing impact optimal on-chip core vs. cache ratios? Contributions and Findings o Developed novel approach to quantifying the true impact of data sharing o Developed analytical performance model that incorporates data sharing o Showed that core area increases 33% to 49%; throughput increases 58% o Presence of data sharing encourages larger cores over smaller ones Motivation o Parallel applications moving from SMP to a single chip o No analytical models exist that can capture the effect of data sharing Problem Statement o What is the right way to quantify the impact of data sharing on miss rates? o How can this be incorporated into an analytical performance model? o Does data sharing impact optimal on-chip core vs. cache ratios? Motivation o Parallel applications moving from SMP to a single chip o No analytical models exist that can capture the effect of data sharing Problem Statement o What is the right way to quantify the impact of data sharing on miss rates? o How can this be incorporated into an analytical performance model? o Does data sharing impact optimal on-chip core vs. cache ratios? Motivation o Parallel applications moving to a single chip, but no change in chip design o No analytical models exist that can capture the effect of data sharing Motivation o Parallel applications moving to a single chip, but no change in chip design o No analytical models exist that can capture the effect of data sharing RESEARCH OVERVIEW In the context of processor chip design trends 6

7 Motivation o Understand driving forces, architectural tradeoffs and performance advantages of hardware accelerators via a detailed case study Problem Statement o How were the hardware accelerators in IBMs PowerEN selected and designed? How well do they perform? o How did the presence of hardware accelerators impact the architecture of the rest of the chip? Contributions and Findings o Analyzed design and performance of each hardware accelerator in PowerEN (Crypto, XML, Compression, RegX, HEA) in detail o Identified tradeoffs in what to accelerate (vs. execute on general purpose core) and when to accelerate (large vs. small packets) o Found that reducing communication overhead and easing programmability requires supporting many new features o shared memory model between cores and accelerators, direct cache injection of data from accelerators, ISA extensions Motivation o Understand driving forces, architectural tradeoffs and performance advantages of hardware accelerators via a detailed case study Problem Statement o How were the hardware accelerators in IBMs PowerEN selected and designed? How well do they perform? o How did the presence of hardware accelerators impact the architecture of the rest of the chip? Contributions and Findings o Analyzed design and performance of each hardware accelerator in PowerEN (Crypto, XML, Compression, RegX, HEA) in detail o Identified tradeoffs in what to accelerate (vs. execute on general purpose core) and when to accelerate (large vs. small packets) o Found that reducing communication overhead and easing programmability requires supporting many new features o shared memory model between cores and accelerators, direct cache injection of data from accelerators, ISA extensions Motivation o Understand driving forces, architectural tradeoffs and performance advantages of hardware accelerators via a detailed case study Problem Statement o How were the hardware accelerators in IBMs PowerEN selected and designed? How well do they perform? o How did the presence of hardware accelerators impact the architecture of the rest of the chip? Motivation o Understand driving forces, architectural tradeoffs and performance advantages of hardware accelerators via a detailed case study Problem Statement o How were the hardware accelerators in IBMs PowerEN selected and designed? How well do they perform? o How did the presence of hardware accelerators impact the architecture of the rest of the chip? Multi Core Homogeneous Multi Core Hybrid Hardware acceleration in the IBM PowerEN processor: architecture and performance Anil Krishna, Timothy Heil, Nicholas Lindberg, Farnaz Toussi, Steven VanderWiel International conference on Parallel Architectures and Compilation Techniques, PACT 2012 Hardware acceleration in the IBM PowerEN processor: architecture and performance Anil Krishna, Timothy Heil, Nicholas Lindberg, Farnaz Toussi, Steven VanderWiel International conference on Parallel Architectures and Compilation Techniques, PACT 2012 Motivation o Understand driving forces, architectural tradeoffs and performance advantages of hardware accelerators via a detailed case study Motivation o Understand driving forces, architectural tradeoffs and performance advantages of hardware accelerators via a detailed case study RESEARCH OVERVIEW In the context of processor chip design trends 7

8 Multi Core Homogeneous Multi Core Hybrid Multi Core Heterogeneous RESEARCH OVERVIEW In the context of processor chip design trends ReSHAPE: Resource Sharing and Heterogeneity-aware Analytical Performance Estimator Anil Krishna, Ahmad Samih, Yan Solihin being submitted to Intl. Symposium on High Performance Computer Architecture, HPCA 2013 ReSHAPE: Resource Sharing and Heterogeneity-aware Analytical Performance Estimator Anil Krishna, Ahmad Samih, Yan Solihin being submitted to Intl. Symposium on High Performance Computer Architecture, HPCA 2013 Large design space o How many cores/cores-types? o What cache hierarchy? o Heterogeneity in caches too? Large design space o How many cores/cores-types? o What cache hierarchy? o Heterogeneity in caches too? Large configuration space o How to schedule applications? o What DVFS settings to use? o What cores and caches to power-gate? Large configuration space o How to schedule applications? o What DVFS settings to use? o What cores and caches to power-gate? 8

9 Design and configuration space explosion with multi-core chips o As number and types of cores designs need to be evaluated Design and configuration space explosion with multi-core chips o As number and types of cores designs need to be evaluated o n! static schedules for a single design with n core types Design and configuration space explosion with multi-core chips o As number and types of cores designs need to be evaluated o n! static schedules for a single design with n core types o Very large configuration space with per-core DVFS even in a single design with a single core type SUMMARY – Motivation Detailed simulation too slow Analytical models fast, but existing models lacking Detailed simulation too slow o Be it trace or execution driven, be it cycle-by-cycle simulation or discrete-event simulation Analytical models fast, but existing models lacking o Too abstract and lacking sufficient fidelity Analytical models fast, but existing models lacking o Too abstract and lacking sufficient fidelity o Not flexible enough to handle shared caches, heterogeneity across cores, multi-program mixes. 9

10 Contribution: ReSHAPE ( Resource Sharing and Heterogeneity-aware Analytical Performance Estimator ) o Hybrid tool: detailed simulation for key statistics + analytical model + iterative solver o Flexible Contribution: ReSHAPE ( Resource Sharing and Heterogeneity-aware Analytical Performance Estimator ) o Hybrid tool: detailed simulation for key statistics + analytical model + iterative solver o Flexible o Typically runs in under a second (10,000x faster than detailed simulation) Contribution: ReSHAPE ( Resource Sharing and Heterogeneity-aware Analytical Performance Estimator ) o Hybrid tool: detailed simulation for key statistics + analytical model + iterative solver o Flexible o Typically runs in under a second (10,000x faster than detailed simulation) o Accuracy is promising – IPC error < 5% and cache miss rate error <15% (validated up to 4 cores) SUMMARY – Problem, Contribution Contribution: ReSHAPE ( Resource Sharing and Heterogeneity-aware Analytical Performance Estimator ) Problem: Need a tool for early design space exploration o Fast: At least 1000x faster than detailed simulation Problem: Need a tool for early design space exploration o Fast: At least 1000x faster than detailed simulation o Accurate: < 20% error in performance projection Problem: Need a tool for early design space exploration o Fast: At least 1000x faster than detailed simulation o Accurate: < 20% error in performance projection o Flexible : Able to model shared cache hierarchies, shared memory bandwidth, heterogeneity across cores and caches on chip and multi-programmed workload mixes Contribution: ReSHAPE ( Resource Sharing and Heterogeneity-aware Analytical Performance Estimator ) o Hybrid tool: detailed simulation for key statistics + analytical model + iterative solver 10

11 ReSHAPE – Inputs and Outputs Core 0 L1I L1D Core 1 L1I L1D Core 0 L1I L1D Core 1 L1I L1D L2 Chip Configuration o core counts o core types o Frequencies o Cache hierarchy o memory bandwidth o application schedule App-Core pair profile ReSHAPE Iterative solver of an underlying analytical model Resource Sharing and Heterogeneity-aware Analytical Performance Estimator App-Core pair profile o Base IPC App-Core pair profile o Base IPC o Cache accesses per Inst. App-Core pair profile o Base IPC o Cache accesses per Inst. o Hit Rate Profiles Throughput (Instructions per Second) 11

12 Chip Configuration o core counts o core types o Frequencies o Cache hierarchy (sizes, latencies) o memory bandwidth o application schedule ReSHAPE – The Analytical Component Resource Sharing and Heterogeneity-aware Analytical Performance Estimator App-Core pair profile o Base IPC o Cache accesses per Inst. o Hit Rate Profiles App-Core pair profile o Base IPC o Cache accesses per Inst. o Hit Rate Profiles App-Core pair profile o Base IPC o Cache accesses per Inst. o Hit Rate Profiles App-Core pair profile o Base IPC o Cache accesses per Inst. o Hit Rate Profiles 12

13 Chip Configuration o core counts o core types o Frequencies o Cache hierarchy (sizes, latencies) o memory bandwidth o application schedule ReSHAPE – The Analytical Component Core 0 L1I L1D L2 Resource Sharing and Heterogeneity-aware Analytical Performance Estimator App-Core pair profile o Base IPC o Cache accesses per Inst. o Hit Rate Profiles 13

14 Chip Configuration o core counts o core types o Frequencies o Cache hierarchy (sizes, latencies) o memory bandwidth o application schedule ReSHAPE – The Analytical Component Core 0 L1I L1D L2 Resource Sharing and Heterogeneity-aware Analytical Performance Estimator App-Core pair profile o Base IPC o Cache accesses per Inst. o Hit Rate Profiles L3 14

15 App-Core pair profile o Base IPC o Cache accesses per Inst. o Hit Rate Profiles Chip Configuration o core counts o core types o Frequencies o Cache hierarchy (sizes, latencies) o memory bandwidth o application schedule ReSHAPE – The Analytical Component Core 0 L1I L1D L2 L3 Resource Sharing and Heterogeneity-aware Analytical Performance Estimator 15

16 Chip Configuration o core counts o core types o Frequencies o Cache hierarchy (sizes, latencies) o memory bandwidth o application schedule App-Core pair profile o Base IPC o Cache accesses per Inst. o Hit Rate Profiles ReSHAPE – The Analytical Component Core 0 L1I L1D L2 L3 Resource Sharing and Heterogeneity-aware Analytical Performance Estimator 16

17 L3 ReSHAPEs Novelty L2 Core 0 L1I L1D L2 Novelty 1 o Separate chip into vertical silos Resource Sharing and Heterogeneity-aware Analytical Performance Estimator ReSHAPEs partition optimizer 17

18 L3 ReSHAPEs Novelty Novelty 1 o Separate chip into vertical silos Resource Sharing and Heterogeneity-aware Analytical Performance Estimator L3 Novelty 2 o Use newly computed IPC as baseIPC o Re-evaluate traffic and partitions o Iterate until convergence (IPC change <1%) After convergence o Use final IPCs to compute throughput 18

19 ReSHAPEs Cache partitioning strategy Resource Sharing and Heterogeneity-aware Analytical Performance Estimator L3 ? ? Cache size Hits per sec Greedy Approach o O(n.k) for n cache slices and k sharers o May be sub-optimal, but does quite well in practice 19

20 ReSHAPEs Cache partitioning strategy Resource Sharing and Heterogeneity-aware Analytical Performance Estimator L3 ? ? Cache size Hits per sec Minimize Misses Strategy o O(log 2 n. 2 k ) for n cache slices and k sharers o May be too slow for large k o We use this strategy for all evaluations presented here 20

21 Loose Locality Medium Locality Tight Locality VALIDATION Comparing ReSHAPEs projections against SIMICS full system simulator Step 1: Analyze benchmark applications 21

22 Step 1: Analyze benchmark applications Step 2: Construct workload mixes VALIDATION Comparing ReSHAPEs projections against SIMICS full system simulator 2 core 12 mixes 4 core 9 core 12 mixes 7 mixes 22

23 Step 1: Analyze benchmark applications Step 2: Construct workload mixes VALIDATION Comparing ReSHAPEs projections against SIMICS full system simulator Step 3: Construct configurations to be validated 32K 10Gb/s 32K 256KB 32K 1MB 10Gb/s 1Gbp/s 100Mb/s 10MB/s 32K 10Gb/s 32K 512KB 32K 2MB 32K 10Gb/s 1Gbp/s 100Mb/s 10MB/s 32K 512KB 32K 10Gb/s 512KB 32K 256KB 32K 10Gb/s 256KB 1MB 32K 128KB 32K 10Gb/s 128KB 2MB 23

24 Step 1: Analyze benchmark applications Step 2: Construct workload mixes VALIDATION Comparing ReSHAPEs projections against SIMICS full system simulator Step 3: Construct configurations to be validated Step 4: Set up identical configurations in SIMICS and ReSHAPE Step 5: Compare projections from SIMICS and ReSHAPE Each mix is checkpointed (under SIMICS) after running for 100 Billion instructions per application At least 1 Billion instructions beyond this are used for validation run 24

25 VALIDATION Comparing ReSHAPEs projections against SIMICS full system simulator Average 1-core IPC Error : 1.5% (std. dev. = 1.4%) 25

26 VALIDATION Comparing ReSHAPEs projections against SIMICS full system simulator 32K 1MB 10Gb/s Average 2-core IPC Error: 2.7% (std. dev. = 2.1%) 26

27 VALIDATION Comparing ReSHAPEs projections against SIMICS full system simulator 32K 1MB 10Gb/s Average miss rate projection error: 13.4 % (std. dev. = 12.6%) 27

28 VALIDATION Comparing ReSHAPEs projections against SIMICS full system simulator 32K 1MB 10Gb/s Average partition size projection error: 3.7 % (std. dev. = 4.5%) 28

29 VALIDATION Comparing ReSHAPEs projections against SIMICS full system simulator 32K 2MB 32K 10Gb/s Average 4-core IPC Error: 2.5% (std. dev. = 1.8%) 29

30 VALIDATION Comparing ReSHAPEs projections against SIMICS full system simulator 32K 2MB 32K 10Gb/s Average miss rate projection error: 12.8 % (std. dev. = 13.1%) 30

31 VALIDATION Comparing ReSHAPEs projections against SIMICS full system simulator 32K 2MB 32K 10Gb/s Average partition size projection error: 20.9% (std. dev. = 12.8%) 31

32 VALIDATION Comparing ReSHAPEs projections against SIMICS full system simulator 32K 2MB 32K 10Gb/s 1Gb/s 0.1Gb/s 0.01Gb/s Average IPC Error: 17.3% (std. dev. = 5.4%) 32

33 VALIDATION Comparing ReSHAPEs projections against SIMICS full system simulator 32K 128KB 32K 10Gb/s 128KB 2MB Private Caches: Average 4-core IPC Error: 3.1% (std. dev. = 1.6%) 33

34 VALIDATION Comparing ReSHAPEs projections against SIMICS full system simulator 32K 128KB 32K 10Gb/s 128KB 2MB Average miss rate projection error: 7.5 % (std. dev. = 7.1%) 34

35 USE CASES Putting ReSHAPE to use HomogeneousHeterogeneous CachesHeterogeneous CoresHeterogeneous Both Does increasing the sources of heterogeneity buy us performance? 35

36 Max Min Mean USE CASES Putting ReSHAPE to use Does increasing the sources of heterogeneity buy us performance? A B C D App0 App1 App2 App3 C0 C1 C2 C3 Up to 4! unique schedules for a 4-application workload mix A B D C A C B D A C D B A D B C A D C B B C D A B C A D B D C A B D A C B A C D B A D C C D A B C D B A C A D B C A B D C B D A C B A D D A B C D A C B D B A C D B C A D C A B D C B A What one might expect to see o Small improvement with heterogeneous caches. Some loss for bad schedules o Larger improvement with heterogeneous cores o Even larger improvement with heterogeneous cores + heterogeneous caches 1 Het. CacheHet. CoreHet. Both Weighted speedup normalized to Homogeneous design 36 HomogeneousHeterogeneous CachesHeterogeneous CoresHeterogeneous Both

37 USE CASES Putting ReSHAPE to use Does increasing the sources of heterogeneity buy us performance? o Smaller cores hurting more than the larger cores helping o Heterogeneous caches better than heterogeneous cores in this case 37 HomogeneousHeterogeneous CachesHeterogeneous CoresHeterogeneous Both

38 USE CASES Putting ReSHAPE to use HomogeneousHeterogeneous Caches Heterogeneous Cores Heterogeneous Both 9-core designs > 350,000 ReSHAPE sims chart represents > 10 million ReSHAPE sims o As core count scales (4->9) benefit of heterogeneity increases significantly o Heterogeneous cores better than heterogeneous caches in this case; but schedule still crucial 38

39 USE CASES Putting ReSHAPE to use HomogeneousHeterogeneous Caches Heterogeneous Cores Heterogeneous Both 9-core designs with 3 core/cache types o 3-core types and 3-cache sizes does not buy any more performance 39 o How much and what form of heterogeneity needs careful analysis depending on the design being evaluated

40 USE CASES Putting ReSHAPE to use o Different settings for different workload mixes; and not always the fastest setting! Weighted Speedup Perf/Watt 1/(Energy*Delay) c0c1c2c3 c0c1c2c3 c0c1c2c3 m00 3311 1111 3311 m01 3331 1111 3331 m02 1333 1111 1333 m03 1113 1112 1113 m04 1333 1111 1111 m05 1331 1111 1111 m06 1113 1112 1113 m07 1311 1111 1111 m08 3111 1111 1111 m09 3311 2111 3311 m10 1133 1121 1133 m11 1331 1111 1111 Legend1 250MHz, 0.5W2 1GHz, 2W 3 4GHz, 16W 32K 2MB 32K 10Gb/s o Not always the slowest setting when optimizing performance/watt o Somewhere in between when optimizing Energy x Delay product 32K 250MHz 0.5W 32K 1GHz 2W 32K 4GHz 16W 40

41 CONCLUSIONS + FUTURE DIRECTION ReSHAPE extends this classical analytical performance model in novel ways Rich design/configuration space for multi-core chips Accuracy + speed make ReSHAPE a useful tool for early exploration Validate across unique microarchitectures Future direction – extend ReSHAPE Extend key parameters and model - memory level parallelism, writeback traffic, prefetching Evaluate more use cases o best power-gating strategy based on workload mix o dynamic schedules based on per-phase application statistics Explore the rich constrained optimization problem of cache partitioning 41 Analytical modeling can be a promising approach to tackling these large search spaces

42 Thank you! 42

43 RELATED WORK Wentzlaff et al. (MIT Tech Report 2010), Li et al. (ISPASS 2005), Yavits et al. (CAL 2013) all tackle different aspects of multicore chip design, but only consider homogeneous cores. Wu et. Al (ISCA 2013) use locality profiles to identify how the applications cache locality degrades as the application is spread across more threads – they consider multi-threaded applications. Analytical Modeling of multi-core chips Navada et al. (PACT 2010, PACT 2013) consider simulation based, criticality driven, design space exploration and mechanisms for selecting the best way to schedule a single application across multiple cores. Kumar et al. (Micro 2003, PACT 2006, ISCA 2004) did most of the seminal work in the area of heterogeneous multi-core. However, they have typically relied on detailed simulations, private cache hierarchies and single application scheduling. Several works related to heterogeneous design/scheduling 43

44 VALIDATION Comparing ReSHAPEs projections against SIMICS full system simulator Average miss rate projection error: 7.6% (std. dev. = 12.4%) 44


Download ppt "PERFORMANCE MODELING AND CHARACTERIZATION OF MULTICORE COMPUTING SYSTEMS ANIL KRISHNA Advisor: Dr. YAN SOLIHIN PhD Defense Examination, August 6 th 2013."

Similar presentations


Ads by Google