Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Similar presentations


Presentation on theme: "Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket."— Presentation transcript:

1 Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket K. Choudhary, Salil Wadhavkar, Eric Rotenberg Department of Electrical and Computer Engineering North Carolina State University 1

2 Single-ISA HCMP Same ISA Different microarchitectures –Superscalar width –Structure sizes –Frequency Cores have different performance and power New run-time optimization lever Sandeep Navada © 2013 2

3 Monotonic HCMP Cores can be ranked independent of application Core 1 faster than Core 2 for any application Sandeep Navada © 2013 3

4 Monotonic HCMP example Sandeep Navada © 2013 4

5 HCMP literature Focus –Monotonic cores –Cores are preordained –Scheduling Single thread –Minimize energy for given performance degradation threshold w.r.t. highest ranked core Multiple threads –Maximize throughput/Watt/mm 2 Sandeep Navada © 2013 5

6 Going beyond monotonic HCMP Cores can’t be ranked independent of application Cores designed from ground-up, not pre-existing Sandeep Navada © 2013 6

7 Non-monotonic HCMP Sandeep Navada © 2013 7

8 Optimize latency Sandeep Navada © 2013 8 Performance = IPC × frequency Complexity↑ => IPC↑ frequency↓ This tradeoff plays out differently for different apps and is dependent on the ILP characteristics of the app

9 Non-monotonic HCMP challenges Sandeep Navada © 2013 9

10 CORE SELECTION Sandeep Navada © 2013 10

11 Core design space Sandeep Navada © 2013 11 ParameterValue RangeNumber Front end width2, 3, 4, 5, 6, 7, 87 Issue width2, 3, 4, 5, 6, 7, 87 Physical register file size 64, 128, 192, 256, 384, 5126 Issue queue size16, 24, 32, 48, 64, 96, 1287 Load queue/ Store queue size 8/8, 16/16, 24/24, 32/32, 40/40, 48/48, 56/56, 64/64 8 L1 I$ size8, 16, 32, 64, 128KB5 L1 D$ size8, 16, 32, 64, 128KB5 L2$ size2MB1 Clock period0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2 ns8

12 Core selection Sandeep Navada © 2013 12 Core design space Pruning script Pruning script SPEC bench SimPoint tool SimPoint tool Pruned design Space 39 10M phases FabScalar toolset IPC, freq, power FabScalar toolset IPC, freq, power Performance of every phase on every design point Search N=1 HCMP Search N=2 HCMP Search N=2 HCMP Search N=3 HCMP Search N=4 HCMP Optimal 1-core-type HCMP Optimal 1-core-type HCMP Optimal 2-core-type HCMP Optimal 2-core-type HCMP Optimal 3-core-type HCMP Optimal 3-core-type HCMP Optimal 4-core-type HCMP Optimal 4-core-type HCMP N: Number of core types

13 Sandeep Navada © 2013 13 BIPS Core Types ABCDEFGH Phases 11.53.21.32.21.61.71.32.0 20.52.32.51.93.11.82.01.2 Search for Optimal 4-core-type HCMP Core 1Core 2Core 3Core 4Performance ABCD EBCD AFCD EFCD EFGH … HMEAN(3.2, 2.5) = 2.81 HMEAN(3.2, 3.1) = 3.15 HMEAN(2.2, 2.5) = 2.34 HMEAN(2.2, 3.1) = 2.57 HMEAN(2.0, 3.1) = 2.43

14 Kiviat diagram Visualize core parameters Sandeep Navada © 2013 14 larger structures higher frequency increase superscalar width 14

15 Optimal 1-core-type HCMP Sandeep Navada © 2013 15

16 Optimal 1-core-type HCMP Sandeep Navada © 2013 16

17 Optimal 2-core-type HCMP Sandeep Navada © 2013 17

18 Optimal 2-core-type HCMP Sandeep Navada © 2013 18 “A” core is still selected!

19 Optimal 2-core-type HCMP Sandeep Navada © 2013 19

20 Optimal 3-core-type HCMP Sandeep Navada © 2013 20

21 Optimal 3-core-type HCMP Sandeep Navada © 2013 21 “A” core is still selected!!

22 Optimal 3-core-type HCMP Sandeep Navada © 2013 22 “LW” core is still selected.

23 Optimal 3-core-type HCMP Sandeep Navada © 2013 23 “N” core targets frequency bottleneck.

24 Optimal 4-core-type HCMP Sandeep Navada © 2013 24

25 Optimal 4-core-type HCMP Sandeep Navada © 2013 25 “A” and “N” are selected, again. “LW” got split into “L” and “W”, addressing each bottleneck better!

26 LW split Sandeep Navada © 2013 26

27 Optimal HCMP Sandeep Navada © 2013 27 The optimal HCMP consists of 1.Average core which is the best homogeneous core 2.Accelerator cores that relieve distinct bottlenecks in the average core Core TypeClock PeriodILP-extracting buffers WidthsCaches A0.632, 128, 1283, 464, 64 N0.532, 64, 642, 216, 16 L0.748, 128, 3844, 4128, 128 W0.732, 128, 1286, 6128, 32

28 APPLICATION STEERING Sandeep Navada © 2013 28

29 Bottleneck-driven steering Application is continuously diagnosed for bottlenecks on the current core using perf. counters Migrate to different core when bottlenecks change –To an accelerator core that relieves any diagnosed bottleneck and doesn’t worsen any diagnosed bottleneck –To the average core if no accelerator meets this condition, or if no bottlenecks Sandeep Navada © 2013 29

30 Bottleneck-driven steering Sandeep Navada © 2013 30 Track performance counters Diagnose bottlenecks Steer phase

31 Track performance counters Sandeep Navada © 2013 31 CounterDescription Width_ctrReady instruction not issued due to limited issue width. Window_ctrInstruction not dispatched due to issue queue or reorder buffer full. I$_ctrInstruction stalled due to instruction cache miss. D$_ctrLoad instruction stalled due to data cache miss. Misp_ctrMispredicted branch. L2_ctrInstruction stalled due to L2 cache miss. Cycle_ctrNumber of cycles.

32 Diagnose bottlenecks Every 10K instructions, evaluate bottlenecks using performance counters and thresholds Performance counters are normalized with respect to the cycle count If the normalized performance counter value is above threshold, then the corresponding resource is a bottleneck Sandeep Navada © 2013 32

33 Diagnose bottlenecks Sandeep Navada © 2013 33 BottleneckExpression bool Width Width = (Width_ctr > Width_thresh) bool Window Window = (Window_ctr > Window_thresh) bool Frequency Frequency = (Misp_ctr > Misp_thresh) || (L2_ctr > L2_thresh) bool I$ I$ = (I$_ctr > I$_thresh) bool D$ D$ = (D$_ctr > D$_thresh) Thresholds are determined empirically using a training process

34 Steer phase Sandeep Navada © 2013 34 CoreBottlenecks relieved Bottlenecks worsened Steering logic WWidthFrequency if (Width && !Frequency) W LWindowFrequency else if (Window && !Frequency) L NFrequencyWidth, Window else if (Frequency && !(Width || Window)) N An/a else A Paper shows full steering logic with I$ and D$ bottlenecks included.

35 RESULTS Sandeep Navada © 2013 35

36 Methodology Benchmarks: SPEC 2000 –Simulate first 4 billion instructions Metrics –Performance: BIPS –Efficiency: BIPS 3 /Watt Migration overhead –Default: 100 cycles –Sensitivity study: 1K, 10K cycles Sandeep Navada © 2013 36

37 Steering algorithms AlgorithmDescription BaselineRun the entire 4B instructions on the average core SamplingRun on each core type for the sampling interval and then on the best core type for the switching interval BottleneckRun current 10K instruction segment based on the bottlenecks of the prior 10K segment OptimalRun every 10K instruction segment on the best core type of the prior 10K segment OracleRun every 10K instruction segment on the best core type Sandeep Navada © 2013 37

38 4-core-type HCMP Sandeep Navada © 2013 38 4-core HCMP outperforms homogeneous CMP by up to 76% and 15%, on average Our steering algorithm is able to capture most of this gain

39 Sampling vs. bottleneck steering Sandeep Navada © 2013 39 Sampling performs 8.9% better than the average core Bottleneck steering performs 12% better than the average core Sampling performs 8.9% better than the average core Bottleneck steering performs 12% better than the average core

40 Occupancy Sandeep Navada © 2013 40 Occupancy pattern varies dramatically across different applications

41 Efficiency Sandeep Navada © 2013 41 Sampling performs 25% better than the average core Bottleneck steering performs 33% better than the average core

42 SUMMARY Sandeep Navada © 2013 42

43 Summary First proposal to architect and orchestrate multiple core types for latency reduction. With N core types, the optimal HCMP consists of an average core type coupled with N-1 accelerator core types. In the complementary steering algorithm, the application is continuously diagnosed for bottlenecks and is migrated to the core type which relieves the bottlenecks. Sandeep Navada © 2013 43

44 Future work HCMPs open up a whole new direction of microarchitecture research. Many microarchitecture optimizations don’t provide universal benefits. As each core-type targets a narrow workload space, HCMP provides a great platform to reconsider these optimizations. Sandeep Navada © 2013 44


Download ppt "Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket."

Similar presentations


Ads by Google