Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket K. Choudhary, Salil Wadhavkar, Eric Rotenberg Department of Electrical and Computer Engineering North Carolina State University 1

HCMP literature Focus –Monotonic cores –Cores are preordained –Scheduling Single thread –Minimize energy for given performance degradation threshold w.r.t. highest ranked core Multiple threads –Maximize throughput/Watt/mm 2 Sandeep Navada © 2013 5

Optimize latency Sandeep Navada © 2013 8 Performance = IPC × frequency Complexity↑ => IPC↑ frequency↓ This tradeoff plays out differently for different apps and is dependent on the ILP characteristics of the app

Core design space Sandeep Navada © 2013 11 ParameterValue RangeNumber Front end width2, 3, 4, 5, 6, 7, 87 Issue width2, 3, 4, 5, 6, 7, 87 Physical register file size 64, 128, 192, 256, 384, 5126 Issue queue size16, 24, 32, 48, 64, 96, 1287 Load queue/ Store queue size 8/8, 16/16, 24/24, 32/32, 40/40, 48/48, 56/56, 64/64 8 L1 I$ size8, 16, 32, 64, 128KB5 L1 D$ size8, 16, 32, 64, 128KB5 L2$ size2MB1 Clock period0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2 ns8

Core selection Sandeep Navada © 2013 12 Core design space Pruning script Pruning script SPEC bench SimPoint tool SimPoint tool Pruned design Space 39 10M phases FabScalar toolset IPC, freq, power FabScalar toolset IPC, freq, power Performance of every phase on every design point Search N=1 HCMP Search N=2 HCMP Search N=2 HCMP Search N=3 HCMP Search N=4 HCMP Optimal 1-core-type HCMP Optimal 1-core-type HCMP Optimal 2-core-type HCMP Optimal 2-core-type HCMP Optimal 3-core-type HCMP Optimal 3-core-type HCMP Optimal 4-core-type HCMP Optimal 4-core-type HCMP N: Number of core types

Sandeep Navada © 2013 13 BIPS Core Types ABCDEFGH Phases 11.53.21.32.21.61.71.32.0 20.52.32.51.93.11.82.01.2 Search for Optimal 4-core-type HCMP Core 1Core 2Core 3Core 4Performance ABCD EBCD AFCD EFCD EFGH … HMEAN(3.2, 2.5) = 2.81 HMEAN(3.2, 3.1) = 3.15 HMEAN(2.2, 2.5) = 2.34 HMEAN(2.2, 3.1) = 2.57 HMEAN(2.0, 3.1) = 2.43

Optimal HCMP Sandeep Navada © 2013 27 The optimal HCMP consists of 1.Average core which is the best homogeneous core 2.Accelerator cores that relieve distinct bottlenecks in the average core Core TypeClock PeriodILP-extracting buffers WidthsCaches A0.632, 128, 1283, 464, 64 N0.532, 64, 642, 216, 16 L0.748, 128, 3844, 4128, 128 W0.732, 128, 1286, 6128, 32

Bottleneck-driven steering Application is continuously diagnosed for bottlenecks on the current core using perf. counters Migrate to different core when bottlenecks change –To an accelerator core that relieves any diagnosed bottleneck and doesn’t worsen any diagnosed bottleneck –To the average core if no accelerator meets this condition, or if no bottlenecks Sandeep Navada © 2013 29

Track performance counters Sandeep Navada © 2013 31 CounterDescription Width_ctrReady instruction not issued due to limited issue width. Window_ctrInstruction not dispatched due to issue queue or reorder buffer full. I$_ctrInstruction stalled due to instruction cache miss. D$_ctrLoad instruction stalled due to data cache miss. Misp_ctrMispredicted branch. L2_ctrInstruction stalled due to L2 cache miss. Cycle_ctrNumber of cycles.

Diagnose bottlenecks Every 10K instructions, evaluate bottlenecks using performance counters and thresholds Performance counters are normalized with respect to the cycle count If the normalized performance counter value is above threshold, then the corresponding resource is a bottleneck Sandeep Navada © 2013 32

Diagnose bottlenecks Sandeep Navada © 2013 33 BottleneckExpression bool Width Width = (Width_ctr > Width_thresh) bool Window Window = (Window_ctr > Window_thresh) bool Frequency Frequency = (Misp_ctr > Misp_thresh) || (L2_ctr > L2_thresh) bool I$ I$ = (I$_ctr > I$_thresh) bool D$ D$ = (D$_ctr > D$_thresh) Thresholds are determined empirically using a training process

Steer phase Sandeep Navada © 2013 34 CoreBottlenecks relieved Bottlenecks worsened Steering logic WWidthFrequency if (Width && !Frequency) W LWindowFrequency else if (Window && !Frequency) L NFrequencyWidth, Window else if (Frequency && !(Width || Window)) N An/a else A Paper shows full steering logic with I$ and D$ bottlenecks included.

Methodology Benchmarks: SPEC 2000 –Simulate first 4 billion instructions Metrics –Performance: BIPS –Efficiency: BIPS 3 /Watt Migration overhead –Default: 100 cycles –Sensitivity study: 1K, 10K cycles Sandeep Navada © 2013 36

Steering algorithms AlgorithmDescription BaselineRun the entire 4B instructions on the average core SamplingRun on each core type for the sampling interval and then on the best core type for the switching interval BottleneckRun current 10K instruction segment based on the bottlenecks of the prior 10K segment OptimalRun every 10K instruction segment on the best core type of the prior 10K segment OracleRun every 10K instruction segment on the best core type Sandeep Navada © 2013 37

Sampling vs. bottleneck steering Sandeep Navada © 2013 39 Sampling performs 8.9% better than the average core Bottleneck steering performs 12% better than the average core Sampling performs 8.9% better than the average core Bottleneck steering performs 12% better than the average core

Summary First proposal to architect and orchestrate multiple core types for latency reduction. With N core types, the optimal HCMP consists of an average core type coupled with N-1 accelerator core types. In the complementary steering algorithm, the application is continuously diagnosed for bottlenecks and is migrated to the core type which relieves the bottlenecks. Sandeep Navada © 2013 43

Future work HCMPs open up a whole new direction of microarchitecture research. Many microarchitecture optimizations don’t provide universal benefits. As each core-type targets a narrow workload space, HCMP provides a great platform to reconsider these optimizations. Sandeep Navada © 2013 44

Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Similar presentations

Presentation on theme: "Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Similar presentations

Presentation on theme: "Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket."— Presentation transcript:

Similar presentations

About project

Feedback