Presentation on theme: "Scheduling Algorithms for Unpredictably Heterogeneous CMP Architectures J. Winter and D. Albonesi, Cornell University International Conference on Dependable."— Presentation transcript:
Scheduling Algorithms for Unpredictably Heterogeneous CMP Architectures J. Winter and D. Albonesi, Cornell University International Conference on Dependable Systems and Networks, 2008 J. Winter and D. Albonesi, Cornell University International Conference on Dependable Systems and Networks, 2008
Paper Overview “Uniform Cores” are not uniform. E.g., an 8-core Intel Xeon processor is heterogeneous in the sense that the cores do not perform identically, due to hard errors, process variations, etc. It would be nice to schedule applications on the cores with the heterogeneity in mind, to match the capabilities of degraded cores with the applications Three algorithms: Hungarian, Global Search, Local Search Goal: reduce ED 2 over naïve assignment.
Why can’t we make uniform processors? There’s Not So Much Room at the Bottom (with apologies to R. Feynman) As transistors and wires shrink, the number of hard errors increases per die, and they also wear out faster. Yields would be too low if all faulty (non-fatal) cores were thrown out. Processors will therefore ship with “unpredictably heterogeneous” cores.
What can we do about unpredictable heterogeneous cores? Hardware solutions Redundancy, fault diagnosis, defect tolerance: good solutions to certain aspects of the problem, but does not address scheduling Hardware/Software solutions (this paper) Hardware can provide feedback on performance and power dissipation. Operating System handles global balancing requirements
Assumptions and Methodology Assumptions: Application behavior changes slowly Interaction between applications is limited Methodology: Reduce scheduling problem to Assignment Problem Hungarian Algorithm or Iterative Optimization
Related Work Permanent failure toleration techniques Redundancy to tolerate hard errors, and fault isolation and diagnosis leading to reconfiguration Mitigation of manufacturing process variations System-level, fabrication techinques Using the operating system to improve CMP energy efficiency Use Dynamic Voltage and Frequency Scaling based on workload Thermal Control Most previous work deals with homogenous chip systems
Scheduling Algorithms Methodology: Assign applications to cores over a fixed, short period of time. Reassess periodically. Algorithms use the sampling data for the decision. Hungarian Algorithm: Solves the “Assignment Problem” by assuming no interactions between threads and static program performance. Uses normalized energy-delay-squared (ED 2 ) sample results. O(N 3 ) complexity
Scheduling Algorithms (continued) Iterative Optimization Algorithms (Using AI approach) Simple to implement, greedy. Global Search Random schedule each interval, and OS keeps track of best configuration. Plus: Fast exploration. Minus: Does not always provide optimal solution
Scheduling Algorithms (continued) Local Search Uses a “neighborhood” of assignments that are closely related to the current configuration (using pair-wise swaps) During exploration, assignments do not change much, and revert back if previous configuration was better. Plus: more gradual search that steadily improves.
SimulationSimulation SESC Simulator (a microprocessor architectural simulator”) base. Augmented with CACTI, Wattch, Hotspot, and HotLeakage. 4GHz clock frequency, supply voltage of 1.0V Single-threaded applications from SPEC CPU2000
Simulation (continued) Direct interaction among applications on different cores is limited (as per one assumption) Intercore heating effects are limited by L2 caches surrounding each core, which act as heat sinks. Off-chip memory bandwidth is statically partitioned among cores Bottom line: simulation is of “a multi-core processor using single-core simulations to obtain performance, power, and thermal statistics that are then combined by a higher level chip-wide simulator that performs the role of the OS scheduler”
Simulation (continued) Advantage: Scales to CMPs with a large number of cores Baseline: 8-core homogeneous chip multiprocessor with no degradation. Processor degradation types: Pipeline component disabled (ALU, ROB entries, etc.) Frequency degradation from process varations Leakage current variations.
Simulation (continued) Four different workloads (Table 4) Each benchmark is used evenly among the four workloads. OS switches between exploration and steady- state.
ResultsResults Comparisons are made using the ED 2 metric against a baseline with no errors or variations and perfect scheduling. ED 2 chosen to balance performance with power dissipation
Results (simple scheduling) Round Robin and Randomized algorithms degrade ED 2 by 22% on average Worst-case can degrade ED 2 up to 45%
Results (advanced scheduling) Hungarian: 12.5 million cycle intervals Eight apps are executed on each core Rotated seven times for 8x8 cost matrix 7.3% increase in ED 2 200k cycles to solve cost matrix
Results (advanced scheduling) Global and local search: 25 intervals of 4 million cycles Global: Tries initial configuration and 24 other random configs. 19.5% degradation over baseline
Results (advanced scheduling) Local: Three versions: N=1,2,4; N pair-wise swaps N=2,4: beneficial pair- wise swaps are kept, others discarded. 15%, 12.6%, 7.8% degradation.
Results (overall) Comparison between degraded and non- degraded systems Offline Oracle performs better on ED 2 because some degraded cores operate at lower power. Hungarian and Local Search 4 perform at almost baseline.
ConclusionsConclusions CMPs will be affected by more and more variations and hard errors as the technology scales down, creating heterogeneity in otherwise uniform cores. Naïve scheduling on such cores leads to detrimental ED 2 performance. Under limited core-core interaction, the scheduling problem reduces to the Assignment Problem, and can be solved with Hungarian Algorithm. Certain AI schedulers work well, too.