Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Exploring and Predicting the Architecture/Optimising Compiler Co-Design Space Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle Presented by: Caghan Demirci Dept of Computer & Information Sciences University of Delaware

Goal Performance depends on both architecture and compiler optimizations. Current methodology designs compiler and architecture in isolation. => sub-optimal performance Proposed methodology designs compiler and architecture simultaneously (explores the co-design space). => extremely time-consuming Solution: Predict compiler performance on each architecture without tuning compiler. Typically, an architecture is selected under the assumption that the optimizing compiler can deliver a certain level of performance. Then, a compiler is built and tuned for that architecture which will hopefully deliver the performance levels assumed. Sub-optimal: The compiler team may not be able to deliver a compiler that achieves the architect’s expectations. If we could predict the performance of the eventual optimizing compiler on any architecture, then an entirely different architecture may have been chosen. Time-consuming: For each architecture, we must build an optimizing compiler.

Functionality Training: Input: Output:
a small sample (less than 0.01%) of the architecture and optimization space. Input: an architecture configuration. information gained from a non-optimizing baseline compiler on the architecture. Output: performance prediction for a yet-to-be-tuned optimizing compiler on the architecture. error rate of 1.6%. Previous use of predictors to reduce simulation time: Fixed program, fixed architecture, optimization space. Fixed program, fixed optimizations, architecture space. Fixed program, optimization space, architecture space: fails to predict performance of optimizing compiler accurately.

Methodology Benchmarks: Full MiBench suite (35 benchmarks)
Metrics: execution time (cycles), energy, and ED (energy-delay product) Optimizing compiler: Compiler that uses iterative compilation over 1000 randomly-selected flag combinations. Co-design space: Combined space of all architectural configurations and compiler optimizations. For each benchmark, choose the inputs leading to at least 100 million executed instructions. ED: presents the trade-offs between performance and energy consumption in a single value, the lower the better. Iterative compilation can out-perform an optimizing compiler tuned for a specific configuration. This can be considered an upper bound on the performance an optimizing compiler can achieve.

Methodology Baseline architecture: Intel XScale Simulator: Xtrem
Cache and branch predictor configurations are critical components in an embedded processor. possible architecture combinations.

Methodology Baseline compiler: gcc 4.1.0 with -O1
642 million possible optimization combinations. On average, -O2 and -O3 produce the same execution time as -O1, and more energy & ED than -O1. So, -O1 is the best choice for baseline optimization.

Architecture Exploration
Fixed compiler optimization (-O1) Design space: 200 randomly selected architecture configurations. Each graph is independently ordered from lowest to highest. The baseline is a very good choice (since XScale is highly tuned already), but selecting a better architecture leads to an ED value of 0.93 compared with the baseline.

Architecture Exploration
Most benchmarks gain performance, but some lose performance. The configuration that is the best overall is not necessarily the best for each program. Select the architecture configuration leading to the best performance over the whole MiBench suite.

Optimization Exploration
Fixed architecture (baseline) Design space: 1000 randomly selected optimizations. For some benchmarks, there is significant improvement. For some benchmarks, optimizations have little impact. Picking the wrong optimizations can significantly degrade performance. The best case flags for each benchmark give the performance of the optimizing compiler. On average, this leads to an ED value of 0.72 compared with the baseline.

Co-design Exploration
Design space: 200 architecture configurations optimizations. Each graph is independently ordered from lowest to highest. Best compiler optimizations for each benchmark vs. worst compiler optimizations for each benchmark. There is large room for improvement. Picking the wrong optimizations can significantly degrade performance.

Most benchmarks gain performance, but some lose performance. The results are more balanced than when performing only architecture exploration. Select the architecture/optimization configuration that performs best. On average, this leads to an ED value of 0.67 compared with the baseline.

The best compiler optimizations vary across the architecture space. Good optimizations for one architecture are not suitable for others. So it is important to explore both spaces simultaneously. toast benchmark: The optimizations that are best on the baseline architecture are actually worse than compiling with -O1 on other configurations. All benchmarks: Baseline good optimizations: Run 1000 optimizations on the baseline architecture and select those that are within 5% of the best found for each benchmark. Then, for each other architecture, run the benchmarks compiled with these optimizations again to determine the average ED value. Compare this with the best ED value achievable on that configuration. For some architectures, the baseline good optimizations are actually worse than compiling with -O1 on other configurations. 35 benchmarks x 200 configurations x 1000 optimizations = 7 million simulations

Predictor It is not desirable to conduct such a costly co- design space exploration. Solution: Build a machine-learning model to predict performance of optimizing compiler on any architecture. A new model needs to be created for each benchmark.

Predictor Step 1: Run the program compiled with -O1 on 200 randomly- selected architectures. Gather performance counters to characterize its behavior. (IPC, cache utilization, branch predictor utilization, ALU utilization, register utilization, cache miss rate, branch misprediction rate) Use PCA to summarize the 9 features into 2 components. Example on fft.

Predictor Step 2: Select a number of architectures for training. (15 architectures on average) For each architecture, run the program using randomly-selected optimizations to estimate the best performance achievable. To select architectures, divide the principal component space into a 5 x 4 grid and pick one training point per tile. Darker points lead to better performance.

Predictor Step 3: Use SVM on the training data to create the model.
The model learns the difference between architectures based on the performance an optimizing compiler can achieve on them. Architectures that lie in the same color region are predicted to have similar optimizing compiler behavior. The model predicts that the optimizing compiler has little effect in light areas and large effect in dark areas.

Predictor Predictions can now be made for the entire space (200 architectures). For any new architecture: Run -O1 and collect performance counters. Use PCA to reduce the number of features to 2. Make a prediction based on the color of its region.

Evaluation Used 15 architectures for training and 185 architectures for validation. The average error rate is 1.6%.

Evaluation Vaswani’s model using Artificial Neural Network tries to predict the performance of a set of compiler flags for each architecture. Used model to predict the performance of optimizations, and picked the best one as predicted value. Averaged over the whole of MiBench. Much better than Vaswani’s model.

Evaluation It is possible to determine best optimizing compiler / architecture configuration. Ran predictor on 200 randomly-selected architectures. This leads to the minimum ED value of compared with the baseline. This is accurate and validated to be the minimum. The instruction and data caches have high associativity to avoid conflicts.

Conclusion There is the potential for significant improvement over the baseline architecture and compiler by exploring the combined co-design space. It is possible to automatically and accurately predict the performance of an optimizing compiler on any architecture, without tuning the compiler first. It is possible to determine the best possible optimizing compiler / architecture configuration, leading to significant performance improvements.

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Similar presentations

Presentation on theme: "Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Similar presentations

Presentation on theme: "Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle"— Presentation transcript:

Similar presentations

About project

Feedback