Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

Discovering and Exploiting Program Phases Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder CSE 231 Presentation by Justin Ma.

Computer Organization Lab 1 Soufiane berouel. Formulas to Remember CPU Time = CPU Clock Cycles x Clock Cycle Time CPU Clock Cycles = Instruction Count.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

CISC Machine Learning for Solving Systems Problems Presented by: John Tully Dept of Computer & Information Sciences University of Delaware Using.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

Perceptron-based Global Confidence Estimation for Value Prediction Master’s Thesis Michael Black June 26, 2003.

Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

SyNAR: Systems Networking and Architecture Group Symbiotic Jobscheduling for a Simultaneous Multithreading Processor Presenter: Alexandra Fedorova Simon.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Apr 14,2003CPE 631 Project Performance Analysis and Power Estimation of ARM Processor Team: Ajayshanker Krishnamurthy Swathi Tanjore Gurumani Zexin Pan.

Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

Predictive Design Space Exploration Using Genetically Programmed Response Surfaces Henry Cook Department of Electrical Engineering and Computer Science.

CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware

1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware

11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

CISC Machine Learning for Solving Systems Problems Microarchitecture Design Space Exploration Lecture 4 John Cavazos Dept of Computer & Information.

CISC Machine Learning for Solving Systems Problems Presented by: Eunjung Park Dept of Computer & Information Sciences University of Delaware Solutions.

Workload Design: Selecting Representative Program-Input Pairs Lieven Eeckhout Hans Vandierendonck Koen De Bosschere Ghent University, Belgium PACT 2002,

Sunpyo Hong, Hyesoon Kim

CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

Accurate WiFi Packet Delivery Rate Estimation and Applications Owais Khan and Lili Qiu. The University of Texas at Austin 1 Infocom 2016, San Francisco.

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

Ioannis E. Venetis Department of Computer Engineering and Informatics

Application-Specific Customization of Soft Processor Microarchitecture

The University of Adelaide, School of Computer Science

Outline Motivation Project Goals Methodology Preliminary Results

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Tosiron Adegbija and Ann Gordon-Ross+

Ann Gordon-Ross and Frank Vahid*

Reducing Training Time in a One-shot Machine Learning-based Compiler

Phase Capture and Prediction with Applications

Presented by: Divya Muppaneni

Faustino J. Gomez, Doug Burger, and Risto Miikkulainen

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Applying SVM to Data Bypass Prediction

CMSC 611: Advanced Computer Architecture

Maximizing Speedup through Self-Tuning of Processor Allocation

Application-Specific Customization of Soft Processor Microarchitecture

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

Phase based adaptive Branch predictor: Seeing the forest for the trees

Anand Bhat*, Soheil Samii†, Raj Rajkumar* *Carnegie Mellon University

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle Exploring and Predicting the Architecture/Optimising Compiler Co-Design Space Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle Presented by: Caghan Demirci Dept of Computer & Information Sciences University of Delaware

Goal Performance depends on both architecture and compiler optimizations. Current methodology designs compiler and architecture in isolation. => sub-optimal performance Proposed methodology designs compiler and architecture simultaneously (explores the co-design space). => extremely time-consuming Solution: Predict compiler performance on each architecture without tuning compiler. Typically, an architecture is selected under the assumption that the optimizing compiler can deliver a certain level of performance. Then, a compiler is built and tuned for that architecture which will hopefully deliver the performance levels assumed. Sub-optimal: The compiler team may not be able to deliver a compiler that achieves the architect’s expectations. If we could predict the performance of the eventual optimizing compiler on any architecture, then an entirely different architecture may have been chosen. Time-consuming: For each architecture, we must build an optimizing compiler.

Functionality Training: Input: Output: a small sample (less than 0.01%) of the architecture and optimization space. Input: an architecture configuration. information gained from a non-optimizing baseline compiler on the architecture. Output: performance prediction for a yet-to-be-tuned optimizing compiler on the architecture. error rate of 1.6%. Previous use of predictors to reduce simulation time: Fixed program, fixed architecture, optimization space. Fixed program, fixed optimizations, architecture space. Fixed program, optimization space, architecture space: fails to predict performance of optimizing compiler accurately.

Methodology Benchmarks: Full MiBench suite (35 benchmarks) Metrics: execution time (cycles), energy, and ED (energy-delay product) Optimizing compiler: Compiler that uses iterative compilation over 1000 randomly-selected flag combinations. Co-design space: Combined space of all architectural configurations and compiler optimizations. For each benchmark, choose the inputs leading to at least 100 million executed instructions. ED: presents the trade-offs between performance and energy consumption in a single value, the lower the better. Iterative compilation can out-perform an optimizing compiler tuned for a specific configuration. This can be considered an upper bound on the performance an optimizing compiler can achieve.

Methodology Baseline architecture: Intel XScale Simulator: Xtrem Cache and branch predictor configurations are critical components in an embedded processor. 288000 possible architecture combinations.

Methodology Baseline compiler: gcc 4.1.0 with -O1 642 million possible optimization combinations. On average, -O2 and -O3 produce the same execution time as -O1, and more energy & ED than -O1. So, -O1 is the best choice for baseline optimization.

Architecture Exploration Fixed compiler optimization (-O1) Design space: 200 randomly selected architecture configurations. Each graph is independently ordered from lowest to highest. The baseline is a very good choice (since XScale is highly tuned already), but selecting a better architecture leads to an ED value of 0.93 compared with the baseline.

Architecture Exploration Most benchmarks gain performance, but some lose performance. The configuration that is the best overall is not necessarily the best for each program. Select the architecture configuration leading to the best performance over the whole MiBench suite.

Optimization Exploration Fixed architecture (baseline) Design space: 1000 randomly selected optimizations. For some benchmarks, there is significant improvement. For some benchmarks, optimizations have little impact. Picking the wrong optimizations can significantly degrade performance. The best case flags for each benchmark give the performance of the optimizing compiler. On average, this leads to an ED value of 0.72 compared with the baseline.

Co-design Exploration Design space: 200 architecture configurations + 1000 optimizations. Each graph is independently ordered from lowest to highest. Best compiler optimizations for each benchmark vs. worst compiler optimizations for each benchmark. There is large room for improvement. Picking the wrong optimizations can significantly degrade performance.

Co-design Exploration Most benchmarks gain performance, but some lose performance. The results are more balanced than when performing only architecture exploration. Select the architecture/optimization configuration that performs best. On average, this leads to an ED value of 0.67 compared with the baseline.

Co-design Exploration The best compiler optimizations vary across the architecture space. Good optimizations for one architecture are not suitable for others. So it is important to explore both spaces simultaneously. toast benchmark: The optimizations that are best on the baseline architecture are actually worse than compiling with -O1 on other configurations. All benchmarks: Baseline good optimizations: Run 1000 optimizations on the baseline architecture and select those that are within 5% of the best found for each benchmark. Then, for each other architecture, run the benchmarks compiled with these optimizations again to determine the average ED value. Compare this with the best ED value achievable on that configuration. For some architectures, the baseline good optimizations are actually worse than compiling with -O1 on other configurations. 35 benchmarks x 200 configurations x 1000 optimizations = 7 million simulations

Predictor It is not desirable to conduct such a costly co- design space exploration. Solution: Build a machine-learning model to predict performance of optimizing compiler on any architecture. A new model needs to be created for each benchmark.

Predictor Step 1: Run the program compiled with -O1 on 200 randomly- selected architectures. Gather performance counters to characterize its behavior. (IPC, cache utilization, branch predictor utilization, ALU utilization, register utilization, cache miss rate, branch misprediction rate) Use PCA to summarize the 9 features into 2 components. Example on fft.

Predictor Step 2: Select a number of architectures for training. (15 architectures on average) For each architecture, run the program using 1000 randomly-selected optimizations to estimate the best performance achievable. To select architectures, divide the principal component space into a 5 x 4 grid and pick one training point per tile. Darker points lead to better performance.

Predictor Step 3: Use SVM on the training data to create the model. The model learns the difference between architectures based on the performance an optimizing compiler can achieve on them. Architectures that lie in the same color region are predicted to have similar optimizing compiler behavior. The model predicts that the optimizing compiler has little effect in light areas and large effect in dark areas.

Predictor Predictions can now be made for the entire space (200 architectures). For any new architecture: Run -O1 and collect performance counters. Use PCA to reduce the number of features to 2. Make a prediction based on the color of its region.

Evaluation Used 15 architectures for training and 185 architectures for validation. The average error rate is 1.6%.

Evaluation Vaswani’s model using Artificial Neural Network tries to predict the performance of a set of compiler flags for each architecture. Used model to predict the performance of 1000 optimizations, and picked the best one as predicted value. Averaged over the whole of MiBench. Much better than Vaswani’s model.

Evaluation It is possible to determine best optimizing compiler / architecture configuration. Ran predictor on 200 randomly-selected architectures. This leads to the minimum ED value of 0.677 compared with the baseline. This is accurate and validated to be the minimum. The instruction and data caches have high associativity to avoid conflicts.

Conclusion There is the potential for significant improvement over the baseline architecture and compiler by exploring the combined co-design space. It is possible to automatically and accurately predict the performance of an optimizing compiler on any architecture, without tuning the compiler first. It is possible to determine the best possible optimizing compiler / architecture configuration, leading to significant performance improvements.