Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advertisements

Efficient Program Compilation through Machine Learning Techniques Gennady Pekhimenko IBM Canada Angela Demke Brown University of Toronto.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
ECG Signal processing (2)
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
On-line learning and Boosting
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Support Vector Machines
Support vector machine
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Data Locality CS 524 – High-Performance Computing.
TRADING OFF PREDICTION ACCURACY AND POWER CONSUMPTION FOR CONTEXT- AWARE WEARABLE COMPUTING Presented By: Jeff Khoshgozaran.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
The Implicit Mapping into Feature Space. In order to learn non-linear relations with a linear machine, we need to select a set of non- linear features.
Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
What is machine learning? 1. A very trivial machine learning tool K-Nearest-Neighbors (KNN) The predicted class of the query sample depends on the voting.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Advanced Computer Architectures
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Using Neural Networks to Predict Claim Duration in the Presence of Right Censoring and Covariates David Speights Senior Research Statistician HNC Insurance.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Reduced Instruction Set Computers. Major Advances in Computers(1) The family concept —IBM System/ —DEC PDP-8 —Separates architecture from implementation.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
Topics to be covered Instruction Execution Characteristics
Code Optimization.
Advanced Architectures
Computer Architecture Principles Dr. Mike Frank
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Embedded Systems Design
课程名 编译原理 Compiling Techniques
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Instruction Scheduling for Instruction-Level Parallelism
CSCI1600: Embedded and Real Time Software
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Objective of This Course
Register Pressure Guided Unroll-and-Jam
CMSC 611: Advanced Computer Architecture
Predicting Unroll Factors Using Supervised Classification
Chapter 12 Pipelining and RISC
Multivariate Methods Berlin Chen
How to improve (decrease) CPI
CMSC 611: Advanced Computer Architecture
Multivariate Methods Berlin Chen, 2005 References:
CMSC 611: Advanced Computer Architecture
CSCI1600: Embedded and Real Time Software
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Presentation transcript:

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning

Predicting Unroll Factors Loop Unrolling sensitive to unroll factor Current solution: expert design –Difficult: Hand-tuned heuristics –Must be rewritten frequently Predict parameters with machine learning –Easy: data collection takes ~1wk No human time –Algorithm does not change with compiler

Loop Unrolling Combines multiple iterations loop body Fewer Iterations  Less Branching Allows other transformations: –Exposes adjacent memory locations –Allows instruction reordering across iterations

Unroll Factors How many iterations to combine? Too few? –Provides little benefit Too large –Increased cache pressure –Increase live range  register pressure

Optimal Unroll Factors

Classification Problems Input a vector of features –E.g. nest depth, # of branches, # of ops Output a class –E.g. unroll factor, 1-8 No prior knowledge required –Meaning of features/classes –Relevance of features –Relationships between features

Nearest Neighbors Paper describes Kernel Density Estimator All dimensions normalized to [0,1] Given a test point p: –Consider training points “close” to p Within fixed distance, e.g. 0.3 –Majority vote among qualifying training points

Nearest Neighbors

Support Vector Machine Assume two classes, easily generalized Transform data –Make classes linearly separable Find line to maximize sep. margin For test point: –Perform transformation –Classify based on learned line

Maximal Margin

Non-Linear SVM

Some Features # operands Live range size Critical path length # operations Known tripcount # floating point ops Loop nest level # branches # memory ops Instruction fan-in in DAG # instructions Language: C, fortran # memory ops # Implicit instructions & more (38 total)

Results: No Software Parallelism

Results: With Software Parallelism

Big Idea: Easy Maintenance Performance improvements modest –Sometimes worse, sometimes much better –Usually little change Requires no re-tuning to change compiler –Gathering data takes ~1wk, no human time General mechanism –Can be applied to all parameters –No model of system needed Can be applied to new transformations where expert knowledge is unavailable

Integrated CPU and L2 Cache Voltage Scaling using Machine Learning

Dynamic Voltage Control Monitor system When activity is low, reduce power –Also reduces computational capacity –May need more energy if work takes longer

Multiple Clock Domains Adjust separate components independently Better performance/power –E.g. CPU-bound application may be able to decrease power to memory and cache without affecting performance More complex DVM policy

Motivation Applications go through phases Frequency/voltages should change too Focus on core, L2 cache – Consume large fraction of total power Best policy may change over time –On battery: conserve power –Plugged in: maximize performance

Learning a DVM Policy Compiler automatically instruments code –Insert sampling code to record perf. Counters –Instrument code only to gather data Use machine learning to create policy Implement policy in microcontroller

ML Parameters Features –Clock cycles per instruction –L2 accesses per instruction –Memory access per instruction Select voltage to minimize: –Total energy –Energy*delay

Machine Learning Algorithm Automatically learn set of if-then rules –E.g: If (L2PI >= 1) and (CPI <=0) then f_cache=1GHz Compact, expressive Can be implemented in hardware

Results Compared to independently managing core and L2: –Saves 22% on average, 46% max Learns effective rules from few features Compiler modifications instrument code Learned policy offline Implemented policy in microcontroller

Conclusion Machine learning derives models from data automatically Allows easy maintenance of heuristics Creates models that are more effective than hand-tuned