Phase-Aware Optimization in Approximate Computing

Phase-Aware Optimization in Approximate Computing
CGO-2017 Subrata Mitra, Saurabh Bagchi (Purdue) Good morning everyone. I am Subrata Mitra, just graduated from Purdue. I am really sorry for not being able to present our paper on “Phase-Aware Optimization in Approximate Computing“ in person at CGO. For any questions, please feel free to me at This is a joint work between me and my advisor Saurabh Bagchi from Purdue, Manish from Microsoft and Sasa from UIUC. Manish K. Gupta (Microsoft) Sasa Misailovic (UIUC)

Huge energy demands of computation
Energy demand of computation in modern systems is well known and often make it to the headlines.

HPC world has a similar story
HPC systems or the supercomputers also face similar problems. A report from 2015 highlighted that an exascale system would draw almost the entire output from a nuclear power plant. Such huge energy demand prompted many research activities to find smarter ways to perform equally effective computation.

Can tolerate some imprecision
We can do much better Data Analytics Can tolerate some imprecision Image Processing Media Applications Computer Vision It turns out that we can really do much better! Fortunately we know a class of applications that can tolerate moderate amount of imprecise computation. These applications span across various domains such as image-processing, media application, data analytics, machine-learning, scientific-simulation etc. Even though there are large varieties of these applications, in majority of the cases the end output of these applications are consumed by human in analog form and thus inherently can tolerate slight inexactness in computation. Did you even notice that the color of the “blue arrows” are slightly different from each other ? Machine Learning Scientific Simulations

Output quality degradation in Sobel
To give a concrete example, in Sobel filter used by many image processing applications, Rahimi et al. showed that a 10% error in the output image remains invisible to the end user giving us ample opportunity to approximate and save energy or gain speedup. For example, this 10% error from approximation can result in a huge energy savings of about 57%. 0% Quality Loss 5% Quality Loss 10% Quality Loss 10% Quality loss is nearly indiscernible to the eye and yet provides 57% energy savings Rahimi et al. DATE-2015

Approximate computing: Trade accuracy for energy saving or computation speedup
Speedup and Energy Reduction 90% 10% 10x Accuracy Adjust knobs to control the approximation-levels of computation This approach of trading accuracy to gain speed or reduce energy is called approximate computing. LULESH – A Hydrodynamic Simulation

Approximate computing: Various prior approaches
Software Sage-MICRO-2013, Capri-ASPLOS-2016, Dynamic-Knobs-ASPLOS-2011 Hardware Esmaeilzadeh-ASPLOS-2012, Chippa-DAC-2010, Raha-CASES-2014 Compilers / PL Ansel-CGO-2011, PetaBricks-PLDI-2009, Misailovic-OOPSLA-2014, EnerJ-PLDI-2011 Input sensitivity Ansel-PLDI-2015, Ding-PLDI-2015, Laurenzano-PLDI-2016 Prior works used various techniques for approximate computing. Some techniques were embedded in the software layer, some required changes in the hardware or architecture. Some proposed compiler transformations and auto-tuner frameworks. Some propose different algorithms based on the properties of the input-data.

Assumption of a monolithic execution
The general approach has been to have single approximation configuration throughout the entire execution Application Application Kernel Execution However, previous approaches treated the application execution as a single monolithic entity. The common approach is, they try a single approximation settings for the entire duration of the execution and depending on the quality of the final result, they finally choose a suitable approximation setting for that application. We propose a new technique called OPPROX which opens-up the execution and chooses different settings for different “phases” of the execution. We found that since OPPROX operates at a finer granularity, it can often find a better setting that was not feasible by the prior coarse-grained approach. Output Quality Speedup

Phases in super-loop computations
Our concept of phase is based on a common super-loop based computation pattern that frequently appear in many applications. Usually these super-loops, inside which the main computation happens, can represent a time-step loop or an iterative convergence loop or a stream processing loop etc. depending on the exact application. Here is an example of the abstracted simulation-loop in LULESH which is a hydro-dynamic simulation application. Here the computation for the simulation continues inside a loop, between lines 2-11, until it reaches a stable state. OPPROX divides these super-loop iterations into multiple equal sized phases. Abstract computation pattern in LULESH

Workflow of OPPROX OPPROX a technique to perform phase-aware optimization in approximate-computing Takes an user provided error budget, applications with tunable knobs and tells what would be the best phase-specific settings to maximize performance At a very high level, let me now explain the workflow of our technique which as I mentioned is also called Opprox. Opprox takes as input: an application with tunable approximation settings and a set of representative inputs. Then it goes through a series of offline training steps which divides the application execution into phases and builds control-flow path and phase-specific models. Then, for a user-provided error budget it finds the best approximation settings for the individual phases that would maximize the speedup.

Application with tunable approximation levels
Loop Perforation (ICSE-2010): for ( i = 0 ; i < n ; i = i + approx _level ) { result = computeresult( ) ; } Loop Truncation (ICSE-2010): for ( i = 0 ; i < ( n - approx_ level ) ; i ++) { result = computeresult( ) ; } Loop Memoization (FSE-2011): for ( i = 0 ; i < n ; i ++) { if (0 == i % approx_level ) cached_result = result = computeresult( ) ; else result = cached_result ; } Opprox can work with ANY approximation methods which has a tunable knob to control its level. We used few compiler based approximation methods from prior work namely loop perforation, truncation and memorization. Loop-perforation skips some iterations, Loop-truncation terminates early, Loop-memoization uses cached results. We call these code regions as approximable blocks. The variable “approx _level” in these examples, controls the level of approximation. Basically, we profile the application to find which of the loops in the application can be transformed into one of these approximaable blocks. In addition to these, some applications also provide explicit algorithmic knobs which can be tuned as well to achieve some level of approximation.

Phase-specific characteristics (1)
But why phase is so important ? Let’s introduce QoS or Quality of Service degradation which is an application specific metric and measures how much the final outcome differs from the exact computation. Here we show the QoS degradation for two applications namely LULESH and Bodytrack, if we approximate in different phases vs keep a fixed setting for the entire duration of the execution. 1,2,3,4 on the X-axis shows which phase was approximated and the “All” indicates approximation of the entire execution. Y axis is QoS degradation and lower is better. Moreover, it can be seen that if approximation is performed in the later phases of the execution, it results in lower QoS degradation. Bodytrack LULESH

Phase-specific characteristics (2)
LULESH To illustrate more, here we show as we increase the granularity of the phases, i.e. divide the application to more number of phases, we achieve higher control over QoS degradation. But higher phase granularity comes with higher training overhead. So how do we choose appropriate phase granularity ? Bodytrack

Choose a proper phase-granularity
We use a user-provided threshold based algorithm to find a suitable phase-granularity of an application. The algorithm works as follows: We first divide the execution into 2 phases and approximate one phase at a time and measure the resulting mean QoS degradation corresponding to each phase. Then we calculate the maximum difference of QoS degradation corresponding to any two consecutive phases. Then we increase the phase granularity by dividing the execution into 4 phases and again similarly calculate the maximum difference in QoS degradation between any two phases. If these maximum difference values corresponding to 2 phased execution and 4 phased execution differ less than a user configurable threshold value we stop. Ultimately, it is for the user to decide how finely tuned control she wants. < threshold

Application speedup Measure speedup in terms of the number of instructions executed. S = # 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 𝑒𝑥𝑒𝑐𝑢𝑡𝑒𝑑 𝑖𝑛 𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒 𝑟𝑢𝑛 # 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 𝑒𝑥𝑒𝑐𝑢𝑡𝑒𝑑 𝑖𝑛 𝑎𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑒 𝑟𝑢𝑛 That was the error or QoS degradation. Let’s see what happens to the speedup. We define speedup in terms of number of instructions executed.

Speedup characteristics
For many applications, speedup characteristics of different phases look identical – which means speedup benefits of approximation from any of these phases are similar. Like the Bodytrack example shown here, where we divided the execution into 4-phases. 1,2,3,4 on the X-axis shows which phase was approximated and the “All” indicates approximation of the entire execution. Y axis is Speedup and higher is better. However, there are few applications where phase-specific approximation do change the speedup characteristics across phases – for example LULESH here. Bodytrack LULESH

Modeling to capture phase behavior
Collect training data for different phase-specific approximation setting. Build phase-specific speedup and QoS-degradation models using polynomial regression. For polynomial regression, the approximation knobs corresponding to different approximation blocks are the inputs and final speedup or QoS degradation are the outputs. Example: Two approximation blocks with two knobs a1 and a2, Model for speedup with a degree-2 polynomial: S = c0 + c1a1 + c2a2 + c3(a1)2 + c4(a2)2 + c5a1a2 To capture both speedup and QoS degradation characteristics, we build phase specific models. First we instrument the application and run the application with different phase-specific approximation settings to collect training data. Then use polynomial-regression to model the data. For polynomial regression, the approximation knobs corresponding to different approximable blocks are the inputs and final speedup or QoS degradation are the outputs. The coefficients c0, c1, c2 etc. are determined by the regression algorithm using training data.

Control-flow path specific models
Application speedup / QoS degradation characteristic might change with change in control-flow paths. Example: Changing filter ordering in FFmpeg Use decision-trees to predict input-parameter dependent control-flow paths One thing to note here is that the speedup or QoS degradation characteristics might change depending on the control-flow path the program takes. For example, in FFmpeg which is a image/video processing application, the relative ordering of two filters drastically changes the resulting QoS degradation characteristics. Therefore we build all models per unique control-flow paths and we use decision-trees to capture which control-flow path the application is going to take. First we estimate the path for a given input parameter and then for that path find what is the optimal setting of approximation knobs. Build speedup or QoS models per unique control-flow paths

Finding phase-specific optimization
For a user provided QoS-degradation budget find the best phase-specific optimization settings. For each phase calculate ROI or “return on investment” as mean speedup over mean error. Divide the error budget among the phases in proportion to the ROI values. Solve a polynomial optimization problem for each phase with the sub error budget as the constraint and find the best approximation settings for that phase. Redistributed any unused budget to the remaining phases. Now for a user provided QoS-degradation budget, we find the best phase-specific approximation settings. We follow a greedy optimization approach. For each phase we calculate a metric called ROI or return-on-investment as the mean speedup over mean error for that phase. Then divide the user-provided budgets to phase-specific sub-budgets in proportion to these ROI values. Then we solve an optimization problem for each phase using these sub-error-budgets as the constraint and find the best phase-specific settings. Any unused budget is re-distributed among other phases. Please refer to our paper for further technical details.

Evaluations Now lets talk about the evaluations.

Phase characteristics
LULESH CoMD Bodytrack FFmpeg First let me show the phase-specific characteristics of few more applications. QoS degradation at the top and Speedup at the bottom. We see similar trend for QoS degradation – approximating at later phases limits the degradation i.e. improves the quality of the result.

Behavior with different inputs
QoS degradation Speedup Bodytrack Now let’s inspect phase-specific behaviors for different input parameters for Bodytrack and LULESH. It can be seen that the characteristics differ slightly as we vary the inputs but qualitatively it shows the same property that approximating in later phases tend to reduce QoS degradation. LULESH

Speedup obtained by Opprox
High Medium Small Speedup Now we show the final benefit from our phase-specific optimization. We evaluate for 3-levels of QoS degradation budgets (30%, 20% and 10% QoS degradation). The base-line is a phase-agnostic exhaustive search where we go over all the points to find the best settings when the same approximation setting is used for the entire execution. It can be seen phase-aware optimization becomes most effective when operating under medium or low error budgets. This is because, since we can control error at a finer granularity, we can explore settings that a phase-agnostic approximation would never find as feasible. Phase-specific approximation is more attractive when operating under small error budget

Summary We show when using approximation will boost application performance, instead of “where” and “how much” we can also control “when” to fine-tune the expected outcome. Main computation inside a giant outer-loop which can be divided into “phases” to achieve fine-grained control over when to approximate. We present Opprox, a technique to characterize, model and optimize the gains from such phase-specific approximation. Opprox is particularly useful compared to traditional methods when operating under low error budget. So in summary: In addition to “where” and “how much”, we introduce another control which would say “when” to approximate. Main computation inside a giant-loop present in many applications can be divided in to phases to achieve this finer temporal control. We introduce OPPROX a technique to characterize, model and optimize the gains from such phase-specific approximation. We show that Opprox is particularly useful compared to traditional methods when operating under low error budget.

For any questions: subrata@purdue.edu
Thank You! For any questions: Thanks a lot for your time and I once again apologies for this recorded presentation.

Phase-Aware Optimization in Approximate Computing

Similar presentations

Presentation on theme: "Phase-Aware Optimization in Approximate Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Phase-Aware Optimization in Approximate Computing

Similar presentations

Presentation on theme: "Phase-Aware Optimization in Approximate Computing"— Presentation transcript:

Similar presentations

About project

Feedback