Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Queen’s Tower Imperial College London South Kensington, SW7 6th Jun 2007 | Ashley Brown Real-Number Optimisation: A Speculative, Profile- Guided Approach.

Similar presentations


Presentation on theme: "The Queen’s Tower Imperial College London South Kensington, SW7 6th Jun 2007 | Ashley Brown Real-Number Optimisation: A Speculative, Profile- Guided Approach."— Presentation transcript:

1 The Queen’s Tower Imperial College London South Kensington, SW7 6th Jun 2007 | Ashley Brown Real-Number Optimisation: A Speculative, Profile- Guided Approach PhD Transfer Presentation 6 th June 2007

2 6th Jun 2007 | Ashley BrownIntroduction Most useful applications use real-number algorithms –Chemical modelling –Weather forecasting –MP3 players –Mobile phones –Etc. Science Applications = Double precision floating point –64-bit or 80-bit floating point ALU Embedded Applications = 32-bit fixed point –32-bit integer ALU # 2

3 6th Jun 2007 | Ashley Brown# 3 Introduction: Our Focus Two distinct sets of requirements Embedded systems –High precision often not important (video/audio processing) –Fixed point implementations possible Scientific computation –High precision extremely important –Reduction in precision or conversion to single prec. must be done with great care –IEEE-754 floating point

4 6th Jun 2007 | Ashley Brown What if….? What if we could make these applications run faster? What if we could shrink the hardware resources needed to run them? What if we could do it more aggressively than current methods? What if we took up gambling?

5 6th Jun 2007 | Ashley Brown Introduction: Avoiding the Safe Option Standard formats (e.g. IEEE-754) are good for generality But, we can do better –Optimise data format for the job in hand –Use reconfigurable technology to change format on-the-fly Static analysis steps must be conservative Formal proofs are also conservative We could be more daring! # 5

6 6th Jun 2007 | Ashley Brown Motivation Graphics cards have made highly parallel vector processors commodity items In the high performance computing world, FPGAs are available as acceleration cards –New Hypertransport cards provide faster communication channels Both provide speed increases if used effectively –But graphics cards only have single-precision f.p. –Simply implementing double precision f.p. on FPGAs provides no benefits Optimise aggressively to provide reduced f.p. capabilities, within needs of application

7 6th Jun 2007 | Ashley Brown FPGA Focus FPGAs provide a good prototyping platform Exploiting reconfiguration may provide further benefits Limitations: –Clock-rates are much slower than attached processors –Communication channels are typically slow Acceleration comes from parallelism The rest of this talk considers FPGA/custom hardware implementations

8 6th Jun 2007 | Ashley Brown# 8 The Problem Double precision floating point on FPGAs uses a lot of area Density is improving: but still want to squeeze more in! –Re-using hardware can reduce concurrency Scientific applications: typically 64-bit floating point Often full precision is (believed to be) required –Is this really the case? We have more options than single or double

9 6th Jun 2007 | Ashley Brown# 9 Current Solutions for F.P. minimisation Finding ‘minimal precision’: –Tools such as BitSize –Select precision for some operands, tool calculates the rest –Test vectors used to gauge errors Reducing hardware area: –Replacing floating point by fixed point, transparent to user (Cheung et al.) –Solution above is dangerous in scientific computations –Works in this case as trigonometric functions have defined ranges

10 6th Jun 2007 | Ashley Brown Current Solutions for F.P. minimisation (2) Approach by Dongarra et al –Inspired by Cell single precision f.p. performance –Use single precision most of the time with an iterative refinement algorithm –When approaching convergence, switch to double precision to finish off –Works on Cell and with SSE instructions Strodzka’s mixed-precision approach –Split loop into low and high preceision loops –Low precision computation loop –High precision correction loop # 10

11 Profile-guided Speculative Optimisation Three stages –Profile to Find Key Kernels –Optimise Data Format –Generate hardware with fallback mechanism Optimised hardware should produce correct results for most calculations We need to know when it gets it wrong 6th Jun 2007 | Ashley Brown# 11

12 Backup option if we get it wrong! Aggressive optimisation We could get it wrong Must ensure we get correct answers We must guess correctly often enough to make falling back insignificant 6th Jun 2007 | Ashley Brown# 12

13 6th Jun 2007 | Ashley Brown# 13 Optimisation Opportunities Reduce floating point unit –Reduced precision –Restricted normalisation Use an alternative representation –Non-standard floating point (e.g. 48-bit) –Fixed point –Dual fixed-point Minimisation of redundancy –Remove denormal handling unless required –Remove or predict zero-value calculations

14 6th Jun 2007 | Ashley Brown Background: Floating Point Primer # 14

15 6th Jun 2007 | Ashley Brown Simplified F.P. Adder

16 6th Jun 2007 | Ashley Brown# 16 Reduce Hardware Example using MORPHY F.P. values are interesting –Most confined to a narrow range –Different data sets do not vary the range Full range of double precision floating point not required Reduce Exponent –Limit size of shifting logic –Smaller data format = lower communication cost

17 6th Jun 2007 | Ashley Brown28th Jan 2007 | Ashley Brown# 17 Reduce Hardware – Alignment/Normalisation Most expensive step: shifting for add/subtract –Operand alignment –Normalisation Set limits on alignment to reduce hardware size –Trap to software to perform other alignments Provisional results: only shift-by-4 required for some applications

18 6th Jun 2007 | Ashley Brown# 18 Alternative Representations #1: Custom Floating Point No need to use 64- or 32-bit Use a compromise instead, maybe 48-bit is enough? 1mantissa(52)exp(11) 1mantissa(23)exp(8) 1mantissa(38)exp(9) IEEE Single Custom IEEE Double Can we drop the sign bit? Reduce hardware Reduce communications

19 6th Jun 2007 | Ashley Brown# 19 Alternative Representations #2: Fixed Point For very narrow ranges, fixed point may be an option Must be treated with extreme care Dual fixed-point format provides another possibility –Two different formats: different fixed point positions –1 bit reserved to switch between formats

20 6th Jun 2007 | Ashley Brown# 20FloatWatch Valgrind-based value profiler Can return a number of metrics: –Floating point value ranges –Variation between 32-bit and 64-bit F.P. executions –Difference in magnitude between F.P. operations Each metric has uses for optimisation!

21 6th Jun 2007 | Ashley Brown# 21FloatWatch Operates on x86 binaries under Valgrind –x86 machine code converted to simplified SSA –FloatWatch inserts instrumentation code after floating point operations –SSA converted back to x86 and cached Outputs a data file with selected metrics Processing script produces HTML+JavaScript report

22 6th Jun 2007 | Ashley Brown# 22Report Dynamic HTML interface –Copy HTML file from computing cluster to desktop, no installation required Select/deselect source lines, SSA “instructions” –Dynamic in-page graph –Table for exporting to GNU-plot, Excel etc. View value ranges at instruction, source line, function, file and application levels.

23 6th Jun 2007 | Ashley Brown# 23

24 6th Jun 2007 | Ashley Brown# 24

25 6th Jun 2007 | Ashley Brown# 25

26 6th Jun 2007 | Ashley Brown# 26

27 6th Jun 2007 | Ashley Brown# 27

28 6th Jun 2007 | Ashley Brown# 28

29 6th Jun 2007 | Ashley Brown# 29 What does this tell us? Alpha is constant (but could have found that from source) Memory operands all fall within the same range Result falls within the same range as memory operands Intermediate values result in a shift in the range Optimisation: we do not need double precision –A custom floating point format would suffice

30 6th Jun 2007 | Ashley Brown# 30 Profiling Results – SPECFP95 ‘swim’ Sawtooth caused by multiplication

31 6th Jun 2007 | Ashley Brown# 31 ‘swim’ Close-up

32 6th Jun 2007 | Ashley Brown# 32 Profiling Results – SPECFP95 ‘mgrid’ Operations producing zero Two ranges: similar shapes

33 6th Jun 2007 | Ashley Brown# 33 Range Close-up

34 6th Jun 2007 | Ashley Brown# 34 Profiling Results – MMVB As with MORPHY, ranges similar between datasets But we were using test datasets

35 6th Jun 2007 | Ashley Brown Adaptive Floating Point The main focus of future work Dynamic modification of acceleration hardware –Exploit reconfigurability of FPGAs –Reconfigure device to meet application requirements Important considerations: –What happens to data already on the chip? # 35

36 6th Jun 2007 | Ashley Brown# 36 “Pipeline Prediction” Similar concept to branch prediction Build a selection of pipelines with different performance characteristics –Slow but generic version –Fast version with limited range, reduced operand alignment –Compromise in between Predict which version is best to use (how?)

37 6th Jun 2007 | Ashley Brown# 37 True Reconfiguration – Temporal Profiling Value ranges can vary –for different application phases –within loops iterating to convergence Potential to reconfigure hardware as phases change Particularly apparent when iterating to convergence Simple example using Newton-Raphson method –Solve a cos(x) – bx 3 + c = 0 –Choose an estimate for the solution, iterate to refine

38 6th Jun 2007 | Ashley Brown Predictable Behaviour: Newton’s Method Each line represents a different starting estimate – the actual value is around 0.86

39 6th Jun 2007 | Ashley Brown Resolution Refinement Work by Dongarra et al uses this technique –Start with SSE/32-bit f.p. –Iterate to “ballpark” convergence –Switch to 64-bit for more precise result Alternative with reconfiguration –Multi-stepped refinement –Start at 8-bit, move to 24, 48, 64, 128 depending on precision required

40 6th Jun 2007 | Ashley Brown# 40 Problems with our approach No guarantees that values do not occur outside identified ranges –We must have a backup plan if it goes wrong Not all applications will demonstrate behaviour similar to MORPHY –Value ranges could vary wildly with different datasets Valgrind is slow Getting FPGAs to provide a speed-up can be difficult and painful

41 6th Jun 2007 | Ashley Brown# 41 Future Work State-based profiling: –profile functions based on call-stack –allows context-dependent configurations Active simulation –Test new representations to check for errors Use results in practice –FPGA implementations for real applications –Adaptive representations

42 The Queen’s Tower Imperial College London South Kensington, SW7 6th Jun 2007 | Ashley Brown Any Questions? Jezebel 1916 Dennis ‘N’ Type Fire Engine Royal College of Science Motor Club Imperial College Union, SW7


Download ppt "The Queen’s Tower Imperial College London South Kensington, SW7 6th Jun 2007 | Ashley Brown Real-Number Optimisation: A Speculative, Profile- Guided Approach."

Similar presentations


Ads by Google