# Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis.

## Presentation on theme: "Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis."— Presentation transcript:

Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis de Supinski and Matt LeGendre Lawrence Livermore National Lab

Background Floating point represents real numbers as (± sgnf × 2 exp ) o Sign bit o Exponent o Significand (mantissa or fraction) Finite precision o Single-precision: 24 bits (~7 decimal digits) o Double-precision: 53 bits (~16 decimal digits) o Introduces rounding error 032 16 84 Significand (23 bits)Exponent (8 bits) IEEE Single 2 03264 16 84 Significand (52 bits)Exponent (11 bits) IEEE Double

Motivation Double precision is ubiquitous o Necessary for some computations o Lack of easy-to-use techniques for reasoning about precision Single precision is preferable o Faster computation o Tesla K20X: 2.95 TFlops (singles) vs. 1.31 TFlops (doubles) o Intel Xeon Phi: 2.15 GFlops (singles) vs. 1.07 GFlops (doubles) o Standard CPUs: 2x operations w/ SSE vector operations o Reduced memory pressure o Up to 50% footprint reduction o Data movement is a bottleneck for some domains Desire: Balance speed (singles) with accuracy (doubles) 3

Mixed Precision 4 1: LU PA 2: solve Ly = Pb 3: solve Ux 0 = y 4: for k = 1, 2,... do 5:r k b – Ax k-1 6:solve Ly = Pr k 7:solve Uz k = y 8:x k x k-1 + z k 9:check for convergence 10: end for Red text indicates steps performed in double-precision (all other steps are single-precision) Mixed-precision linear solver algorithm Use double precision where necessary Use single precision where possible Nearly 2x speedups [Baboulin2008]

Our Goal Use automated analysis techniques to prototype mixed-precision variants and provide insight about a programs precision level requirements. 5

Framework CRAFT : Configurable Runtime Analysis for Floating-point Tuning Static binary instrumentation o Parse binary on disk o Replace or augment floating-point instructions with new code o Rewrite modified binary Dynamic analysis o Run modified program on representative data set o Produce results and recommendations 6

Previous Work Cancellation detection [WHIST11] o Reports loss of precision due to subtraction o Provides insight regarding numerical behavior Range tracking o Reports per-instruction min/max values o Provides insight regarding low dynamic ranges Mixed-precision variants o Replaces double-precision instructions and operands o Provides insight regarding precision-level sensitivity 7

downcast conversion In-place replacement o Narrowed focus: doubles singles o In-place downcast conversion o Flag in the high bits to indicate replacement 03264 16 84 Double 03264 16 84 Replaced Double 7FF4DEAD Non-signalling NaN 032 16 84 Single 8Implementation

Example gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1movsd 0x601e38(%rax, %rbx, 8) %xmm0 2mulsd -0x78(%rsp) * %xmm0 %xmm0 3addsd -0x4f02(%rip) + %xmm0 %xmm0 4movsd %xmm0 0x601e38(%rax, %rbx, 8) 9

Example gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1movsd 0x601e38(%rax, %rbx, 8) %xmm0 2mulss -0x78(%rsp) * %xmm0 %xmm0 3addss -0x4f02(%rip) + %xmm0 %xmm0 4movsd %xmm0 0x601e38(%rax, %rbx, 8) 10

gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1movsd 0x601e38(%rax, %rbx, 8) %xmm0 check/replace -0x78(%rsp) and %xmm0 2mulss -0x78(%rsp) * %xmm0 %xmm0 check/replace -0x4f02(%rip) and %xmm0 3addss -0x4f02(%rip) + %xmm0 %xmm0 4movsd %xmm0 0x601e38(%rax, %rbx, 8) 11Example

Replacement Code push %rax push %rbx mov %rbx, 0xffffffff00000000 and %rax, %rbx # extract high word mov %rbx, 0x7ff4dead00000000 test %rax, %rbx # check for flag je next # skip if replaced cvtsd2ss %rax, %rax # down-cast value or %rax, %rbx # set flag next: pop %rbx pop %rax # e.g. addsd => addss 12

Dyninst Binary analysis framework o Parses executable files (InstructionAPI & ParseAPI) o Inserts instrumentation (DyninstAPI) o Supports full binary modification (PatchAPI) o Rewrites binary executable files (SymtabAPI) dyninst.org 13

Block Editing 14 double single conversion original instruction in block block splits initialization check/replace

Overhead 15 Benchmark (name.CLASS) Average Overhead bt.A50.6X cg.A6.1X ep.A13.8X ft.A10.1X lu.A28.5X mg.A14.0X sp.A19.5X

Binary Editing 16 Original Binary (mutatee) Modified Binary CRAFT (mutator) Double Precision Mixed Precision Mixed Config Configuration (parser & GUI)

Configuration 17

Automated Search Manual mixed-precision replacement o Hard to use without intuition regarding potential replacements Automatic mixed-precision analysis o Try lots of configurations (empirical auto-tuning) o Test with user-defined verification routine and data set o Exploit program control structure: replace larger structures (modules, functions) first o If coarse-grained replacements fail, try finer-grained subcomponent replacements 18

System Overview 19

Example Results 20

Example Results 21

NAS Results 22 Benchmark (name.CLASS) Candidate Instructions Configurations Tested Instructions Replaced % Static % Dynamic bt.W6,6473,85476.2 85.7 bt.A6,6823,83275.9 81.6 cg.W94027093.7 6.4 cg.A93422994.7 5.3 ep.W39711293.7 30.7 ep.A39711393.1 23.9 ft.W4227284.4 0.3 ft.A4227393.6 0.2 lu.W5,9573,76973.7 65.5 lu.A5,9292,81480.4 69.4 mg.W1,35145884.4 28.0 mg.A1,35145684.1 24.4 sp.W4,7725,72936.9 45.8 sp.A4,8215,04451.9 43.0

NAS Results 23 Benchmark (name.CLASS) Candidate Instructions Configurations Tested Instructions Replaced % Static % Dynamic bt.W6,2283,93473.783.2 bt.A6,262 cg.W96225190.7 7.4 cg.A95625590.4 5.6 ep.W42311792.047.2 ep.A42311491.745.5 ft.W4267593.0 0.3 ft.A4267493.0 0.2 lu.W6,0384,11769.257.4 lu.A6,014 mg.W1,39344389.439.2 mg.A1,39343788.836.6 sp.W4,4585,12438.140.5 sp.A4,507

AMGmk Results 24 Algebraic MultiGrid microkernel Multigrid method is iterative and highly adaptive Good candidate for replacement Automatic search Complete conversion (100% replacement) Manually-rewritten version Speedup: 175 sec to 95 sec ( 1.8X ) Conventional x86_64 hardware

SuperLU Results 25 Package for LU decomposition and linear solves Reports final error residual (useful for threshholding) Both single- and double-precision versions Verified manual conversion via automatic search Used error from provided single-precision version as threshold Final config matched single-precision profile (99.9% replacement) ThresholdInstructions Replaced % Static % Dynamic Final Error 1.0e-0399.199.91.59e-04 1.0e-0494.187.34.42e-05 7.5e-0591.352.54.40e-05 5.0e-0587.945.23.00e-05 2.5e-0580.326.61.69e-05 1.0e-0575.4 1.67.15e-07 1.0e-0672.6 1.64.7e7-07

Future Work Memory-based analysis Case studies Search optimization 26

Conclusion Automated binary modification can build prototype mixed-precision program variants. Automated search can provide insight to focus mixed-precision implementation efforts. 27

Thank you! sf.net/p/crafthpc 28

Download ppt "Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis."

Similar presentations