Efficient Run-Time Dispatching in Generic Programming with Minimal Code Bloat Lubomir Bourdev Advanced Technology Labs Adobe Systems Jaakko Järvi Computer.

Efficient Run-Time Dispatching in Generic Programming with Minimal Code Bloat Lubomir Bourdev Advanced Technology Labs Adobe Systems Jaakko Järvi Computer Science Department Texas A&M University

Agenda Context & problem statement Background – previous approaches Our approach to code bloat reduction Code bloat reduction in run-time dispatch Results & conclusion

Context: Image Manipulation Images vary in many different ways Writing generic and efficient image processing algorithms is challenging

Image Representations 4x3 image in which the second pixel is hilighted In interleaved form: In planar form: planar vs. interleaved channel depth 8-bit, 16-bit… channel order (RGB vs. BGR) Color space (RGB, CMYK…) optional padding at the end of rows

Generic Image Library (GIL) Adobe’s Open Source Image Library http://opensource.adobe.com/gil Abstracts image representations from algorithms on images Allows for writing the algorithm once & having it work on images of any representation, without loss of performance

Problem Statement How do we write image processing algorithms that are: –Generic –Efficient –Compact –Run-Time Flexible

Image algorithms via inheritance & polymorphism struct image { virtual void invert()=0; }; struct rgb_image : public image { virtual void invert() { for (i=0; i<img.size(); ++i) … } }; struct cmyk_image : public image { virtual void invert() { for (i=0; i<img.size(); ++i) … } }; Generic XX Efficient √ Compact √ Run-Time Flexible √

Image algorithms via inheritance & polymorphism struct pixel { virtual void invert()=0; }; struct rgb_pixel : public pixel { virtual void invert(); }; struct gray_pixel : public pixel { virtual void invert(); }; struct image { pixel* operator[](size_t i); }; void invert(image* img) { for (i=0; i<img.size(); ++i) img[i]->invert(); } Generic X Efficient X Compact √ Run-Time Flexible √ Performance problem: dynamic dispatch once per pixel

Image Algorithms via Generic Programming struct rgb_pixel {…}; struct gray_pixel {…}; void invert_pixel(rgb_pixel&) {…} void invert_pixel(gray_pixel&) {…} template struct image { Pixel& operator[](size_t i); }; template void invert(Image& img) { for (i=0; i<img.size(); ++i) invert_pixel(img[i]); } Generic √ Efficient √ Compact √ Run-Time Flexible X

Generic Code Lacks Flexibility We need run-time flexibility: typedef boost::mpl::vector images; gil::any_image runtime_image; gil::jpeg_read_image(runtime_image, “test.jpg”); invert(runtime_image); How can we do that without loss of performance? –Variant construct (see boost::variant) –runtime_image holds: index: index to the type of image bits: buffer containing the currently instantiated image –To invoke an algorithm, go through a switch statement & cast –Efficient: invoke dynamic dispatch only once per algorithm

Variant invocation void invert_image(void* bits, int index) { switch (index) { case kLAB: invert(*(image *)(bits)); case kRGB: invert(*(image *)(bits)); } } Generic version: template void apply_operation(void* bits, int index, Op op) { switch (index) { case kLAB: op(*(image *)(bits)); case kRGB: op(*(image *)(bits)); } } Generic √ Efficient √ Compact x Run-Time Flexible √

Solution: Template Hoisting Define a class hierarchy: template class k_channel_image {…}; class rgb_image : public k_channel_image {}; class lab_image : public k_channel_image {}; Define the algorithm at the appropriate level of the hierarchy: template void invert(k_channel_image &) {…} - enforces a specific hierarchy - different algorithms may need different hierarchies - switch statement overhead remains - does not help when the function is inlined Generic x Efficient √ Compact Run-Time Flexible √

Our method: Algorithm-centric approach to code bloat Define dimensions of variability of the type Specify, for a given algorithm, the set of dimensions that matter example: copy_pixels(source_image, dst_image); Reduce the type along the dimensions that don’t matter Image propertySource of “copy_pixels” Color SpaceNot important Channel TypeImportant Number of ChannelsImportant Channel OrderingImportant MutabilityNot important

Type Reduction Every algorithm partitions the space of its argument types into a set of equivalence classes Members of an equivalence result in the same assembly when instantiated The algorithm is instantiated only with one representative from each equivalence class

Type Reduction Implementation Metafunction to define the partition: template struct reduce { typedef T type; }; Generic algorithm invocation: template inline void apply_operation(const T& argument, Op op) { typedef typename reduce ::type base_t; op(reinterpret_cast (argument)); }

Example: The invert algorithm Define the algorithm as a function object: struct invert_op { template void operator()(Image&){…} }; Provide a function overload to invoke it: template inline void invert(Image& image) { apply_operation(image, invert_op()); } Inverting RGB and LAB images is assembly-level identical: template<> struct reduce { typedef rgb8_image_t; };

The technique generalizes to multiple dimensions template void apply_operation(T1& arg1, T2& arg2, Op op) { typedef typename reduce ::type base1_t; typedef typename reduce ::type base2_t; typedef std::pair pair_t; typedef typename reduce ::type base_pair_t; std::pair p(&arg1,&arg2); op(reinterpret_cast (p)); } template <> struct reduce {…}; template <> struct reduce<copy_pixels_op, std::pair > {…};

Defining Reduce Specializations Reduce dimensions separately, then combine: template struct reduce { typedef reduce_cs ::type cs; typedef reduce_ch ::type channel; typedef image_type ::type type; }; Reuse structures via metafunction forwarding: template struct reduce > : public reduce > {};

Example: binary color space reduction We identified eight such common color space equivalence classes A B G R B G R A A R G B R G B A reduces to:

Reduction in variants Input: a variant of: input_types: [rgb8_image, lab8_image, cmyk16_image, rgba16_image] input_index: 2 Step 1: Reduce each member of the vector: reduced_t: [rgb8_image, rgb8_image, rgba16_image, rgba16_image] Step 2: Remove duplicates: output_types_t: [rgb8_image, rgba16_image] Step 3: Create index vector from reduced_t to output_types_t: indices_t: [0, 0, 1, 1] Step 4: Use indices_t to map the input index to an output index: output_index = indices_t[input_index] = indices[2] = 1 Invoke the algorithm on a variant of: output_types_t: [rgb8_image, rgba16_image] output_index: 1

Binary reduction in variants Step 1: Perform unary pre-reduction on each argument [A1, A2, A3, A4] with index 2 -> [A1, A3] with out_index1 = 1 [B1, B2, B3] with index 3 -> [B1, B2] with out_index2 = 0 Step 2: Compute a vector of the cross-products of types [(A1,B1), (A1,B2), (A3,B1), (A3,B2)] Step 3: Apply unary reduction on it: output_types_t = [(A1,B1), (A1,B2), (A3,B2)] Step 4: Compute the index in the output vector out_index = out_index1 * size(Vec1) + out_index2 Invoke the algorithm on a single variant of: output_types_t = [(A1,B1), (A1,B2), (A3,B2)] out_index

Hypothetical Reduction Example: copy_pixels Start with 3*9*2*2 = 108 image types –channel type (8 / 16 / 32 bit) –color space (rgb,bgr,lab,hsb,rgba,argb,bgra,abgr,cmyk) –planar / interleaved pixel ordering –mutable / immutable type Unary pre-reduction (3*6*2*1 = 36 equiv. classes) –reduce lab,hsb to rgb, cmyk to rgba –mutable-only Binary reduction –reduce color space pairs to 8 equiv. classes based on mapping –reduce incompatible combinations End result: 1 switch statement with 96 cases (down from 109 case statements with 108 2 =11664 cases!)

Tests Test sets –Set A: 90 types (10 color spaces, 3 channel types, other variations) –Set B: 10 types (4 color spaces, other) –Set C: 12 types (3 color spaces, planar/interleaved, step/nonstep) Tests –Test 1: copy_pixels on Set B (inlined binary algorithm) –Test 2: copy_pixels on Set C (inlined binary algorithm) –Test 3: resample_pixels on Set B (non-inlined binary algorithm) –Test 4: resample_pixels on Set C (non-inlined binary algorithm) –Test 5: invert_pixels on Set A (inlined unary algorithm)

Results Test 142.034.518%201.6107.547% Test 241.526.037%252.875.970% Test 346.042.58%259.8144.045% Test 433.534.0-1%318.798.869% Test 524.016.531%62.231.250% Visual Studio 8GCC 4.0 No Reduce Reduce Percent reduction No Reduce Reduce Percent reduction Test 1106%116% Test 278%97% Test 387%118% Test 475%103% Test 5194%307% VS 8.0 GCC 4.0 Reduction in code bloat Effect on compile time

Conclusion Drawbacks –Unsafe –Requires intimate knowledge of the types and the algorithm –Some compilers can optimize most of the code bloat Benefits –Works even when functions are inlined –Simplifies code generated by variants (especially double dispatch) –Does not impose class hierarchy (essential for generic code!) –Works when algorithms differ in requirements

Efficient Run-Time Dispatching in Generic Programming with Minimal Code Bloat Lubomir Bourdev Advanced Technology Labs Adobe Systems Jaakko Järvi Computer.

Similar presentations

Presentation on theme: "Efficient Run-Time Dispatching in Generic Programming with Minimal Code Bloat Lubomir Bourdev Advanced Technology Labs Adobe Systems Jaakko Järvi Computer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Run-Time Dispatching in Generic Programming with Minimal Code Bloat Lubomir Bourdev Advanced Technology Labs Adobe Systems Jaakko Järvi Computer.

Similar presentations

Presentation on theme: "Efficient Run-Time Dispatching in Generic Programming with Minimal Code Bloat Lubomir Bourdev Advanced Technology Labs Adobe Systems Jaakko Järvi Computer."— Presentation transcript:

Similar presentations

About project

Feedback