Presentation is loading. Please wait.

Presentation is loading. Please wait.

Interleaved Pixel Lookup for Embedded Computer Vision

Similar presentations


Presentation on theme: "Interleaved Pixel Lookup for Embedded Computer Vision"— Presentation transcript:

1 Interleaved Pixel Lookup for Embedded Computer Vision
Kota Yamaguchi, Yoshihiro Watanabe, Takashi Komuro, Masatoshi Ishikawa

2 Outline Introduction Problems to apply interleaving Techniques
Example: Lucas-Kanade Conclusion

3 Purpose To find a technique to efficiently implement a parallel memory for pixel lookup operations Interleaving Image Processing Computer Vision Tasks Model objects, Feature space (e.g. Pose, Shape) Camera captures Images

4 Motivation Strong influence to downstream performance
Massive memory operations Always a headache for embedded designers Image Processing Computer Vision Tasks Model objects, Feature space (e.g. Pose, Shape) Camera captures Images

5 Motivation Interleaving in graphics hardware
Texram [Schilling, 96] Texture memory in Recent GPUs Is it also beneficial to an embedded computer vision hardware? Yes, if appropriately implemented

6 Pixel lookup operations
Geometry-to-pixel conversion Geometry stream Pixel stream xk+2 xk+1 xk I (xk+2) I (xk+1) I (xk ) Input images as a lookup table

7 Straightforward implementation
Random access memory Expensive and slow Geometry stream Pixel stream RAM xk+2 xk+1 xk I (xk+2) I (xk+1) I (xk ) Input images

8 Interleaved implementation
Higher throughput with same capacity But, suffers from partitioning and alignment issues Geometry stream Pixel stream Interleaved Memory Packed words Input images

9 Partitioning issue Parallel word does not match to operations
e.g. packing neighboring 1x4 pixels into a word, but required 4x1 pixels at each operation Pixel read read read align read

10 Misalignment issue Unaligned access requires multiple reads and sub-word alignment Word boundary read align read

11 Techniques 2D partitioning Indirect addressing Data switching

12 2D partitioning See an entire image as tiled spatial patterns
Packed word = spatial pattern required Avoids partitioning issue Memory banks Spatial Pattern Packed word

13 Spatial pattern Certain pattern present in a lookup sequence E.g.
- 2x2 block for interpolation - 3x3 block for convolution (i’, j’) (i’+1, j’) (i, j) (i+1, j) (i’+1, j’) (i’+1, j’+1) (i ,j+1) (i+1, j+1) Input images

14 2D partitioning and misalignment
Tiled patterns guarantee data elements in a word are always distributed even if an access overlaps address boundaries Bank 1 Bank 2 Bank 3 Bank 4 4 3 2 1 4 3 2 1

15 Indirect addressing Generating patterned addresses for each bank removes multiple reads for misaligned access Bank 1 Bank 2 Bank 3 Bank 4 4 3 2 1 4 3 2 1 Address generator

16 Data switching Switch removes throughput decrease caused by sub-word alignment Bank 1 Bank 2 Bank 3 Bank 4 4 3 2 1 4 3 2 1 Address generator

17 Techniques overview Indirect addressing Data switching … …
Geometry stream Address generator Pixel stream Memory banks 2D partitioning Input images

18 Example: Lucas-Kanade
Image registration algorithm Non-linear least squares to solve for parameters of affine transformation between input and template [Baker & Matthews, 04] Input image Gauss-Newton method Affine parameters Template image

19 LK data flow Bottleneck: for-each-x for-each-iteration stack
Includes pixel lookup For each iteration For each

20 Pixel lookup in LK Affine warped coordinates to pixels conversion
Lookup neighboring 4x4 pixels for each output Raw pixels Warped gradient pixels Warped coordinates Pixel lookup table Interpolation Warped input pixels Input images

21 Straightforward implementation
Filter Kernels Raw pixels RAM Multiply-Adds Input images

22 Interleaved implementation
Filter Kernels Raw pixels Interleaved memory Multiply-Adds Address generator Memory banks Input images 4x4 block partitioning

23 Comparison of memory configurations
Single port 4x4 multi-port 4x4 interleaved (SIMD) 4x4 interleaved with alignment support Throughput 1 16 5-6 Capacity requirement Peripherals None Switch Address generator and Switch Easier to implement peripherals than increasing memory capacity

24 FPGA implementation of LK pipeline
Just interleaving contributes to 16x larger throughput for the dedicated pipeline Dedicated hardware pipeline FPU Affine Warp Calculator Filter Kernel Generator Gradient / Interpolation Filter Jacobian Filter Hessian Matrix Calculator FP ALU Input Pixel Table SDPU Calculator Error Calculator FP Register Template Pixel Table For each x For each iteration

25 HDL synthesis 16x larger throughput, but still same capacity requirement and feasible hardware costs Estimated performance: 200 fps for registration of 5 pieces of 64x64 8-bit image patches at 100 MHz Assumption: all registration converge within 10 iterations FPGA Xilinx Virtex-4 XC4VLX200 Maximum freq. MHz Slices DSP slices RAM blocks 3,108 / 890,833 (3%) 75 / 96 (79%) 266 / 336 (78%) (4,788 Kb)

26 Summary Interleaved pixel lookup Techniques Example: Lucas-Kanade
Sub-word parallel memory operations utilizing spatial pattern in lookup sequences Techniques 2D partitioning Indirect addressing Data switching Example: Lucas-Kanade 16x larger throughput with same memory capacity and feasible hardware cost


Download ppt "Interleaved Pixel Lookup for Embedded Computer Vision"

Similar presentations


Ads by Google