A Hardware Processing Unit For Point Sets S. Heinzle, G. Guennebaud, M. Botsch, M. Gross Graphics Hardware 2008
Motivation Point-based graphics established Powerful algorithms –Representation –Processing –Manipulation –Rendering Decomposition –Get neighborhood –Operate on neighbors Graphics Hardware
Motivation GPUs not suited for getting neighborhood –SIMD –Incoherent branching –Dynamic data structures slow –Recursive calls not supported CPUs –Small number of FPUs –Inflexible memory caches Graphics Hardware Courtesy of NVIDIA Courtesy of Intel
Contributions Hardware architecture for point sets –Neighbor search module –Novel advanced caching mechanism –Reconfigurable processing module –Programmability using FPGA compiler FPGA prototype and measurements Small & Lean Integration into multi-core CPU/GPU possible Graphics Hardware
Outline Related Work Spatial Searching and Caching Architecture and Prototype Results Conclusion Graphics Hardware
Related Work Kd-Tree [Bentley 75] Graphics Hardware kNN on GPUs [Ma and McCool 02] Kd-Tree Hardware [Woop et al. 05] [Woop et al. 06] Kd-Tree on GPUs [Popov et al. 07]
Related Work Adaptive SPH Fluid Simulation [Adams et al. ‘07] Graphics Hardware Linear Moving Least Squares, [Adamson and Alexa ’04] Algebraic Moving Least Squares, [Guennebaud and Gross ‘07]
Linear Moving Least Squares Graphics Hardware Implicit surface definition defined by set of points
Linear Moving Least Squares Graphics Hardware x Implicit surface definition defined by set of points
Linear Moving Least Squares Graphics Hardware x pipi nini
Linear Moving Least Squares Graphics Hardware x Iterative projections onto plane
Linear Moving Least Squares Graphics Hardware x Iterative projections onto plane x’ ’
Linear Moving Least Squares Graphics Hardware x Iterative projections onto plane x’’ ’
Linear Moving Least Squares Graphics Hardware x Iterative projections onto plane x’’’ ’ ’ ’
Linear Moving Least Squares Graphics Hardware x Surface defined by points projecting onto themselves
Outline Related Work Spatial Searching and Caching Architecture & Prototype Results Conclusion Graphics Hardware
Spatial Search Spatial search: kNN and NN –Common in most point operations –Based on kd-tree Example NN: Graphics Hardware
Spatial Search kNN search similar to NN search: –Start with infinite radius –Sort leaf points into priority queue –Shrink radius with every point sorted Graphics Hardware
Coherent Neighbor Cache ( NN) Find neighbors in slightly bigger radius Re-use result for spatially close query Graphics Hardware Re-use if
Coherent Neighbor Cache (kNN, exact) Find (k+1) neighbors Re-use result for spatially close query Graphics Hardware Re-use if
Coherent Neighbor Cache (kNN, approximation) Approximation error –Enlarge radius Graphics Hardware Re-use if
Outline Related Work Spatial Searching and Caching Architecture & Prototype Results Conclusion Graphics Hardware
The Architecture Graphics Hardware Host
Eight cached neighborhoods Problem: parallel queries in kd-tree module Interleave spatially similar queries Coherent Neighbor Cache Graphics Hardware nn n
Kd-Tree Traversal Graphics Hardware
Graphics Hardware Kd-tree structure on chip 16 threads Pipelining and multi-threading Node Recurse
Stacks 16 stacks Parallel read/write Bounded in depth 6 bytes per thread per recursion Graphics Hardware
Leaf 16 parallel priority queues (1-cycle ops) Queues store pointers and distances Bandwidth bottleneck Graphics Hardware
Multithreaded quad-port bank of 16 registers 128 threads Programmability using FPGA-technology Processing Module Graphics Hardware
Further Data Implemented on two FPGAs –64 bit DDR DRAM –Interconnection: no overhead Resource usage regs and LUTs –Virtex 2 Pro 100 (kNN): 26% registers, 38% LUTs –Virtex 2 Pro 70 (MLS): 47% registers, 52% LUTs Clock frequency: 75 MHz Graphics Hardware
Outline Related Work Spatial Searching and Caching Architecture & Prototype Results Conclusion Graphics Hardware
Applications Tested on various applications PCI interface of prototype slow Graphics Hardware [Weyrich et al. 04] [Adams et al. 07]
Results kNN Graphics Hardware CUDA: x4 CPU: x1.5 FPGA: x1 CUDA: x2.4 CPU: x1.4 FPGA: x1 CUDA w/o sort: x4.0 CUDA: x1.6 CPU: x1.1 FPGA: x1 CUDA w/o sort: x MHz 1200 MHz 2200 MHz Number of Neighbors Number of queries ASIC estimate, 500 MHz x6.6
Results kNN Graphics Hardware CUDA: x4 CPU: x1.5 FPGA: x1 CUDA: x2.4 CPU: x1.4 FPGA: x1 CUDA w/o sort: x4.0 CUDA: x1.6 CPU: x1.1 FPGA: x1 CUDA w/o sort: x MHz 1200 MHz 2200 MHz Number of Neighbors Number of queries ASIC estimate, 500 MHz x6.6 Small hardware footprint FPGA slightly slower Realistic clock frequency Prototype faster than CPU/GPU
Results MLS Graphics Hardware FPGA: x1 MLS CPU: x0.4 MLS CUDA x MHz 1200 MHz 2200 MHz Number of Neighbors Number of queries FPGA faster than CPU kNN bottleneck –FPGA –GPU
Coherent Neighbor Cache Graphics Hardware CPU, =0.1 FPGA, exact FPGA, =0.1 Level of coherence Number of queries
Results Approximation Error (MLS projection) Graphics Hardware approximation MLS Error no approx.
Results Approximation Error (MLS projection) Graphics Hardware Cache hits Cache Hits approximation
Approximation Error (visual) Graphics Hardware
Approximation Error (visual) Graphics Hardware Coherent Neighbor Cache: Not optimal for exact queries Approximate queries –Can be tolerated in most cases –Greatly increases performance –Even for small approximations
Outline Related Work Spatial Searching and Caching Architecture & Prototype Results Conclusion Graphics Hardware
Conclusion Novel hardware architecture for –Nearest-neighbor searches –Generic meshless processing operators Cache exploiting spatial coherence Good performance considering resources Possible GPU integration Graphics Hardware
Future Work Programmable data structure –Support different data structures –Programmability in data structure –Construction on-chip ‘Real’ programmability in point processing module Graphics Hardware
A Hardware Processing Unit For Point Sets S. Heinzle, G. Guennebaud, M. Botsch, M. Gross Graphics Hardware 2008