A Hardware Processing Unit For Point Sets S. Heinzle, G. Guennebaud, M. Botsch, M. Gross Graphics Hardware 2008.

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

AP STUDY SESSION 2.
1
Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.
Processes and Operating Systems
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Objectives: Generate and describe sequences. Vocabulary:
UNITED NATIONS Shipment Details Report – January 2006.
David Burdett May 11, 2004 Package Binding for WS CDL.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination. Introduction to the Business.
and 6.855J Spanning Tree Algorithms. 2 The Greedy Algorithm in Action
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
CALENDAR.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 10 second questions
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Break Time Remaining 10:00.
This module: Telling the time
The basics for simulations
Augmenting FPGAs with Embedded Networks-on-Chip
Turing Machines.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
Advance Nano Device Lab. Fundamentals of Modern VLSI Devices 2 nd Edition Yuan Taur and Tak H.Ning 0 Ch9. Memory Devices.
PP Test Review Sections 6-1 to 6-6
Bright Futures Guidelines Priorities and Screening Tables
Bellwork Do the following problem on a ½ sheet of paper and turn in.
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
15. Oktober Oktober Oktober 2012.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Artificial Intelligence
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
Subtraction: Adding UP
: 3 00.
5 minutes.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Converting a Fraction to %
Clock will move after 1 minute
PSSA Preparation.
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.
0 x x2 0 0 x1 0 0 x3 0 1 x7 7 2 x0 0 9 x0 0.
Presentation transcript:

A Hardware Processing Unit For Point Sets S. Heinzle, G. Guennebaud, M. Botsch, M. Gross Graphics Hardware 2008

Motivation Point-based graphics established Powerful algorithms –Representation –Processing –Manipulation –Rendering Decomposition –Get neighborhood –Operate on neighbors Graphics Hardware

Motivation GPUs not suited for getting neighborhood –SIMD –Incoherent branching –Dynamic data structures slow –Recursive calls not supported CPUs –Small number of FPUs –Inflexible memory caches Graphics Hardware Courtesy of NVIDIA Courtesy of Intel

Contributions Hardware architecture for point sets –Neighbor search module –Novel advanced caching mechanism –Reconfigurable processing module –Programmability using FPGA compiler FPGA prototype and measurements Small & Lean  Integration into multi-core CPU/GPU possible Graphics Hardware

Outline Related Work Spatial Searching and Caching Architecture and Prototype Results Conclusion Graphics Hardware

Related Work Kd-Tree [Bentley 75] Graphics Hardware kNN on GPUs [Ma and McCool 02] Kd-Tree Hardware [Woop et al. 05] [Woop et al. 06] Kd-Tree on GPUs [Popov et al. 07]

Related Work Adaptive SPH Fluid Simulation [Adams et al. ‘07] Graphics Hardware Linear Moving Least Squares, [Adamson and Alexa ’04] Algebraic Moving Least Squares, [Guennebaud and Gross ‘07]

Linear Moving Least Squares Graphics Hardware Implicit surface definition defined by set of points

Linear Moving Least Squares Graphics Hardware x Implicit surface definition defined by set of points

Linear Moving Least Squares Graphics Hardware x pipi nini

Linear Moving Least Squares Graphics Hardware x Iterative projections onto plane

Linear Moving Least Squares Graphics Hardware x Iterative projections onto plane x’ ’

Linear Moving Least Squares Graphics Hardware x Iterative projections onto plane x’’ ’

Linear Moving Least Squares Graphics Hardware x Iterative projections onto plane x’’’ ’ ’ ’

Linear Moving Least Squares Graphics Hardware x Surface defined by points projecting onto themselves

Outline Related Work Spatial Searching and Caching Architecture & Prototype Results Conclusion Graphics Hardware

Spatial Search Spatial search: kNN and  NN –Common in most point operations –Based on kd-tree Example  NN: Graphics Hardware

Spatial Search kNN search similar to  NN search: –Start with infinite radius –Sort leaf points into priority queue –Shrink radius with every point sorted Graphics Hardware

Coherent Neighbor Cache (  NN) Find neighbors in slightly bigger radius Re-use result for spatially close query Graphics Hardware Re-use if

Coherent Neighbor Cache (kNN, exact) Find (k+1) neighbors Re-use result for spatially close query Graphics Hardware Re-use if

Coherent Neighbor Cache (kNN, approximation) Approximation error  –Enlarge radius Graphics Hardware Re-use if

Outline Related Work Spatial Searching and Caching Architecture & Prototype Results Conclusion Graphics Hardware

The Architecture Graphics Hardware Host

Eight cached neighborhoods Problem: parallel queries in kd-tree module  Interleave spatially similar queries Coherent Neighbor Cache Graphics Hardware nn n

Kd-Tree Traversal Graphics Hardware

Graphics Hardware Kd-tree structure on chip 16 threads Pipelining and multi-threading Node Recurse

Stacks 16 stacks Parallel read/write Bounded in depth 6 bytes per thread per recursion Graphics Hardware

Leaf 16 parallel priority queues (1-cycle ops) Queues store pointers and distances Bandwidth bottleneck Graphics Hardware

Multithreaded quad-port bank of 16 registers 128 threads Programmability using FPGA-technology Processing Module Graphics Hardware

Further Data Implemented on two FPGAs –64 bit DDR DRAM –Interconnection: no overhead Resource usage regs and LUTs –Virtex 2 Pro 100 (kNN): 26% registers, 38% LUTs –Virtex 2 Pro 70 (MLS): 47% registers, 52% LUTs Clock frequency: 75 MHz Graphics Hardware

Outline Related Work Spatial Searching and Caching Architecture & Prototype Results Conclusion Graphics Hardware

Applications Tested on various applications PCI interface of prototype slow Graphics Hardware [Weyrich et al. 04] [Adams et al. 07]

Results kNN Graphics Hardware CUDA: x4 CPU: x1.5 FPGA: x1 CUDA: x2.4 CPU: x1.4 FPGA: x1 CUDA w/o sort: x4.0 CUDA: x1.6 CPU: x1.1 FPGA: x1 CUDA w/o sort: x MHz 1200 MHz 2200 MHz Number of Neighbors Number of queries ASIC estimate, 500 MHz x6.6

Results kNN Graphics Hardware CUDA: x4 CPU: x1.5 FPGA: x1 CUDA: x2.4 CPU: x1.4 FPGA: x1 CUDA w/o sort: x4.0 CUDA: x1.6 CPU: x1.1 FPGA: x1 CUDA w/o sort: x MHz 1200 MHz 2200 MHz Number of Neighbors Number of queries ASIC estimate, 500 MHz x6.6 Small hardware footprint FPGA slightly slower Realistic clock frequency  Prototype faster than CPU/GPU

Results MLS Graphics Hardware FPGA: x1 MLS CPU: x0.4 MLS CUDA x MHz 1200 MHz 2200 MHz Number of Neighbors Number of queries FPGA faster than CPU kNN bottleneck –FPGA –GPU

Coherent Neighbor Cache Graphics Hardware CPU,  =0.1 FPGA, exact FPGA,  =0.1 Level of coherence Number of queries

Results Approximation Error (MLS projection) Graphics Hardware  approximation MLS Error no approx.

Results Approximation Error (MLS projection) Graphics Hardware Cache hits Cache Hits  approximation

Approximation Error (visual) Graphics Hardware

Approximation Error (visual) Graphics Hardware Coherent Neighbor Cache: Not optimal for exact queries Approximate queries –Can be tolerated in most cases –Greatly increases performance –Even for small approximations

Outline Related Work Spatial Searching and Caching Architecture & Prototype Results Conclusion Graphics Hardware

Conclusion Novel hardware architecture for –Nearest-neighbor searches –Generic meshless processing operators Cache exploiting spatial coherence Good performance considering resources Possible GPU integration Graphics Hardware

Future Work Programmable data structure –Support different data structures –Programmability in data structure –Construction on-chip ‘Real’ programmability in point processing module Graphics Hardware

A Hardware Processing Unit For Point Sets S. Heinzle, G. Guennebaud, M. Botsch, M. Gross Graphics Hardware 2008