Presentation on theme: "A Coherent Grid Traversal Algorithm for Volume Rendering Ioannis Makris Supervisors: Philipp Slusallek*, Céline Loscos *Computer Graphics Lab, Universität."— Presentation transcript:
A Coherent Grid Traversal Algorithm for Volume Rendering Ioannis Makris Supervisors: Philipp Slusallek*, Céline Loscos *Computer Graphics Lab, Universität des Saarlandes UCL Department of Computer Science
2 Overview Introduction Previous work in software Direct Volume Rendering Introduction to the Cell Broadband Engine The Coherent Grid Traversal Algorithm Parallelisation Schemes UCL Department of Computer Science
3 Introduction to Direct Volume Rendering Technique of displaying a 2D projection of a 3D sampled dataset (volume), by accumulating samples across lines of sight with some transfer function. Several types of sampled data. We will only deal with rectilinear grids.
4 Direct Volume Rendering Ray Casting (Levoy 1988, 1990) –Image order algorithm Splatting (Westover 1990) –Object order Shear Warp (Lacroute 1994, 1996) –Hybrid order UCL Department of Computer Science
5 Ray Casting Cast a ray from the viewpoint to the volume for all pixels Obtain samples from the volume in equal intervals, by trilinearly interpolating neighbouring voxels. Accumulate with some operator to get final colour. Several acceleration techniques have been suggested (early ray termination (Levoy 1990), adaptive sampling, octrees (Ogata et al. 1998), kd-trees(Wald et al 2005) UCL Department of Computer Science
6 Shear-Warp Considered the fastest known Direct Volume Rendering algorithm. Steps: –Transform volume to sheared object space –Project sheared slices on an intermediate image –Transform the intermediate image to image space Requires 3 copies of the data, for every principal axis, but RLE compression can help. UCL Department of Computer Science
7 Characteristics of modern x86 processors Deep instruction pipeline. Very sophisticated hardware branch prediction 2 levels of cache, supports software prefetching Rich SIMD instruction set
8 The CELL processor Developed jointly by IBM, Sony and Toshiba Combines a PowerPC general purpose processor with 8 separate SIMD execution units (SPUs). Exceptional FLOPS / cost ratio and more powerful than the Itanium! Needs fast memory, which is relatively expensive UCL Department of Computer Science
9 Notable Characteristics of the SPUs Software managed local store (i.e. no caches) No branch prediction, expensive branch misses SIMD loads/stores ONLY Favors streaming code UCL Department of Computer Science
10 Motivation for a new algorithm Ray Casting algorithms are typically not cache friendly. Performance depends on viewing axis. Acceleration structures may produce non- streaming code and several overheads. Shear Warp may require too much memory for certain data. UCL Department of Computer Science
11 A Coherent Grid Traversal Algorithm for Volume Rendering (1) Original idea from “Ray Tracing Animated Scenes using Coherent Grid Traversal” (Wald et al, SIGGRAPH 2006). Bundles (frustums) of coherent rays are traced in grid space, by incrementaly computing the overlap with grid slices. The overlap of the frustum is computed with a SIMD addition and a SIMD truncation only UCL Department of Computer Science
12 A Coherent Grid Traversal Algorithm for Volume Rendering (2) The volume rendering version of the algorithm uses a “bricked” volume (Sakas et al 1994), bricks replace the grid elements. Bricks are referenced by 3 maps, one for each principal axis. Compression is achieved by not storing empty bricks. UCL Department of Computer Science
13 A Coherent Grid Traversal Algorithm for Volume Rendering (3)
14 A Coherent Grid Traversal Algorithm for Volume Rendering (4) Traversal is performed on the principal axis, using the corresponding map. Indices are computed incrementally. If all the overlapping bricks of a slice are empty, the slice is skipped. If some bricks are empty, they are associated with a locally stored empty brick and processed redundantly (but not fetched). UCL Department of Computer Science
15 A Coherent Grid Traversal Algorithm for Volume Rendering (examples) UCL Department of Computer Science
16 Bundle Parallelisation Bundle Parallelisation is trivial. On a x86 C++ OpenMP implementation, it only required 1 line of code. It is possible to have some blocks fetched multiple times from neighbouring bundles. UCL Department of Computer Science
17 Slice Parallelisation A slice parallelisation is less likely to exhibit this problem, but traversal of brick slices is not incremental! So, how would the processing element know which bundles to process for a given slice? UCL Department of Computer Science
18 Slice Parallelisation Most bundles will start on k=0, or end on k=kmax (or both). During tracing, we create 2 vectors of references to bundles, we shall call them A and D, along with 2 index tables for the corresponding slices we shall call P and Q. The bundles that run through a given slice s can be expressed as Only 2 memory reads are required for that, or no memory reads if the bundles are large enough for A and D to fit in the cache/local store. UCL Department of Computer Science
19 Slice Parallelisation Remaining bundles can take up to 33% (they are about 14% average). We use two more lists, we shall call S and E with index tables M and N. S holds references to the remaining bundles sorted by the first slice they intersect, and E sorted by the last. Remaining bundles that run through s are: We need to run through both these lists to find that out, but this does not hit performance. UCL Department of Computer Science
20 A notable problem of the CGT algorithm as described in [Wald 2006] When the “roll” angle of the bundles to the respective angle of the volume is close to π/4, the number of blocks fetches can be double than the number required. There is a good solution to that (not yet published).
21 Results First results demonstrated an speed increase of up to 2 orders of magnitude from ray-casting. This may increase with further optimisations UCL Department of Computer Science
22 Conclusion We have developed a scalable algorithm for coherent volume traversal with performance on- par with the Shear – Warp, with reduced memory requirements. We demonstrated parallel implementations.
23 Future Work Investigate mixed parallelisation schemes Optimise the computation performed per brick.
24 The End Thank you for your attention Questions? UCL Department of Computer Science