Distributed Interactive Ray Tracing for Large Volume Visualization Dave DeMarle Steven Parker Mark Hartner Christiaan Gribble Charles Hansen
Scientific Computing and Imaging Institute, University of Utah Ray tracing CPU 4 CPU 3 CPU 2 CPU 1 For every pixel, cast a ray and find the first hit object Every pixel is independent, so image parallelism is a natural choice for acceleration
Scientific Computing and Imaging Institute, University of Utah Ray traced forest Image parallelism Showing work division
Scientific Computing and Imaging Institute, University of Utah Interactive ray tracing of scientific data
Scientific Computing and Imaging Institute, University of Utah Large volume visualization Richtmyer- Meshkov instability simulation Each timestep is 1920x2048x2048 x 8 bit
Scientific Computing and Imaging Institute, University of Utah Zooming in on the test data set
Scientific Computing and Imaging Institute, University of Utah Architectural comparison SGI test machineCluster test machine ~$1.5 million~$150 thousand Threaded programmingCustom or add on APIs 1x MHz R12K CPUs32x2 1.7 GHz Xeon CPUs 64 bit addressing32 bit addressing 16GB RAM (shared)32GB RAM (1GB per node) ccNUMA hypercube networkSwitched Gbit Ethernet 335ns avg round trip latency (spec) 34000ns avg round trip latency (measured) 12.8 Gbit/sec bandwidth (spec).6 Gbit/sec bandwidth (measured)
Scientific Computing and Imaging Institute, University of Utah Lack of a parallel programming model Build a minimal networking library based on TCP Lower network performance Perform IO asynchronously to overlap computation and communication Workers try to keep a backlog Supervisor does upkeep tasks while workers render Memory limited to 4GB and isolated within each node Create an object based DSM in the network library to share memory between nodes Overcoming Cluster Limitations
Scientific Computing and Imaging Institute, University of Utah System architecture Node 2 Node 3 Node 1 Ray thread 1 Ray thread 2 Ray thread 1 Ray thread 2 Ray thread 1 Ray thread 2 Supervisor node local memory
Scientific Computing and Imaging Institute, University of Utah Central executive network limitation Moderately complex scene
Scientific Computing and Imaging Institute, University of Utah Central executive network limitation Frame rate scales until supervisor bottleneck dominates
Scientific Computing and Imaging Institute, University of Utah Central executive network limitation latency = 19μs per tile bandwidth = 600Mbit/s
Scientific Computing and Imaging Institute, University of Utah With enough processors, latency is the limitation
Scientific Computing and Imaging Institute, University of Utah Beyond 2 32 bytes We make the entire memory space of the cluster usable in our object based DSM Each node owns part of the data Application threads acquire bricks from the DSM If the data isn't locally owned, the DSM gets a copy from the owner The DSM tries to cache blocks to use again - Corrie & Mackerras, 1993, “Parallel Volume Rendering and Data Coherence”
Scientific Computing and Imaging Institute, University of Utah System architecture Node 2 Node 3 Node 1 Ray thread 1 Ray thread 2 Ray thread 1 Ray thread 2 Ray thread 1 Ray thread 2 Supervisor node local memory
Scientific Computing and Imaging Institute, University of Utah System architecture, extended with SDSM Node 2 Node 3 Node 1 Ray thread 1 Ray thread 2 Ray thread 1 Ray thread 2 Ray thread 1 Ray thread 2 Supervisor node local memory Software distributed shared memory
Scientific Computing and Imaging Institute, University of Utah Beyond 2 32 bytes 036 owned cached DSM Communication thread Ray thread 1 Node 1 Node 2 Node 3 Ray thread owned cached DSM Communication thread 258 owned cached DSM Communication thread
Scientific Computing and Imaging Institute, University of Utah Beyond 2 32 bytes 036 owned cached DSM Communication thread acquire(4) Ray thread 1 Node 1 4 Node 2 Node 3 acquire(3) Ray thread 1 acquire(2) Ray thread owned cached DSM Communication thread owned cached DSM Communication thread
Scientific Computing and Imaging Institute, University of Utah Beyond 2 32 bytes 0361 owned cached DSM Communication thread acquire(7) release(4) Ray thread 1 Node Node 2 Node 3 acquire(8) release(3) Ray thread 1 acquire(4) release(2) Ray thread owned cached DSM Communication thread owned cached DSM Communication thread 463
Scientific Computing and Imaging Institute, University of Utah Which node owns what? Isosurface of visible female Showing ownership
Scientific Computing and Imaging Institute, University of Utah Acceleration structure a “macrocell” lists the min and max values inside- space leaping reduces the number of data accesses Isovalue=92 Missed Max=85 Min=100 Max=97 Min=89 Parker et al, viz 98, “Interactive Ray Tracing for Isosurface Rendering”
Scientific Computing and Imaging Institute, University of Utah Acceleration structure Enter only those macrocells that contain the isovalue Isovalue=92 Missed Max=85 Min=100 Max=97 Min=89
Scientific Computing and Imaging Institute, University of Utah Acceleration structure Recurse inside interesting macrocells, until you have to access the actual volume data Missed Max=90 Max=95 Min=91 Max=93 Min=90 Missed Not traversed Max=97 Min=89 ed 00
Scientific Computing and Imaging Institute, University of Utah &0&1&2&3&4&5&6&7 &8&9&10 &11 &12&13&14&15 Data bricking Use multi-level, 3D tiling for memory coherence - 64 byte cache line 4KB OS page &15 &0&1&4&5 &2&3&6&7 &8&9&12 &13 &10&11&14&15 Parker et al, viz 98, “Interactive Ray Tracing for Isosurface Rendering”
Scientific Computing and Imaging Institute, University of Utah &0&1&2&3&4&5&6&7 &8&9&10 &11 &12&13&14&15 Data bricking Use 3 level, 3D tiling for memory coherence 64 byte cache line 4KB OS page 4KB x L 3 Network transfer size For the datasets we've tried level three brick size of 32KB is the best trade-off between data locality and transmission time &15 &0&1&4&5 &2&3&6&7 &8&9&12 &13 &10&11&14&15
Scientific Computing and Imaging Institute, University of Utah 89.6 Isosurface intersection We analytically test for ray isosurface intersection within a voxel by solving a cubic polynomial, defined by ray parameters and 8 voxel corner values We use the data gradient for the surface normal Isovalue=92
Scientific Computing and Imaging Institute, University of Utah Benchmark test isovalue viewpoint Frame #
Scientific Computing and Imaging Institute, University of Utah Consolidated data access Most of the time is spent accessing data which is locally owned or cached Reduce the number of DSM accesses by eliminating redundant accesses When ray needs data, sort accesses to get all needed data in one shot 7µs hit time 600 µs miss time 98% hit rate
Scientific Computing and Imaging Institute, University of Utah Acquire on every voxel corner Brick 1Brick 2Brick 3 Brick 4Brick 5Brick 6 macrocell # accesses = 3,279,000 per worker per frame frame rate =.115 f/s
Scientific Computing and Imaging Institute, University of Utah Acquire on every voxel corner # accesses = 453,400 per worker per frame frame rate =.709 f/s Brick 1Brick 2Brick 3 Brick 4Brick 5Brick 6 macrocell
Scientific Computing and Imaging Institute, University of Utah Acquire on every voxel corner # accesses = 53,290 per worker per frame frame rate = 1.69 f/s Brick 1Brick 2Brick 3 Brick 4Brick 5Brick 6 macrocell
Scientific Computing and Imaging Institute, University of Utah Results – cache behaviour Frame Number Measured Frame Rate [f/s] Transfers [MB/node/f] ISO VIEW
Scientific Computing and Imaging Institute, University of Utah Results – cache behaviour Frame Number Filled Frame Rate [f/s] Measured Frame Rate [f/s] 1290 Cache fill costs 22% here
Scientific Computing and Imaging Institute, University of Utah Results - machine comparison SGI 1x31 CPUs avg = 4.7 [f/s] Cluster 31x2 CPUs avg = 1.7 [f/s] Cluster 31x1 CPUs avg = 1.1 [f/s]
Scientific Computing and Imaging Institute, University of Utah Results - machine comparison SGI 1x31 CPUs avg = 5.7 [f/s] Cluster 31x2 CPUs avg = 1.5 [f/s] 1290 SGI is 3.8x faster
Scientific Computing and Imaging Institute, University of Utah Results - machine comparison SGI 1x31 CPUs avg = 4.2 [f/s] Cluster 31x2 CPUs avg = 2.6 [f/s] SGI is 1.6x faster
Scientific Computing and Imaging Institute, University of Utah Conclusions Confirmed that interactive ray tracing on a cluster is possible Scaling limited by latency, and the number of tiles determines max frame rate Data sets that exceed the memory space of any one node can be handled with a DSM For isosurfacing hit time is limiting factor, not network time Overheads make the cluster slower than the supercomputer, but the new solution has a significant price advantage
Scientific Computing and Imaging Institute, University of Utah Future work Make it faster!!! Use lower latency network Remove central bottleneck Use block prefetch and ray rescheduling Optimize the DSM for faster hit times Use more parallelism - SIMD, hyperthreading, GPU
Scientific Computing and Imaging Institute, University of Utah Acknowledgments NSF Grants , DOE Views NIH Grants Mark Duchaineau at LLNL Our anonymous reviewers For more information