DISTRIBUTED INTERACTIVE RAY TRACING FOR LARGE VOLUME VISUALIZATION Dave DeMarle May 1 2003.

DISTRIBUTED INTERACTIVE RAY TRACING FOR LARGE VOLUME VISUALIZATION Dave DeMarle May 1 2003

Thesis: It is possible to visualize multi-Gigabyte datasets interactively using ray tracing on a cluster.

Outline Background. Related work. Communication. Ray tracing with replicated data. Distributed shared memory. Ray tracing large volumes.

Ray Tracing For every pixel, compute a ray from a viewpoint into space, and test for intersection with every object. Take the nearest hit object’s color for the pixel. Shadows, reflections, refractions and photorealistic effects simply require more rays.

Interactive Ray Tracing 1998: *-Ray Image Parallel renderer optimized for SGI-Origin shared memory supercomputer. My work moves this program to a Cluster, in order to make it less expensive.

CPU 1 CPU 2 CPU 3 CPU 4

Ray Traced Forest SceneShowing task distribution

Cluster Computing Connect inexpensive machines. Advantages: Cheaper. Faster growth curve in commodity market. Disadvantages: Slower network. Separate Memory.

RayNebula ~$1.5 million.~$150 thousand. 32 0.39 GHz R12K CPUs.2x32 1.7 GHz Xeon CPUs. 16GB RAM (shared).32GB RAM (1GB per node). NUMA hypercube network.Switched Gbit Ethernet. 335ns avg round trip latency.34000ns avg round trip latency. 12.8 Gbit/sec bandwidth..6 Gbit/sec bandwidth.

Related Work 2001: Saarland Renderer Trace 4 rays with SIMD operations. Obtain data from a central server. Limited to triangular data. My work keeps *-Ray’s flexibility, and uses distributed ownership.

Related Work 1993: Corrie and Mackeras Volume rendering on a Fujitsu AP1000. My work uses recent hardware, and multithreading on each node, to achieve interactivity.

Communication Legion Goal 1: to reduce library overhead. Built on top of TCP. Goal 2: reduce wait time. Dedicated communication thread handles incoming traffic.

Inbound: Select(), read header(), call function. Outbound: protect with mutex for thread. Comp Thread 1 Comp Thread T … Communicator Thread handler_1() select() Communicator::send() Node 0 handler_h() Net

Distributed Ray Tracer Implementation Image Parallel Ray Tracer. Supervisor/Workers program structure. Each node runs a multithreaded application. Replicate data if it fits in each node’s memory. Use Distributed Shared Memory (DSM) for larger volumetric data.

Worker 2Worker 3Worker 1 Render Thread 1 Render Thread 2 Render Thread 1 Render Thread 2 Render Thread 1 Render Thread 2 Supervisor Image User

Supervisor Program Communicator Scene State Frame State Task State Display Thread Aux. Dpy Threads Image Node 0

Worker Program Communicator Scene State Frame State TaskManager Render Thread 1 Scene Node N Render Thread N TaskQueue ViewManager …

Render State Data that *-Ray communicated by reference between functional units, is now transferred over the network. SceneState – constant over a session. Acceleration structure type, number of workers… FrameState – can change each frame. Camera Position, image resolution… TaskState – changes during a frame. Pixel tile assignments.

TaskManager keeps a local queue of tasks. Two semaphores guard the queue. Tile SupervisorWorker 1 Tile TaskManager Tile Render Thread 1 Render Thread 2 TaskQueue Tile Image

Network Limitation Max frame rate determined by network. 19 μs per tile (queuing), 600Mbit/sec bandwidth. 18121631

Replicated Comparison

Machine Comparison with Replicated Data 18162431

Large Volumes Richtmyer-Meshkov Instability Simulation from Lawrence Livermore National Labs. 1920x2048x2048 x 8 bit

Legion’s DSM DataServer class Compute threads call acquire to obtain blocks of memory. The DataServer finds and returns the requested block. Compute threads call release to let the DataServer reuse the space. The DataServer uses Legion to transfer blocks over the network. Each node owns the blocks in its resident_set area, and caches remote owned blocks in its local_cache area. 5 DataServer flavors: single threaded, multithreaded direct mapped, associative, mmap from disk, and writable.

0361 resident_setlocal_cache DataServer Communicator Thread get_data() release_data() Comp. Thread 1 Node 0 427 1475 resident_setlocal_cache DataServer Communicator Thread get_data() release_data() Comp. Thread 1 Node 1 863 2581 resident_setlocal_cache DataServer Communicator Thread get_data() release_data() Comp. Thread 1 Node 2 463

Large Volumes Use distributed versions of *-Ray’s templated volume classes, which treat DataServer as a 3D array. DISOVolumeDMIPVolume DBrickArray3 DataServer Data(x,y,z) Block Q, Offset R

Isosurface of visible femaleShowing data ownership

Optimized Data access for Large Volumes Use 3 level bricking for memory coherence: 64 byte cache line. 4KB OS page. 4KB * L^3 Network transfer size. 3 rd level bricks = DataServer blocks. Use macrocell hierarchy to reduce number of accesses.

Results with Distributed Data Hit time of 6.86 μs or higher. Associative DataServer takes longer. Miss time of 390 μs or higher. Larger bricks take longer. Empirically, if local cache is >10% of data size, get >95% hit rates for isosurfacing, MIPing. Investigated techniques to increase hit rate, reduce number of accesses.

Consolidated Access Hit time is usually the limiting factor. Reduce the number of DSM accesses. Eliminate redundant accesses. When ray needs data, sort accesses to get all needed data inside with one DSM access.

Consolidated Access Brick 1Brick 2Brick 3 Brick 4Brick 5Brick 6 macrocell

2 GB Frames/sec Acquires/node/frame Access 1Access 8 Access X

Machine Comparison Use the Richtmyer-Meshkov data set to compare the distributed ray tracer with *-Ray. To determine how data sharing effects the cluster program.

1589 300

Traffic When entire volume is in view it takes a few frames for the caches to load, which slows down the renderer. When only a portion is in view, the working set is small and network traffic is not an issue.

isovalue viewpoint MB/node frames/sec

Frame Number

Images Treepot scene 2 million polygons 512x512 1 hard shadow ~1 f/s CPU bound, not network bound

Images Richtmyer-Meshkov Timestep 270 1920x2048x2048 512x512 1..2 f/s w/ 1 hard shadow CPU or network bound, depending on the Viewpoint.

Images Focusing in…

Conclusion Confirmed that interactive Ray Tracing on a cluster is possible. Scaling and the ultimate Frame Rate is limited by latency, and number of tasks in image determines max frame rate. With reasonably complex scenes the render is CPU bound, even with 62 processors. With tens of processors, cluster is comparable to supercomputer.

Conclusion Data Sets that exceed the memory space of any one node can be managed with a DSM. For isosurfacing, and MIPing, hit time is limiting factor, not network time. The longer data access time makes the cluster slower than the supercomputer, but it is still interactive.

Future Work Faster for realistic images interactively. Faster network layer. Faster DSM. Faster ray tracing. Direct volume rendering. Distributed polygonal data sets.

Acknowledgments NSF Grants 9977218, 9978099. DOE Views. NIH Grants. My Committee, Steve, Chuck and Pete. Patti DeMarle. Thanks to everyone else, for making this a great place to live and work!

DISTRIBUTED INTERACTIVE RAY TRACING FOR LARGE VOLUME VISUALIZATION Dave DeMarle May 1 2003.

Similar presentations

Presentation on theme: "DISTRIBUTED INTERACTIVE RAY TRACING FOR LARGE VOLUME VISUALIZATION Dave DeMarle May 1 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DISTRIBUTED INTERACTIVE RAY TRACING FOR LARGE VOLUME VISUALIZATION Dave DeMarle May 1 2003.

Similar presentations

Presentation on theme: "DISTRIBUTED INTERACTIVE RAY TRACING FOR LARGE VOLUME VISUALIZATION Dave DeMarle May 1 2003."— Presentation transcript:

Similar presentations

About project

Feedback