Distributed Interactive Ray Tracing for Large Volume Visualization Dave DeMarle Steven Parker Mark Hartner Christiaan Gribble Charles Hansen.

Slides:



Advertisements
Similar presentations
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Advertisements

Sven Woop Computer Graphics Lab Saarland University
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Datorteknik BusInterfacing bild 1 Bus Interfacing Processor-Memory Bus –High speed memory bus Backplane Bus –Processor-Interface bus –This is what we usually.
A Coherent Grid Traversal Algorithm for Volume Rendering Ioannis Makris Supervisors: Philipp Slusallek*, Céline Loscos *Computer Graphics Lab, Universität.
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
Interactive Deformation and Visualization of Level-Set Surfaces Using Graphics Hardware Aaron Lefohn Joe Kniss Charles Hansen Ross Whitaker Aaron Lefohn.
Memory-Savvy Distributed Interactive Ray Tracing David E. DeMarle Christiaan Gribble Steven Parker.
Query Reordering for Photon Mapping Rohit Saboo. Photon Mapping A two step solution for global illumination: Step 2: Shoot eye rays and perform a “gather”
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
DDDDRRaw: A Prototype Toolkit for Distributed Real-Time Rendering on Commodity Clusters Thu D. Nguyen and Christopher Peery Department of Computer Science.
Enhancing and Optimizing the Render Cache Bruce Walter Cornell Program of Computer Graphics George Drettakis REVES/INRIA Sophia-Antipolis Donald P. Greenberg.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Memory Efficient Acceleration Structures and Techniques for CPU-based Volume Raycasting of Large Data S. Grimm, S. Bruckner, A. Kanitsar and E. Gröller.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.
DISTRIBUTED INTERACTIVE RAY TRACING FOR LARGE VOLUME VISUALIZATION Dave DeMarle May
Parallel Graphics Rendering Matthew Campbell Senior, Computer Science
Multiprocessor Cache Coherency
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
Optimizing RAM-latency Dominated Applications
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
Interactive Ray Tracing: From bad joke to old news David Luebke University of Virginia.
Chep06 1 High End Visualization with Scalable Display System By Dinesh M. Sarode, S.K.Bose, P.S.Dhekne, Venkata P.P.K Computer Division, BARC, Mumbai.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Interactive Exploration of Large Remote Micro-CT Scans Prohaska, Hutanu, Kähler, Hege (Zuse Institut Berlin)
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
So far we have covered … Basic visualization algorithms Parallel polygon rendering Occlusion culling They all indirectly or directly help understanding.
Gregory Fotiades.  Global illumination techniques are highly desirable for realistic interaction due to their high level of accuracy and photorealism.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
(Short) Introduction to Parallel Computing CS 6560: Operating Systems Design.
Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper
Interactive Visualization of Exceptionally Complex Industrial CAD Datasets Andreas Dietrich Ingo Wald Philipp Slusallek Computer Graphics Group Saarland.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner Comp 238, UNC-CH, September 10, 2001 Joshua Stough.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Gravitational N-body Simulation Major Design Goals -Efficiency -Versatility (ability to use different numerical methods) -Scalability Lesser Design Goals.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Outline Why this subject? What is High Performance Computing?
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
Sunpyo Hong, Hyesoon Kim
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Background Computer System Architectures Computer System Software.
Path/Ray Tracing Examples. Path/Ray Tracing Rendering algorithms that trace photon rays Trace from eye – Where does this photon come from? Trace from.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
G. Russo, D. Del Prete, S. Pardi Frascati, 2011 april 4th-7th The Naples' testbed for the SuperB computing model: first tests G. Russo, D. Del Prete, S.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
CMSC 611: Advanced Computer Architecture
CSC 4250 Computer Architectures
Scalability of Intervisibility Testing using Clusters of GPUs
Multiprocessor Cache Coherency
So far we have covered … Basic visualization algorithms
Cache Memory Presentation I
Real-Time Ray Tracing Stefan Popov.
Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: illusion of having more physical memory program relocation protection.
CMSC 611: Advanced Computer Architecture
Presentation transcript:

Distributed Interactive Ray Tracing for Large Volume Visualization Dave DeMarle Steven Parker Mark Hartner Christiaan Gribble Charles Hansen

Scientific Computing and Imaging Institute, University of Utah Ray tracing CPU 4 CPU 3 CPU 2 CPU 1 For every pixel, cast a ray and find the first hit object Every pixel is independent, so image parallelism is a natural choice for acceleration

Scientific Computing and Imaging Institute, University of Utah Ray traced forest Image parallelism Showing work division

Scientific Computing and Imaging Institute, University of Utah Interactive ray tracing of scientific data

Scientific Computing and Imaging Institute, University of Utah Large volume visualization Richtmyer- Meshkov instability simulation Each timestep is 1920x2048x2048 x 8 bit

Scientific Computing and Imaging Institute, University of Utah Zooming in on the test data set

Scientific Computing and Imaging Institute, University of Utah Architectural comparison SGI test machineCluster test machine ~$1.5 million~$150 thousand Threaded programmingCustom or add on APIs 1x MHz R12K CPUs32x2 1.7 GHz Xeon CPUs 64 bit addressing32 bit addressing 16GB RAM (shared)32GB RAM (1GB per node) ccNUMA hypercube networkSwitched Gbit Ethernet 335ns avg round trip latency (spec) 34000ns avg round trip latency (measured) 12.8 Gbit/sec bandwidth (spec).6 Gbit/sec bandwidth (measured)

Scientific Computing and Imaging Institute, University of Utah Lack of a parallel programming model Build a minimal networking library based on TCP Lower network performance Perform IO asynchronously to overlap computation and communication Workers try to keep a backlog Supervisor does upkeep tasks while workers render Memory limited to 4GB and isolated within each node Create an object based DSM in the network library to share memory between nodes Overcoming Cluster Limitations

Scientific Computing and Imaging Institute, University of Utah System architecture Node 2 Node 3 Node 1 Ray thread 1 Ray thread 2 Ray thread 1 Ray thread 2 Ray thread 1 Ray thread 2 Supervisor node local memory

Scientific Computing and Imaging Institute, University of Utah Central executive network limitation Moderately complex scene

Scientific Computing and Imaging Institute, University of Utah Central executive network limitation Frame rate scales until supervisor bottleneck dominates

Scientific Computing and Imaging Institute, University of Utah Central executive network limitation latency = 19μs per tile bandwidth = 600Mbit/s

Scientific Computing and Imaging Institute, University of Utah With enough processors, latency is the limitation

Scientific Computing and Imaging Institute, University of Utah Beyond 2 32 bytes We make the entire memory space of the cluster usable in our object based DSM Each node owns part of the data Application threads acquire bricks from the DSM If the data isn't locally owned, the DSM gets a copy from the owner The DSM tries to cache blocks to use again - Corrie & Mackerras, 1993, “Parallel Volume Rendering and Data Coherence”

Scientific Computing and Imaging Institute, University of Utah System architecture Node 2 Node 3 Node 1 Ray thread 1 Ray thread 2 Ray thread 1 Ray thread 2 Ray thread 1 Ray thread 2 Supervisor node local memory

Scientific Computing and Imaging Institute, University of Utah System architecture, extended with SDSM Node 2 Node 3 Node 1 Ray thread 1 Ray thread 2 Ray thread 1 Ray thread 2 Ray thread 1 Ray thread 2 Supervisor node local memory Software distributed shared memory

Scientific Computing and Imaging Institute, University of Utah Beyond 2 32 bytes 036 owned cached DSM Communication thread Ray thread 1 Node 1 Node 2 Node 3 Ray thread owned cached DSM Communication thread 258 owned cached DSM Communication thread

Scientific Computing and Imaging Institute, University of Utah Beyond 2 32 bytes 036 owned cached DSM Communication thread acquire(4) Ray thread 1 Node 1 4 Node 2 Node 3 acquire(3) Ray thread 1 acquire(2) Ray thread owned cached DSM Communication thread owned cached DSM Communication thread

Scientific Computing and Imaging Institute, University of Utah Beyond 2 32 bytes 0361 owned cached DSM Communication thread acquire(7) release(4) Ray thread 1 Node Node 2 Node 3 acquire(8) release(3) Ray thread 1 acquire(4) release(2) Ray thread owned cached DSM Communication thread owned cached DSM Communication thread 463

Scientific Computing and Imaging Institute, University of Utah Which node owns what? Isosurface of visible female Showing ownership

Scientific Computing and Imaging Institute, University of Utah Acceleration structure a “macrocell” lists the min and max values inside- space leaping reduces the number of data accesses Isovalue=92 Missed Max=85 Min=100 Max=97 Min=89 Parker et al, viz 98, “Interactive Ray Tracing for Isosurface Rendering”

Scientific Computing and Imaging Institute, University of Utah Acceleration structure Enter only those macrocells that contain the isovalue Isovalue=92 Missed Max=85 Min=100 Max=97 Min=89

Scientific Computing and Imaging Institute, University of Utah Acceleration structure Recurse inside interesting macrocells, until you have to access the actual volume data Missed Max=90 Max=95 Min=91 Max=93 Min=90 Missed Not traversed Max=97 Min=89 ed 00

Scientific Computing and Imaging Institute, University of Utah &0&1&2&3&4&5&6&7 &8&9&10 &11 &12&13&14&15 Data bricking Use multi-level, 3D tiling for memory coherence - 64 byte cache line 4KB OS page &15 &0&1&4&5 &2&3&6&7 &8&9&12 &13 &10&11&14&15 Parker et al, viz 98, “Interactive Ray Tracing for Isosurface Rendering”

Scientific Computing and Imaging Institute, University of Utah &0&1&2&3&4&5&6&7 &8&9&10 &11 &12&13&14&15 Data bricking Use 3 level, 3D tiling for memory coherence 64 byte cache line 4KB OS page 4KB x L 3 Network transfer size For the datasets we've tried level three brick size of 32KB is the best trade-off between data locality and transmission time &15 &0&1&4&5 &2&3&6&7 &8&9&12 &13 &10&11&14&15

Scientific Computing and Imaging Institute, University of Utah 89.6 Isosurface intersection We analytically test for ray isosurface intersection within a voxel by solving a cubic polynomial, defined by ray parameters and 8 voxel corner values We use the data gradient for the surface normal Isovalue=92

Scientific Computing and Imaging Institute, University of Utah Benchmark test isovalue viewpoint Frame #

Scientific Computing and Imaging Institute, University of Utah Consolidated data access Most of the time is spent accessing data which is locally owned or cached Reduce the number of DSM accesses by eliminating redundant accesses When ray needs data, sort accesses to get all needed data in one shot 7µs hit time 600 µs miss time 98% hit rate

Scientific Computing and Imaging Institute, University of Utah Acquire on every voxel corner Brick 1Brick 2Brick 3 Brick 4Brick 5Brick 6 macrocell # accesses = 3,279,000 per worker per frame frame rate =.115 f/s

Scientific Computing and Imaging Institute, University of Utah Acquire on every voxel corner # accesses = 453,400 per worker per frame frame rate =.709 f/s Brick 1Brick 2Brick 3 Brick 4Brick 5Brick 6 macrocell

Scientific Computing and Imaging Institute, University of Utah Acquire on every voxel corner # accesses = 53,290 per worker per frame frame rate = 1.69 f/s Brick 1Brick 2Brick 3 Brick 4Brick 5Brick 6 macrocell

Scientific Computing and Imaging Institute, University of Utah Results – cache behaviour Frame Number Measured Frame Rate [f/s] Transfers [MB/node/f] ISO VIEW

Scientific Computing and Imaging Institute, University of Utah Results – cache behaviour Frame Number Filled Frame Rate [f/s] Measured Frame Rate [f/s] 1290 Cache fill costs 22% here

Scientific Computing and Imaging Institute, University of Utah Results - machine comparison SGI 1x31 CPUs avg = 4.7 [f/s] Cluster 31x2 CPUs avg = 1.7 [f/s] Cluster 31x1 CPUs avg = 1.1 [f/s]

Scientific Computing and Imaging Institute, University of Utah Results - machine comparison SGI 1x31 CPUs avg = 5.7 [f/s] Cluster 31x2 CPUs avg = 1.5 [f/s] 1290 SGI is 3.8x faster

Scientific Computing and Imaging Institute, University of Utah Results - machine comparison SGI 1x31 CPUs avg = 4.2 [f/s] Cluster 31x2 CPUs avg = 2.6 [f/s] SGI is 1.6x faster

Scientific Computing and Imaging Institute, University of Utah Conclusions Confirmed that interactive ray tracing on a cluster is possible Scaling limited by latency, and the number of tiles determines max frame rate Data sets that exceed the memory space of any one node can be handled with a DSM For isosurfacing hit time is limiting factor, not network time Overheads make the cluster slower than the supercomputer, but the new solution has a significant price advantage

Scientific Computing and Imaging Institute, University of Utah Future work Make it faster!!! Use lower latency network Remove central bottleneck Use block prefetch and ray rescheduling Optimize the DSM for faster hit times Use more parallelism - SIMD, hyperthreading, GPU

Scientific Computing and Imaging Institute, University of Utah Acknowledgments NSF Grants , DOE Views NIH Grants Mark Duchaineau at LLNL Our anonymous reviewers For more information