Presentation is loading. Please wait.

Presentation is loading. Please wait.

Streaming v. Multicore in Graphics Applications Jared Hoberock Victor Lu Sanjay Patel John C. Hart Univ. of Illinois.

Similar presentations


Presentation on theme: "Streaming v. Multicore in Graphics Applications Jared Hoberock Victor Lu Sanjay Patel John C. Hart Univ. of Illinois."— Presentation transcript:

1 Streaming v. Multicore in Graphics Applications Jared Hoberock Victor Lu Sanjay Patel John C. Hart Univ. of Illinois

2 Dynamic Virtual Environments World of Warcraft Social Internet World Completely Unconstrained (can build & share things) Lower Quality Graphics Grand Theft Auto IV Sandbox World Free Interaction (within gamespace) High Quality Graphics Halo 3 First-Person Shooter Constrained Interaction Photorealistic Graphics (much precomputation) Dynamic, Flexible Game Graphics Precomputed, Rigid Film Graphics Multicore enables both flexibility and photorealism

3 Videogame Production Costly ° Expensive: $10M/title ° Slow: 3+ years/title Compromises ° Precomputed visibility – restricts viewer mobility and environment complexity ° Precomputed lighting – restricts scene dynamics, user alterations ° Precomputed motion – restricts movement to mocap data, rigging Consequences ° Significant development effort to achieve realtime rates ° Dynamic social gamespace quality lags that of solo/team shooter levels Solution ° Leverage multicore power to ray trace for dynamic visibility & lighting

4

5 How Close Are We? Single CPU ray tracing ° RTRT Core renders at 1~5 Hz on 2.5 GHz P4 ° Need 60 Hz for games ° 30 GHz CPU needed to ray trace game scenes [Schmittler et al., Realtime Ray Tracing for Current & Future Games, SIGGRAPH 2006 Real Time Ray Tracing Course Notes] We wont see a 30GHz serial processor (burns too brightly!) We will see 16+ cores But can we do in parallel what we predict in serial? Ingo Wald, RTRT Core, SIGGRAPH 2005 Real Time Ray Tracing Course Notes

6 Spatial Data Structures Nearest Neighbor Problems in Graphics Rendering: Photon Mapping (k-NN) ° Find 500 photons nearest to a ray-surface intersection to compute surfaces illumination Modeling: Surface Reconstruction ( -NN) ° Surface reconstructed at each point depends o locations of nearest points within a given distance Animation: Collision Detection ( -NN) ° Collision between multiple interacting elements accelerated by avoiding all pairs intersections Built on hierarchical spatial data structures How can we build, query and maintain on SIMD GPUs?

7 kD-Tree Hierarchy of axis-aligned partitions ° 2-D partitions are lines ° 3-D partitions are planes Axis of partitions alternates wrt depth of the tree Average access time is O(log n) Worst case O(n) when tree is severely lopsided Need to maintain a balanced tree O(n log n) Can find k nearest neighbors in O(k + log n) time using a heap

8 GPU Hierarchy Traversal SIMD stackless hierarchy traversal ° Prethread with hit/miss pointers ° Hit pointer points to first child ° Miss pointer points to next sibling or if last sibling then ancestors sibling References ° Foley & Sugerman, kD-tree Acceleration Structures for a GPU raytracer, Graphics HW 05 ° Carr, Hoberock, Crane & Hart, Fast GPU Ray Tracing of Dynamic Meshes Using Geometry Images, Graphics Interface 2006

9 GPU Hierarchy Construction Recent approaches sort first, then organize into hierarchy ° Zhou, Hou, Wang, Guo, Real- Time KD-Tree Construction on Graphics Hardware, SIGGRAPH Asia 2008 ° Godiyal, Hoberock, Hart, Garland, Rapid Multipole Graph Drawing on the GPU, Graph Drawing 2008 Latter uses kD-tree for fast n-body approximation to compute force directed layout CPU+GPU ° CPU builds kD-tree ° GPU performs median selection ° Practical when > 50K elements

10

11 Incoherent Shader Execution Videogame graphics rasterize triangles ° Same shader applied to all pixels (fragments) in triangle ° Shading & visibility occur simultaneously Future videogames will also trace rays ° Visibility first, then shading Primary eye rays are coherent Secondary rays are reflected or scattered into incoherent shader queries Different shader (not just different shader data) applied to each ray ° e.g. hair, skin, cloth, liquids, foliage Chris Wyman

12 GPU Architecture GPU = MIMD of SIMD MIMD processing ° Cell: 8 MIMD nodes ° GF8800: 16 MIMD nodes ° LRB: 32 MIMD nodes SIMD processing ° Cell: 4 per MIMD node ° GF8800: 8 per MIMD node ° LRB: 16 per MIMD node Some MIMD nodes have distinct control processors though similar processing could occur via one SIMD node (masking rest) LRB core is a MIMD proc., NVIDIA core is a SIMD proc. NVIDIA warp is 32 threads streaming on one MIMD node MIMD Node SIMD Node

13 IBM Cell Architecture Flex I/O Memory interface controller Bus interface controller Dual XDR 32 bytes/cycle16 bytes/cycle Element interconnect bus (up to 96 bytes/cycle) 16 bytes/cycle Synergistic processor elements Power processor element Power processor unit Power execution unit L1 cache L2 cache Local store SXU SPU SMF Local store SXU SPU SMF Local store SXU SPU SMF Local store SXU SPU SMF 16 bytes/cycle16 bytes/cycle (2x) Local store SXU SPU SMF Local store SXU SPU SMF Local store SXU SPU SMF Local store SXU SPU SMF 64-bit Power Architecture with vector media extensions Gschwind et al., Synergistic Processing in Cells Multicore Architecture, IEEE Micro, 2006

14 NVIDIA Tesla Architecture

15 Conditional Program Flow High-performance stuck with low-level streaming SIMD Even in multicore Problem with SIMD: Conditional Program Flow ° If a data-dependent condition leads to two different program flows ° Then both program flows must be executed on all SIMD nodes (serialization) ° Result masked per SIMD processor by the condition data MIMD for loop SIMD for loop if (X) then A else B MIMD for loop SIMD for loop if (X) then A else B T T T T T T T T T T T T T T F F X: X? A A B B A A B B Mask on X A A A A A A A A A A A A A A B B

16 Deferred Shading Handle visibility first ° Intersect rays w/scene ° Store result for later shading Shade ray intersections If different rays in the same MIMD node need different shaders, then shaders are serialized O(NS) performance ° N = # of rays ° S = # of shaders (per MIMD node) ° O(S) when distributed across N processes MIMD for all rays SIMD for all rays intersect ray with scene set mask to shader # MIMD for all rays SIMD for all rays for all shaders in SIMD ray warp shader(ray) if mask == shader MIMD for all rays SIMD for all rays intersect ray with scene set mask to shader # MIMD for all rays SIMD for all rays for all shaders in SIMD ray warp shader(ray) if mask == shader

17 Process Sorting Need to bucket computations to move those with identical control flows onto the SIMD processors of the same MIMD node When is it worth the trouble?

18 Scan (Prefix Sum)

19 Shader Scheduling Sort jobs based on shader request ° Radix sort ° Segmented scan ° Global v. local sort Load MIMD nodes only with rays requesting the same shader Still O(NS) ° Performing O(N) scan on each of S shaders Can we scan on all shaders simultaneously? MIMD for all rays SIMD for all rays intersect ray with scene MIMD for all shaders Scan rays needing that shader MIMD for all rays needing that shader SIMD for all rays shader(ray) MIMD for all rays SIMD for all rays intersect ray with scene MIMD for all shaders Scan rays needing that shader MIMD for all rays needing that shader SIMD for all rays shader(ray)

20 Stanford Bunny in Cornell Box Three shaders: wall, glass, light Shaders simple Warp size: 32 HitIncoherenceBranchesEff. 10.6%1.1587% 230%2.4042% 338%2.5539% 440%2.6537% 540%2.6737% How often rays shader differs from previous rays Average # of branches per warp

21 Automotive/CAD Viz DJ_Designs via Google 3D WH 16 simple shaders Small parts ameliorate their shaders impact on overall efficiency BounceIncoherenceEfficiency 11.6%28% 240%13% 330%14% 422%15% 517%

22 Angel in Cornell Box Four shaders: wall, light simple marble, wood are more expensive, procedural BounceIncoherenceEfficiency 11.2%77% 252%23% 353%21% 447%22% 540%23%

23 Siebel Center Staircase Six shaders ° Copper, glass girder, chrome, marble, light Efficiency bump due to smooth glass/chrome coherence and rays exiting the scene BounceIncoherenceEfficiency 13%68% 234% 336%33% 4 32% 530%34%

24 Efficiency Images Branching Penalties Warp size: 32 All 32 SIMD threads must follow the same control flow 16 shaders one shader

25 Shader execution ° Serial: one at a time ° SIMD: as a big switch Serialized ° Slower, wastes processors ° Avoids locks ° Can conserve memory Compare w/ & w/o stream compaction Memory Coherence Processes: Memory: Processes: Memory: Processes: Memory:

26 Scheduling Approaches Five Options ° Serial Unsorted ° Serial Global Compaction ° Parallel Unsorted ° Parallel Local Compaction ° Parallel Global Compaction Each variation involves bookkeeping overhead

27 0 SerializedSIMD Parallel Unsorted Sorted Global UnsortedSorted Local Sorted Global Observations Even for these modest scenes there are significant performance gains Local per-node compaction doesn't work Even zero-time sort would not improve most cases Local per-node workloads hindered by too many shaders to schedule Faster stream compaction: Prefix sum, Scatter/Gather

28 Conclusions Stream compaction ° Not practical for simple shaders ° Practical for procedural textures (wood, marble) ° Probably for complex shaders (hair, cloth, skin) Warp coherence nevertheless leads to data incoherence ° Even when all shaders in a MIMD node run the same shader, their data is still distributed across memory, outside of cache boundaries Static tuning ok, but run-time better Broader implication to object polymorphism ° Streaming same objects with different virtual function tables


Download ppt "Streaming v. Multicore in Graphics Applications Jared Hoberock Victor Lu Sanjay Patel John C. Hart Univ. of Illinois."

Similar presentations


Ads by Google