Mesh Layouts for Block-Based Caches Sung-Eui Yoon Peter Lindstrom Lawrence Livermore National Laboratory.

Slides:

Advertisements

Similar presentations

Multilevel Streaming for Out-of-Core Surface Reconstruction

Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions

Least-squares Meshes Olga Sorkine and Daniel Cohen-Or Tel-Aviv University SMI 2004.

Evaluating Graph Coloring on GPUs Pascal Grosset, Peihong Zhu, Shusen Liu, Suresh Venkatasubramanian, and Mary Hall Final Project for the GPU class - Spring.

Graphs Chapter 12. Chapter Objectives  To become familiar with graph terminology and the different types of graphs  To study a Graph ADT and different.

Efficient Storage and Processing of Adaptive Triangular Grids using Sierpinski Curves Csaba Attila Vigh Department of Informatics, TU München JASS 2006,

CSE554ContouringSlide 1 CSE 554 Lecture 4: Contouring Fall 2013.

I/O-Efficient Batched Union-Find and Its Applications to Terrain Analysis Pankaj K. Agarwal, Lars Arge, Ke Yi Duke University University of Aarhus.

Haptic Rendering using Simplification Comp259 Sung-Eui Yoon.

lecture 4 : Isosurface Extraction

I/O-Algorithms Lars Arge Spring 2009 February 2, 2009.

Streaming Computation of Delaunay Triangulations Jack Snoeyink UNC Chapel Hill Jonathan Shewchuk UC Berkeley Yuanxin Liu UNC Chapel Hill Martin Isenburg.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Cache-Oblivious Mesh Layouts Sung-Eui Yoon, Peter Lindstrom Valerio Pascucci, Dinesh Manocha 1: University.

Advanced Computer Graphics (Fall 2010) CS 283, Lecture 4: 3D Objects and Meshes Ravi Ramamoorthi

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Quick-VDR: Interactive View-Dependent Rendering of Massive Models Sung-Eui Yoon Brian Salomon Russell Gayle.

Lossless Compression of Floating-Point Geometry Martin Isenburg UNC Chapel Hill Peter Lindstrom LLNL Livermore Jack Snoeyink UNC Chapel Hill.

Graphs Chapter 12. Chapter 12: Graphs2 Chapter Objectives To become familiar with graph terminology and the different types of graphs To study a Graph.

Spring 2010CS 2251 Graphs Chapter 10. Spring 2010CS 2252 Chapter Objectives To become familiar with graph terminology and the different types of graphs.

Streaming Compression of Triangle Meshes Martin Isenburg University of California at Berkeley Jack Snoeyink University of North Carolina at Chapel Hill.

Ray Tracing Dynamic Scenes using Selective Restructuring Sung-eui Yoon Sean Curtis Dinesh Manocha Univ. of North Carolina at Chapel Hill Lawrence Livermore.

Irregular to Completely Regular Meshing in Computer Graphics Hugues Hoppe Microsoft Research International Meshing Roundtable 2002/09/17 Hugues Hoppe Microsoft.

R-LODs: Fast LOD-based Ray Tracing of Massive Models Sung-Eui Yoon Lawrence Livermore National Lab. Christian Lauterbach Dinesh Manocha Univ. of North.

Fall 2007CS 2251 Graphs Chapter 12. Fall 2007CS 2252 Chapter Objectives To become familiar with graph terminology and the different types of graphs To.

Martin Isenburg UC Berkeley Streaming Meshes Peter Lindstrom LLNL.

I/O-Algorithms Lars Arge Spring 2008 January 31, 2008.

Out-of-Core Compression for Gigantic Polygon Meshes Martin IsenburgStefan Gumhold University of North CarolinaWSI/GRIS at Chapel Hill Universität Tübingen.

1 A Novel Page-Based Data Structure for Interactive Walkthroughs Behzad Sajadi Yan Huang Pablo Diaz-Gutierrez Sung-Eui Yoon M. Gopi.

Zoltan Szego †*, Yoshihiro Kanamori ‡, Tomoyuki Nishita † † The University of Tokyo, *Google Japan Inc., ‡ University of Tsukuba.

Data Management Techniques Sung-Eui Yoon KAIST URL:

Compressing Multiresolution Triangle Meshes Emanuele Danovaro, Leila De Floriani, Paola Magillo, Enrico Puppo Department of Computer and Information Sciences.

Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Interactive Rendering of Meso-structure Surface Details using Semi-transparent 3D Textures Vision, Modeling, Visualization Erlangen, Germany November 16-18,

Bounds on the Geometric Mean of Arc Lengths for Bounded- Degree Planar Graphs M. K. Hasan Sung-eui Yoon Kyung-Yong Chwa KAIST, Republic of Korea.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

1 On-the-Fly Transformation and Rendering of Compressed Irregular Volume Data Chuan-kai Yang Department of Computer Science State University of New York.

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,

RACBVHs: Random-Accessible Compressed Bounding Volume Hierarchies Tae-Joon Kim Bochang Moon Duksu Kim Sung-Eui Yoon KAIST (Korea Advanced Institute of.

Random-Accessible Compressed Triangle Meshes Sung-eui Yoon Korea Advanced Institute of Sci. and Tech. (KAIST) Peter Lindstrom Lawrence Livermore National.

Saarland University, Germany B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes Sven Woop Gerd Marmitt Philipp Slusallek.

Click to edit Master title style HCCMeshes: Hierarchical-Culling oriented Compact Meshes Tae-Joon Kim 1, Yongyoung Byun 1, Yongjin Kim 2, Bochang Moon.

Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)

1 Compressing Triangle Meshes Leila De Floriani, Paola Magillo University of Genova Genova (Italy) Enrico Puppo National Research Council Genova (Italy)

I/O-Efficient Batched Union-Find and Its Applications to Terrain Analysis Pankaj K. Agarwal, Lars Arge, Ke Yi Duke University University of Aarhus.

View-dependent Adaptive Tessellation of Spline Surfaces

CSE554ContouringSlide 1 CSE 554 Lecture 4: Contouring Fall 2015.

Spectral Sequencing Based on Graph Distance Rong Liu, Hao Zhang, Oliver van Kaick {lrong, haoz, cs.sfu.ca {lrong, haoz, cs.sfu.ca.

UNC Chapel Hill M. C. Lin Introduction to Motion Planning Applications Overview of the Problem Basics – Planning for Point Robot –Visibility Graphs –Roadmap.

Models of Hierarchical Memory. Two-Level Memory Hierarchy Problem starts out on disk Solution is to be written to disk Cost of an algorithm is the number.

An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.

Data Structures and Algorithms in Parallel Computing Lecture 7.

BOĞAZİÇİ UNIVERSITY – COMPUTER ENGINEERING Mehmet Balman Computer Engineering, Boğaziçi University Parallel Tetrahedral Mesh Refinement.

Graphs Chapter 12. Chapter 12: Graphs2 Chapter Objectives To become familiar with graph terminology and the different types of graphs To study a Graph.

15-853: Algorithms in the Real World Locality II: Cache-oblivious algorithms – Matrix multiplication – Distribution sort – Static searching.

1 Random Disambiguation Paths Al Aksakalli In Collaboration with Carey Priebe & Donniell Fishkind Department of Applied Mathematics and Statistics Johns.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

CSE554Contouring IISlide 1 CSE 554 Lecture 5: Contouring (faster) Fall 2013.

1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.

Fifth International Conference on Curves and Surfaces Incremental Selective Refinement in Hierarchical Tetrahedral Meshes Leila De Floriani University.

1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.

A Graph Theoretic Approach to Cache-Conscious Placement of Data for Direct Mapped Caches Mirza Beg and Peter van Beek University of Waterloo June

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

Random Testing: Theoretical Results and Practical Implications IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2012 Andrea Arcuri, Member, IEEE, Muhammad.

Hybrid Ray Tracing and Path Tracing of Bezier Surfaces using a mixed hierarchy Rohit Nigam, P. J. Narayanan CVIT, IIIT Hyderabad, Hyderabad, India.

Morphing and Shape Processing

Hybrid Ray Tracing of Massive Models

Cache-Efficient Layouts of BVHs and Meshes

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Presentation transcript:

Mesh Layouts for Block-Based Caches Sung-Eui Yoon Peter Lindstrom Lawrence Livermore National Laboratory

2 Goal ● Provide cache-coherent layouts of meshes and graphs ● Derive metrics that measure cache- coherence of layouts ● Generality ● Simplicity ● Efficiency ● Accuracy

3 Cache-Coherent Metrics ● Measure the expected number of cache misses of a layout given block-based caches ● Should correlate well with the observed number of cache misses ● Cache-aware metrics ● Measure cache-coherence given known cache parameters (e.g., block size) ● Cache-oblivious metrics ● Consider all possible cache parameters

4 Motivation Accumulated growth rate during 1993 – 2004 (log scale) Courtesy: Anselmo Lastra, ●Lower growth rate of data access speed 1.5X 20X 46X 130X during

5 Memory Hierarchies and Block- Based Caches CPU Fast memory or cache Slow memory Block transfer Disk sec Access time: sec10 -8 sec

6 Main Contributions ● Propose novel and practical cache-aware and cache-oblivious metrics ● Derive metrics given block-based caches ● Propose efficient cache-coherent layout constructions ● Apply to different applications

7 Related Work ● Computation reordering ● Data layout optimization

8 Computational Reordering ● Cache-aware [Coleman and McKinley 95, Vitter 01, Sen et al. 02] ● Cache-oblivious [Frigo et al. 99, Arge et al. 04] Focus on specific problems such as sorting and linear algebra computations

9 Data Layout Optimization ● Graph and matrix layout [Diaz et al. 02] ● Minimum linear arrangement (MLA), bandwidth, and wavefront, etc. ● Space-filling curves ● [Sagan 94, Pascucci and Frank 01, Lindstrom and Pascucci 01, Gopi and Eppstein 04] ● Rendering and processing sequences ● [Deering 95, Hoppe 99, Bogomjakov and Gotsman 02, Isenburg and Lindstrom 05] ● Cache-oblivious mesh layout ● [Yoon et al. 05]

10 Outline ● Computation models ● Cache-aware and cache-oblivious metrics ● Results

11 Outline ● Computation models ● Cache-aware and cache-oblivious metrics ● Results

12 ncnc General Framework of Layout Computation nana nbnb ndnd Input directed graph, G (N, A) Layout algorithm, φ 1D layout, φ(N) …….. nana ndnd ncnc nbnb Cache-coherent metric

13 ncnc Two-Level I/O Model [Aggarwal and Vitter 88] nana nbnb ndnd Input directed graph Layout algorithm Cache M cache blocks, whose size is B 1D layout with block size = 3 …….. nana ndnd ncnc nbnb nana ndnd ncnc nbnb

14 Graph Representation ● Directed graph, G = (N, A) ● Represent access patterns between nodes ● Nodes, N ● Data element ● (e.g., mesh vertex or mesh triangle) ● Directed arcs, A ● Connects two nodes if they are accessed sequentially ncnc nana nbnb ndnd

15 Weights of Nodes and Arcs ● Indicate probabilities that each element will be accessed ● Computed in an equilibrium status during infinite random walks ● Assume that applications infinitely access the data according to the input graph ● Correspond to eigen-values of the probability transition matrix

16 Cache-Coherence of a Layout given Block-Based Caches ● Expected number of cache misses of a layout ● Probability accessing a node from another node by traversing an arc ● Conditional probability that we will have a cache miss given the above access pattern ncnc nana nbnb ndnd

17 Specialization to Meshes ● Expected number of cache misses of a layout ● Probability accessing a node from another node by traversing an arc ● Conditional probability that we will have a cache miss given the above access pattern ncnc nana nbnb ndnd ncnc nana nbnb ndnd An input mesh Implicitly derived graph = constant 1. Two opposite directed arcs 2. Uniform distribution to access adjacent nodes given a node

18 Outline ● Computation models ● Cache-aware and cache-oblivious metrics ● Results

19 Four Different Cases Cache-aware case single cache block, M=1 Cache-oblivious case single cache block, M=1 Cache-aware case multiple cache blocks, M>1 Cache-oblivious case multiple cache blocks, M>1

20 Cache-Aware: Single Cache Block, M=1 ncnc nana nbnb ndnd Input directed graph 1D layout with block size = 3 …….. nana ndnd ncnc nbnb Cache, whose block size is B Straddling arcs

21 Cache-Aware: Multiple Cache Blocks, M>1 ncnc nana nbnb ndnd Input directed graph 1D layout with block boundary …….. nana ndnd ncnc nbnb Straddling arcs Cache

22 Final Cache-Aware Metric ● Counts the number of straddling arcs of the layout given a block size B : block index containing the node, i : Unit step function, 1 if x > 0 0 otherwise. where

23 High Accuracy of Cache-Aware Metric Linear correlation [-1, 1] Observed number of cache misses With 5 cache blocks With 25 cache blocks Cache-aware metric 0.97 Tested layouts: Z-curve, Hilbert curve, H-order, minimum linear arrangement layout, βΩ-layout, geometric CO layout, (bi or uni) row-by-row, (bi or uni) diagnoal layouts Z-curve on a uniform grid Tested block size = 4KB

24 Four Different Cases Cache-aware case single cache block, M=1 Cache-oblivious case single cache block, M=1 Cache-aware case multiple cache blocks, M>1 Cache-oblivious case multiple cache blocks, M>1

25 Cache-Oblivious: Single Cache Block, M=1 Cache Does not assume a particular block size: Then, what are good representatives for block sizes?

26 Two Possible Block Size Progressions ● Arithmetic progression ● 1, 2, 3, 4, … ● Geometric progression ● 2 0, 2 1, 2 2, 2 3, … ● Well reflects current caching architectures ● E.g., L1: 32B, L2: 64B, Page: 4KB, etc.

27 Probability that an Arc is a Straddling Arc …….. nana ndnd ncnc nbnb Is an arc straddling given a block size? Computed as a probability as a function of arc length, l Arc length, l, = 2

28 Two Cache-Oblivious Metrics ● Arithmetic cache-oblivious metric, ● Geometric cache-oblivious metric, Arc length of arc (i, j) Geometric mean of arc lengths MLA metric, Arithmetic mean

29 Validation for Cache-Oblivious (CO) Metrics ● Geometric cache-oblivious metric ● Practical and useful Geometric CO layout Arithmetic CO layout 97% of tested block sizes The number of cache misses when M = 1 (log scale) 73% of tested power-of-two block sizes

30 Correlations between Metrics and Observed Number of Cache Misses Linear correlation [-1, 1] Observed number of cache misses With 1 cache block With 5 cache blocks Geometric CO metric Arithmetic CO metric Tested block size = 4KB Tested with 10 different layouts on a uniform grid

31 Efficient Layout Computation for Our Metrics ● Cache-aware layouts ● Optimized with cache-aware metric given a block size B ● Computed from the graph partitioning ● Geometric cache-oblivious metric ● Very efficient ● Can be used in different layout methods

32 Layout Computation with Geometric Cache-Oblivious Metric ● Multi-level construction method ● Partition an input mesh into k different sets ● Layout partitions based on our metric … 1. Partition 2. Lay out Generalized layout method for unstructured meshes

33 Outline ● Computation models ● Cache-aware and cache-oblivious metrics ● Results

34 Applications ● Isosurface extraction ● View-dependent rendering

35 Iso-Surface Extraction ● Uses contour tree [van Kreveld et al. 97] ● Runtime is dominated by the traversal of iso- surface ● Layout graph ● Use an input tetrahedral mesh Spx model (140K vertices)

36 High Correlation with Number of Cache Misses Linear correlation [-1, 1] Observed number of cache misses With 1 cache block With 10K cache blocks Geometric CO metric Tested block size = 4KB Tested with 8 different layouts: our geometric CO, our cache-aware, breadth-first (and depth-first) layouts, spectral [Juvan and Mohar 92], cache- oblivious mesh [Yoon et al. 05], Z-curve [Sagan 94], X-axis sorted layouts

37 High Correlation with Runtime Performance Linear correlation [-1, 1] First iso-surface extraction time Second iso-surface extraction time Geometric CO metric 0.94 Disk I/O time is major bottleneck Memory access time is major bottleneck

38 Comparison with Other Layouts The first iso-surface extraction time (sec) 8% - 77% improvement and very close to the cache-aware performance

39 View-Dependent Rendering ● Layout vertices and triangles of progressive meshes ● Used in an efficient VDR system [Yoon et al. 04] ● Reduce misses in GPU vertex cache

40 Cache Miss Ratio on Bunny Model GPU vertex cache miss ratio Vertex cache size Hoppe [Hoppe 99] Theoretical lower bound [Bar-Yehuda and Gotsman 96] Universal rendering seq. [Bogomjakov and Gotsman 02] Geometric CO layout

41 Cache Miss Ratio on Power Plant Model GPU vertex cache miss ratio Vertex cache size Z-curve Hoppe’s rendering seq. [Hoppe 99] Theoretical lower bound [Bar-Yehuda and Gotsman 96] COML [Yoon et al. 05] Geometric CO layout

42 Conclusion ● Novel cache-aware and cache-oblivious metrics to evaluate layouts ● Derived metrics based on two-level I/O model ● Improved the performance of applications without modifying codes OpenCCL, open source library

43 Ongoing and Future Work ● Derive a lower bound on our geometric cache-oblivious metric ● Employ mesh compression to further reduce disk I/O accesses ● Investigate efficient layout method for deforming models ● Apply to non-graphics applications ● e.g., shortest path or other graph computations

44 Cache-Efficient Layouts of Bounding Volume Hierarchies ● Yoon and Manocha, Eurographics 06 Ray tracingCollision detection

45 Acknowledgements ● Ajith Mascarenhas ● Martin Isenburg ● Dinesh Manocha ● Fabio Bernardon, Joao Comba, and Claudio Silva ● For their unstructured tetrahedra rendering program ● Members of data analysis group in LLNL ● Anonymous reviewers

46 UCRL-PRES This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-ENG-48.

47 Additional slides

48 Cache-Coherence of Layouts ● Well known heuristics for cache-coherent layouts ● Space-filling curves [Sagan 94] ● How can we compute better layouts? ● Requires metrics measuring cache-coherence of layouts

49 Main Results ● Define cache-coherence of layout as: ● Expected number of cache misses during random walks of a graph given block-based caches ● Then, the exp. number of cache misses = Number of straddling arcs in a cache-aware cache ≈ Geometric mean of arc lengths in a cache- oblivious case

50 Data Layout Optimization ● Rendering sequences ● Triangle strips ● [Deering 95, Hoppe 99, Bogomjakov and Gotsman 02] ● Processing sequences ● [Isenburg and Gumhold 03, Isenburg and Lindstrom 05] Assume that access pattern globally follows the layout order

51 Data Layout Optimization ● Space-filling curves ● [Sagan 94, Velho and Gomes 91, Pascucci and Frank 01, Lindstrom and Pascucci 01, Gopi and Eppstein 04] Assume geometric regularity

52 Data Layout Optimization ● Graph and matrix layout ● A survey [Diaz et al. 02] ● Minimum linear arrangement (MLA) ● Bandwidth ● Profile ● Wavefront, etc. Does not necessarily produce good layouts for block-based caches

53 Cache-Aware Metric ● Expected number of cache misses given a block size B

54 Correlation between Cache-Aware Metric and Observed Number of Cache Misses Cache-aware metric M = 5 M = 25 Observed number of cache misses Cache block size = 4KB R 2 = 0.97

55 Evaluating Existing Layouts ● No known tight bound ● Compare against the best layout we can construct ● Employ an efficient sampling method Is it close to the optimal layout? An existing layout, φ Use it Build a new one

56 Implementation ● Modify our open source layout computation codes, OpenCCL ● Based on METIS graph partitioning library [Karypis and Kumar 98] ● Processing speed ● 15k triangles / second

57 Comparison with Cache-Oblivious Mesh Layouts (COML) [Yoon et al. 05] ● Two major improvements ● Accuracy ● Usability

58 Pros and Cons ● Limitations ● Not directly applicable to dynamic models ● Pros ● Generality ● Can have benefits without modifying underlying codes

59 Specialization to Meshes ● Assume an equally likelihood to access adjacent nodes given a node ncnc nana nbnb ndnd ncnc nana nbnb ndnd An input mesh Implicitly derived graph

60 Two Possible Block Size Progressions ● Arithmetic progression ● 1, 2, 3, 4, … ● Geometric progression ● 2 0, 2 1, 2 2, 2 3, … ● Well reflects current caching architectures ● E.g., L1: 32B, L2: 64B, Page: 4KB, etc. B Pr(B) B Geometric distribution Uniform distribution

61 Cache Miss Ratio with Different Mesh Resolutions of Bunny Model GPU vertex cache miss ratio (case size: 32) Mesh resolution Hoppe’s rendering sequence [Hoppe 99] COLg

62 Correlation between Cache-Aware Metric and Number of Cache Misses M = 5 M = 25 Observed number of cache misses Cache block size = 4KB R 2 = 0.97

63 Correlations with Observed Number of Cache Misses Cache misses One blk : Mult blks Arithmetic CO metric Geometric CO metric R 2 = 0.98R 2 = 0.81 Cache block size = 4KB,

64 Comparison with Other Layouts X-axis BFL Spectral layout [Juvan and Mohar 92] DFL COML [Yoon et al. 05] Z-curve COLg Aware Correlation: Up to 2X speedup 9% lower than cache- aware layout

65 Comparison with Other Layouts ● Compute eight different layouts ● Our geometric cache-oblivious layout ● Our cache-aware layout ● Breadth-first layout ● Depth-first layout ● Spectral layout [Juvan and Mohar 92] ● Cache-oblivious mesh layout [Yoon et al. 05] ● Z-curve [Sagan 94] ● X-axis sorted layout

66 Our Layout of Bunny Model