Mesh Layouts for Block-Based Caches Sung-Eui Yoon Peter Lindstrom Lawrence Livermore National Laboratory.

Mesh Layouts for Block-Based Caches Sung-Eui Yoon Peter Lindstrom Lawrence Livermore National Laboratory

2 Goal ● Provide cache-coherent layouts of meshes and graphs ● Derive metrics that measure cache- coherence of layouts ● Generality ● Simplicity ● Efficiency ● Accuracy

3 Cache-Coherent Metrics ● Measure the expected number of cache misses of a layout given block-based caches ● Should correlate well with the observed number of cache misses ● Cache-aware metrics ● Measure cache-coherence given known cache parameters (e.g., block size) ● Cache-oblivious metrics ● Consider all possible cache parameters

4 Motivation Accumulated growth rate during 1993 – 2004 (log scale) Courtesy: Anselmo Lastra, http://www.hcibook.com/e3/online/moores-law/ ●Lower growth rate of data access speed 1.5X 20X 46X 130X during 99 - 04

5 Memory Hierarchies and Block- Based Caches CPU Fast memory or cache Slow memory Block transfer Disk 10 -2 sec Access time: 10 -7 sec10 -8 sec

6 Main Contributions ● Propose novel and practical cache-aware and cache-oblivious metrics ● Derive metrics given block-based caches ● Propose efficient cache-coherent layout constructions ● Apply to different applications

7 Related Work ● Computation reordering ● Data layout optimization

8 Computational Reordering ● Cache-aware [Coleman and McKinley 95, Vitter 01, Sen et al. 02] ● Cache-oblivious [Frigo et al. 99, Arge et al. 04] Focus on specific problems such as sorting and linear algebra computations

9 Data Layout Optimization ● Graph and matrix layout [Diaz et al. 02] ● Minimum linear arrangement (MLA), bandwidth, and wavefront, etc. ● Space-filling curves ● [Sagan 94, Pascucci and Frank 01, Lindstrom and Pascucci 01, Gopi and Eppstein 04] ● Rendering and processing sequences ● [Deering 95, Hoppe 99, Bogomjakov and Gotsman 02, Isenburg and Lindstrom 05] ● Cache-oblivious mesh layout ● [Yoon et al. 05]

10 Outline ● Computation models ● Cache-aware and cache-oblivious metrics ● Results

12 ncnc General Framework of Layout Computation nana nbnb ndnd Input directed graph, G (N, A) Layout algorithm, φ 1D layout, φ(N) …….. nana ndnd ncnc nbnb Cache-coherent metric

13 ncnc Two-Level I/O Model [Aggarwal and Vitter 88] nana nbnb ndnd Input directed graph Layout algorithm Cache M cache blocks, whose size is B 1D layout with block size = 3 …….. nana ndnd ncnc nbnb nana ndnd ncnc nbnb

14 Graph Representation ● Directed graph, G = (N, A) ● Represent access patterns between nodes ● Nodes, N ● Data element ● (e.g., mesh vertex or mesh triangle) ● Directed arcs, A ● Connects two nodes if they are accessed sequentially ncnc nana nbnb ndnd

15 Weights of Nodes and Arcs ● Indicate probabilities that each element will be accessed ● Computed in an equilibrium status during infinite random walks ● Assume that applications infinitely access the data according to the input graph ● Correspond to eigen-values of the probability transition matrix

16 Cache-Coherence of a Layout given Block-Based Caches ● Expected number of cache misses of a layout ● Probability accessing a node from another node by traversing an arc ● Conditional probability that we will have a cache miss given the above access pattern ncnc nana nbnb ndnd

17 Specialization to Meshes ● Expected number of cache misses of a layout ● Probability accessing a node from another node by traversing an arc ● Conditional probability that we will have a cache miss given the above access pattern ncnc nana nbnb ndnd ncnc nana nbnb ndnd An input mesh Implicitly derived graph = constant 1. Two opposite directed arcs 2. Uniform distribution to access adjacent nodes given a node

19 Four Different Cases Cache-aware case single cache block, M=1 Cache-oblivious case single cache block, M=1 Cache-aware case multiple cache blocks, M>1 Cache-oblivious case multiple cache blocks, M>1

20 Cache-Aware: Single Cache Block, M=1 ncnc nana nbnb ndnd Input directed graph 1D layout with block size = 3 …….. nana ndnd ncnc nbnb Cache, whose block size is B Straddling arcs

21 Cache-Aware: Multiple Cache Blocks, M>1 ncnc nana nbnb ndnd Input directed graph 1D layout with block boundary …….. nana ndnd ncnc nbnb Straddling arcs Cache

22 Final Cache-Aware Metric ● Counts the number of straddling arcs of the layout given a block size B : block index containing the node, i : Unit step function, 1 if x > 0 0 otherwise. where

23 High Accuracy of Cache-Aware Metric Linear correlation [-1, 1] Observed number of cache misses With 5 cache blocks With 25 cache blocks Cache-aware metric 0.97 Tested layouts: Z-curve, Hilbert curve, H-order, minimum linear arrangement layout, βΩ-layout, geometric CO layout, (bi or uni) row-by-row, (bi or uni) diagnoal layouts Z-curve on a uniform grid Tested block size = 4KB

24 Four Different Cases Cache-aware case single cache block, M=1 Cache-oblivious case single cache block, M=1 Cache-aware case multiple cache blocks, M>1 Cache-oblivious case multiple cache blocks, M>1

25 Cache-Oblivious: Single Cache Block, M=1 Cache Does not assume a particular block size: Then, what are good representatives for block sizes?

26 Two Possible Block Size Progressions ● Arithmetic progression ● 1, 2, 3, 4, … ● Geometric progression ● 2 0, 2 1, 2 2, 2 3, … ● Well reflects current caching architectures ● E.g., L1: 32B, L2: 64B, Page: 4KB, etc.

27 Probability that an Arc is a Straddling Arc …….. nana ndnd ncnc nbnb Is an arc straddling given a block size? Computed as a probability as a function of arc length, l Arc length, l, = 2

28 Two Cache-Oblivious Metrics ● Arithmetic cache-oblivious metric, ● Geometric cache-oblivious metric, Arc length of arc (i, j) Geometric mean of arc lengths MLA metric, Arithmetic mean

29 Validation for Cache-Oblivious (CO) Metrics ● Geometric cache-oblivious metric ● Practical and useful Geometric CO layout Arithmetic CO layout 97% of tested block sizes The number of cache misses when M = 1 (log scale) 73% of tested power-of-two block sizes

30 Correlations between Metrics and Observed Number of Cache Misses Linear correlation [-1, 1] Observed number of cache misses With 1 cache block With 5 cache blocks Geometric CO metric0.980.81 Arithmetic CO metric-0.19-0.32 Tested block size = 4KB Tested with 10 different layouts on a uniform grid

31 Efficient Layout Computation for Our Metrics ● Cache-aware layouts ● Optimized with cache-aware metric given a block size B ● Computed from the graph partitioning ● Geometric cache-oblivious metric ● Very efficient ● Can be used in different layout methods

32 Layout Computation with Geometric Cache-Oblivious Metric ● Multi-level construction method ● Partition an input mesh into k different sets ● Layout partitions based on our metric … 1. Partition 2. Lay out Generalized layout method for unstructured meshes

34 Applications ● Isosurface extraction ● View-dependent rendering

35 Iso-Surface Extraction ● Uses contour tree [van Kreveld et al. 97] ● Runtime is dominated by the traversal of iso- surface ● Layout graph ● Use an input tetrahedral mesh Spx model (140K vertices)

36 High Correlation with Number of Cache Misses Linear correlation [-1, 1] Observed number of cache misses With 1 cache block With 10K cache blocks Geometric CO metric 0.990.98 Tested block size = 4KB Tested with 8 different layouts: our geometric CO, our cache-aware, breadth-first (and depth-first) layouts, spectral [Juvan and Mohar 92], cache- oblivious mesh [Yoon et al. 05], Z-curve [Sagan 94], X-axis sorted layouts

37 High Correlation with Runtime Performance Linear correlation [-1, 1] First iso-surface extraction time Second iso-surface extraction time Geometric CO metric 0.94 Disk I/O time is major bottleneck Memory access time is major bottleneck

38 Comparison with Other Layouts The first iso-surface extraction time (sec) 8% - 77% improvement and very close to the cache-aware performance

39 View-Dependent Rendering ● Layout vertices and triangles of progressive meshes ● Used in an efficient VDR system [Yoon et al. 04] ● Reduce misses in GPU vertex cache

40 Cache Miss Ratio on Bunny Model GPU vertex cache miss ratio Vertex cache size Hoppe [Hoppe 99] Theoretical lower bound [Bar-Yehuda and Gotsman 96] Universal rendering seq. [Bogomjakov and Gotsman 02] Geometric CO layout

41 Cache Miss Ratio on Power Plant Model GPU vertex cache miss ratio Vertex cache size Z-curve Hoppe’s rendering seq. [Hoppe 99] Theoretical lower bound [Bar-Yehuda and Gotsman 96] COML [Yoon et al. 05] Geometric CO layout

42 Conclusion ● Novel cache-aware and cache-oblivious metrics to evaluate layouts ● Derived metrics based on two-level I/O model ● Improved the performance of applications without modifying codes OpenCCL, open source library http:://gamma.cs.unc.edu/COL/OpenCCL

43 Ongoing and Future Work ● Derive a lower bound on our geometric cache-oblivious metric ● Employ mesh compression to further reduce disk I/O accesses ● Investigate efficient layout method for deforming models ● Apply to non-graphics applications ● e.g., shortest path or other graph computations

44 Cache-Efficient Layouts of Bounding Volume Hierarchies ● Yoon and Manocha, Eurographics 06 Ray tracingCollision detection

45 Acknowledgements ● Ajith Mascarenhas ● Martin Isenburg ● Dinesh Manocha ● Fabio Bernardon, Joao Comba, and Claudio Silva ● For their unstructured tetrahedra rendering program ● Members of data analysis group in LLNL ● Anonymous reviewers

46 UCRL-PRES-225448 This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-ENG-48.

47 Additional slides

48 Cache-Coherence of Layouts ● Well known heuristics for cache-coherent layouts ● Space-filling curves [Sagan 94] ● How can we compute better layouts? ● Requires metrics measuring cache-coherence of layouts

49 Main Results ● Define cache-coherence of layout as: ● Expected number of cache misses during random walks of a graph given block-based caches ● Then, the exp. number of cache misses = Number of straddling arcs in a cache-aware cache ≈ Geometric mean of arc lengths in a cache- oblivious case

50 Data Layout Optimization ● Rendering sequences ● Triangle strips ● [Deering 95, Hoppe 99, Bogomjakov and Gotsman 02] ● Processing sequences ● [Isenburg and Gumhold 03, Isenburg and Lindstrom 05] Assume that access pattern globally follows the layout order

51 Data Layout Optimization ● Space-filling curves ● [Sagan 94, Velho and Gomes 91, Pascucci and Frank 01, Lindstrom and Pascucci 01, Gopi and Eppstein 04] Assume geometric regularity

52 Data Layout Optimization ● Graph and matrix layout ● A survey [Diaz et al. 02] ● Minimum linear arrangement (MLA) ● Bandwidth ● Profile ● Wavefront, etc. Does not necessarily produce good layouts for block-based caches

53 Cache-Aware Metric ● Expected number of cache misses given a block size B

54 Correlation between Cache-Aware Metric and Observed Number of Cache Misses Cache-aware metric M = 5 M = 25 Observed number of cache misses Cache block size = 4KB R 2 = 0.97

55 Evaluating Existing Layouts ● No known tight bound ● Compare against the best layout we can construct ● Employ an efficient sampling method Is it close to the optimal layout? An existing layout, φ Use it Build a new one

56 Implementation ● Modify our open source layout computation codes, OpenCCL ● Based on METIS graph partitioning library [Karypis and Kumar 98] ● Processing speed ● 15k triangles / second

57 Comparison with Cache-Oblivious Mesh Layouts (COML) [Yoon et al. 05] ● Two major improvements ● Accuracy ● Usability

58 Pros and Cons ● Limitations ● Not directly applicable to dynamic models ● Pros ● Generality ● Can have benefits without modifying underlying codes

59 Specialization to Meshes ● Assume an equally likelihood to access adjacent nodes given a node ncnc nana nbnb ndnd ncnc nana nbnb ndnd An input mesh Implicitly derived graph

60 Two Possible Block Size Progressions ● Arithmetic progression ● 1, 2, 3, 4, … ● Geometric progression ● 2 0, 2 1, 2 2, 2 3, … ● Well reflects current caching architectures ● E.g., L1: 32B, L2: 64B, Page: 4KB, etc. B Pr(B) B Geometric distribution Uniform distribution

61 Cache Miss Ratio with Different Mesh Resolutions of Bunny Model GPU vertex cache miss ratio (case size: 32) Mesh resolution Hoppe’s rendering sequence [Hoppe 99] COLg

62 Correlation between Cache-Aware Metric and Number of Cache Misses M = 5 M = 25 Observed number of cache misses Cache block size = 4KB R 2 = 0.97

63 Correlations with Observed Number of Cache Misses Cache misses One blk : Mult blks Arithmetic CO metric Geometric CO metric R 2 = 0.98R 2 = 0.81 Cache block size = 4KB,

64 Comparison with Other Layouts 0.99 0.98 0.94 X-axis BFL Spectral layout [Juvan and Mohar 92] DFL COML [Yoon et al. 05] Z-curve COLg Aware Correlation: Up to 2X speedup 9% lower than cache- aware layout

65 Comparison with Other Layouts ● Compute eight different layouts ● Our geometric cache-oblivious layout ● Our cache-aware layout ● Breadth-first layout ● Depth-first layout ● Spectral layout [Juvan and Mohar 92] ● Cache-oblivious mesh layout [Yoon et al. 05] ● Z-curve [Sagan 94] ● X-axis sorted layout

66 Our Layout of Bunny Model

Mesh Layouts for Block-Based Caches Sung-Eui Yoon Peter Lindstrom Lawrence Livermore National Laboratory.

Similar presentations

Presentation on theme: "Mesh Layouts for Block-Based Caches Sung-Eui Yoon Peter Lindstrom Lawrence Livermore National Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mesh Layouts for Block-Based Caches Sung-Eui Yoon Peter Lindstrom Lawrence Livermore National Laboratory.

Similar presentations

Presentation on theme: "Mesh Layouts for Block-Based Caches Sung-Eui Yoon Peter Lindstrom Lawrence Livermore National Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback