Many-Core Graph Workload Analysis

Many-Core Graph Workload Analysis
Stijn Eyerman, Wim Heirman, Kristof Du Bois, Joshua B. Fryman, Ibrahim Hur Intel Corporation

Graph Analysis: a growing domain, different from classic HPC
Many Big Data sets can be represented as a graph: Social networks, IP network traffic, road networks, physics models, etc. Graph analytics can reveal interesting information: detecting patterns and clusters, shortest path calculation, search problems Differences with ‘classic’ HPC: Sparse data: connection matrix has small fraction of non-zeros Light computations: walking through graph makes most algorithms memory bound Data dependent: process time depends on number of neighbors, which can be highly unbalanced

Prior work observations of graph applications
Graph applications are memory bound, but do not consume full memory bandwidth [4] Graph applications have high cache and TLB miss rates and low prefetcher accuracy [10] But also contain memory streams that can benefit from caching and prefetching Vector units are underutilized [16] Optimal thread count and multi-threading benefits are variable [16]

Goals of our study Study the performance of graph applications on contemporary architectures Find root causes of observed behavior through detailed microarchitectural simulation Project the performance for a future graph processing architecture Provide recommendations for an efficient graph processor

Benchmarks GAP benchmarks (UC Berkeley)
High-performance low-level (C++/OpenMP) implementations Higher performance than high-level frameworks [22] Applications: Pagerank (pr): website popularity Triangle count (tc): graph density metric Connected components (cc): split disjoint graphs Breadth first search (bfs): start search from one vertex Single source shortest path (sssp): shortest path from one vertex to all others Betweenness centrality (bc): vertex centrality metric Input sets: synthetic RMAT matrices Exponential degree distribution: many vertices with few neighbors, few vertices with many neighbors Scale 20, 22 and 24 (log2 of vertex count)

Methodology Run benchmarks on two machines from 1 thread to max thread count Intel Xeon Skylake server (SKX): 26 cores, 2.4 GHz, 39 MB L3 cache, 115 GB/s DDR, 2 threads per core Intel Xeon Phi Knight’s Landing (KNL): 64 cores, 1.4 GHz, 1 MB L2 per 2 cores, 460 GB/s MCDRAM, 4 threads per core Simulate the same machines using in-house detailed simulator (based on the Sniper multicore simulator) In-depth profiling of memory, cache, prefetcher, core pipeline, etc. Find main bottlenecks and their causes Simulate a hypothetical manycore graph processor 512 single-issue in-order cores, 4 threads per core, 1 GHz No L2/L3 caches, no prefetcher, high-bandwidth memory (400 GB/s)

Pagerank scaling on SKX and KNL
2 threads/core Good scaling when 1 thread per core, hyperthreading not beneficial SKX core performs 4x better than KNL core: 2x pipeline width 1.7x frequency Larger caches 4 threads /core Perfect scaling 1 thread/core 2 threads/core

Single source shortest path scaling on SKX and KNL
Scales worse than pagerank, especially for small graphs Hyperthreading even decreases performance

Execution profile: pagerank on KNL (scale 24, 64 threads)
Memory latency bound But low BW utilization

Execution profile: SSSP on SKX (scale 24, 26 threads)
Not enough parallelism: Single source Load imbalance Memory latency bound Branch predictor misses

Execution profile: SSSP on KNL (scale 24, 64 threads)
Larger impact of low parallelism Low BW utilization

Main observations Low core pipeline usage: < 10% of peak IPC
All applications are memory latency bound, but bandwidth utilization is low Some applications have phases with low parallelism and load imbalance Hyperthreading (SMT) is not beneficial  Simulator profiles explain why this happens

High cache miss rates due to low reuse
0 bytes used: unused prefetch Only one 32 bit element used Full 64 byte cache line used

Why low reuse and low prefetch coverage?
3 7 12 4 9 1 2 5 6 neighbors of v0 values CSR representation: v0 .2 v1 .4 v2 .1 v3 .8 v4 .3 v5 .5 … vn Explains peak at 1 element used (index pointer list and values) and all elements used (neighbors list) Unused prefetches: neighbor list prefetched too far v0 v1 3 v2 5 v3 8 v4 v5 9 … vn no locality some locality no locality

Why low memory bandwidth usage?
Few prefetches, mostly demand misses: no background traffic Some applications have low memory-level parallelism due to pointer-chasing code and atomics Higher latencies due to TLB misses: large data sets with low locality

Many Small Cores (MSC) In-order cores because of low IPC on out-of-order cores and memory boundedness 512 cores with 4 threads per core to saturate bandwidth with demand misses No large caches, no prefetcher because of low locality High bandwidth memory to process many in-flight memory operations

Scaling on MSC

Solution: fetch only 4 bytes if no reuse
Pagerank on MSC Solution: fetch only 4 bytes if no reuse

SSSP on MSC Solution: heterogeneous design with few high-performance cores to speed up phases with low active thread count

Conclusions Graph applications are not a good fit for general-purpose computers Low cache, TLB and branch predictor hit rate Low IPC Bandwidth underutilization A graph processing architecture should have Many small cores to issue many memory operations High bandwidth memory Some large cores for phases with low parallelism Small caches/scratchpads with a selective caching policy: No locality: not cached and fetch only 1 element to save on bandwidth Locality: cached and fetch full cache line to exploit spatial locality

Future work Further optimize the graph processor:
Implement non-cached 1-element (4 byte or 8 byte) memory loads Find heuristics on what to cache and what not to cache for optimal performance and efficient bandwidth usage; implemented in hardware or software? Prefetcher for indirect memory accesses Scheduling policy for heterogeneous designs to reduce load imbalance Look for other algorithms that have less synchronization and load imbalance to enable massive parallelism

Specialized graph processors
Cray Urika [9,15] Many small cores, many threads per core to hide memory latency No large caches Memory-coherent network Several proposals for accelerators Sparse matrix accelerator [25] Processing-in-memory [2] Dedicated accelerators [12,20]

Execution profile: pagerank on SKX (scale 24, 26 threads)
Most pagerank values fit in cache

Prefetcher performance
useful prefetch # prefetches misses avoided misses w/o prefetcher

Heterogeneous configuration to speed up sequential sections
Replace 64 small MSC cores with 4 SKX cores If no thread executes on SKX core  move thread from MSC core to SKX core Ensures that SKX cores executes threads when thread count is low

SSSP: homogeneous versus heterogeneous

Why is there no benefit from SMT?
Most applications are memory latency bound  SMT helps with hiding latencies Problem 1: threads on one core share L1 and L2 caches Threads create many useless cache line fetches and evictions Threads evict each other’s cache lines, even the useful ones  performance degrades Problem 2: more threads means more load imbalance And the longest running thread has a longer execution time because it competes with another thread in the beginning of its execution Result: similar or even lower performance

Many-Core Graph Workload Analysis

Similar presentations

Presentation on theme: "Many-Core Graph Workload Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Many-Core Graph Workload Analysis

Similar presentations

Presentation on theme: "Many-Core Graph Workload Analysis"— Presentation transcript:

Similar presentations

About project

Feedback