Heterogeneous Memory Subsystem for Natural Graph Analytics

Heterogeneous Memory Subsystem for Natural Graph Analytics
Abraham Addisie, Hiwot Kassa, Opeoluwa Matthews, Valeria Bertacco University of Michigan . IISWC 2018 October 2, 2018, Raleigh, NC

Graph Applications Challenges and Solutions
Memory subsystem is a huge bottleneck to performance The few hardware solutions available are inflexible As you know, graph applications are being applied in a wide-range of areas. In search engines, graph applications like PageRank is being used to provide high quality search results In social networks, they are being used to analyze interaction among users and map for instance an organized crime group In medical centers, they are being used to understand functional brain connectivity aiding in diagnosing brain-related diseases like tumor On the other side, the amount of data generated is increasing as well. For instance, in case of social networks, there is larger number of online user presence each day. And a high-resolution MRI device is being able to generate a larger functional brain connectivity graph. This increasing data size is creating challenges to the existing hardware solutions . One of the challenge is that the memory subsystem is becoming a huge bottleneck. And the other challenge is that a few hardware solutions available today are inflexible. To address the first challenge we propose to designed a specialized memory subsystem architecture extract locality that exists in common natural graphs. And to address the second challenge we propose to keep the genera-purpose cores intact and focus the optimization effort in the memory subsystem. This will make it possible for many graph frameworks to benefit from our solution. Solution: Solution: Extract existing locality in common natural graphs Keep general-purpose cores and applications unmodified

Background: Graph Algorithms
PageRank execution flow Key operation: [atomic_]PR[dest] += PR[src]/src.outDegree 5 6 1 7 4 3 2 9 8 Operation done atomically +PR0 +PR0 +PR1 +PR1 +PR1 1 2 +PR2 +PR2 Key data structures vertex property (vtxProp) edge list (edgelist) non-graph data (nongraph) accessed randomly accessed sequentially

Background: Power Law Distribution
Degree of many graphs follow power law distribution #indegree edges #vertices 5 6 1 7 4 3 2 9 8 An instance of power law: 20% of vertices connect 80% of indegree edges Common due to preferential attachment Large size means highly-connected

Graph Workload Characterization
Percentage of accesses to the vtxProp of the 20% most-connected vertices Graph dataset slashdot ca-AstroPh rMat orkut enwiki ljournal indochina uk roadNet-PA roadNet-CA west-USA PageRank 76.0 99.8 91.8 45.3 74.0 77.8 93.2 90.0 20.1 28.8 20.2 99.7 93.6 66.3 83.4 77.7 92.3 89.9 20.0 77.5 93.4 57.8 78.3 92.2 90.4 23.5 29.5 20.5 76.5 88.9 47.4 64.0 75.8 89.3 24.8 75.9 58.4 85.2 77.6 92.0 56.5 84.8 77.4 89.8 17.3 28.3 91.9 56.6 84.7 77.3 29.4 58.7 84.6 89.6 81.1 28.7 Breadth-First Search Single-Source Shortest Path Betweenness Centrality Graph algorithm Radii Connected Components Triangle Counting K-Core Key observations For most graphs: >70% of vtxProp accesses on 20% of vertices Exceptions: road networks (e.g. rCA)

OMEGA Architecture Baseline CMP OMEGA Key ideas
Heterogenous memory subsystem architecture Scratchpads (SPs): store most-connected vtxProp Caches: store edgelist, nongraph, least-connected vtxProp Processing in Scratchpad (PISC) Computes atomic operations on vtxProp in situ

Execution of Atomic Operations without PISC
Example: Core0 runs PageRank starting at vertex V4 V1’s current value is read from the remote scratchpad (SP1) Core0 updates its value req 4 req 1 res 1 On-chip lat. & traffic Locking overhead Core energy consumption res 4

Offloading Atomic Operations to PISC
Example: Core0 runs PageRank starting at vertex V4 V1 update is offloaded to SP1’s PISC req 4 upd 1 On-chip lat. & traffic Locking overhead Core energy consumption res 4

access to source vertex
Source Vertex Buffer Minimizes remote accesses to source vertex’s data SSSP atomic update function access to source vertex ShortestLen[dest] += min(ShortestLen[dest], ShortestLen[src]+edgeLen) Example: Core0 runs SSSP starting at vertex V3 First V3 access: served from SP1 and cached in the buffer Subsequent V3 accesses: served from the buffer upd 0 upd 1 req 3 req 3 res 3 res 3 V3 Buffer is read only and not coherent with other SPs

Graph Preprocessing Graph reordering identify the most-connected vertices Indegree-based reordering is preferable 5 1 2 7 4 6 3 9 8 5 6 1 7 4 3 2 9 8 5 1 2 7 4 6 3 9 8 Most-connected vertices: [V0-V1] mapped to SPs Least-connected vertices: [V2-V9] stored in caches OMEGA node

High-level Framework Modifications
Source to source transformation tool Transforms the atomic update function Extracts scratchpad & PISC configuration parameters e.g., #vertices, atomic-operation type, etc. Transforming the atomic update function [atomic_]next_PR[dest] += curr_PR[src]/src.degree *mmr1 = curr_PR[src]/src.degree *mmr2 = dest *mmr = memory mapped register

Comparison with Prior Works
Beamer et al. [IISWC’15] Graphicionado [MICRO’16] Tesseract [ISCA’15] GraphPIM [HPCA’17] OMEGA leverage power law yes no memory subsystem general specialized heterog. framework flexibility yes no limited on-chip traffic high low vtxProp access latency and energy limited low high

Experimental Setup Experiment setup (Gem5) Graph framework Common
16 cores, OoO, 8-wide L1 I/D cache: 16KB, 4/8-way, private Baseline-specific L2 cache: 2MB per core, 8way, shared OMEGA-specific Scrachpad + L2 Cache = Baseline’s L2 Cache L2 cache: 1MB per core & scratchpad: 1MB per core PISC has insignificant area and power overhead (<<1%) Source vertex buffer: 32 entries Graph framework Ligra

Workload Description Graph algorithms Graph datasets
name description PageRank Page Rank BFS Breadth-First Search SSSP Single-Source Shortest Path BC Betweenness Centrality Radii CC Connected Components TC Triangle Counting name description remark Size rMat synthetic medium sd slashdot social network ap astroPh collaboration network rCA roadNet-CA road network rPA roadNet-PA ic Indochina web graph large wiki enwiki orkut Orkut lj ljournal Missing algorithms and datasets from our workload char. k-Core: because of having a similar characteristic with TC Some datasets: because of long simulation time

Performance Analysis >2x speedup on average Key observations medium
large >2x speedup on average Key observations Both medium and large graphs achieve similar speedup Significant speedup even for non-power-law graphs (rCA and rPA) that fit in the scratchpads

Comparison of Off- & On-chip Communication
PageRank 2.28x bandwidth utilization over a CMP baseline PageRank 3.2x on-chip traffic reduction over a CMP baseline

Energy Analysis – on PageRank
2.5x energy saving over a baseline CMP Main sources of energy savings Higher speedup Lower #DRAM accesses Lower energy for scratchpad compared to cache

Conclusions Heterogenous memory subsystem provides significant performance/energy improvements for power-law graphs OMEGA provides over 2x speedup on average achieves over 2.5x energy savings on PageRank does not incur area overhead

Backup Slides

Performance on large-scale Datasets
1.68x for PageRank on twitter, storing only 5% of vtxProp 1.35x for BFS on twitter, storing only 10% of vtxProp

Performance on non-power-law graphs
Only 1.15x speedup on a large non-power-law graph

Comparison of On-chip Storage Access
OMEGA’s scratchpad + cache storage provides over 70% hit rate for power law graphs

Heterogeneous Memory Subsystem for Natural Graph Analytics

Similar presentations

Presentation on theme: "Heterogeneous Memory Subsystem for Natural Graph Analytics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Heterogeneous Memory Subsystem for Natural Graph Analytics

Similar presentations

Presentation on theme: "Heterogeneous Memory Subsystem for Natural Graph Analytics"— Presentation transcript:

Similar presentations

About project

Feedback