Presentation is loading. Please wait.

Presentation is loading. Please wait.

Accelerating Linked-list Traversal Through Near-Data Processing

Similar presentations


Presentation on theme: "Accelerating Linked-list Traversal Through Near-Data Processing"— Presentation transcript:

1 Accelerating Linked-list Traversal Through Near-Data Processing
Byungchul Hong, Gwangsun Kim, John Kim Yongkee Kwon, Hongsik Kim Jung Ho Ahn

2 Background : Linked-list
A collection of data elements of any type by chaining node : data + pointer Linked-list traversal (LLT) Accessing data elements through pointer-chasing To retrieve D2 ptr0  Dnew  D1  D2 : 4 memory accesses sequentially Head =? =? =? ptr0 D1 ptr D2 Dnew ptr ptr1 D2 Index ptr2 (+) Easy to insert/delete (+) Variable types of data (- ) Slow data retrieving

3 Background : Linked-list
Linked-list is the basis of important data structures Hash table (key/value type data store) Adjacency list of graphs Keys snu ptr kaist seoul dajeon “key” “value” Hash function Buckets (head array + linked-list) index hash value

4 Background : Different Types of Linked-List
Linked-list is the basis of important data structures Hash table (key/value type data store) Adjacency list of graphs null null null item next Bucket Array (or Head) item next item item next item item next item next key + value (+header) node null ... ... null item next item next item next item next <Type1> <Type2> node null item item next item next ... ... item next next next next next ... item item next item item next ... item next ... ... item item item item item item item item next <Type3> <Type4>

5 Linked-List Traversal in Big-memory Workloads
Measured on Dell PowerEdge R910 (Xeon E7540 X4, 512GB DDR3)

6 Challenges in Linked-List Traversal (LLT)
LLT is intrinsically sequential & random access Modern CPUs are not efficient to execute LLT Prior work proposed specialized logic in host CPU to accelerate hash + LLT [Kocberber et al. MICRO’13] Memory Limited Scalability B0 independent (parallelism) Acc Acc OoO Core Acc A0 B1 C1 D1 dependent A1 C0 D2 D0 CPU A2 Memory

7 Enabling Near-Data Processing (NDP) of LLT
NDP architecture and communication interface Acc Acc Memory B0 Acc Acc OoO Core Acc A0 B1 C1 D1 LLT compute Acc Acc Acc A1 C0 D2 D0 CPU A2 Memory

8 Enabling Near-Data Processing (NDP) of LLT
NDP architecture and communication interface Latency : Minimize off-chip accesses through localization Throughput : Batching multiple LLT operations and improve parallelism Acc Acc Memory Acc A0 A1 A2 B0 OoO Core A0 B1 C1 D1 LLT compute Acc Acc D2 D1 D0 C0 C1 Acc A1 C0 D2 D0 CPU A2 Memory

9 Contents Introduction/Background
Memory-Centric Near-Data Processing (NDP) Architecture NDP-aware data localization Batching Results Conclusion

10 Memory-Centric Systems
Hybrid Memory Cube (HMC) Vault controller Vault controller Vault DRAM layers Intra-HMC Network Logic layer I/O port I/O port High-speed link Packet

11 Memory-Centric Systems
Logic layer High-speed link DRAM layers I/O port Vault controller Intra-HMC Network CPU Memory Network [Kim et al., PACT’13]

12 NDP architecture : System with NDP
Vault ctrl Vault ctrl Packet CPU Engine Scheduler E E E E Packet CBUF Intra-HMC network Load/Store RBUF I/O I/O Page Table

13 System with multiple memory modules
Host-processing NDP HMC Host CPU Star Host CPU FBFLY [Kim et al., ISCA’07] 4 HMC Tree Host CPU Host CPU 4-way dDFLY [Kim et al., PACT’13] 16 HMC

14 Host-processing vs. NDP
Linked-list traversal example : Arr D0 D1 Host-processing NDP HMC Host CPU Host CPU off-loading Arr Arr D1 D1 D0 D0 10-hops 14-hops

15 NDP-aware data localization
Store linked-list to a physically neighboring memory physically neighboring = memory group = domain of localization M = number of memory groups HMC M=1 M=4 M=16 M=256 Vault logic DRAM Vault logic DRAM No localization per “host link” per “module” Intra-HMC net. per “vault” Memory accesses from NDP stay within each memory group

16 NDP-aware data localization : How ?
Store linked–list to a single memory group “Hash mod M” hv index hv index %M Hash 1 2 3 Hash 1 2 3 Interleaving granularity (ITLV) Localize No group assignment M = 4, memory group 1 2 3

17 NDP-aware data localization : How ?
Linked-list within an array (Type4 in our paper) e.g., Adjacency list of graphs Head 2 LLT w e2 .. next array vertex 11 15 20 END .. e11 x vertex# e15 2 11 15 20 e20 y item array z .. 0,w 0,x 0,y 0,z Edges : 2, 11, 15, 20 Accessed by Host ... <No localization>

18 NDP-aware data localization : How ?
Relocate linked-list to a single memory group “Vertex mod M” Relocation boundary is limited to “ITLV × M” Can be relocated to a single specific index in each group Two restrictions Head Relocation boundary 2 LLT next array .. 3(2) 3 11 15 20 END .. v 3(2)  11 2 11 11 ? 15 20 item array 0,w 0,x 0,y 0,z .. Accessed by Host

19 Vertexes having same distance from src vertex
Batching To maximize parallel execution : Exploit inter-LLT parallelism & sufficient LLT engines in NDP How ? Batch multiple “independent” LLT operations Multiple NDP commands through a single memory packet source vertex distance1 distance2 2nd Batch ... 3rd Batch 4th Batch max. 16 1st Batch Workload What to Batch Hash Join (Probe) Multiple tuples Memcached Multiple GET(key) Graph BFS Vertexes having same distance from src vertex

20 NDP architecture : With Batching
Vault ctrl Vault ctrl Packet R CPU Engine Scheduler E E E E Packet C Packet C Packet C CBUF Intra-HMC network Load/Store RBUF I/O I/O Page Table

21 Methodology Workload Performance Energy: Evaluated System :
LLU, HashJoin (Probe phase), Memcached, Graph500 BFS Performance McSimA+ (core) + gem5 (cache/directory) + Booksim (network) Energy: McPAT (CPU) + CACTI-3DD (DRAM) + Network energy Evaluated System : 1CPU-4HMC, 1CPU-16HMC CPU: 32 Out-of-Order cores HMC: 4 GB, 8 layers x 16 vaults Memory network : Star/Tree (host-processing), FBFLY/DFLY (NDP)

22 Evaluated Configurations
Add each optimization to NDP and compare with host-processing Localize, Batch, Multiple engines Configuration Name System Configuration HSP Baseline host-processing NDP Near-data processing with LLT offloading NDP_d NDP with data locality NDP_b NDP with batching NDP_db NDP with data locality and batching HSP_4x HSP with 4x processing (i.e., 128 host threads) NDP_b_4x NDP_b with 4 LLT engines per vault NDP_db_4x NDP_db with 4 LLT engines per vault

23 Results 6.6x 6.4x -21% -6.4% +31% 0.36x 4.8x 2.1x Perf. (16HMC)
11.7 13.0 14.1 16.3 Perf. (16HMC) 6.6x 6.4x -21% -6.4% Perf. (4 HMC) 7.38 9.39 +31% Energy (16HMC) 0.36x 4.8x 2.1x

24 Conclusion While NDP can provide significant benefits, simply off-loading LLT to NDP does not necessarily improve performance and can actually degrade energy efficiency. NDP-aware data localization and batching fully realize the benefits of near-memory processing. Minimizes off-chip accesses (localization) Improves throughput by exploiting the parallelism (batching) Can be extended for other memory-intensive workloads to provide scalable performance

25 Accelerating Linked-list Traversal Through Near-Data Processing
Byungchul Hong, Gwangsun Kim, John Kim Yongkee Kwon, Hongsik Kim Jung Ho Ahn


Download ppt "Accelerating Linked-list Traversal Through Near-Data Processing"

Similar presentations


Ads by Google