Accelerating Linked-list Traversal Through Near-Data Processing

Slides:

Advertisements

Similar presentations

Memory-centric System Interconnect Design with Hybrid Memory Cubes Gwangsun Kim, John Kim Korea Advanced Institute of Science and Technology Jung Ho Ahn,

Advertisements

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University.

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.

Multi-GPU System Design with Memory Networks

Memory Network: Enabling Technology for Scalable Near-Data Computing Gwangsun Kim, John Kim Korea Advanced Institute of Science and Technology Jung Ho.

FAWN: A Fast Array of Wimpy Nodes Presented by: Aditi Bose & Hyma Chilukuri.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

Graph Algebra with Pattern Matching and Aggregation Support 1.

School of Engineering and Computer Science Victoria University of Wellington Copyright: Xiaoying Gao, Peter Andreae, VUW Indexing Large Data COMP

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.

Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

A Method for Distributed Computation of Semi-Optimal Multicast Tree in MANET Eiichi Takashima, Yoshihiro Murata, Naoki Shibata*, Keiichi Yasumoto, and.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Overview High Performance Packet Processing Challenges

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,

MemcachedGPU Scaling-up Scale-out Key-value Stores Tayler Hetherington – The University of British Columbia Mike O’Connor – NVIDIA / UT Austin Tor M. Aamodt.

1 Lecture 20: Big Data, Memristors Today: architectures for big data, memristors.

BUFFALO: Bloom Filter Forwarding Architecture for Large Organizations Minlan Yu Princeton University Joint work with Alex Fabrikant,

1 Data Organization Example 1: Heap storage management Maintain a sequence of free chunks of memory Find an appropriate chunk when allocation is requested.

A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.

Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin

R-Storm: Resource Aware Scheduling in Storm

Gwangsun Kim, Jiyun Jeong, John Kim

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Algorithmic Improvements for Fast Concurrent Cuckoo Hashing

Seth Pugsley, Jeffrey Jestes,

Reducing Memory Interference in Multicore Systems

Data Center Network Architectures

INLS 623– Database Systems II– File Structures, Indexing, and Hashing

Massive Spatial Query on the Kepler Architecture

Concurrent Data Structures for Near-Memory Computing

Parallel Programming By J. H. Wang May 2, 2017.

MemCache Widely used for high-performance Easy to use.

Task Scheduling for Multicore CPUs and NUMA Systems

Addressing: Router Design

Decoupled Access-Execute Pioneering Compilation for Energy Efficiency

Exploring Concentration and Channel Slicing in On-chip Network Router

Database Performance Tuning and Query Optimization

Buffered Compares: Excavating the Hidden Parallelism inside DRAM Architectures with Lightweight Logic Jinho Lee, Kiyoung Choi, and Jung Ho Ahn Seoul.

Accelerating Linked-list Traversal Through Near-Data Processing

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads

HashKV: Enabling Efficient Updates in KV Storage via Hashing

April 30th – Scheduling / parallel

Mingxing Zhang, Youwei Zhuo (equal contribution),

Linchuan Chen, Peng Jiang and Gagan Agrawal

KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

University of Wisconsin-Madison

Peng Jiang, Linchuan Chen, and Gagan Agrawal

Chapter 11 Database Performance Tuning and Query Optimization

Fully-dynamic aimGraph

NetCloud Hong Kong 2017/12/11 NetCloud Hong Kong 2017/12/11 PA-Flow:

Active-Routing: Compute on the Way for Near-Data Processing

The Gamma Database Machine Project

Fast Accesses to Big Data in Memory and Storage Systems

Rajeev Balasubramonian

Funded by the Horizon 2020 Framework Programme of the European Union

Tesseract A Scalable Processing-in-Memory Accelerator

Accelerating Regular Path Queries using FPGA

Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho

Presentation transcript:

Accelerating Linked-list Traversal Through Near-Data Processing Byungchul Hong, Gwangsun Kim, John Kim Yongkee Kwon, Hongsik Kim Jung Ho Ahn

Background : Linked-list A collection of data elements of any type by chaining node : data + pointer Linked-list traversal (LLT) Accessing data elements through pointer-chasing To retrieve D2 ptr0  Dnew  D1  D2 : 4 memory accesses sequentially Head =? =? =? ptr0 D1 ptr D2 Dnew ptr ptr1 D2 Index ptr2 … (+) Easy to insert/delete (+) Variable types of data (- ) Slow data retrieving

Background : Linked-list Linked-list is the basis of important data structures Hash table (key/value type data store) Adjacency list of graphs … Keys snu ptr kaist seoul dajeon “key” “value” Hash function Buckets (head array + linked-list) … index hash value

Background : Different Types of Linked-List Linked-list is the basis of important data structures Hash table (key/value type data store) Adjacency list of graphs … null null null item next Bucket Array (or Head) item next item item next item item next item next key + value (+header) node null ... ... null item next item next item next item next <Type1> <Type2> node null item item next item next ... ... item next next next next next ... item item next item item next ... item next ... ... item item item item item item item item next <Type3> <Type4>

Linked-List Traversal in Big-memory Workloads Measured on Dell PowerEdge R910 (Xeon E7540 X4, 512GB DDR3)

Challenges in Linked-List Traversal (LLT) LLT is intrinsically sequential & random access Modern CPUs are not efficient to execute LLT Prior work proposed specialized logic in host CPU to accelerate hash + LLT [Kocberber et al. MICRO’13] Memory Limited Scalability B0 independent (parallelism) Acc Acc OoO Core Acc A0 B1 C1 D1 dependent A1 C0 D2 D0 CPU A2 Memory

Enabling Near-Data Processing (NDP) of LLT NDP architecture and communication interface Acc Acc Memory B0 Acc Acc OoO Core Acc A0 B1 C1 D1 LLT compute Acc Acc Acc A1 C0 D2 D0 CPU A2 Memory

Enabling Near-Data Processing (NDP) of LLT NDP architecture and communication interface Latency : Minimize off-chip accesses through localization Throughput : Batching multiple LLT operations and improve parallelism Acc Acc Memory Acc A0 A1 A2 B0 OoO Core A0 B1 C1 D1 LLT compute Acc Acc D2 D1 D0 C0 C1 Acc A1 C0 D2 D0 CPU A2 Memory

Contents Introduction/Background Memory-Centric Near-Data Processing (NDP) Architecture NDP-aware data localization Batching Results Conclusion

Memory-Centric Systems Hybrid Memory Cube (HMC) Vault controller … Vault controller Vault DRAM layers Intra-HMC Network Logic layer I/O port … I/O port High-speed link Packet

Memory-Centric Systems Logic layer High-speed link DRAM layers I/O port … Vault controller Intra-HMC Network CPU Memory Network [Kim et al., PACT’13]

NDP architecture : System with NDP Vault ctrl Vault ctrl Packet … CPU Engine Scheduler E E E E … … Packet CBUF Intra-HMC network Load/Store RBUF I/O … I/O Page Table

System with multiple memory modules Host-processing NDP HMC Host CPU Star Host CPU FBFLY [Kim et al., ISCA’07] 4 HMC Tree Host CPU Host CPU 4-way dDFLY [Kim et al., PACT’13] 16 HMC

Host-processing vs. NDP Linked-list traversal example : Arr D0 D1 Host-processing NDP HMC Host CPU Host CPU off-loading Arr Arr D1 D1 D0 D0 10-hops 14-hops

NDP-aware data localization Store linked-list to a physically neighboring memory physically neighboring = memory group = domain of localization M = number of memory groups HMC M=1 M=4 M=16 M=256 Vault logic DRAM Vault logic DRAM … No localization per “host link” per “module” Intra-HMC net. per “vault” Memory accesses from NDP stay within each memory group

NDP-aware data localization : How ? Store linked–list to a single memory group “Hash mod M” hv index hv index %M Hash 1 2 3 Hash 1 2 3 … … … … … … Interleaving granularity (ITLV) Localize No group assignment … … … … M = 4, memory group 1 2 3

NDP-aware data localization : How ? Linked-list within an array (Type4 in our paper) e.g., Adjacency list of graphs Head 2 LLT w … e2 .. next array vertex 11 15 20 END .. e11 x … vertex# e15 2 11 15 20 e20 y … item array z .. 0,w 0,x 0,y 0,z Edges : 2, 11, 15, 20 … Accessed by Host ... <No localization>

NDP-aware data localization : How ? Relocate linked-list to a single memory group “Vertex mod M” Relocation boundary is limited to “ITLV × M” Can be relocated to a single specific index in each group Two restrictions Head Relocation boundary 2 … LLT next array .. 3(2) 3 11 15 20 END .. v 3(2)  11 2 11 11 ? 15 20 item array 0,w 0,x 0,y 0,z .. Accessed by Host

Vertexes having same distance from src vertex Batching To maximize parallel execution : Exploit inter-LLT parallelism & sufficient LLT engines in NDP How ? Batch multiple “independent” LLT operations Multiple NDP commands through a single memory packet source vertex distance1 distance2 2nd Batch ... 3rd Batch 4th Batch max. 16 1st Batch Workload What to Batch Hash Join (Probe) Multiple tuples Memcached Multiple GET(key) Graph BFS Vertexes having same distance from src vertex

NDP architecture : With Batching Vault ctrl Vault ctrl Packet R … CPU Engine Scheduler E E E E … … Packet C Packet C Packet C CBUF Intra-HMC network Load/Store RBUF I/O … I/O Page Table

Methodology Workload Performance Energy: Evaluated System : LLU, HashJoin (Probe phase), Memcached, Graph500 BFS Performance McSimA+ (core) + gem5 (cache/directory) + Booksim (network) Energy: McPAT (CPU) + CACTI-3DD (DRAM) + Network energy Evaluated System : 1CPU-4HMC, 1CPU-16HMC CPU: 32 Out-of-Order cores HMC: 4 GB, 8 layers x 16 vaults Memory network : Star/Tree (host-processing), FBFLY/DFLY (NDP)

Evaluated Configurations Add each optimization to NDP and compare with host-processing Localize, Batch, Multiple engines Configuration Name System Configuration HSP Baseline host-processing NDP Near-data processing with LLT offloading NDP_d NDP with data locality NDP_b NDP with batching NDP_db NDP with data locality and batching HSP_4x HSP with 4x processing (i.e., 128 host threads) NDP_b_4x NDP_b with 4 LLT engines per vault NDP_db_4x NDP_db with 4 LLT engines per vault

Results 6.6x 6.4x -21% -6.4% +31% 0.36x 4.8x 2.1x Perf. (16HMC) 11.7 13.0 14.1 16.3 Perf. (16HMC) 6.6x 6.4x -21% -6.4% Perf. (4 HMC) 7.38 9.39 +31% Energy (16HMC) 0.36x 4.8x 2.1x

Conclusion While NDP can provide significant benefits, simply off-loading LLT to NDP does not necessarily improve performance and can actually degrade energy efficiency. NDP-aware data localization and batching fully realize the benefits of near-memory processing. Minimizes off-chip accesses (localization) Improves throughput by exploiting the parallelism (batching) Can be extended for other memory-intensive workloads to provide scalable performance

Accelerating Linked-list Traversal Through Near-Data Processing Byungchul Hong, Gwangsun Kim, John Kim Yongkee Kwon, Hongsik Kim Jung Ho Ahn