Buffered Compares: Excavating the Hidden Parallelism inside DRAM Architectures with Lightweight Logic Jinho Lee, Kiyoung Choi, and Jung Ho Ahn Seoul.

Buffered Compares: Excavating the Hidden Parallelism inside DRAM Architectures with Lightweight Logic Jinho Lee, Kiyoung Choi, and Jung Ho Ahn Seoul National University

Outline Introduction Our Approach Buffered Compare Architecture Evaluation Summary

Introduction – Memory Wall
CPU The number of cores in a chip is increasing The memory bandwidth is not as much… --> “memory wall” problem Emerging big data applications require even more bandwidth Actually, much of the bandwidth is wasted!

Introduction – Table Scan
Which items are made out of wood? Which items are heavier than 5kg? Item# Material Weight A Wood 10kg B Metal 1.5kg C 7kg D Stone 3kg E 2kg …

Core Key Search key ③ ② Cmp Result D0 D1 D2 D3 ① D0 D1 D2 D3 Data in table DRAM Data are read and the comparisons are done We only need the result – waste in bandwidth!

Core Key Result ❶ ❸ Key ❷ Cmp D0 D1 D2 D3 DRAM Do compare within the memory Only two transfers needed instead of many Essentially a PIM (processing-in-memory) approach

PIM research was active late 90’s ~ early 00’s
Introduction - PIM PIM research was active late 90’s ~ early 00’s EXECUBE, IRAM, FlexRAM, Smart memory, Yukon, DIVA, etc. Multiple cores in DRAM --> hard to integrate Re-gaining interests for various reasons Big data workloads Limited improvement of processor speed Limited improvement of memory bandwidth 3D stacked memory (HMC, HBM, etc.)

PIM with 3D stacked memory
Introduction - PIM PIM with 3D stacked memory Out-Of-Order Core L1 Cache L2 Cache Last-Level Cache HMC Controller Crossbar Network DRAM Controller Host Processor HMC … PCU PIM Directory Locality Monitor PMU Host Processor PEI (PIM enabled instructions) [J. Ahn et al., ISCA 2015] DRAM Controller NI In-Order Core Crossbar Network … ListPref. Prefetch Buffer Tesseract Mes.Trig. Pref. [J. Ahn et al., ISCA 2015] Message Queue

Our Approach - DRAM Architecture & Motivation
DRAM Chip Bank Global Row Decoder Global Sense Amp. (Bank I/O) Mat … Global wordline Column Decoder Global Dataline Mat Local Row Decoder Local Wordline 512 x 512 cells Local Bitline Local Sense Amp. (Row Buffer) Global Dataline Row Bank … Activated Activated Activated Activated Off-chip link Activated Internal Shared Bus Chip I/O Buffered A single chip is comprised of 8-16 banks When accessing data, a row in a bank is “activated” and stored in a row buffer A cache line (64B) is fetched at one burst

DRAM Chip Bank Global Row Decoder Global Sense Amp. (Bank I/O) Mat … Global wordline Column Decoder Global Dataline Mat Local Row Decoder Local Wordline 512 x 512 cells Local Bitline Local Sense Amp. (Row Buffer) Global Dataline Bank … Off-chip link Zzz.. Activated Internal Shared Bus Chip I/O Zzz.. Zzz.. One bank can fill up the bandwidth for the off-chip link Since the time required to activate a row is very long and thus multiple banks are used  We have 8x-16x internal bandwidth Most of the internal bandwidth is wasted

DRAM Chip Bank Global Row Decoder Global Sense Amp. (Bank I/O) Mat … Global wordline Column Decoder Global Dataline Mat Local Row Decoder Local Wordline 512 x 512 cells Local Bitline Local Sense Amp. (Row Buffer) Global Dataline Bank … Internal Shared Bus Off-chip link Compute Activated Chip I/O Compute Compute Compute inside each bank to utilize the excess bandwidth

Our Approach – Goal Utilize the unused internal bandwidth Minimal area overhead to DRAM Less-invasive to the existing ecosystem (i.e., leave the DDR3/4 protocol intact as much as possible)

Our Approach – Goal All PIM operations have deterministic latency
DRAM Chip Bank Global Row Decoder Global Sense Amp. (Bank I/O) Mat … Global wordline Column Decoder Global Dataline Mat Local Row Decoder Local Wordline 512 x 512 cells Local Bitline Local Sense Amp. (Row Buffer) Global Dataline Bank … Internal Shared Bus Off-chip link Chip I/O All PIM operations have deterministic latency All DRAM CMDs (ACT, RES, …) have pre-determined latencies DDR protocols have no mechanism for a memory to signal processors No branching, caching, or pipelining allowed Preserves existing DDR interface and makes logic lightweight

Our Approach – Goal All PIM operations have deterministic latency
DRAM Chip Bank Global Row Decoder Global Sense Amp. (Bank I/O) Mat … Global wordline Column Decoder Global Dataline Mat Local Row Decoder Local Wordline 512 x 512 cells Local Bitline Local Sense Amp. (Row Buffer) Global Dataline Bank … Internal Shared Bus Activated Activated Activated Activated Off-chip link Activated Chip I/O All PIM operations have deterministic latency Single-row restriction Inter-bank communication is expensive Activation of other rows incur another overhead Allows to use bank I/O as an operand register

Our Approach - What to compute with PIM?
We focus on ‘compare-n-op’ patterns over a long range of data DRAM D0 D1 D2 DN … CMP Key

Compare-n-read Returns the match results for each item DRAM D0 D1 D2 DN … CMP Key Result: (=, <, =, … , >)

Compare-n-select Returns the min/max among each item DRAM D0 D1 D2 DN … CMP Max Max: (D7)

Compare-n-increment Increments matching items DRAM K0, V0 K1, V1 K2, V2++ KN, VN … CMP K2

Buffered Compare Architecture
DRAM Chip Bank Mat Bank Bank Mat Local Row Decoder Local Wordline 512 x 512 cells Local Bitline Local Sense Amp. (Row Buffer) … … Global wordline … Global Row Decoder Global Dataline Chip I/O Internal Shared Bus Column Decoder Bank Bank CGEN Bank I/O … Global Dataline Result Queue Key Buffer Arithmetic Unit Key buffer: Holds a value written by the processor Arithmetic unit: Performs computation (cmp, add, etc.) using Bank I/O and Key buffer as operands Result queue: Stores compare results CGEN: Repeats the bank-local commands The datapath is 64 bits wide 0.53% overhead in DRAM area

Buffered Compare Architecture
Key Buffer Mask Arithmetic Unit Result Queue 64 Mats … To/from internal shared bus Cmd Gen Bank IO Data cells Bank Mat … Global wordline … Global Row Decoder Global Dataline Column Decoder CGEN Bank I/O Result Queue Key Buffer Arithmetic Unit

Buffered Compare Architecture - Compare-n-read
Data cells To/from internal shared bus 64 64 64 ❶ A DRAM row is activated Mats Bank IO Key Buffer Key Buffer Mask … 64 Mats Arithmetic Unit Cmd Gen Result Queue A DRAM row is activated and the data becomes ready

Data cells To/from internal shared bus 64 64 64 ❶ A DRAM row is activated Mats ❷ Key buffer is filled with the search key Bank IO Key Buffer Key Buffer Mask … 64 Mats Arithmetic Unit Cmd Gen Result Queue The host writes the search key to the key buffer

Data cells To/from internal shared bus 64 64 64 ❶ A DRAM row is activated Mats ❷ Key buffer is filled with the search key Bank IO Key Buffer Key Buffer Mask … ❸ target data are fetched to the bank I/O 64 Mats Arithmetic Unit Cmd Gen Result Queue 64B data are read to the Bank I/O

Data cells To/from internal shared bus 64 64 64 ❶ A DRAM row is activated Mats ❷ Key buffer is filled with the search key Bank IO Key Buffer Key Buffer Mask … ❸ target data are fetched to the bank I/O 64 Mats Arithmetic Unit ❹ Arithmetic unit performs comparison and queues the results Cmd Gen Result Queue Comparison is performed on the arithmetic unit, and the results are queued

Data cells To/from internal shared bus 64 64 64 ❶ A DRAM row is activated Mats ❷ Key buffer is filled with the search key Bank IO Key Buffer Key Buffer Mask … ❸ target data are fetched to the bank I/O 64 Mats Arithmetic Unit ❹ Arithmetic unit performs comparison and queues the results ❺ Repeat ③, ④ using the command generator Cmd Gen Result Queue Repeat for the determined range

Data cells To/from internal shared bus 64 64 64 ❶ A DRAM row is activated Mats ❷ Key buffer is filled with the search key Bank IO Key Buffer Key Buffer Mask … ❸ target data are fetched to the bank I/O ❻ Send the results to the host 64 Mats Arithmetic Unit ❹ Arithmetic unit performs comparison and queues the results ❺ Repeat ③, ④ using the command generator Cmd Gen Result Queue Send the results to the host

Buffered Compare Architecture - Problems and Solutions
Virtual address cannot be handled Physical address should be used or Virtual address should be translated within DRAM Cache coherence problem Processor cache and the DRAM has to be coherent Solutions Direct segment with non-cacheable region Keep base, limit, and offset registers for a large memory segment Translation can be done by simple additions Data are kept non-cacheable within the segment

Buffered Compare Architecture - Problems and Solutions
Data placement A 64bit word is distributed over multiple chips of a rank and interleaved in units of 8bits, but we need a whole word Chip 0 Chip 1 Chip 7 … A0 B0 … A1 B1 … A7 B7 … Word ‘A’ is distributed Solutions: Use word-interleaving within the segment Chip 0 Chip 1 Chip 7 … A0 A1 … B0 B1 … H0 H1 … Critical-word-first is disabled within the segment

Buffered Compare Architecture - Programming Model
SW code __kernel search(keys[], searchkey, d[]){ int id = get_global_id(0) if (keys[id] == searchkey) return d[id] = 1 } Instruction BC_cmp_read(searchkey, keys, N) … DRAM cmd CMP_RD(searchkey, addr, range) OpenCL based programming model Programmers need not be aware of DRAM parameters (page size, number of banks, …)

Evaluation - Setup McSimA+ simulator Processor Memory
22nm, 16 cores running at 3GHz 16KB private L1 32MB S-NUCA L2 Directory-based MESI coherence Memory 28nm DDR4-2000 4 ranks per channel 16 banks per chip PAR-BS (parallelism-aware batch scheduling)

BC was evaluated against baseline and AMO (Active Memory Operation)
Evaluation - Setup Six workloads TSC : In-memory linear scan (Column-store) TSR : In-memory linear scan (Row-store) BT : B+ tree traversal (index scan) MAX : MAX aggregation SA : Sequence assembly KV : Key-value store BC was evaluated against baseline and AMO (Active Memory Operation)

Evaluation - Speedup BC performs 3.62 times better than the baseline
AMO

Evaluation – Bandwidth Usage
BC can utilize internal bandwidth by more than 8.64x on geomean

Evaluation – Sensitivity
Usually, the more aggregate banks, the more speedup Sometimes introducing more ranks degrades speed

Experimental Result Energy consumption reduced by 73.3% on average
Proc : 77.2% Mem: 43.9%

Summary We proposed Buffered Compare, a processing-in-memory approach to utilize internal bandwidth of DRAM Minimal overhead to the DRAM area Less invasive to existing DDR protocols 3.62X speedup and 73.3% energy reduction Limitations Restricted within a single large segment When using x4 devices, only up to 32bits are supported for the operands

The End Thank you!

Buffered Compares: Excavating the Hidden Parallelism inside DRAM Architectures with Lightweight Logic Jinho Lee, Kiyoung Choi, and Jung Ho Ahn Seoul.

Similar presentations

Presentation on theme: "Buffered Compares: Excavating the Hidden Parallelism inside DRAM Architectures with Lightweight Logic Jinho Lee, Kiyoung Choi, and Jung Ho Ahn Seoul."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Buffered Compares: Excavating the Hidden Parallelism inside DRAM Architectures with Lightweight Logic Jinho Lee, Kiyoung Choi, and Jung Ho Ahn Seoul.

Similar presentations

Presentation on theme: "Buffered Compares: Excavating the Hidden Parallelism inside DRAM Architectures with Lightweight Logic Jinho Lee, Kiyoung Choi, and Jung Ho Ahn Seoul."— Presentation transcript:

Similar presentations

About project

Feedback