Ambit In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology Vivek Seshadri Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali.

Slides:



Advertisements
Similar presentations
Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu
Advertisements

Thank you for your introduction.
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu.
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.
Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.
1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu.
The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.
Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu, Todd C. Mowry Phillip B.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,
Embedded System Lab. 김해천 Linearly Compressed Pages: A Low- Complexity, Low-Latency Main Memory Compression Framework Gennady Pekhimenko†
CPEN Digital System Design
Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.
Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative
The Evicted-Address Filter
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.
DRAM Tutorial Lecture Vivek Seshadri. Vivek Seshadri – Thesis Proposal DRAM Module and Chip 2.
Simple DRAM and Virtual Memory Abstractions for Highly Efficient Memory Systems Thesis Oral Committee: Todd Mowry (Co-chair) Onur Mutlu (Co-chair) Phillip.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
CS161 – Design and Architecture of Computer Main Memory Slides adapted from Onur Mutlu (CMU)
A Case for Toggle-Aware Compression for GPU Systems
Prof. Hsien-Hsin Sean Lee
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Seth Pugsley, Jeffrey Jestes,
Reducing Memory Interference in Multicore Systems
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,
SRAM Memory External Interface
Basic Performance Parameters in Computer Architecture:
Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM Kevin Chang Prashant Nair, Donghyuk Lee, Saugata Ghose, Moinuddin.
Prof. Gennady Pekhimenko University of Toronto Fall 2017
Buffered Compares: Excavating the Hidden Parallelism inside DRAM Architectures with Lightweight Logic Jinho Lee, Kiyoung Choi, and Jung Ho Ahn Seoul.
BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.
ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality
Lecture 15: DRAM Main Memory Systems
Energy-Efficient Address Translation
Prof. Gennady Pekhimenko University of Toronto Fall 2017
What Your DRAM Power Models Are Not Telling You: Lessons from a Detailed Experimental Study Saugata Ghose, A. Giray Yağlıkçı, Raghav Gupta, Donghyuk Lee,
Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu
The Main Memory system: DRAM organization
Jeremie S. Kim Minesh Patel Hasan Hassan Onur Mutlu
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Session C1, Tuesday 10:40 AM Vivek Seshadri.
Today, DRAM is just a storage device
Ambit In-memory Accelerator for Bulk Bitwise Operations
Genome Read In-Memory (GRIM) Filter:
Lecture: DRAM Main Memory
GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
MICROPROCESSOR MEMORY ORGANIZATION
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.
Yaohua Wang, Arash Tavakkol, Lois Orosa, Saugata Ghose,
Reducing DRAM Latency via
Yaohua Wang, Arash Tavakkol, Lois Orosa, Saugata Ghose,
MICRO-50 Swamit Tannu Zachary Myers Douglas Carmean Prashant Nair
Jeremie S. Kim Minesh Patel Hasan Hassan Onur Mutlu
Samira Khan University of Virginia Nov 14, 2018
15-740/ Computer Architecture Lecture 19: Main Memory
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.
CROW A Low-Cost Substrate for Improving DRAM Performance, Energy Efficiency, and Reliability Hi, my name is Hasan. Today, I will be presenting CROW, which.
Prof. Onur Mutlu ETH Zürich Fall September 2018
Computer Architecture Lecture 6a: ChargeCache
Presented by Florian Ettinger
Computer Architecture Lecture 30: In-memory Processing
Presentation transcript:

Ambit In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology Vivek Seshadri Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, Todd C. Mowry

Executive Summary Problem: Bulk bitwise operations Our Proposal: Ambit present in many applications, e.g., databases, search filters existing systems are memory bandwidth limited Our Proposal: Ambit perform bulk bitwise operations completely inside DRAM bulk bitwise AND/OR: simultaneous activation of three rows bulk bitwise NOT: inverters already in sense amplifiers less than 1% area overhead over existing DRAM chips Results compared to state-of-the-art baseline average across seven bulk bitwise operations 32X performance improvement, 35X energy reduction 3X-7X performance for real-world data-intensive applications Talk about energy more Explicitly state the performance improvement numbers Range for throughput and energy? Color more? Bulk AND/OR or NOT Mechanism for NOT – fix the description. (use inverters already in sense amplifiers}? Cost not coming out – mention the area cost

Encryption algorithms BitWeaving (database queries) Bitmap indices (database indexing) BitFunnel (web search) Bulk Bitwise Operations Set operations DNA sequence mapping Encryption algorithms ... Add … (signify more applications) Add references to some works (bitweaving, bitfunnel) [1] Li and Patel, BitWeaving, SIGMOD 2013 [2] Goodwin+, BitFunnel, SIGIR 2017

Today, DRAM is just a storage device! Write Processor (CPU, GPU, FPGA) Channel Read Throughput of bulk bitwise operations limited by available memory bandwidth

Our Approach DRAM Channel Processor (CPU, GPU, FPGA or PiM) Channel Use analog operation of DRAM to perform bitwise operations completely inside memory!

Outline of the talk 1. DRAM Background 2. Ambit-AND-OR: Bitwise AND/OR in DRAM 3. Ambit-NOT: Bitwise NOT in DRAM 4. Ambit Implementation 5. Applications and Evaluation

Inside a DRAM Chip 2D Array of DRAM Cells … Sense amplifiers 8KB

DRAM Cell Operation wordline bitline capacitor access transistor Sense Amp enable bitline

DRAM Cell Operation wordline bitline capacitor access transistor deviation in bitline voltage raise wordline wordline 1 VDD ½ VDD + δ ½ VDD capacitor bitline cell regains charge cell loses charge to bitline access transistor connects cell to bitline Sense Amp 1 enable enable sense amp bitline ½ VDD

Outline of the talk 1. DRAM Background 2. Bitwise AND/OR in DRAM 3. Bitwise NOT in DRAM 4. Ambit Implementation 5. Applications and Evaluation

Triple-Row Activation: Majority Function 1 VDD ½ VDD + δ ½ VDD activate all three rows 1 1 Sense Amp 1 enable sense amp

Bitwise AND/OR Using Triple-Row Activation 1 VDD Sense Amp A B C

Bitwise AND/OR Using Triple-Row Activation 1 VDD Sense Amp A B C Output = AB + BC + CA = C (A OR B) + ~C (A AND B) Control the value of C to perform bitwise OR or bitwise AND of A and B 38X improvement in raw throughput 44X reduction in energy consumption for bulk bitwise AND/OR operations

Potential Concerns with Triple-Row Activation With three cells, bitline deviation may not be enough Process variation: all cells are not equal Cells leak charge Memory controller may have to send three addresses Source data gets destroyed Spice simulations put these concerns to rest. (Section 6 in paper) * Address these challenges through implementation (next slide)

Bulk Bitwise AND/OR in DRAM Statically reserve three designated rows t1, t2, and t3 Result = row A AND/OR row B Copy data of row A to row t1 Copy data of row B to row t2 Initialize data of row t3 to 0/1 Activate rows t1/t2/t3 simultaneously Copy data of row t1/t2/t3 to Result row Copy data of row A to row t1 Copy data of row B to row t2 Initialize data of row t3 to 0/1 Activate rows t1/t2/t3 simultaneously Copy data of row t1/t2/t3 to Result row MICRO 2013 * each copy operation takes only 80ns

Bulk Bitwise AND/OR in DRAM Statically reserve three designated rows t1, t2, and t3 Result = row A AND/OR row B Copy RowClone data of row A to row t1 Copy RowClone data of row B to row t2 Initialize RowClone data of row t3 to 0/1 Activate rows t1/t2/t3 simultaneously Copy RowClone data of row t1/t2/t3 to Result row Copy data of row A to row t1 Copy data of row B to row t2 Initialize data of row t3 to 0/1 Activate rows t1/t2/t3 simultaneously Copy data of row t1/t2/t3 to Result row Replace copies with rowclone once it is mentioned Tie back to the concerns. Reiterate speedup numbers strike copy and make it rowclone Use RowClone to perform copy and initialization operations completely in DRAM!

Outline of the talk 1. DRAM Background 2. Bitwise AND/OR in DRAM 3. Bitwise NOT in DRAM 4. Ambit Implementation 5. Applications and Evaluation

Negation Using the Sense Amplifier Can we copy the negated value from bitline to a DRAM cell? bitline Mark the enable signal to the sense amplifier Before introducing the dual contact cell, state the key idea Sense Amp enable bitline

Negation Using the Sense Amplifier Dual Contact Cell Regular wordline Negation wordline bitline Mark the enable signal to the sense amplifier Before introducing the dual contact cell, state the key idea Sense Amp enable bitline

Negation Using the Sense Amplifier VDD ½ VDD + δ ½ VDD activate source source activate negation wordline bitline Provide the result for NOT Present all the results after this slide? Sense Amp enable sense amp ½ VDD bitline

Ambit vs. DDR3: Performance and Energy 32X 35X

Outline of the talk 1. DRAM Background 2. Bitwise AND/OR in DRAM 3. Bitwise NOT in DRAM 4. Ambit Implementation 5. Applications and Evaluation

Ambit – Implementation … Regular Data Rows Pre-initialized Rows 1 Designated Rows for Triple Activation Make it clear that this is a subarray Activate -> activation Dual Contact Cells Sense Amplifiers

Ambit – Implementation Regular Row Decoder … Regular Data Rows Pre-initialized Rows 1 Temporary Rows for Triple Activate Bitwise Decoder * Command sequence to perform bitwise AND operation? Dual Contact Cells Sense Amplifiers

Integrating Ambit with the System PCIe device Similar to other accelerators (e.g., GPU) System memory bus Ambit uses the same DRAM command/address interface Pros and cons discussed in paper (Section 5.4) Make it simpler Two approaches.

Outline of the talk 1. DRAM Background 2. Bitwise AND/OR in DRAM 3. Bitwise NOT in DRAM 4. Ambit Implementation 5. Applications and Evaluation

Real-world Applications Methodology (Gem5 simulator) Processor: x86, 4 GHz, out-of-order, 64-entry instruction queue L1 cache: 32 KB D-cache and 32 KB I-cache, LRU policy L2 cache: 2 MB, LRU policy Memory controller: FR-FCFS, 8 KB row size Main memory: DDR4-2400, 1 channel, 1 rank, 8 bank Workloads Database bitmap indices BitWeaving –column scans using bulk bitwise operations Set operations – comparing bitvectors with red-black trees Add color Fix workloads – add set operation workload as well

Bitmap Indices: Performance 5.4X 6.1X 6.3X 5.7X 6.2X 6.6X Add color Remove “week” and “users” Consistent reduction in execution time. 6X on average

Speedup offered by Ambit for BitWeaving select count(*) where c1 < field < c2 Number of rows in the database table 12X Reference bitweaving again Add color Fix the tag

Other Details and Results in Paper Detailed implementation of Ambit Changes to DRAM chips Optimizations to improve performance Error correction codes (open problem) Detailed SPICE simulation analysis Comparison to 3D-stacked DRAM Other applications Set operations BitFunnel: Web search document filtering Masked initialization Cryptography DNA sequence mapping Add reference to the bitfunnel Add … Other details in the paper Compared to 3D stacking

Conclusion Problem: Bulk bitwise operations Our Proposal: Ambit present in many applications, e.g., databases, search filters existing systems are memory bandwidth limited Our Proposal: Ambit perform bulk bitwise operations completely inside DRAM bulk bitwise AND/OR: simultaneous activation of three rows bulk bitwise NOT: inverters already in sense amplifiers less than 1% area overhead over existing DRAM chips Results compared to state-of-the-art baseline average across seven bulk bitwise operations 32X performance improvement, 35X energy reduction 3X-7X performance for real-world data-intensive applications

Ambit In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology Vivek Seshadri Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, Todd C. Mowry

Backup Slides

Data movement consumes high energy src: Bill Dally Keynote, “Challenge for Future Computer Systems,” HIPEAC 2015.

RowClone: In-DRAM Bulk Data Copy (MICRO 2013) activate source 1 VDD ½ VDD + δ ½ VDD source activate destination 1 MICRO 2013 destination data gets copied source to destination Add MICRO 2013 to the title Mark source and destination on the slide Mention the speedup of RowClone mechanism Sense Amp 1 enable sense amp

Today, DRAM is just a storage device! Write (64B) Processor Channel Read (64B) Can we do more with DRAM?

Ambit – Implementation Regular Row Decoder … Regular Data Rows Pre-initialized Rows 1 Temporary Rows for Triple Activate Bitwise Decoder Reserved Address 0 * Command sequence to perform bitwise AND operation? Dual Contact Cells Sense Amplifiers

Summary of operations … Copy AND OR NOT Sense amplifiers

Ambit Throughput

Error Correction Code Need ECC that is homomorphic over bitwise operations ECC(A and B) = ECC(A) and ECC(B) ECC(A or B) = ECC(A) or ECC(B) ECC(not A) = not ECC(A) Triple Modular Redundancy trivially satisfies the above condition 2X capacity overhead Better performance and energy efficiency Lower overall cost