DEUCE: WRITE-EFFICIENT ENCRYPTION FOR PCM March 16 th 2015 ASPLOS-XX Istanbul, Turkey Vinson Young Prashant Nair Moinuddin Qureshi.

Slides:



Advertisements
Similar presentations
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
A Case for Refresh Pausing in DRAM Memory Systems
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
Architectures for Secure Processing Matt DeVuyst.
Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon*, Naveen Muralimanohar ǂ, Justin Meza*, Onur Mutlu*,
†The Pennsylvania State University
Reducing Read Latency of Phase Change Memory via Early Read and Turbo Read Feb 9 th 2015 HPCA-21 San Francisco, USA Prashant Nair - Georgia Tech Chiachen.
A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.
Phase Change Memory What to wear out today? Chris Craik, Aapo Kyrola, Yoshihisa Abe.
Exploring timing based side channel attacks against i CCMP Suman Jana, Sneha K. Kasera University of Utah Introduction
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
TinySec: Link Layer Security Chris Karlof, Naveen Sastry, David Wagner University of California, Berkeley Presenter: Todd Fielder.
CAFO: Cost Aware Flip Optimization for Asymmetric Memories RAKAN MADDAH *, SEYED MOHAMMAD SEYEDZADEH AND RAMI MELHEM COMPUTER SCIENCE DEPARTMENT UNIVERSITY.
1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.
Moinuddin K. Qureshi ECE, Georgia Tech ISCA 2012 Michele Franceschini, Ashish Jagmohan, Luis Lastras IBM T. J. Watson Research Center PreSET: Improving.
Study of AES Encryption/Decription Optimizations Nathan Windels.
Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.
Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.
Defining Anomalous Behavior for Phase Change Memory
Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.
Security Refresh Prevent Malicious Wear-out and Increase Durability for Phase-Change Memory with Dynamically Randomized Address Mapping Nak Hee Seong Dong.
Sangyeun Cho Hyunjin Lee
© 2007 IBM Corporation HPCA – 2010 Improving Read Performance of PCM via Write Cancellation and Write Pausing Moinuddin Qureshi Michele Franceschini and.
Reducing Refresh Power in Mobile Devices with Morphable ECC
Protecting Data on Smartphones and Tablets from Memory Attacks
Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,
Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.
1 Lecture: Virtual Memory, DRAM Main Memory Topics: virtual memory, TLB/cache access, DRAM intro (Sections 2.2)
1 Towards Phase Change Memory as a Secure Main Memory André Seznec IRISA/INRIA.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
P AY -A S -Y OU -G O S TORAGE -E FFICIENT H ARD E RROR C ORRECTION Moinuddin K. Qureshi ECE, Georgia Tech Research done while at: IBM T. J. Watson Research.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
OCR GCSE Computing © Hodder Education 2013 Slide 1 OCR GCSE Computing Chapter 2: CPU.
Modes of Usage Dan Fleck CS 469: Security Engineering These slides are modified with permission from Bill Young (Univ of Texas) 11 Coming up: Modes of.
TinySec: A Link Layer Security Architecture for Wireless Sensor Networks Chris Karlof :: Naveen Sastry :: David Wagner Presented by Roh, Yohan October.
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches
© 2007 IBM Corporation MICRO-2009 Start-Gap: Low-Overhead Near-Perfect Wear Leveling for Main Memories Moinuddin Qureshi John Karidis, Michele Franceschini.
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
07/11/2005 Register File Design and Memory Design Presentation E CSE : Introduction to Computer Architecture Slides by Gojko Babić.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
Cache Replacement Policy Based on Expected Hit Count
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
November 14, 2016 Secure MAC algorithms for use with NTP draft-aanchal4-ntp-mac-03 CFRG: IETF97 Aanchal Malhotra Sharon Goldberg.
Seth Pugsley, Jeffrey Jestes,
Cache Memory Presentation I
Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD
Scalable High Performance Main Memory System Using PCM Technology
Energy-Efficient Address Translation
CMSC 611: Advanced Computer Architecture
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture 6: Reliability, PCM
CARP: Compression-Aware Replacement Policies
MICRO-2018 Gururaj Saileshwar1 Prashant Nair1 Prakash Ramrakhyani2
Motherboard External Hard disk USB 1 DVD Drive RAM CPU (Main Memory)
Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah
SYNERGY: Rethinking Secure-Memory Design for Error-Correcting Memories
Use ECP, not ECC, for hard failures in resistive memories
Milestone 2 Enhancing Phase-Change Memory via DRAM Cache
Enabling Transparent Memory-Compression for Commodity Memory Systems
Counter Mode, Output Feedback Mode
Janus Optimizing Memory and Storage Support for Non-Volatile Memory Systems Sihang Liu Korakit Seemakhupt, Gennady Pekhimenko, Aasheesh Kolli, and Samira.
Introduction to CUDA Programming
Presentation transcript:

DEUCE: WRITE-EFFICIENT ENCRYPTION FOR PCM March 16 th 2015 ASPLOS-XX Istanbul, Turkey Vinson Young Prashant Nair Moinuddin Qureshi

Insight: re-encrypt only modified words DEUCE: write-efficient encryption scheme Result: bit flips reduced 50%  23% (27% speedup) PCM more vulnerable to attacks  stolen module Secure PCM with encryption Problem: Encryption increases bit flips 12%  50% Goal: Encrypt PCM without causing 4x bit flips Bit flips expensive in Phase Change Memory (PCM) –Affects Lifetime, Power, and Performance –PCM system optimized to reduce bit flips (~12%) EXECUTIVE SUMMARY 2

PCM AS MAIN MEMORY Phase Change Memory (PCM) –Improved scaling + density –Non-volatile (no refresh) Key Challenge –Writes: Limited endurance ( million writes) –Writes: Slow and limited throughput –Writes: Power-hungry 3 PCM systems are designed to reduce writes

Invert if too many bit flipsWrite only flipped bits Flip-N-Write TYPICAL WRITE OPTIMIZATIONS IN PCM Data-Comparison-Write 4 Cache Comparison Write Cache Flip-N-Write Optimizations reduce bit-flips per write to 10-12%

Non-volatility: Power savings Security More vulnerable to stolen-memory attack SECURITY VULNERABILITY OF PCM 5 Stolen memory attack Bus snooping attack also possible We want to protect PCM from both “stolen memory attack” and “bus snooping attack”

ENCRYPTION ON PCM Protect PCM using memory encryption Encryption causes 50% bit flips on each write 1 bit flip in line  50% bit flips in encrypted line 6 Cache Encrypted PCM Encryption causes 50% bit flips  Write-intensive Avalanche Effect

ENCRYPTION ON PCM COSTLY 7 Encryption increases bit flips from 12% to 50% (4x!) 4x

GOAL: WRITE-EFFICIENT ENCRYPTION Goal: How do we implement memory encryption, without increasing bit flips by 4x? 8

OUTLINE Introduction to PCM and encryption Background on Counter-mode Encryption DEUCE Results Summary 9

Pad-based DecryptionDirect Decryption Encrypted Data WHY PAD-BASED ENCRYPTION? 10 AES Key Data Pad Encrypted Data AES Key Data Pad-based decryption has low-latency AES AES in critical path! XOR in critical path + PCM Cache

+ ZEROES PAD NEED FOR UNIQUE PADS (0000) ⊕ Pad = Pad Attack! Insecure if pad learned 11 ZEROES PAD Secure pad encryption cannot re-use pads + PAD ENCRYPTED PAD DATA ENCRYPTED

Different pad per line address Different pad per write to same line  per-line counter PAD-ctr Line1 PAD1 Line2 PAD2 Line1 Time3 Line1 Time2 COUNTER-MODE ENCRYPTION 12 Line1ENCRYPT1 Line2 PAD2 ENCRYPT2 Line1 Time1 ENCRYPT1 ENCRYPT2 PAD1 PAD-ctr2PAD-ctr1 Pad changes every write  50% bit flips every write Secured!

OUTLINE Introduction to PCM and encryption Background on Counter-mode Encryption DEUCE Results Summary 13

What if we re-encrypt only modified words? INSIGHT 1 Reduce bit flips by re-encrypting only modified words Re-encrypted Cache Encrypted PCM Re-encrypted Remains encrypted

Naïve implementation needs per-word counters/pads Reduce storage overhead by using only two counters INSIGHT 2: EFFICIENT IMPLEMENTATION 15 Encrypted PCM Do efficient partial re-encryption with only two counters Expensive!

Each line has two counters: Leading counter (LeadCTR): incremented every write Trailing counter (TrailCTR): updated every N writes (Epoch) “Modified” bit per word to choose between counters DEUCE: DUAL COUNTER ENCRYPTION 16

Epoch Start DEUCE: OPERATION On a write LeadCTR++ Re-encrypt words modified since Epoch Set “modified” bits TrailCTR=LeadCTR Re-encrypt ALL words Reset “modified” bits Between Epoch 17 DEUCE re-encrypts only words modified since Epoch

DEUCE: EXAMPLE (EPOCH = 4 WRITES) ALL W0 W7 W0 W4 W7 W0 W4 W7 ALL W2 W4 Encrypted Words LCTR TCTR W0,W7 W4 W7W2,W Modified Words in Write Get TrailCTR by masking 2 LSB off LeadCTR DEUCE needs only one physical counter per line X 18 DEUCE does partial re-encryption between Epochs

Encrypted Line DEUCE: HOW TO DECRYPT? DEUCE generates two pads with LCTR and TCTR Which pad to use decided by the “Modified” bits Decrypted Line Pad - LCTR Pad - TCTR AES Line counter Mask 4 LSB LCTR TCTR Combined 19

OUTLINE Introduction to PCM and encryption Background on Counter-mode Encryption DEUCE Results Summary 20

Core Chip  8 cores, each 4GHz 4-wide core  L1/L2/L3  32KB/256KB/1MB METHODOLOGY 21 Shared L4 Cache Phase Change Memory CPU

Shared L4 Cache:  64 MB capacity  50 cycle latency METHODOLOGY 22 Shared L4 Cache Phase Change Memory CPU

Phase Change Memory [Samsung ISSCC’12]  4 ranks, each 8GB  32GB total  Read latency 75ns  Write latency 150ns (per 128-bit write slot) METHODOLOGY 23 Shared L4 Cache Phase Change Memory CPU

METHODOLOGY 24 Shared L4 Cache Phase Change Memory CPU Workloads  SPEC2006: High MPKI, rate mode  4 billion instruction slice DEUCE: Epoch=32; Word size=2B

RESULTS: BIT FLIP ANALYSIS 25 DEUCE eliminates two-thirds of the extra bit flips caused by encryption

RESULTS: SPEEDUP&POWER ANALYSIS 26 Bit flip reduction improves speedup and EDP +27% DEUCE FNW Speedup Energy-Delay-Product -43% Unencrypted

Heavily-written bits still heavily-written with DEUCE Solution: Zero-cost “Horizontal” Wear Leveling RESULTS: LIFETIME ANALYSIS 27 Lifetime improvement of 2x +100% DEUCE with HWL

OUTLINE Introduction to PCM and encryption Baseline–Counter-mode Encryption DEUCE Results Summary 28

Insight: re-encrypt only modified words DEUCE: write-efficient encryption scheme Result: bit flips reduced 50%  23% (27% speedup) PCM more vulnerable to attacks  stolen module Secure PCM with encryption Problem: Encryption increases bit flips 12%  50% Goal: Encrypt PCM without causing 4x bit flips Bit flips expensive in Phase Change Memory (PCM) –Affects Lifetime, Power, and Performance –PCM system optimized to reduce bit flips (~12%) EXECUTIVE SUMMARY 29

THANK YOU 30

EXTRA SLIDES 31

DEUCE FOR OTHER NVM In-place writing NVM –RRAM –STT-MRAM 32

WRITE SLOTS PCM is power-limited Write-slot of 128 Flip-N-Write-enabled bits (64 bit flips) 33 Average number of write slots used per write request. On average, DEUCE consumes 2.64 slots whereas unencrypted memory takes 1.92 slots out of the 4 write slots

IMPROVING PCM WEAR LEVELING 34 Vertical Wear Leveling (Start-gap)—Inter-line leveling Horizontal Wear Leveling—Intra-line leveling Re-use vertical wear leveling to level inside line  START A B C D GAP   START C Rot3 D Rot3 A Rot B Rot2 A Rot3 D Rot4 C Rot4 B Rot4 B Rot3 GAP  Vertical Wear LevelingHorizontal Wear Leveling

QUESTION: IS IT SECURE? Aren’t you re-using the pad? AES-CTR-mode –Unique full pad per ever write. DEUCE –Epoch start, uses unique full pad –Between epochs, uses a unique full pad per write, to re- encrypt modified words. Unmodified words simply not touched  whole line is always encrypted, and pads are not re-used 35

MINIMUM RE-ENCRYPT 36 Words modified Value of Leading Counter (Epoch=4) W2 W7 W6 W0, W4, W5 W0 W0, W3, W4

DEUCE RE-ENCRYPT 37 Words modified Value of Leading Counter (Epoch=4) W2 W7 W6 W0, W4, W5 W0 W0, W3, W4

QUESTION: IS IT SECURE? What about information leak? AES-CTR-mode –Can tell when a line is modified DEUCE –Can tell when a line is modified, and –Can sometimes tell when a word is modified –Similar information leakage 38

CIPHER BLOCK CHAINING 39

COUNTER-MODE ENCRYPTION 40 Security bounds no worse than CBC (NIST-approved) Weakness to input is accepted to due to the underlying block cipher and not the mode of operation

RESULTS: BIT FLIP ANALYSIS 41 DEUCE eliminates two-thirds of the extra bit flips caused by encryption