Dr. Nael Abu-Ghazaleh and Dr. Dmitry Ponomarev

Slides:



Advertisements
Similar presentations
1 KCipher-2 KDDI R&D Laboratories Inc.. ©KDDI R&D Laboratories Inc. All rights Reserved. 2 Introduction LFSR-based stream ciphers Linear recurrence between.
Advertisements

CPU Structure and Function
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
COMP375 Computer Architecture and Organization Senior Review.
Bypass and Insertion Algorithms for Exclusive Last-level Caches
SE-292 High Performance Computing
Topics Left Superscalar machines IA64 / EPIC architecture
Instruction Level Parallelism
IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.
University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.
Using Instruction Block Signatures to Counter Code Injection Attacks Milena Milenković, Aleksandar Milenković, Emil Jovanov The University of Alabama in.
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Computer Organization and Architecture
From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 RAKSHA: A FLEXIBLE ARCHITECTURE FOR SOFTWARE SECURITY Computer Systems Laboratory Stanford University Hari Kannan, Michael Dalton, Christos Kozyrakis.
Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.
A Novel Cache Architecture with Enhanced Performance and Security Zhenghong Wang and Ruby B. Lee.
Architecture for Protecting Critical Secrets in Microprocessors Ruby Lee Peter Kwan Patrick McGregor Jeffrey Dwoskin Zhenghong Wang Princeton Architecture.
Branch Regulation: Low-Overhead Protection from Code Reuse Attacks.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Hardware-Software Integrated Approaches to Defend Against Software Cache-based Side Channel Attacks Jingfei Kong* University of Central Florida Onur Acıiçmez.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
G. Venkataramani, I. Doudalis, Y. Solihin, M. Prvulovic HPCA ’08 Reading Group Presentation 02/14/2008.
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
A Framework For Trusted Instruction Execution Via Basic Block Signature Verification Milena Milenković, Aleksandar Milenković, and Emil Jovanov Electrical.
Covert Channels Through Branch Predictors: a Feasibility Study
Reference Monitors/Information Flow Tracking
New Cache Designs for Thwarting Cache-based Side Channel Attacks
Simultaneous Multithreading
Computer Structure Multi-Threading
nZDC: A compiler technique for near-Zero silent Data Corruption
5.2 Eleven Advanced Optimizations of Cache Performance
RIC: Relaxed Inclusion Caches for Mitigating LLC Side-Channel Attacks
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Bruhadeshwar Meltdown Bruhadeshwar
Continuous, Low Overhead, Run-Time Validation of Program Executions
Computer Architecture Lecture 4 17th May, 2006
Hardware Multithreading
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park,
Henk Corporaal TUEindhoven 2011
Chapter 1 Introduction.
Computer Architecture
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Hardware Multithreading
The University of Adelaide, School of Computer Science
ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.
Presentation transcript:

Architectural Support for Security in the Many-core Age: Threats and Opportunities Dr. Nael Abu-Ghazaleh and Dr. Dmitry Ponomarev Department of Computer Science SUNY-Binghamton {nael,dima}@cs.binghamton.edu

Multi-cores-->Many-cores Moore's law coming to an end Power wall; ILP wall; memory wall “End of lazy-boy programming era” Multi-cores offer a way out New Moore's law: 2x number of cores every 1.5 years The many-core era is about to get started Will have more cores than can power -> likely to have a lot of accelerators, including the ones for security How to best support trusted computing? Critical to anticipate and diffuse security threats

Security Challenges for Many-cores Diverse applications, both parallel and sequential New vulnerabilities due to resource sharing Side-Channel and Denial-of-Service Attacks Performance impact is a critical consideration Can use spare cores/thread contexts to accelerate security mechanisms Can use speculative checks to lower latency

Defending against attacks on shared resources

Attacks on Shared Resources Resource sharing (specifically the sharing of the cache hierarchy) opens the door for two types of attacks Side-Channel Attacks Denial-of-Service Attacks Our first target: software cache-based side channel attacks. First, some cache background...

Background: Set-Associative Caches

L1 Cache Sharing in SMT Processor Instruction Cache Issue Queue PC Register File PC PC Execution Units Data Cache Fetch Unit PC PC PC PC Load/Store Queues PC LDST Units Decode Register Rename Re-order Buffers Arch State Private Resources Shared Resources

Last-level Cache Sharing on Multicores (Intel Xeon) Morgan Kaufmann Publishers 13 January 2010 Last-level Cache Sharing on Multicores (Intel Xeon) 2 × quad-core Intel Xeon e5345 (Clovertown) Chapter 7 — Multicores, Multiprocessors, and Clusters 8

Advanced Encryption Standard (AES) One of the most popular algorithms in symmetric key cryptography 16-byte input (plaintext) 16-byte output (ciphertext) 16-byte secret key (for standard 128-bit encryption) several rounds of 16 XOR operations and 16 table lookups secret key byte Lookup Table index Input byte 9

Cache-Based Side Channel Attacks An attacker and a victim process (e.g. AES) run together using a shared cache Access-Driven Attack: Attacker occupies the cache, evicting victim’s data When victim accesses cache, attacker’s data is evicted By timing its accesses, attacker can detect intervening accesses by the victim Time-Driven Attack Attacker fills the cache Times victim’s execution for various inputs Performs correlation analysis

Attack Example Cache a b c d Main Memory AES data Attacker’s data b>(a≈c≈d) Can exploit knowledge of the cache replacement policy to optimize attack

Simple Attack Code Example #define ASSOC 8 #define NSETS 128 #define LINESIZE 32 #define ARRAYSIZE (ASSOC*NSETS*LINESIZE/sizeof(int)) static int the_array[ARRAYSIZE] int fine_grain_timer(); //implemented as inline assembler void time_cache() { register int i, time, x; for(i = 0; i < ARRAYSIZE; i++) { time = fine_grain_timer(); x = the_array[i]; time = fine_grain_timer() - time; the_array[i] = time; }

Existing Solutions Avoid using pre-computed tables – too slow Lock critical data in the cache (Lee, ISCA 07) Impacts performance Requires OS/ISA support for identifying critical data Randomize the victim selection (Lee, ISCA 07) Significant cache re-engineering -> impractical High complexity Requires OS/ISA support to limit the extent to critical data only

Desired Features and Our Proposal Desired solution: Hardware-only (no OS, ISA or language support) Low performance impact Low complexity Strong security guarantee Ability to simultaneously protect against denial-of-service (a by-product of access-driven attack) Our solution: Non-Monopolizable (NoMo) Caches

NoMo Caches Key idea: An application sharing cache cannot use all lines in a set NoMo invariant: For an N-way cache, a thread can use at most N – Y lines Y – NoMo degree Essentially, we reserve Y cache ways for each co-executing application and dynamically share the rest If Y=N/2, we have static non-overlapping cache partitioning Implementation is very simple – just need to check the reservation bits at the time of replacement 15

NoMo Replacement Logic 16

NoMo example for an 8-way cache Shared way usage F:1 H:1 R:2 Q:2 K:1 A:1 P:1 G:1 B:1 J:1 N:1 D:1 M:1 L:1 T:2 S:2 I:1 U:2 O:1 E:1 Reserved way usage F:1 H:1 R:2 Q:2 K:1 A:1 P:1 G:1 B:1 J:1 D:1 M:1 L:1 T:2 S:2 I:1 O:1 E:1 NoMo Entry (Yellow = T1, Blue = T2) F:1 H:1 C:1 Q:2 K:1 A:1 P:1 G:1 B:1 N:1 J:1 D:1 M:1 L:1 I:1 O:1 E:1 More cache usage F:1 H:1 C:1 K:1 A:1 P:1 G:1 B:1 N:1 J:1 D:1 M:1 L:1 I:1 O:1 E:1 Thread 2 enters F:1 H:1 C:1 Q:2 K:1 A:1 P:1 G:1 B:1 N:1 J:1 D:1 M:1 L:1 I:1 O:1 E:1 Initial cache usage F:1 H:1 C:1 K:1 A:1 G:1 B:1 J:1 D:1 L:1 I:1 E:1 Showing 4 lines of an 8-way cache with NoMo-2 X:N means data X from thread N

Why Does NoMo Work? Victim’s accesses become visible to attacker only if the victim has accesses outside of its allocated partition between two cache fills by the attacker. In this example: NoMo-1 18

Evaluation Methodology We used M-Sim-3.0 cycle accurate simulator (multithreaded and Multicores derivative of Simplescalar) developed at SUNY Binghamton http://www.cs.binghamton.edu/~msim Evaluated security for AES and Blowfish encryption/decryption Ran security benchmarks for 3M blocks of randomly generated input Implemented the attacker as a separate thread and ran it alongside the crypto processes Assumed that the attacker is able to synchronize at the block encryption boundaries (i.e. It fills the cache after each block encryption and checks the cache after the encryption) Evaluated performance on a set of SPEC 2006 Benchmarks. Used Pin-based trace-driven simulator with Pintools.

Aggregate Exposure of Critical Data 20

Sets with Critical Exposure AES enc. AES dec. BF enc. BF dec. NoMo-0 128 NoMo-1 NoMo-2 10 14 22 NoMo-3 1 NoMo-4 21

Impact on IPC Throughput (105 2-threaded SPEC 2006 workloads simulated) 22

Impact on Fair Throughput (105 2-threaded SPEC 2006 workloads simulated) 23

NoMo Design Summary Practical and low-overhead hardware-only design for defeating access-driven cache-based side channel attacks Can easily adjust security-performance trade-offs by manipulating degree of NoMo Can support unrestricted cache usage in single- threaded mode Performance impact is very low in all cases No OS or ISA support required

NoMo Results Summary (for an 8-way L1 cache) NoMo-4 (static partitioning): complete application isolation with 1.2% average (5% max) performance and fairness impact on SPEC 2006 benchmarks NoMo-3: No side channel for AES, and 0.07% critical leakage for Blowfish. 0.8% average(4% max) performance impact on SPEC 2006 benchmarks NoMo-2: Leaks 0.6% of critical accesses for AES and 1.6% for Blowfish. 0.5% average (3% max) performance impact on SPEC 2006 benchmarks NoMo-1: Leaks 15% of critical accesses for AES and 18% for Blowfish. 0.3% average (2% max) performance impact on SPEC 2006 benchmarks

Extending NoMo to Last-level Caches Side-channel attack is possible at the L2/L3 level, especially with cache hierarchy that explicitly guarantee inclusion Attacker can invalidate victim’s lines in L2/L3, thus forcing their evictions from private L1s. Effect of partitioning is much more profound at that level. Have to address the possibility of a multithreaded attack. Examine other designs for protecting L2/L3 caches. Latency is less critical there. Investigations in progress...

Using extra Cores/threads for security

Using Extra Cores/Threads for Security Main opportunity: using extra cores and core extensions to support security: Improve performance by offloading security-related computations Reduce design complexity Applications that we consider: Dynamic Information Flow Tracking (DIFT) Dynamic Bounds Checking (not covered in this talk)

Dynamic Information Flow Tracking Basic Idea: Attacks come from outside of processor Mark data coming from the outside as tainted Propagate taint inside processor during program execution Flag the use of tainted data in unsafe ways

Security Checking Policies Memory address AND data tainted Load address is tainted Store address is tainted Jump destination is tainted* Branch condition is tainted* System call arguments are tainted* Return address register is tainted Stack pointer is tainted Memory address OR data tainted* undesired code is not executed

Existing DIFT Schemes and Limitations Hardware solutions: Taint propagation with extra busses Additional checking units Limitations intrusive changes to datapath design Software solutions: More instructions to propagate and check taint High performance cost Source code recompilation

Hardware –based DIFT add r3,r1,r4 Existing DIFT t 1 1 ALU 1 1 MEM IFQ WriteBack t + t data 1 r1 r1+r4 data 1 Instruction Decode add r3 r1 r4 r0 data 1 data 1 r4 r1 r1 1 1 ALU r2 1 Exception Checking Logic r1+r4 r3 1 MEM r4 r4 r5 IFQ RF 1 Taint Computation Logic Existing DIFT

Software –based DIFT add r3 r1 r4 Existing DIFT compiler WriteBack add r3 r1 r4 shr r2 r5 2 shl r0 r5 1 data and r2 r2 16 data and r0 r0 16 Instruction Decode r0 data or r0 r2 r0 data or r5 r5 r0 r1 ALU r2 r3 00111100 r4 MEM 01100000 IFQ A small region of memory is used to store taint information of memory RF r5 is used for storing taint information of remaining register file Existing DIFT

Our Proposal: SIFT (SMT-based DIFT) Execute two threads on SMT processor Primary thread executes real program Security thread executes taint tracking instructions Committed instructions from main thread generate taint checking instruction(s) for security thread Instruction generation is done in hardware Taint tracking instructions are stored into a buffer from where they are fed to the second thread context Threads are synchronized at system calls

Instruction Flow in SIFT

SMT Datapath with SIFT Logic

SIFT Example add r3 r1 r4 Shared Resources Context-1 Context-2 DIFT IFQ 1 add r3 r1 r4 WriteBack 1 data or + 1 r1+r4 1 data 1 data add r3 r1 r4 r1 r1 Instruction Decode 1 data SIFT Generator r0 ALU r4 r4 r1 r1 1 1 r2 1 MEM r3 r1+r4 1 or r3 r1 r4 r4 r4 r5 or r3 r1 r4 RF 1 RF 2 IFQ 2 Shared Resources Context-1 Context-2 DIFT

SIFT Instruction Generation Logic 2. Secutiry Instruction Opcodes are read from COT 1. Taint Code Generation 3. Rest of the instructions are taken from Register Organizer and stored Instruction Buffer 4. Load and Store Instruction’s memory addresses are stored in Address Buffer

Die Floorplan with SIFT Logic SUN T1 Open Source Core IGL synthesized using Synopsys Design Compiler using a TSMC 90nm standard cell library COT, IB and AB implemented using Cadence Virtuoso The integrated processor netlist placed and routed using Cadence SoC Encounter Cost 4.5% of whole processor area

Benefits of Taint Checking with SMT Software is not involved, transparent to user and applications (although the checking code can also be generated in software) Hardware instruction generation is faster than software generation Additional hardware is at the back end of the pipeline, it is not on the critical path No inter-core communication

Number of Security Instructions per committed Primary Thread Instruction

SIFT Performance Overhead

SIFT Performance Optimizations Reduce the number of checking instructions by eliminating the ones that never change the taint state. Reduce data dependencies in the checker by preloading taint values into its cache once the main program encounters the corresponding address Reduce the number of security instructions depending on taint state of registers and TLB

Eliminating Checking Instructions Primary Thread SIFT Security Thread SIFT – F Security Thread lda r0,24176(r0) xor r9,3,r2 addq r0,r9,r0 ldah r16,8191(r29) ldq_u r1,0(r0) lda r16,-26816(r16) lda r0,1(r0) lda r18,8(r16) extqh r1,r0,r1 sra r1,56,r10 bne r2,0x14 and r9,255,r9 stl r3,32(r30) ldl r2,64(r2) lda r16,48(r30) bic r2,255,r2 bis r3,r2,r2 bis r0,r0,r0 bis r9,r9,r2 bis r0,r9,r0 bis r29,r29,r16 ldq_u r1,0(r0) bis r1,r0,r1 bne r1,0xfffffffffffff080 bis r16,r16,r16 bis r16,r16,r18 bis r1,r1,r10 bne r2,0xfffffffffffff080 bis r9,r9,r9 bis r3,r30,r3 bne r3,0xfffffffffffff080 stl r3,32(r30) ldl r2,64(r2) bis r2,r2,r2 bis r30,r30,r16 bis r3,r2,r2 bis r9,r9,r2 bis r0,r9,r0 bis r29,r29,r16 ldq_u r1,0(r0) bis r1,r0,r1 bne r1,0xfffffffffffff080 bis r16,r16,r18 bis r1,r1,r10 bne r2,0xfffffffffffff080 bis r3,r30,r3 bne r3,0xfffffffffffff080 stl r3,32(r30) ldl r2,64(r30) bis r30,r30,r16 bis r3,r2,r2

SIFT Logic with Instruction Elimination

Performance Impact of Eliminating Security Instructions Performance Loss of SIFT and SIFT-F compared to Baseline Single Thread execution Percentage of Filtered Instructions

Performance Impact of Cache Prefetching

SIFT Datapath with TLB Based Optimization

SIFT Logic with TLB Based Optimization

Details of TLB Based Optimization For Stores tainted register -> clean page – generate instructions tainted register -> tainted page – generate instructions clean register -> clean page – don't generate instructions clean register -> tainted page – generate instructions For Loads tainted page -> tainted register – generate instructions tainted page -> clean register – generate instructions clean page -> tainted register – generate instructions clean page -> clean register – don't generate instructions

SIFT Performance on a 4-way Issue Processor

SIFT Performance on a 8-way Issue Processor

Future Work To consider in the future: Collapse multiple checking instructions via ISA Optimize resource sharing between two threads Provide additional execution units for the checker Implementation of Register Taint Vector and TLB based instruction elimination

Thank you! Any Questions?