Bo Fang☨, Qining Lu ☨, Karthik Pattabiraman ☨, Matei Ripeanu ☨

EPVF: An ENHANCED PROGRAM VULNERABILITY FACTOR METHODOLOGY FOR CROSS-LAYER RESILIENCE ANALYSIS
Bo Fang☨, Qining Lu ☨, Karthik Pattabiraman ☨, Matei Ripeanu ☨ and Sudhanva Gurumurthi * ☨ The University of British Columbia, Canada *Cloud Innovation Lab, IBM, USA

What are we facing? SoC soft error trends: overall FIT rate per SoC is increasing [DATE 2014, Chandra AMD] soft error rate is increasing when the feature size shrinks. This figure, presented in DATE 2014 from AMD, Yaxis and Xaxis, shows that this trend is actually true for both memory and processor logic. Combine this with other studies, the take-away is that we will have one failure per day per chip in the near future.

Why Software-based Fault Tolerance
Hardware-based techniques Hardware Faults We look this problem in the context of the whole system stack, a hardware fault occurred and cause errors in the system, the errors can propagate in all system layers. The number of impactful errors are reducing when errors are moving up in the system stack To protect the system from hardware faults, hardware-based techniques have been studied for decades. But given the current constraints, especially in HPC systems, pure hardware-based solutions are not affordable anymore because of the energy overhead. That is to say, we must consider software-based techniques to protect the system from hardware faults. In order to do that, we need to first understand the error resilience of the software and applications Impactful Errors Software-based techniques: more cost-effective

Mitigating Silent Data Corruption (SDC): Key to Error Resilience
Incorrect output Error Fault Crash Hang Normal execution Benign a main goal of error resilience study is to mitigate the SDCs. To set up some context, from an application perspective, we have a normally running program, and fault happens, become error in the system, and cause the program to crash, hang, benign or SDCs. SDC is important because there is no indication for the users.

Error Resilience Estimation: Accuracy vs Cost
High resource consumption, low `predictive power Conservative estimation of Error Resilience Before doing anything to mitigate SDC, we need to measure it. Error resilience estimation/characterization can be explored in two dimensions: accuracy and cost. Usually, FI experiments can provide high accuracy by running a large number of fault injection runs. ,not good for early time analysis Another line of study is AVF/PVF (Expand AVF/PVF) analysis. They aim to identify vulnerable bits of the whole bits used by a program. People have shown that although AVF-like analysis is fast because it does not require fault injections, it has low accuracy. it can only provide an conservative estimation of error resilience. Our goal is to design a method that provides high accuracy and remains low cost.

Identifying SDC-causing Bits
AVF/PVF: Identify Architecturally Correct Execution (ACE) Bits [MICRO03, HPCA10] ACE bits SDC-causing bits Crash-causing bits Total bits for execution A major task of this goal is to identify SDC-causing states of a program. Operationally, ACE bits mean that when a fault occurred in that bit, it can go to the final output. The Goal is to identify SDC-causing bits, the approach we take is to first identify the crash-causing bits from all ACE bits. (A crash by definition does not lead to a SDC). We will have a much closer estimation on the SDC-causing bits. e(nhanced)PVF: a methodology that distinguishes crash-causing bits from ACE bits

PVF Analysis [Sridharan, HPCA10’]
R1 = LD R2 R4 = ADD R1, R3 R5 = ADD R6*4, R7 ST R4, R5 R8 = LD R2 ADDR1 ACE Bits= 𝑖=1 7 𝐵𝑖𝑡𝑠 𝑖𝑛 𝑅𝑖 Total Bits = 𝑖=1 8 𝐵𝑖𝑡𝑠 𝑖𝑛 𝑅𝑖 PVF = 𝐴𝐶𝐸 𝐵𝑖𝑡𝑠 𝑇𝑜𝑡𝑎𝑙 𝐵𝑖𝑡𝑠 = 88.9% LD R8 LD R1 R3 ADD R4 ADD Before jumping into the ePVF, allow first to simply explain how PVF works. pathfinder Assume all registers are 32-bit wide. I will walk you through the process of building a Data Dependence Graph for this piece of code. ADDR2 is the place where the final output is R6 ST ADD ADDR2 R7 R5 ADD

Our Approach: ePVF Source of crashes Direct crash-causing bits
ADDR1 R2 R1 R3 R4 ADDR2 R5 R6 R7 R8 LD ADD ST Source of crashes Segmentation faults (99% of crashes are due to segfaults) Direct crash-causing bits Crash model Indirect crash-causing bits Propagation model Again, Our Approach is identifying the crash-causing bits from all the bits. The challenge is how? First of all, we need to understand why crash happens.

Obtaining Program Trace
Identify bits that cause a program to make an invalid memory access and crash Overall methodology Obtaining Program Trace PVF- Identify ACE bits Crash Model Propagation Model The overall methodology contains 4 steps. I am gonna focus on crash model and propagation model in this talk. Identify bits on the backward slice of bits that directly cause crashes

Crash model PVF- Identify ACE bits Obtaining Program Trace Crash Model Propagation Model Determining the bits that cause an out-of-bound memory access Applied on every memory instruction R1 = LD R2 R1 = LD R2 R4 = ADD R1, R3 R5 = ADD R6*4, R7 ST R4, R5 R8 = LD R2 R2 ∈ [addr_min, addr_max] R2 vma_start vma_end I will give you an example based on the same piece of code. Remember the crash model is to identify direct crash-causing bits. We use LD as an example. Because we already capture the valid segment bound for this LD, we can have a range of values that R2 should be inside to not have segfaults. Then, we can infer which bits have to remain correct if we want R2 inside the range. We try this idea but it doesn’t work. Then we look into OS kernel code, and we found that the actual range is wider, and it is related to the current ESP. Then we revise our model and achieve 99% accurate to determine those bits. … OS Info ESP

Propagation model PVF- Identify ACE bits Obtaining Program Trace Crash Model Propagation Model Identifying all possible bits that can affect the bits identified by the crash model R1 = LD R2 R4 = ADD R1, R3 R5 = ADD R6*4, R7 ST R4, R5 R8 = LD R2 R5 = ADD R6*4 + R7 ST R4, R5 Crash model min(R5),max(R5) The next step is propagation model. Not only for add. max(R6) = (max(R5) – R7)/4 min(R6) = (min(R5) – R7)/4 max(R7) = max(R5) – R6*4 min(R7) = min(R5) – R6*4

Overall ePVF methodology
Obtaining Program Trace PVF- Identify ACE bits Crash Model Propagation Model ePVF is a general method that can be implemented in any level, we choose to implement it in LLVM. LLVM is produciton level compiler , close to assembly, architecturally neutral ePVF Bits that potentially lead to SDCs

Experimental setup Scientific benchmarks Fault Model LLFI [DSN 14]
8 from Rodinia [IISWC 09] Matrix Multiplication LULESH: DOE proxy app [IPDPS 2013] Fault Model LLFI [DSN 14] 3,000 runs per benchmark We use fault injection as a ground truth when we evaluate our models. We use LLFI, Inject single bit-flip into IR source registers Prepare for the DAC questions. We care about the faults affecting the applications. We don’t care about faults get masked below the stack of applications.

Evaluation RQ1: Accuracy of the models
RQ2: Effectiveness of the ePVF methodology RQ3: Performance Total bits for execution ACE bits SDC-causing bits Crash-causing bits 1. How accurate is the ePVF methodology when predicting the bits in which faults lead to program crashes 2. Can the methodology be used to obtain a significantly tighter estimate for the SDC rate than the conventional PVF methodology 4. How fast and scalable is the ePVF analysis How large the dotted circle is. And then how large the rest of area is

RQ1: Accuracy of the models
Recall Precision Our models achieve average 89% recall and 92% precision Recall: from the fault injection results where there are crashes, we run our model to see if those flipped bits are identified by our model Precision: we inject the bit-flips suggested by the models, and see if they will actually cause crashes. *Error bar * Flowchart about how to come up with recall and precision * Explain why these results are good (relate to the Venn diagram) Try to put the flow and result together Put that true positive

RQ1. Accuracy of the Models
Total bits for execution ACE bits SDC-causing bits Crash-causing bits Lets see how close we are to SDC-causing bits. On average, 90% of the time the ePVF methodology is accurate to identify crash-causing bits

RQ2: Effectiveness of the ePVF
SDC estimate using PVF analysis, ePVF analysis and Fault Injection ePVF significantly tightens the upper bound of estimated SDCs by 61% on average 71%-98% pvf 25% - 40% epvf Average 61% difference 19% off on average to predict SDC rate. Understand we are still a little conservative for SDC, but we will show ePVF is already useful.

ePVF-informed Duplication
Rank instructions based on their ePVF value ePVF value per instruction = 𝐴𝐶𝐸 𝑏𝑖𝑡𝑠 −𝐶𝑟𝑎𝑠ℎ−𝑐𝑢𝑎𝑠𝑖𝑛𝑔 𝑏𝑖𝑡𝑠 𝐴𝐶𝐸 𝑏𝑖𝑡𝑠 Higher the ePVF value, Higher chance to lead to SDCs Duplication highly-ranked ePVF instructions 30% more SDC coverage than hot-path duplication for the same performance overhead

RQ3: Performance Modeling time ranges from 30s (lavaMD) to ~ 4 hours (pathfinder). Depending on the size of the DDG, hence the number of dynamic instructions Optimization (Sampling and Extrapolation) Intuition – scientific applications usually have repetitive behaviors. Extrapolated ePVF values based on 10% of the graph, and showing less than 1% difference on average Sample 10% of the graph. (1/10 of overall time) How to sample *Performance

Code: https://github.com/flyree/enhancedPVF
Conclusion ePVF removes the crash-causing bits from PVF to get a more accurate estimate of SDC rate. A crash model that predicts direct crash-causing bits A propagation model that identifies bit that lead to direct crash-causing bits Implementation with LLVM compiler Drive selective protection of SDC-causing instructions Code: An fundamental Insight Three things Put that venn diagram address, contact Put a discussion slide as a backup slide

Discussion Sources of Inaccuracy Performance Lucky Load
Floating point precision Y-branch Performance Python -> C/C++ Parallelization

Obtaining program trace
PVF- Identify ACE bits Obtaining Program Trace Crash Model Propagation Model Dynamic instructions Data flow information For each load and store: valid segment bounds The first step is called obtaining program trace. In this step, we collect the dynamic instructions of a program. And the values stored in each register.

Bo Fang☨, Qining Lu ☨, Karthik Pattabiraman ☨, Matei Ripeanu ☨

Similar presentations

Presentation on theme: "Bo Fang☨, Qining Lu ☨, Karthik Pattabiraman ☨, Matei Ripeanu ☨"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bo Fang☨, Qining Lu ☨, Karthik Pattabiraman ☨, Matei Ripeanu ☨

Similar presentations

Presentation on theme: "Bo Fang☨, Qining Lu ☨, Karthik Pattabiraman ☨, Matei Ripeanu ☨"— Presentation transcript:

Similar presentations

About project

Feedback