Tong Zhang, Dongyoon Lee, Changhee Jung

Tong Zhang, Dongyoon Lee, Changhee Jung
TxRace: Efficient Data Race Detection Using Commodity Hardware Transactional Memory Tong Zhang, Dongyoon Lee, Changhee Jung Computer Science Department Good afternoon, my name is Tong, I come from VirginiaTech. Today I’m going to talk about our paper: TxRace, Efficient data race detection using commodity hardware transactional memory. This work is done supervised by my advisor Dongyoon Lee and Changhee Jung.

Data Races in Multithreaded Programs
Two threads access the same shared location (one is write) Not ordered by synchronizations Thread 1 Thread 2 p = NULL if (p) { crash fput(p, …) } What is data race? A data race occurs when: two or more threads in a single process access the same memory location concurrently, and. at least one of the accesses is for writing, and. the threads are not using any exclusive locks to control their accesses to that memory. For example, we have a real world bug comes from MySQL. If we have such interleaving that causes nullify of p between these lines, it will cause thread 1 crash. Now let’s take a look at real world problems it caused. MySQL bug #3596

Race Conditions Caused Severe Problems
Northeast Blackout of 2003 • 50+ million people lost power • Cost an estimated $6 billion Stock Price Mismatch in 2012 • About 30 million shares’ worth of trading were affected • Cost an estimated $13 million northeast blackout in 2003, which caused 50+ million people in darkness, caused $6 billion loss. Recently, in 2012 it caused nasdaq stock price mismatch. That caused $13 million loss. And it affected 30 million ---- We can not bear this from happening. And this motivated the development of data race detectors.

State-Of-The-Art Dynamic Data Race Detector
Software based solutions FastTrack [PLDI’09] Intel Inspector XE Google Thread Sanitizer ... Hardware based solutions ReEnact [ISCA’03] CORD [HPCA’06] SigRace [ISCA’09] … ✔ Sound (no false negatives) ✔ Complete (no false positives) ✗ High overhead (10-100x) ✔ Low overhead There are static and dynamic data race detectors. TxRace focus on dynamic race detectors, which detect data race happened during program execution. For dynamic race detectors, Software solutions are often sound and complete but has high overhead. This inhibit it from being widely used. In order to lower the overhead of software solution, hardware solutions are proposed. But hardware solutions require custom hardware which is not likely to be widely deployed in commodity hardware, that makes it another obstacle to be used by developers. TxRace get the good part from both world, it has low overhead and requires no custom hardware. ✗ Custom hardware

Our Approach Hybrid SW + (existing) HW solution
Leverage the data conflict detection mechanism of Hardware Transactional Memory (HTM) in commodity processors for lightweight data race detection ✔ Low overhead TxRace is an hybrid approach, In order to do lighweight data race detection, HTM’s data conflict detection mechanism is exploited. ✔ No custom hardware

Outline Motivation Background: Transactional Memory
TxRace: Design and Implementation Experiments Conclusion So far we have talked about data race, the pros and cons of existing solutions and our proposed solution, which is based on HTM. Next I’m going to talk about about transactional memory and challenges for data race detection.

Transactional Memory (TM)
Allow a group of instructions (a transaction) to execute in an atomic manner Thread1 Thread2 time Transaction begin Transaction end Data conflict Rollback Read(X) Abort Write(X) What is Transactional memory? Transactional memory is a technology of concurrent threads synchronization. It simplifies the parallel programming by extracting instruction groups to atomic transactions. Here is an example: two threads are executing on separate cores, they have conflict access to shared variable X. this will cause transaction of thread 1 abort, rollback and re-execution. Because of the limitations of HTM, there are few challenges when using it for data race detection. Read(X)

Challenge 1: Unable to Pinpoint Racy Instructions
When a transaction gets aborted, we know that there was a data conflict between transactions However, we DO NOT know WHY and WHERE - e.g. which instruction? at which address? Which transaction caused the conflict? Thread1 Thread2 ? Read(X) Abort Write(X) The first challenge is that HTM can not pinpoint racy instruction. Usually, in data race detection, we need not only the racy instruction pair, but also the racy variable. However, we only know the fact that a transaction is aborted. ----- Here we have the previous two threads 1 and thread 2, once the transaction in thread 1 is aborted and rolled back there’s no information about the racy instruction pair and racy variable to the outside.

Challenge 2: False Conflicts → False Positives
HTM detects data conflicts at the cache-line granularity → False positives Thread1 Thread2 False transaction abort without data race Read(X) Abort Write(Y) The second challenge is that HTM is implemented using cache coherent protocol, which detect conflict data access at cache line granularity, and that may cause false positive. If two threads accessed different variable but happened to be located in the same cache line, due to the detection granularity, this will cause an abort. This case should be rule out, otherwise it causes false positive. X Y Cache line

Challenge 3. Non-conflict Aborts
Best-effort (non-ideal) HTM with limitations → Transaction may get aborted without data conflicts → False negatives (if ignored) . Read(X) Read(X) Write(Y) Write(Y) Read(Z) . Z I/O syscall() Y Abort The third challenge is non-conflict abort. ----- For capacity abort, because of the implementation of HTM is based on cache coherent protocol, there's a hardware buffer, usually it is L1 cache, that is used to buffer intermediate result. It keeps track of the variables accessed inside the transaction. Once it is full and not able to track anymore, transaction will abort. ------ If we have unsupported instructions, such as IO and syscall inside transaction. It will also abort, and that is categorized as unknown abort. Simply because hardware is not able to run detection for these cases, if we leave them alone, it will result in false negatives. Abort X Hardware Buffer “Unknown” Abort “Capacity” Abort

Outline Motivation Background: Transactional Memory
TxRace: Design and Implementation Experiments Conclusion Till now we have introduced Transactional memory and the three main challenges. Next, I’m going to talk about how TxRace solve these challenges.

TxRace: Two-phase Data Race Detection
Potential data races Fast-path (HTM-based) Slow-path (SW-based) ✔ Fast Unable to pinpoint races False sharing(false positive) Non-conflict aborts(false negative) ✔ Sound(no false negative) Complete(no false positive) Slow ✔ ✗ ✗ ✗ TxRace have a fast path which leverages conflict data access detection mechanism in hardware. But because of the HTM limitations, it is unable to pinpoint races, has false sharing problem and it has non-conflict aborts. In order to address these challenges, we employed a slow-path which is software based, it is sound and complete, but it is slow. When potential data race is detected(there may be false positive because of the false sharing), we switch back to slow-path to do sound and complete data race detection. So that we can rule pinpoint racy instruction, rule out false positives and conservatively deal with non-conflict aborts. Because the program switch back to slow-path only when there’s a potential data race, we expect that the program will take fast-path for most of the time. So that we can take advantage of slow-path’s sound and complete race detector to pinpoint racy pair and racy variable. When slow-path is finished, we switch back to fast-path to do fast data conflict detection. For fast-path, Intel RTM is used in this study. We used Google Thread Sanitizer in this study and configured it to be sound and complete. In order to support Two-phase execution, the program need to be instrumented. ✗ Intel Haswell (RTM) Google ThreadSanitizer (TSan)

Compile-time Instrumentation
Fast-path: convert sync-free regions into transactions Slow-path: add Google TSan checks Thread1 Thread2 Lock() Transaction begin Transaction end Sync-free X=1 X=2 Unlock() Lock() There are two things to do when instrumenting a program. For fast-path, we need annotate code region, the sync-free region, that need to be protected by HTM. For slow-path, we need to intercept memory operations and sync operations for google tsan. ------ For example, there are code for two threads have conflict access to X, we want to detect such race condition. And we annotate such code region as transaction. So that it can be detected by HTM. Next I’m going to explain how program transit between two phase. Sync-free Unlock()

Fast-path HTM-based Detection
Slow-path Fast-path HTM-based Detection Leverage HW-based data conflict detection in HTM Problem: On conflict, one transaction gets aborted, but all others just proceed → slow-path missed racy transactions Thread1 Thread2 Thread3 Already passed X=1 The Fast-path leverages HTM for conflict data access detection. ---- Here, we have three threads 1,2,3. Thread 2 and 3 have conflict data access, and thread 2 got aborted, thread 3 can continue it execution. We got a problem here. Pause. If we just let thread 3 go without sending it to do data race detection. We will miss it and cause false negative. Abort X=2

Fast-path HTM-based Detection
Slow-path Fast-path HTM-based Detection Leverage HW-based data conflict detection in HTM Problem: On conflict, one transaction gets aborted, but all others just proceed → Cannot switch to slow-path Solution: Abort in-flight transactions artificially Thread1 Thread2 Thread3 Rollback all R(TxFail) R(TxFail) R(TxFail) So we need to artificially abort the rest of them and sent all to the slow-path. ---- In order to do so, we let introduced another shared variable TxFail, and let all transactions read TxFail immediately after transaction is started. When abort happens, we let the aborted transaction write TxFail. Because of the strong isolation property of HTM, this will abort all ongoing transactions. After they rolled back, program will run in slow path phase. X=1 Abort X=2 Abort Abort W(TxFail)

Slow-path SW-based Detection
Fast-path Slow-path Slow-path SW-based Detection Use SW-based sound and complete data race detection Pinpoint racy instructions Filter out false positives (due to false sharing) Handle non-conflict (e.g., capacity) aborts conservatively Thread1 Thread2 Thread3 SW-based detection When in slow-path, it will do complete and sound data race detection, so that the racy instruction pair can be pinpoint. And false positive due to false sharing can be ruled out. We can also deal with other kinds of abort conservitely. After slow path is done, we switch back to fast-path again. X=1 Abort X=2 Abort Abort

Implementation Two-phase data race detection Instrumentation
Fast-path: Intel’s Haswell Processor Slow-path: Google’s Thread Sanitizer Instrumentation LLVM compiler framework Compile-time & profile-guided optimizations Evaluation PARSEC benchmark suites with simlarge input Apache web server with 300K requests from 20 clients 4 worker threads (4 hardware transactions) We implemented TxRace using LLVM framework, the fast-path is implemented for Intel RTM, Google Thread Sanitizer is used for slow-path. There are several other optimizations: like cutting the transaction to make it smaller so that there can be fewer capacity abort. please refer to our paper for details. TxRace was evaluated using PARSEC benchmark set and one real word application Apache. Thread number is set to max-core number 4.

Outline Motivation Background: Hardware Transactional Memory
TxRace: Design and Implementation Experiments 1) Performance 2) Soundness (Detection capability) 3) Cost-effectiveness Conclusion Next, I am going to show some result about the performance overhead, its effectiveness and cost effectivenss.

1. Performance Overhead >10x reduction 11.68x 4.65x
This one shows the runtime overhead of Tsan and TxRace. Yellow bar is TSan, blue bar is TxRace. X-axis is each application. Y-axis is runtime overhead normalized to original execution. Lower the better. The geomean for Tsan is 11.68, while for TxRace, the overhead is TxRace is much better than Tsan interms of overhead. Next, let take a look at the breakdown of performance overhead.

2. Soundness (Detection Capability)
Recall:0.95 False Negative This chart shows the number of data races detected by Tsan and TxRace. Yellow bar is for TSan, blue bar is for TxRace. X-axis is for each application. Y-axis is the number of data races detected. Higher is better. For most applications, TxRace detected all data races that Tsan can detect. For 3 of them, TxRace missed some. Some of these false negative is caused by data-race in non-overlap transactions. This figure tells us that TxRace has high recall of 0.95. For the false negatives, why does it exists?

False Negatives Due to non-overlapped transactions X=1 X=2
time X=1 Transaction begin Transaction end One limitation of TxRace is that it may have false negative. That is because TxRace can detect data race happened in overlapped transactions. If the overlapped transaction does not have data conflict, TxRace can not detect it. X=2

False Negatives Case Study in vips
Repeat the experiment to exploit different interleaving All detected This can be somehow solved by repeating the experiment. For example, VIPS, we can run the program several times to exploit more interleaving so that the data race can be discovered. After all, what if we do some naive sampling? We want to know where does our TxRace stand when we do naive sampling of memory operations?

3. Cost-effectiveness Compared to Sampling
TxRace vs. Tsan with Sampling Overhead equivalent to naïve sampling at 25.5% We picked one application from PARSEC benchmark set, the bodytrack. It shows the overhead changes when changing the percentage of total memory operations samples. Yelow triangle is where TxRace stands. We can see that TxRace’s overhead is similar to 25.5% sampling. But does TxRace detects more data race than 25.5% sampling?

Recall compared to sampling
TxRace: Less overhead + High recall Recall = 𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑 𝑅𝑒𝑎𝑙 𝐷𝑎𝑡𝑎 𝑅𝑎𝑐𝑒𝑠 𝑇𝑜𝑡𝑎𝑙 𝑅𝑒𝑎𝑙 𝐷𝑎𝑡𝑎 𝑅𝑎𝑐𝑒𝑠 Spend 25.5% Get 47.2% This figure shows the Recall and sampling rate for Tsan. The recall is calculated by dividing reported real data races by total real data races. The higher the better. Here we use unmodified TSan as oracle, the total real data races is reported by oracle. The yellow triangle is for TxRace, blue line is for TSan+ sampling. Our TxRace can detect same amount of data races as sampling 47.2% of total memory operations. But we only have overhead of 25.5% memory sampling. This means we are spending less but get more. TxRace is doing better than Tsan.

Conclusion TxRace HTM-based fast-path(most of the time)
SW-based slow-path(on-demand) Performance Completeness Soundness 11.68x -> 4.65x TxRace TSan To conclude, TxRace is an efficient dynamic race detector leverages HTM in commodity hardware and is has no false positive while it has low overhead. We addressed the challenges using two-phase data race detector. The recall for is high and it does better than naïve sampling. Recall: 0.95

Q&A Thank you!

Performance overhead Transaction overhead Large number of is low
short transactions This figure shows the overhead of transaction. It is normalized to original execution. The overhead of transaction is low, for swaptions and streamcluster the overhead is high, that is because of there are too many short transactions. 1.16

Performance overhead 2.73 This after we handling conflict abort using slow path the overhead becomes like this. We can see that the overhead for conflict handling is higher than pure transaction overhead, overall.

Performance overhead This show the overhead when we deal with capacity aborts. For some of the applications, we found that cost for handling the capacity abort is even higher than the cost of handling conflict abort.

Performance overhead This is the last figure, the overhead when we handle unknown abort. We can see for most of the cases, the transaction itself has low overhead, most of the time is spent in dealing with different kind of aborts, i.e. the time is spent in slow-path. Next, let see whether the detection capability is heavily compromised with this low overhead.

False Negative Transactions finish and escape before being artifically aborted T1 T2 T3 W(TxFail) R(TxFail) X=1 Abort X=2

Tong Zhang, Dongyoon Lee, Changhee Jung

Similar presentations

Presentation on theme: "Tong Zhang, Dongyoon Lee, Changhee Jung"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tong Zhang, Dongyoon Lee, Changhee Jung

Similar presentations

Presentation on theme: "Tong Zhang, Dongyoon Lee, Changhee Jung"— Presentation transcript:

Similar presentations

About project

Feedback