A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill

Slides:

Advertisements

Similar presentations

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.

Advertisements

RTR: 1 Byte/Kilo-Instruction Race Recording Min Xu Rastislav BodikMark D. Hill.

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

UW-Madison Computer Sciences Multifacet Group© 2011 Karma: Scalable Deterministic Record-Replay Arkaprava Basu Jayaram Bobba Mark D. Hill Work done at.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

(C) 2002 Daniel SorinWisconsin Multifacet Project SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery.

Recording Inter-Thread Data Dependencies for Deterministic Replay Tarun GoyalKevin WaughArvind Gopalakrishnan.

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

1 Thread 1Thread 2 X++T=Y Z=2T=X What is a Data Race? Two concurrent accesses to a shared location, at least one of them for writing. Indicative of a bug.

Continuously Recording Program Execution for Deterministic Replay Debugging.

Deterministic Logging/Replaying of Applications. Motivation Run-time framework goals –Collect a complete trace of a program’s user-mode execution –Keep.

BugNet Continuously Recording Program Execution for Deterministic Replay Debugging Satish Narayanasamy Gilles Pokam Brad Calder.

(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,

Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison

(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,

A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill

15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.

- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Thread-Level Speculation Karan Singh CS

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Simulating a $2M Commercial Server on a $2K PC Alaa R. Alameldeen, Milo M.K. Martin, Carl J. Mauer, Kevin E. Moore, Min Xu, Daniel J. Sorin, Mark D. Hill.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Rerun: Exploiting Episodes for Lightweight Memory Race Recording Derek R. Hower and Mark D. Hill Computer systems complex – more so with multicore What.

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Speculative Execution in a Distributed File System Ed Nightingale Peter Chen Jason Flinn University of Michigan.

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations Hyojin Sung and Sarita Adve Department of Computer Science.

A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06) Min Xu Rastislav BodikMark D. Hill Shimin Chen LBA Reading Group Presentation.

Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.

Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.

Availability in CMPs By Eric Hill Pranay Koka. Motivation RAS is an important feature for commercial servers –Server downtime is equivalent to lost money.

Processor Memory Processor-memory bus I/O Device Bus Adapter I/O Device I/O Device Bus Adapter I/O Device I/O Device Expansion bus I/O Bus.

An Architectural Evaluation of Java TPC-W Harold “Trey” Cain, Ravi Rajwar, Morris Marden, Mikko Lipasti University of Wisconsin-Madison

CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick

Dynamic Verification of Sequential Consistency Albert Meixner Daniel J. Sorin Dept. of Computer Dept. of Electrical and Science Computer Engineering Duke.

Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.

Maurice Herlihy and J. Eliot B. Moss, ISCA '93

Rerun: Exploiting Episodes for Lightweight Memory Race Recording

The University of Adelaide, School of Computer Science

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Effective Data-Race Detection for the Kernel

The University of Adelaide, School of Computer Science

Introduction to Operating Systems

CMSC 611: Advanced Computer Architecture

The University of Adelaide, School of Computer Science

Improving Multiple-CMP Systems with Token Coherence

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

Simulating a $2M Commercial Server on a $2K PC

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Dynamic Verification of Sequential Consistency

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

University of Wisconsin-Madison Presented by: Nick Kirchem

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill June 9th, 2003  Software bugs cost time & money  Hardware is getting cheaper  Use hardware to aid software debugging?

Xu et al.ISCA'03: Flight Data Recorder2 Brief Overview Approach: Full-system Record-Replay – Add H/W “Flight Data Recorder” – Target cache-coherence multiprocessor server – Enables S/W deterministic replay Full-system Evaluation: Low Overhead – Piggyback on coherence protocol: little extra H/W – Non-trivial recording interval: 1 second – Negligible runtime overhead: less than 2% – Can be “Always On”

Xu et al.ISCA'03: Flight Data Recorder3 Outline Overview – Why Deterministic Replay? – The Debugging Scenario – The Solution Recording Multithreading Recording System State & I/O Evaluation Conclusions Efficient RecordingWith full-system commercial workloads

Xu et al.ISCA'03: Flight Data Recorder4 Why Deterministic Replay? Software Bugs Happens In the Field – Differences between development & deployment – Data races (Web server, Database) – I/O interactions (OS, Device Driver) Debugging Usually happens In the Lab – Need to replay the buggy execution Use Core Dump? – Captures the final application state – Not enough for “race” bugs Need Better “Core Dump” – Enable faithfully replaying prior to the failure

Xu et al.ISCA'03: Flight Data Recorder5 The Debugging Scenario Recorder Crash Dump “Core” P1 P2 P3 P4 Checkpoint BCheckpoint C Store log AStore log BStore log C Checkpoint A Crash Read Checkpoint B Replaying from log B, C Replayer

Xu et al.ISCA'03: Flight Data Recorder6 The Solution Online Recorder – Like airplane flight data recorder – “Always on” even on deployed system – H/W based (no change to S/W) Transparent to S/W Minimal performance impact Offline Replayer – Post-mortem replay of pre-crash execution – Possibly on a different machine off-site – Based on existing technology i.e. Simics full-system simulator Focus of this workNot emphasized in this work

Xu et al.ISCA'03: Flight Data Recorder7 Outline Overview Recording Multithreading – What to record? – An example – Practical recorder hardware Recording System State & I/O Evaluation Conclusions Efficient Recording

Xu et al.ISCA'03: Flight Data Recorder8 What to Record? Multithreading Problem – Record order of instruction interleaving Assume Sequential Consistency (SC) – Accesses (appear to have) total order

Xu et al.ISCA'03: Flight Data Recorder9 Previous Record-Replay Approaches InstantReplay ’87 – Record order or memory accesses – overhead may affect program behavior Netzer ’93 – Record optimal trace – too expensive to keep track of all memory locations Bacon & Goldstein ’91 – Record memory bus transactions with hardware – high logging bandwidth RecPlay ’00 – Record only synchronizations – Not deterministic if have data races

Xu et al.ISCA'03: Flight Data Recorder10 Our Approach Uses existing cache coherence hardware – Low overhead, not affect program behavior – Works for program with races – Adapts Netzer’s algorithm in hardware – only record sync. if data race free An Example – Progressively refine the recording algorithm

Xu et al.ISCA'03: Flight Data Recorder11 Example: Record SC Order 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij

Xu et al.ISCA'03: Flight Data Recorder12 Example: Record SC Order 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij i:4  j:15 j:15  i:5 i:5  j:16 i:6  j:17 i:7  j:18 j:16  i:6 j:17  i:7 Need to add processor instruction count (IC) The very same interleaving is recorded, but …

Xu et al.ISCA'03: Flight Data Recorder13 Example: Record Word Conflict Order i:4  j:15 j:15  i:7 i:7  j:18 i:5  j:21 i:6  j:22 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij Recording just word conflict can enable deterministic replay Hard to remember word accesses and too many arcs …

Xu et al.ISCA'03: Flight Data Recorder14 Example: Record Block Conflict Order i:4  j:15 j:15  i:7 i:7  j:18 i:5  j:21 i:6  j:22 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij

Xu et al.ISCA'03: Flight Data Recorder15 Example: Record Block Conflict Order i:4  j:15 j:15  i:7 i:7  j:18 i:5  j:21 i:6  j:22 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij Need to remember last accessing IC in the cache i:6  j:21 But, can we do better?

Xu et al.ISCA'03: Flight Data Recorder16 Example: Apply Transitive Reduction i:4  j:15 j:15  i:7 i:7  j:18 i:5  j:21 i:6  j:22 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij i:6  j:21

Xu et al.ISCA'03: Flight Data Recorder17 Three arcs! No need to know syncs Automatic sync only for race free program Example: Apply Transitive Reduction i:4  j:15 j:15  i:7 i:7  j:18 i:5  j:21 i:6  j:22 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij i:3  j:21

Xu et al.ISCA'03: Flight Data Recorder18 Practical Recorder Hardware Processor – instruction count 4 bytes per processor Cache – last access instruction count 6.25% space overhead Coherence Controller – vector of instruction counters 3×4 bytes per processor for 4-way multiprocessor Finite Cache, Out-of-Order, Prefetch, etc. – Recorder still applicable – Details in the paper

Xu et al.ISCA'03: Flight Data Recorder19 Further Details At each processor j: – IC = inst count of last committed inst by j – VIC[P] = latest ICs received by each proc – CIC[M] = IC of last load/store of block b in j’s L1 On commit, IC++; if load(b) then CIC[b] = IC i sends arc start (i,CIC_i[b]) on coherence reply On coherence reply receive, arc end is (j,IC+1) – If CIC_i[b] > VIC[i] then log arc; VIC[i]=CIC_i[b]

Xu et al.ISCA'03: Flight Data Recorder20 Example Transitive Reduction i:4  j:15 j:15  i:7 i:7  j:18 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij IC LOG VIC i: 7 VIC j: 15 CIC {x1,x2}: 6 i,6 6 < 7, so ignore coherence traffic ?

Xu et al.ISCA'03: Flight Data Recorder21 Relaxation Exact: coherence reply sends (i,CIC[b]) Safe: send (i,x) for any x: CIC[b] ≤ x ≤ IC Exact: on receive, add arc (x, IC’+1) Safe: add arc (x,y) for any y: IC+1 ≤ y ≤ IC’+1 x CIC[b] IC y (e.g., IC+1) IC’+1 (read b) x Finite Caches Speculative Processors

Xu et al.ISCA'03: Flight Data Recorder22 Outline

Xu et al.ISCA'03: Flight Data Recorder23 SafetyNet Checkpoint Hardware Problem – To beginning of “replay” interval – Logically take a snapshot of the system Solution – Adapt SafetyNet [Sorin et al. ISCA ‘02] Processor Checkpointing Memory Incremental logging Slightly modified for longer interval

Xu et al.ISCA'03: Flight Data Recorder24 Recording I/O Interrupts – Not exceptions – Record Interrupt type & IC Instruction I/O – Load: record values – Store: ignored DMA – Record input values – Record ordering: as pseudo thread

Xu et al.ISCA'03: Flight Data Recorder25 Outline Overview Recording Memory Races Recording System State & I/O Evaluation – An example system – Simulation methods – Runtime, log size Conclusions With full-system commercial workloads

Xu et al.ISCA'03: Flight Data Recorder26 Target System Commercial Server H/W – Sequential Consistent CC-NUMA – Full I/O: Interrupt, DMA, etc. – Simulation system (Simics + Memory Simulator) 4 way in-order issue, 1 GHz, 4 processors 128KB I/D L1, 4MB L2, MOSI directory protocol Commercial Server S/W – Unmodified commercial server benchmarks Apache, Slash, SPEC JBB, OLTP

Xu et al.ISCA'03: Flight Data Recorder27 CC-NUMA MP An Example System Memory Banks DMA Interface Core Cache(s) Cache Controller Directory Data Compressor (LZ77) Recorder Memory DMA Content & Order Interrupts, I/O Cache Checkpoint Memory Races Memory Checkpoint

Xu et al.ISCA'03: Flight Data Recorder28 Runtime Overhead Slowdown – Less than 2% – statistically insignificant for 2 workloads – No problem “always on” Slowdown causes – Extra traffic – Stall by buffer overflow – More blocking – Extra coherence message on some get- shared’s Runtime per Transaction (Normalized to base system) OLTPJBBAPACHESLASH OLTP: database transactions (TPC-C on DB2) JBB: server side java benchmark that models a 3-tier system APACHE: static web server SLASH: dynamic web server (slashdot message posting)

Xu et al.ISCA'03: Flight Data Recorder29 UncompressedCompressed Log Size 1 – 1.33 Second Recording – Buffer: 35 MB (7%); Bandwidth: 25 MB/Second/Processor Efficient Race Log – Longer recording is possible with better checkpoint scheme Longer Recording – Using disk can get longer replay: 320 GB disk = ~3 hours recording Interrupt, Input, DMA Log Races log Checkpoint Log Log Size (MB/Second/Processor) OLTPJBBAPACHESLASH OLTPJBBAPACHESLASH

Xu et al.ISCA'03: Flight Data Recorder30 Conclusion Low Overhead Deterministic Replay – Piggyback MP cache coherence hardware – Modest extra hardware – Modest overhead (less than 2% slowdown) Minimal race recording with transitive reduction Full-system Deterministic Replay – Evaluated with commercial workloads – Full-system recording (including OS, I/O)

Xu et al.ISCA'03: Flight Data Recorder31 Thank You Questions?

Xu et al.ISCA'03: Flight Data Recorder32 Flight Data Recorder vs. ReEnact Flight Data RecorderReEnact Target SystemCC-NUMATLS Deterministic Replay?YesYes* Race-detection?No**Yes Effective Interval (instructions)>100,000,000<100,000 Slowdown<2%Avg 5.8% OS, I/OYesNo (extendable?) Active during OS & I/O?YesNo * Need to disable TLS? ** Not in the recorder, but in the replayer

Xu et al.ISCA'03: Flight Data Recorder33 Scalability More processors, more races log – Not a quadratic increase – e.g. 4p to 16p for 2x more log Real systems have more I/O – But, also more memory available for log

Xu et al.ISCA'03: Flight Data Recorder34 Protocol Changes Get IC count from source processor – W  R: Piggyback IC count to DataResponse msg – W  W: Piggyback IC count to DataResponse msg – R  W: Piggyback IC count to InvalidateAck msg Cache block Writeback – Snooping protocol Eager IC update Extra messages on interconnect Not on critical path – Directory based protocol Lazy IC update Extra latency for cache misses

Xu et al.ISCA'03: Flight Data Recorder35 Replayer (Full-system Simulator) Input data to the replayer – Checkpoint – Execution log – DMA log – I/O log – Exception log Replay the execution – Load system checkpoint: registers, TLB, etc – Replay the MP execution order in partial order – Replay the I/O and exceptions – Proper device model needed to interrupt system output – Memory inspection support – Step forward/backward (enhanced debugger features)

Xu et al.ISCA'03: Flight Data Recorder36 Example: False Sharing 32 X1:=515 $r1:=Flag 33 X2:=6 16 Bneq $r1,$r0,-1 34 Flag:=017 Nop 15(P1,31) 34(P2,15) 18(P1,34) 21(P1,32) 22(P1,33) 31 Flag=1 14 Private2:=2 18 $r1:=Flag35 Private1:=3 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 P1P2

Xu et al.ISCA'03: Flight Data Recorder37 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 32 X1:=515 $r1:=Flag 33 X2:=6 16 Bneq $r1,$r0,-1 34 Flag:=017 Nop 15(P1,31) 34(P2,15) 18(P1,34) 21(P1,32) 22(P1,33) 31 Flag=1 14 Private2:=2 18 $r1:=Flag35 Private1:=3 P1P2 Example: False Sharing

Xu et al.ISCA'03: Flight Data Recorder38 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 32 X1:=515 $r1:=Flag 33 X2:=6 16 Bneq $r1,$r0,-1 34 Flag:=017 Nop 15(P1,31) 34(P2,15) 18(P1,34) 21(P1,32) 22(P1,33) 31 Flag=1 14 Private2:=2 18 $r1:=Flag35 Private1:=3 21(P1,33) 35(P2,14) P1P2 Example: False Sharing

Xu et al.ISCA'03: Flight Data Recorder39 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 32 X1:=515 $r1:=Flag 33 X2:=6 16 Bneq $r1,$r0,-1 P1P2 34 Flag:=017 Nop 15(P1,31) 34(P2,15) 18(P1,34) 21(P1,32) 22(P1,33) 31 Flag=1 14 Private2:=2 18 $r1:=Flag35 Private1:=3 21(P1,33) 35(P2,14) Example: False Sharing

Xu et al.ISCA'03: Flight Data Recorder40 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 32 X1:=515 $r1:=Flag 33 X2:=6 16 Bneq $r1,$r0,-1 34 Flag:=017 Nop 15(P1,31) 34(P2,15) 18(P1,34) 21(P1,32) 22(P1,33) 31 Flag=1 14 Private2:=2 18 $r1:=Flag35 Private1:=3 21(P1,33) 35(P2,14) P1P2 Example: False Sharing

Xu et al.ISCA'03: Flight Data Recorder41 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 32 X1:=515 $r1:=Flag 33 X2:=6 16 Bneq $r1,$r0,-1 34 Flag:=017 Nop 15(P1,31) 34(P2,15) 18(P1,34) 21(P1,32) 22(P1,33) 31 Flag=1 14 Private2:=2 18 $r1:=Flag35 Private1:=3 21(P1,33) 35(P2,14) P1P2 Example: False Sharing