Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Architecture Lab at Evangelos Vlachos, Michelle L. Goodstein, Michael A. Kozuch, Shimin Chen, Phillip B. Gibbons, Babak Falsafi and Todd C. Mowry.

Similar presentations


Presentation on theme: "Computer Architecture Lab at Evangelos Vlachos, Michelle L. Goodstein, Michael A. Kozuch, Shimin Chen, Phillip B. Gibbons, Babak Falsafi and Todd C. Mowry."— Presentation transcript:

1 Computer Architecture Lab at Evangelos Vlachos, Michelle L. Goodstein, Michael A. Kozuch, Shimin Chen, Phillip B. Gibbons, Babak Falsafi and Todd C. Mowry ParaLog: Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

2 Software Errors & Analysis Tools Errors abundant in parallel software –Program crashes/vulnerabilities, limited performance Three main categories of analysis tools –Checking before, during or after program execution Instruction-grain Lifeguards –Online detailed analysis, but with high overhead –Several tools available, but mostly support for single- threaded code 2© Evangelos VlachosASPLOS '10 - ParaLog ParaLog: a framework for efficient analysis of parallel applications

3 Lifeguards and Parallel Applications Application Threads Timesliced Execution & Analysis Parallel Execution & Analysis Time Butterfly AnalysisParaLog windows of uncertainty precise application order (previous talk)(this talk) DBI tools available today - high overhead due to serialization - some false positives +software-based - new hardware required +no false positives +even better performance

4 Low-Overhead Instruction-level Analysis © Evangelos VlachosASPLOS '10 - ParaLog4 accelerators: IT, IF, MTLB [Chen et. al., ISCA’08] event stream event capturing application thread lifeguard thread event delivery application lifeguard online monitoring platform metadata add r1  r2, r4 add, r1, r2, r4 add_handler(){ i = load_state(r2); j = load_state(r4); if(check(i, j)) upd_state(r1); else error(); } Lifeguard core Application core

5 accelerators: IT, IF, MTLB Challenges in Parallel Monitoring © Evangelos VlachosASPLOS '10 - ParaLog5 event stream application lifeguardonline parallel monitoring platform [ParaLog] application thread 1 event capturing event delivery lifeguard thread 1 global metadata event stream application thread k event capturing event delivery lifeguard thread k

6 accelerators: IT, IF, MTLB Addressing the Challenges 1.Application event ordering 2.Ensuring metadata access atomicity efficiently 3.Parallelizing hardware accelerators © Evangelos VlachosASPLOS '10 - ParaLog6 event stream application-only order capturing order enforcing application lifeguardonline parallel monitoring platform dependence arcs [ParaLog] application thread 1 event capturing event delivery lifeguard thread 1 global metadata event stream application-only order capturing order enforcing application thread k event capturing event delivery lifeguard thread k

7 Outline Introduction Addressing the Challenges of Parallel Monitoring 1.Capturing & enforcing application event ordering 2.Ensuring metadata access atomicity 3.Parallelizing hardware accelerators Evaluation Conclusions 7© Evangelos VlachosASPLOS '10 - ParaLog

8 Event Ordering: the Problem Case Study: Information flow analysis (i.e., Taintcheck) © Evangelos VlachosASPLOS '10 - ParaLog8 store(A) load(A) Application thread jthread k st_handler(A) Lifeguard thread jthread k Application Time ld_handler(A) Expose happens-before information to lifeguards Lifeguard Time

9 {thread j, t j } progress j : t j progress j : t j - 2 progress k : t k - 1progress k : t k progress k : t k - 2 progress j : t j - 1 Event Ordering: the solution (1/2) Coherence-based ordering of application events –Similar to FDR, but online, focusing on application-only events © Evangelos VlachosASPLOS '10 - ParaLog9 store(A) load(A) Application thread jthread k Time t j - 1 tjtj t j + 1 t k - 1 tktk t k + 1 st_handler(A) ld_handler(A) Lifeguard thread jthread k wait while progress j < t j wait while progress j < t j

10 Is monitoring coherence enough?Event Ordering: the Solution (2/2) Previous work has not solved the problem of Logical Races Both logical races and system calls resolved with Conflict Alert messages © Evangelos VlachosASPLOS '10 - ParaLog10 free(A) load(A) Application thread jthread k free(A) start ld_handler(A) Lifeguard thread jthread k Metadata(A) free(A) end Conflict Alert MessageDependence Logical Race Application Time Lifeguard Time

11 Metadata Atomicity Frequent use of locking too expensive –# of instructions added & synchronization cost Dependence arcs handle the majority of the cases –Sufficient conditions: 1.One-to-one data-to-metadata mapping 2.Application reads don’t become metadata writes –Enforcing dependence arcs  race-free operation Rest of the cases handled by acquiring a lock –Lock used only in the load_handler(); other handlers safe © Evangelos VlachosASPLOS '10 - ParaLog11 (more details in the paper)

12 Parallel Hardware Accelerators Speed-up frequent lifeguard actions –Metadata-TLB; fast metadata address calculation –Idempotent Filters; filter out redundant checking –Inheritance Tracking; fast tracking of dataflow paths Accelerators have only local view of the analysis –Cache locally analysis information (e.g., frequent events) –Important events have application-wide effects (e.g., free()) –Coherence-like issues with accelerators’ local state Important events accompanied by Conflict Alerts –Use Conflict Alerts to flush accelerators’ state © Evangelos VlachosASPLOS '10 - ParaLog12

13 Outline Introduction Addressing the Challenges of Parallel Monitoring –Capturing & enforcing application event ordering –Ensuring metadata access atomicity –Parallelizing hardware accelerators Evaluation Conclusions 13© Evangelos VlachosASPLOS '10 - ParaLog

14 Experimental Framework 14© Evangelos VlachosASPLOS '10 - ParaLog Log-Based Architectures framework –Simics full-system simulation –CMP system with {2, 4, 8, 16} cores –{1, 2, 4, 8} of application and lifeguard threads –Sequentially Consistent memory model Benchmarks and multithreaded Lifeguards used –SPLASH-2 and PARSEC –TaintCheck: Information flow tracking; accelerated by M-TLB, IT –AddrCheck: Memory access checking; accelerated by M-TLB, IF Comparison with Timesliced Monitoring

15 Performance Results: AddrCheck 15© Evangelos VlachosASPLOS '10 - ParaLog 8 app/lifeguard threads 16 cores total Normalized to sequential, unmonitored

16 Performance Results: AddrCheck 16© Evangelos VlachosASPLOS '10 - ParaLog

17 Performance Results: AddrCheck 17© Evangelos VlachosASPLOS '10 - ParaLog 2.3 6.1 6.7 1.7 1.9 2.9 9.5 15.4 2.1 6.2 1.92.4 Timesliced Monitoring is not scalable On average 15x slowdown over No Monitoring (8 threads)

18 Performance Results: AddrCheck 18© Evangelos VlachosASPLOS '10 - ParaLog Highest overhead with 8 threads: S WAPTIONS  6x Lowest overhead with 8 threads: < 5% Average overhead with 8 threads: 26%

19 Performance Results: TaintCheck 19© Evangelos VlachosASPLOS '10 - ParaLog

20 Performance Results: TaintCheck 20© Evangelos VlachosASPLOS '10 - ParaLog 2.1 11.5 12.9 1.9 10 1.7 1.9 2.9 6.6 4.6 15.7 2.42.8 1.7 Timesliced Monitoring is not scalable On average 23x slowdown over No Monitoring (8 threads)

21 Performance Results: TaintCheck 21© Evangelos VlachosASPLOS '10 - ParaLog Highest overhead with 8 threads: B ARNES  2.6x Lowest overhead with 8 threads: L U  5% Average overhead with 8 threads: 48%

22 Other Results in the Paper Order capturing and order enforcing under TSO Performance Impact of Lifeguard Accelerators –AddrCheck: [1.13x – 3.4x], TaintCheck: [2x – 9x] A less expensive order capturing mechanism gets similar performance results –1 timestamp per core vs. 1 timestamp per cache block © Evangelos VlachosASPLOS '10 - ParaLog22

23 Conclusions ParaLog: Fast and precise parallel monitoring Components of event ordering –Normal memory accesses: monitor coherence activity –Logical Races; use of Conflict Alert messages Metadata Atomicity –Enforcing dependence arcs ensures atomicity (most cases) Parallel Hardware Accelerators –Flush local state on remote events (Conflict Alert) Average overhead is relatively low –AddrCheck: 26% and TaintCheck: 48% (8 threads) 23© Evangelos VlachosASPLOS '10 - ParaLog

24 Questions ? 24© Evangelos VlachosASPLOS '10 - ParaLog

25 Backup Slides 25© Evangelos VlachosASPLOS '10 - ParaLog

26 Metadata Atomicity Synchronization-free fast path vs. slow path –Concurrent application reads; no ordering available! Concurrent metadata reads: follow the fast-path Concurrent metadata writes: follow slow-path acquiring a lock Concurrent metadata read and write: read may get either value –In any other case dependence arcs are available © Evangelos VlachosASPLOS '1026 Application EventLifeguard Action RRW WRW AddrCheck TaintCheck MemCheck LockSet

27 Parallel Hardware Accelerators Accelerators have only local view of the analysis –Important events have system-wide effects –Case study: Idempotent Filters and AddrCheck © Evangelos VlachosASPLOS '10 - ParaLog27 R(A) R(B) R(A) R(C)R(B) R(A) IF free(A) R(A) IF LG 0 LG 1 ✔ ✖ ✔ Delivered to lifeguard ✖ Redundant; discarded ✖✔ ✔ ✖ ✔ ✔ ✔ Flush IF filters free(A) Flush local and remote IF filters Details for parallel M-TLB and IT can be found in the paper Builds on Remote Conflict Messages

28 Performance Impact of Lifeguard Accelerators 28© Evangelos VlachosASPLOS '10 - ParaLog 9.4 6.8 7.3 11.3 Accelerators provide a major speedup [2x – 9x]

29 Performance Impact of Lifeguard Accelerators 29© Evangelos VlachosASPLOS '10 - ParaLog Accelerators provide a major speedup [ 1.13x – 3.4x]

30 Transitive Reduction Sensitivity Study 30© Evangelos VlachosASPLOS '10 - ParaLog Limited transitive reduction –No major performance impact; savings in chip area

31 Supporting Total Store Order (TSO) Cycle of dependencies in relaxed memory models –TSO relaxes the RAW ordering –Previous work (RTR): maintain versions of data –Identify SC offending instructions; save loaded value This paper: maintain versions of metadata © Evangelos VlachosASPLOS '10 - ParaLog31 Thread 0Thread 1 Commit order 0 1 2 Wr(A)Wr(B) Rd(B)Rd(A) Memory Order:    P(v 1, A) C(v 0, B) P(v 0, B) C(v 1, A) Log 0 Log 1 Wr(A) Rd(B, v 0 ) Wr(B) Rd(A, v 1 ) produce_version(v 1,A ) Lifeguard 0 store_handler(A) wait_until_available(v 0,B) load_handler(B, v 0 )

32 Parallel Hardware Accelerators Speed-up frequent lifeguard actions –Fast metadata address calculation – Metadata-TLB –Fast tracking of data-flow paths – Inheritance Tracking –Filter out redundant checking – Idempotent Filters Per-instruction checking gives the same result; cache event Accelerators have only local view of the analysis –Important events have system-wide effects (e.g., free()) –Coherence-like issues with accelerators’ local state Important events accompanied by Conflict Alerts –Use Conflict Alerts to flush state and deliver pending events © Evangelos VlachosASPLOS '10 - ParaLog32

33 Experimental Framework BenchmarksInput barnes16K bodies oceanGrid: 258 x 258 luMatrix: 1024 x 1024 fmm32768 particles radiosityBase problem blackscholesSimlarge fluidanimateSimlarge swaptionsSimlarge Simulation Parameters Cores{2, 4, 8,16}, 1 GHz, In-Order scalar x86 L1I & L1D(private) 64KB, 64B line, 4- way assoc. L2 (shared){1, 2, 4, 8}MB, 64B line, 8-way assoc., 6-cycle latency Memory90-cycle latency Log Buffer64KB per thread Multithreaded Lifeguards TaintCheck: Information flow tracking; accelerated by M-TLB and IT AddrCheck: Memory access checking; accelerated by M-TLB and IF 33© Evangelos VlachosASPLOS '10 - ParaLog

34 Relative Slowdown - TaintCheck 34© Evangelos VlachosASPLOS '10 - ParaLog

35 Relative Slowdown - AddrCheck 3.06.0 35© Evangelos VlachosASPLOS '10 - ParaLog

36 Performance Results - AddrCheck 36© Evangelos VlachosASPLOS '10 - ParaLog 2.3 6.1 6.7 1.7 1.9 2.9 9.5 15.4 2.1 6.2 1.92.4

37 Performance Results - TaintCheck 37© Evangelos VlachosASPLOS '10 - ParaLog 2.1 11.5 12.9 1.9 10 1.7 1.9 2.9 6.6 4.6 15.7 2.42.8 1.7

38 Parallel Hardware Accelerators Speed-up frequent lifeguard actions –Metadata-TLB & Inheritance Tracking (discussed in the paper) –Idempotent Filters; identify and filter out redundant checking Per-instruction checking gives the same result Cache incoming event and local state to identify redundancy Accelerators have only local view of the analysis –Important events have application-wide effects (e.g., free()) –Coherence-like issues with accelerators’ local state Important events accompanied by Conflict Alerts –Use Conflict Alerts to flush accelerators’ state © Evangelos VlachosASPLOS '10 - ParaLog38


Download ppt "Computer Architecture Lab at Evangelos Vlachos, Michelle L. Goodstein, Michael A. Kozuch, Shimin Chen, Phillip B. Gibbons, Babak Falsafi and Todd C. Mowry."

Similar presentations


Ads by Google