Presentation is loading. Please wait.

Presentation is loading. Please wait.

Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya.

Similar presentations


Presentation on theme: "Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya."— Presentation transcript:

1 Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya – BarcelonaTech Barcelona Supercomputing Center 01 July 2010 Ferad Zyulkyarov PhD Thesis Proposal

2 Publications Ferad Zyulkyarov, Srdjan Stipic, Tim Harris, Osman Unsal, Adrian Cristal, Ibrahim Hur, Mateo Valero, Discovering and Understanding Performance Bottlenecks in Transactional Applications, PACT'10 Discovering and Understanding Performance Bottlenecks in Transactional Applications Ferad Zyulkyarov, Tim Harris, Osman Unsal, Adrian Cristal, Mateo Valero, Debugging Programs that use Atomic Blocks and Transactional Memory, PPoPP'10Debugging Programs that use Atomic Blocks and Transactional Memory Vladimir Gajinov, Ferad Zyulkyarov, Osman Unsal, Adrian Cristal, Eduard Ayguade, Tim Harris, Mateo Valero, QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory, ICS'09QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Ferad Zyulkyarov, Vladimir Gajinov, Osman Unsal, Adrian Cristal, Eduard Ayguade, Tim Harris, Mateo Valero, Atomic Quake: Using Transactional Memory in an Interactive Multiplayer Game Server, PPoPP’09Atomic Quake: Using Transactional Memory in an Interactive Multiplayer Game Server Ferad Zyulkyarov, Sanja Cvijic,Osman Unsal, Adrian Cristal, Eduard Ayguade, Tim Harris, Mateo Valero, WormBench - A Configurable Workload for Evaluating Transactional Memory Systems, MEDEA '09WormBench - A Configurable Workload for Evaluating Transactional Memory Systems Ferad Zyulkyarov, Milos Milovanovic, Osman Unsal, Adrian Cristal, Eduard Ayguade, Tim Harris, Mateo Valero, Memory Management for Transaction Processing Core in Heterogeneous Chip- Multiprocessors, OSHMA '09 Milos Milovanovic, Osman Unsal, Adrian Cristal, Ferad Zyulkyarov, Mateo Valero, Compiler Support for Using Transactional Memory in C/C++ Applications, INTERACT’07 2

3 Work Plan 3 12m 11m 21m 10m 15m 9.5m 7m 2m 01/10/2010

4 Transactional Memory 4 atomic { statement1; statement2; statement3; statement4;... }

5 The Big Questions Is programming with TM easy? Is TM competitive with locks? Are existing development tools sufficient? 5

6 Atomic Quake Parallel Quake game server – All locks are replaces with atomic blocks 27,400 LOC of C code in 56 files Rich transactional application – 63 atomic blocks – Rich uses of atomic blocks Library calls, I/O, error handling, memory allocation, failure atomicity – Various transactional characteristics A workload to drive research in TM 6

7 Is programming with TM easy? Yes. In large applications where we have many shared objects and want to provide efficient fine grain synchronization – Example: region based locking in tree data structure and graphs. 7

8 Where Transactions Fit? Guarding different types of objects with separate locks. 1 switch(object->type) { /* Lock phase */ 2 KEY: lock(key_mutex); break; 3 LIFE: lock(life_mutex); break; 4 WEAPON: lock(weapon_mutex); break; 5 ARMOR: lock(armor_mutex); break 6 }; 7 8 pick_up_object(object); 9 10 switch(object->type) { /* Unlock phase */ 11 KEY: unlock(key_mutex); break; 12 LIFE: unlock(life_mutex); break; 13 WEAPON: unlock(weapon_mutex); break; 14 ARMOR: unlock(armor_mutex); break 15 }; Lock phase. Unlock phase. atomic { } pick_up_object(object); 8

9 Is TM Competitive to Locks? No. – 4-5x slowdown on single threaded version. But it is promising to be competitive because of the obtained good scalability. 9 Scales OK up to 4 threads. Threads Transaction s Aborts Irrevocable Num% 136 66700.00%17 275 8242410.42%31 4166 0002 6121.58%85 8477 51976 77125.50%237 Sudden increase in aborts.

10 Are Existing Tools Sufficient? No We need: – Richer language level primitives and integration. – Mechanisms to handle I/O. – Dynamic error handling. – Debuggers. – Profilers. 10

11 Unstructured Use of Locks Locks 1 for (i=0; i<sv_tot_num_players/sv_nproc; i++){ 2 3 LOCK(cl_msg_lock[c - svs.clients]); 4 5 if (!c->send_message) { 6 7 UNLOCK(cl_msg_lock[c - svs.clients]); 8 9 continue; 10 } 11 12 if (!sv.paused && !Netchan_CanPacket (&c->netchan)) { 13 14 UNLOCK(cl_msg_lock[c - svs.clients]); 15 16 continue; 17 } 18 19 if (c->state == cs_spawned) { 20 if (frame_threads_num > 1) LOCK(par_runcmd_lock); 21 22 if (frame_thread_num > 1) UNLOCK(par_runcmd_lock); 23 } 24 UNLOCK(cl_msg_lock[c - svs.clients]); 25 26 } Atomic Block 1 bool first_if = false; 2 bool second_if = false; 3 for (i=0; i<sv_tot_num_players/sv_nproc; i++){ 4 5 atomic { 6 7 if (!c->send_message) { 8 9 first_if = true; 10 } else { 11 12 if (!sv.paused && !Netchan_CanPacket(&c->netchan)){ 13 14 second_if = true; 15 } else { 16 17 if (c->state == cs_spawned) { 18 if (frame_threads_num > 1) { 19 atomic { 20 21 } 22 } else { 23 ; 24 } 25 } 26 } 27 } 28 } 29 if (first_if) { 30 ; 31 first_if = false; 32 continue; 33 } 34 if (second_if) { 35 ; 36 second_if = false; 37 continue; 38 } 39 40 } Extra variables and code Solution explicit “commit” Solution explicit “commit” Complicated Conditional Logic 11

12 Various Transactional Characteristics IDTX# Dynamic Length (CPU Cycles)Read Set (Bytes)Write Set (Bytes) TotalMinMaxAvgTotalMinMaxAvgTotalMinMaxAvg 5626,962172,872,572288112,8326,4121,328,53620104490000 605,9315,810,15222441,55298076,212126401392801160 611,09520,573,5404,56049,98419,208723,474887766619084 591,0423,117,8441,52039,3442,99929,176528 16,67216 571,038401,502,152288,704522,528387,55210,963,7197,61415,49010,5622,592,3671,6803,6562,497 581,002134,949,34487,0561,341,504134,9495,054,2823,02853,5665,044931,44554811,161930 15367,66072048,1761,7359632 18666 5299,98859236,3841,9236432 10555 22243,63212,17635,50421,8167236 12864 36240,4766,80044,88020,23824910814112555223328 38271,3682,14431,5044,4619044464526121413 12 Very small transactions Very large transactions Different execution frequency -> Phased behavior. Control flow does not reach all atomic blocks. Most frequent atomic block is read-only. Per-atomic block runtime statistics from Atomic Quake.

13 Debugging Transactional Applications Existing debuggers are not aware of atomic blocks and transactional memory New principles and approaches: – Debugging atomic blocks atomically – Debugging at the level of transactions – Managing transactions at debug-time Extension for WinDbg to debug programs with atomic blocks 13

14 Atomicity in Debugging Step over atomic blocks as if single instruction. Abstracts weather atomic blocks are implemented with TM or lock inference Good for debugging sync errors at granularity of atomic blocks vs. individual statements inside the atomic blocks. 14 atomic { } atomic { } Non-TM Aware DebuggerTM Aware Debugger Debugging becomes frustrating when transaction aborts.

15 Isolation in Debugging What if we want to debug wrong code within atomic block? – Put breakpoint inside atomic block. – Validate the transaction – Step within the transaction. The user does not observe intermediate results of concurrently running transactions – Switch transaction to irrevocable mode after validation. 15 atomic { }

16 Debugging at the Level of Transactions Assumes that atomic blocks are implemented with transactional memory. Examine the internal state of the TM – Read/write set, re-executions, status TM specific watch points – Break when conflict happens – Filters Concurrent work with Herlihy and Lev [PACT’ 09]. 16

17 TM Specific Watchpoints 17 atomic { } Conflict Information Conflicting Threads: T1, T2 Address: 0x84D2F0 Symbol: reservation@04 Readers: T1 Writers: T2 Break when conflict happens Filter: Break if Address = reservation@04 Thread = T2 Filter: Break if Address = reservation@04 Thread = T2 AND

18 Managing Transactions at Debug-Time At the level of atomic blocks – Debug time atomic blocks – Splitting atomic blocks At the level of transactions – Changing the state of TM system (i.e. adding and removing entries from read/write set, change the status, abort) Analogous to the functionality of existing debuggers to change the CPU state 18

19 Example Debug Time Atomic Blocks 19

20 Example Debug Time Atomic Blocks 20 StartDebugAtomic EndDebugAtomic User marks the start and the end of the transactions

21 Issues of Profiling TM Programs TM applications have unanticipated overheads – Problem raised by Pankratius [talk at ICSE’09] and Rossbach et al. [PPoPP’10] Difficult to profile TM applications without profiling tools and without knowing the implementation of the TM system – Experience of optimizing QuakeTM, Gajinov et al. [ICS’2009] 21

22 Profiling TM Programs Design principles – Report results at source language constructs – Abstract the underlying TM system – Low probe effect and overhead Profiling techniques – Conflict point discovery – Identifying conflicting data structures – Visualizing transactions 22

23 Conflict Point Discovery Identifies the statements involved in conflicts Provides c ontextual information Finds the critical path 23 File:Line#Conf.MethodLine Hashtable.cs:51152AddIf (_container[hashCode]… Hashtable.cs:4862Adduint hashCode = HashSdbm(… Hashtable.cs:535Add_container[hashCode] = n … Hashtable.cs:835Addwhile (entry != null) … ArrayList.cs:793Containsfor (int i = 0; i < count; i++ ) ArrayList.cs:521Addif (count == capacity – 1) …

24 Call Context 24 increment() { counter++; } probability80 { probability = random() % 100; if (probability < 80) { atomic { increment(); } probability20 { probability = random() % 100; if (probability >= 80) { atomic { increment(); } for (int i = 0; i < 100; i++) { probability80(); probability20(); } for (int i = 0; i < 100; i++) { probability80(); probability20(); } Thread 1 Thread 2 Bottom-up view + increment (100%) |---- probability80 (80%) |---- probability20 (20%) Bottom-up view + increment (100%) |---- probability80 (80%) |---- probability20 (20%) Top-down view + main (100%) |---- probability80 (80%) |---- increment (80%) |-----probability20 (20%) |---- increment (20%) Top-down view + main (100%) |---- probability80 (80%) |---- increment (80%) |-----probability20 (20%) |---- increment (20%)

25 Aborts Graph (Bayes) 25 AB1AB2 AB3 Conf: 73% Wasted: 63% Conf: 20% Wasted: 29% 72% of wasted work There are 15 atomic blocks and only one of them aborts most. Which atomic blocks cause AB3 to abort?

26 Indentifying Conflicting Objects 26 Per-Object View + List.cs:1 “list” (42%) |--- ChangeNode (20 %) +---- Replace (12%) +---- Add (8%) Per-Object View + List.cs:1 “list” (42%) |--- ChangeNode (20 %) +---- Replace (12%) +---- Add (8%) 1: List list = new List(); 2: list.Add(1); 3: list.Add(2); 4: list.Add(3);... atomic { list.Replace(2, 33); } List123 0x080x100x180x20 GC Memory Allocator DbgEng Object Addr 0x20 GC Root 0x08 Instr Addr 0x446290 List.cs:1

27 Transaction Visualizer (Genome) 27 Aborts occur at the first and last atomic blocks in program order. Garbage Collection Wait on barrier

28 Overhead and Probe Effect 28 Thrd#Bayes+Bayes-Gen+Gen-Intrd+Intrd-Labr+Labr-Vac+Vac-WB+WB- 11.591.001.271.001.291.001.071.001.261.000.711.00 2 0.560.970.670.970.580.640.610.830.590.600.55 40.23 0.730.520.910.360.450.460.580.400.410.33 80.210.200.730.551.570.380.720.560.530.340.330.22 Normalized Execution Time Thrd#Bayes+Bayes-Gen+Gen-Intrd+Intrd-Labr+Labr-Vac+Vac-WB+WB- 10.00 24.394.690.07 3.693.510.190.150.80 0.00 416.2927.310.260.3614.9013.650.350.362.302.450.00 853.7466.080.530.8039.6437.410.400.474.915.300.020.03 Abort Rate in % + Profiling Enabled - Profiling Disabled Standard deviation for the difference 27% Standard deviation for the difference 3.88% Process data offline or during GC.

29 Optimization Techniques Moving statements Atomic block scheduling Checkpoints and nested atomic blocks Pessimistic reads Early release 29

30 Will this code execute the same? Moving Statements atomic { counter++; } atomic { counter++; } 30 No!

31 Checkpoints atomic { } 31 Conflicts 2% 15% 4% 79% Insert Checkpoint

32 Checkpoints atomic { } 32 Conflicts 2% 15% 4% 79% Insert Checkpoint Reduced wasted work for the atomic block with 40%.

33 Conclusion Study the programmability aspects of TM New debugging principles and approaches for TM applications New profiling techniques for TM applications Profile-guided optimization approaches for TM applications 33

34 34 Край


Download ppt "Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya."

Similar presentations


Ads by Google