Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discovering and Understanding Performance Bottlenecks in Transactional Applications Ferad Zyulkyarov 1,2, Srdjan Stipic 1,2, Tim Harris 3, Osman S. Unsal.

Similar presentations


Presentation on theme: "Discovering and Understanding Performance Bottlenecks in Transactional Applications Ferad Zyulkyarov 1,2, Srdjan Stipic 1,2, Tim Harris 3, Osman S. Unsal."— Presentation transcript:

1 Discovering and Understanding Performance Bottlenecks in Transactional Applications Ferad Zyulkyarov 1,2, Srdjan Stipic 1,2, Tim Harris 3, Osman S. Unsal 1, Adrián Cristal 1,4, Ibrahim Hur 1, Mateo Valero 1,2 1 BSC-Microsoft Research Centre 2 Universitat Politècnica de Catalunya 3 Microsoft Research Cambridge 4 IIIA - Artificial Intelligence Research Institute CSIC - Spanish National Research Council 19th International Conference on Parallel Architectures and Compilation Techniques 11-15 September 2010 – Vienna

2 Abstract the TM Implementation 2 for (i = 0; i < N; i++) { atomic { x[i]++; } } for (i = 0; i < N; i++) { atomic { y[i]++; } } Thread 1Thread 2 Accesses to different arrays. We can observe overheads inherent to the TM implementation. We are not interested in such bottlenecks.

3 Abstract the TM Implementation 3 for (i = 0; i < N; i++) { atomic { x[i]++; } } for (i = 0; i < N; i++) { atomic { x[i]++; } } Thread 1Thread 2 Accesses to the same arrays. Contention: Bottleneck common to all implementations of the TM programming model. Contention: Bottleneck common to all implementations of the TM programming model. We are interested in this kind of bottlenecks.

4 Can We Find This Kind of Bottlenecks? 4 atomic { statement1; statement2; statement3; statement4; } Abort rate 80% Where aborts happen? Which variables conflict? Are there false conflicts?

5 Can We Find This Kind of Bottlenecks? 5 atomic { statement1; statement2; statement3; statement4; } counter1=0; counter2=0; counter3=0; counter4=0;

6 Can We Find This Kind of Bottlenecks? 6 atomic { statement1; statement2; statement3; statement4; } counter1=1; counter2=0; counter3=0; counter4=0;

7 Can We Find This Kind of Bottlenecks? 7 atomic { statement1; statement2; statement3; statement4; } counter1=1; counter2=1; counter3=0; counter4=0; Conflict between statement2 and statement4. Goal Profiling techniques to find bottlenecks (important conflicting locations) and why these conflicts happen.

8 Outline Profiling Techniques Implementation Case Studies 8

9 Profiling Techniques 9 Visualizing transactions Conflict point discovery Identifying conflicting data structures

10 Transaction Visualizer (Genome) 10 Aborts occur at the first and last atomic blocks in program order. Garbage Collection 14% Aborts Wait on barrier When these aborts happen?

11 Aborts Graph (Bayes) 11 AB1AB2 AB3 AB4 AB5 AB6 AB7 AB8 AB9 AB10AB12AB11AB13AB14AB15 93% Aborts 73%20%

12 Number of Aborts vs Wasted Work 12 atomic { counter++ } atomic { hashtable.Rehash(); } Aborts = 9 Aborts = 1 Wasted Work = 10% Wasted Work = 90%

13 Conflict Point Discovery 13 File:Line#Conf.MethodLine Hashtable.cs:51152AddIf (_container[hashCode]… Hashtable.cs:4862Adduint hashCode = HashSdbm(… Hashtable.cs:535Add_container[hashCode] = n … Hashtable.cs:835Addwhile (entry != null) … ArrayList.cs:793Containsfor (int i = 0; i < count; i++ ) ArrayList.cs:521Addif (count == capacity – 1) …

14 Conflicts Context 14 increment() { counter++; } probability80 { probability = random() % 100; if (probability < 80) { atomic { increment(); } probability20 { probability = random() % 100; if (probability >= 80) { atomic { increment(); } Thread 1 ------------ for (int i = 0; i < 100; i++) { probability80(); probability20(); } Thread 2 ------------ for (int i = 0; i < 100; i++) { probability80(); probability20(); } All conflicts happen here. Bottom-up view + increment (100%) |---- probability80 (80%) |---- probability20 (20%) Bottom-up view + increment (100%) |---- probability80 (80%) |---- probability20 (20%) Top-down view + main (100%) |---- probability80 (80%) |---- increment (80%) |-----probability20 (20%) |---- increment (20%) Top-down view + main (100%) |---- probability80 (80%) |---- increment (80%) |-----probability20 (20%) |---- increment (20%)

15 Identifying multiple conflicts from a single run 15 atomic { obj1.x = t1; obj2.x = t2; obj3.x = t3;... } atomic {... obj1.x = t1; obj2.x = t2; obj3.x = t3; } Thread 1Thread 2 Conflict detected at 1 st iteration Conflict detected at 2 nd iteration Conflict detected at 3 rd iteration

16 Identifying Conflicting Objects 16 List list = new List(); list.Add(1); list.Add(2); list.Add(3);... atomic { list.Replace(3, 33); } List123 0x080x100x180x20 GCDbgEng Object Addr 0x20 GC Root 0x08 Variable Name (list) Memory Allocator DbgEng Instr Addr 0x446290 List.cs:1 Per-Object View + List.cs:1 “list” (42%) |--- ChangeNode (20 %) +---- Replace (12%) +---- Add (8%) Per-Object View + List.cs:1 “list” (42%) |--- ChangeNode (20 %) +---- Replace (12%) +---- Add (8%)

17 Outline Profiling Techniques Implementation -Bartok -The data that we collect -Probe effect and profiling Case Studies 17

18 Bartok C# to x86 research compiler with language level support for TM STM –Eager versioning (i.e. in place update) –Detects write-write conflicts eagerly (i.e. immediately) –Detects read-write conflicts lazily (i.e. at commit) –Detects conflicts at object granularity 18

19 Profiling Data That We Collect Timestamp –TX start, –TX commit or TX abort Read and write set size On abort –The instruction of the read and write operations involved in the conflict –The conflicting memory address –The call stack Process data offline or during GC 19

20 Probe Effect and Overheads 20 ThreadBayesGenomeIntruderLabyrinthVacationWormBench 10.590.270.290.070.260.29 20.450.300.390.030.240.05 40.010.210.550.010.180.08 80.020.181.190.160.190.11 Normalized Abort Rates Normalized Execution Time ThreadBayesGenomeIntruderLabyrinthVacationWormBench 20.00 40.110.000.010.00 80.120.000.020.00 Average 0.016 Average 0.25

21 Outline Profiling Techniques Implementation Case Studies 21

22 Case Studies Bayes Intruder Labyrinth 22

23 Bayes 23 public class FindBestTaskArg { public int toId; public Learner learnerPtr; public Query[] queries; public Vector queryVectorPtr; public Vector parentQueryVectorPtr; public int numTotalParent; public float basePenalty; public float baseLogLikelihood; public Bitmap bitmapPtr; public Queue workQueuePtr; public Vector aQueryVectorPtr; public Vector bQueryVectorPtr; } Wrapper object for function arguments. FindBestTaskArg arg = new FindBestTaskArg(); arg.learnerPtr = learnerPtr; arg.queries = queries; arg.queryVectorPtr = queryVectorPtr; arg.parentQueryVectorPtr = parentQueryVectorPtr; arg.bitmapPtr = visitedBitmapPtr; arg.workQueuePtr = workQueuePtr; arg.aQueryVectorPtr = aQueryVectorPtr; arg.bQueryVectorPtr = bQueryVectorPtr; Create wrapper object.

24 Bayes 24 public class FindBestTaskArg { public int toId; public Learner learnerPtr; public Query[] queries; public Vector queryVectorPtr; public Vector parentQueryVectorPtr; public int numTotalParent; public float basePenalty; public float baseLogLikelihood; public Bitmap bitmapPtr; public Queue workQueuePtr; public Vector aQueryVectorPtr; public Vector bQueryVectorPtr; } FindBestTaskArg arg = new FindBestTaskArg(); arg.learnerPtr = learnerPtr; arg.queries = queries; arg.queryVectorPtr = queryVectorPtr; arg.parentQueryVectorPtr = parentQueryVectorPtr; arg.bitmapPtr = visitedBitmapPtr; arg.workQueuePtr = workQueuePtr; arg.aQueryVectorPtr = aQueryVectorPtr; arg.bQueryVectorPtr = bQueryVectorPtr; atomic { FindBestInsertTask(BestTaskArg arg) } Call the function using the wrapper object. Create wrapper object. 98% of wasted work is due to the wrapper object 2 threads – 24% execution time 4 threads – 80% execution time 98% of wasted work is due to the wrapper object 2 threads – 24% execution time 4 threads – 80% execution time

25 Bayes – Solution 25 atomic { FindBestInsertTaskArg ( toId, learnerPtr, queries, queryVectorPtr, parentQueryVectorPtr, numTotalParent, basePenalty, baseLogLikelihood, bitmapPtr, workQueuePtr, aQueryVectorPtr, bQueryVectorPtr, ); } Passed the arguments directly and avoid using wrapper object.

26 Intruder – Map Data Structure 26 1 2 3 4 5 6 124 23 12 1 1/3 3/1 6/2 4/3 6/3 2/4 6/4 Network Stream Assembled packet fragments

27 Network Stream Assembled packet fragments Intruder – Map Data Structure 27 1 2 3 4 5 6 12 4 23 12 1 1/3 3/1 6/2 4/3 6/3 2/4 6/4 Aborts caused 68% wasted work. Replaced with a chaining hashtable.

28 Intruder – Moving Code 28 Write-write conflicts are detected eagerly. More to roll back more wasted work atomic { Decoded decodedPtr = new Decoded(); char[] data = new char[length]; Array.Copy(packetPtr.Data, data, length); decodedPtr.flowId = flowId; decodedPtr.data = data; } this.decodedQueuePtr.Push(decodedPtr); Little to roll back, less wasted work

29 Labyrinth 29 atomic { localGrid.CopyFrom(globalGrid); if (this.PdoExpansion(myGrid, myExpansionQueue, src, dst)) { pointVector = PdoTraceback(grid, myGrid, dst, bendCost); success = true; raced = grid.addPathOfOffsets(pointVector); } 2 threads – 80% wasted work 4 threads – 98% wasted work 2 threads – 80% wasted work 4 threads – 98% wasted work Watson PACT’07, it is safe if localGrid is not up to date. Don’t instrument CopyFrom with transactional read and writes.

30 Summary Design principles –Abstract the underlying TM system –Report results at the source language constructs –Low instrumentation probe effect and overhead Profiling techniques –Visualizing transactions –Conflict point discovery –Identifying conflicting data structures 30

31 PPoPP’2010 Debugging Programs that use Atomic Blocks and Transactional Memory ICS’2009 QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory PPoPP’2008 Atomic Quake: Using Transactional Memory in an Interactive Multiplayer Game Server 31 Край


Download ppt "Discovering and Understanding Performance Bottlenecks in Transactional Applications Ferad Zyulkyarov 1,2, Srdjan Stipic 1,2, Tim Harris 3, Osman S. Unsal."

Similar presentations


Ads by Google