Presentation on theme: "A Case of Performance Tuning In a semi-realistic desktop app."— Presentation transcript:
A Case of Performance Tuning In a semi-realistic desktop app.
The Application: Simulation of a Computer-Integrated-Manufacturing (CIM) System. There are thousands of Jobs. Each Job consists of several Operations. Each Operation consists of several Tasks (such as machining) and several Material Handling steps (transporting the workpiece).
Objective: Speed up the simulation. Method: Random Pausing, over multiple iterations. Result: 2-3 Orders of Magnitude Speedup The method: 1.Run the program with enough workload so it can be paused. 2.Pause the program 3.Understand what it is doing at that point in time, including examining the call stack 4.Repeat 1-3 enough times (typically 3 to 20) to see things that could be changed to save time.
How is it different from Profiling? - Precision of Measurement (of time spent in routines or even lines of code) is less important. - Precision of Understanding what is happening is more important. Example: An activity of some description, that could be avoided, is seen on 3 out of 5 samples, in different functions. If it could be removed, a mean savings of roughly (3+1)/(5+2) could be achieved. (Rule of Succession). Since the activity is not localized to a particular function, let alone a particular line of code, a summarizing profiler is less likely to draw attention to it. - Stack-sampling profilers collect, summarize, and then discard, the information you need without letting you see it, on the assumption that statistics are more important. Other kinds of profilers dont even collect it.
So the Scenario Follows Starting with running and sampling the first iteration of the program.
Result: Time goes from 2700 usec down to 1800 usec / job Now take more samples:
Result: 1500 usec/job Take more samples. Heres what I see: Conclusions: (out of 10) 3 delete 2 new 2 Add 1 RemoveAt 1 cout 1 my code Not sure what to do next. I could make do-it-yourself arrays to help with the Add & RemoveAt time, or maybe form the objects into linked lists, because I'm not really accessing them randomly. I could harvest used objects & not spend so much time in new and delete. Decision: Make linked-lists, and pool used objects.
Results: After putting in linked-lists: 1300 usec/job After pooling used objects: 440 usec/job ! Take more samples. See: --- 5 samples, 3 in NTH, 1 in cout, 1 in entering a routine (NTH is a macro for indexing a linked list.) Suggests: Use pointer directly into list, rather than integer index.
Result: After using pointer into list: 170 usec/job - Take more samples. Out of 4 samples, every sample is in cout. Suppress the output. Result: After suppressing output: 3.7 usec/job Thats a total speedup of 2700/3.7 = 730 times !
How is this speedup possible? 1. 33.3% in push_back 2. 11.1% in out-of-line indexing 3. 7.4% in add/remove 4. 31.9% in new/delete 5. 9.3% in getting Nth list element 6. 6.1% in cout > 99% in activities that could be removed!
Common Objection: Too Few Samples! Answer: You Dont Need Them! The cost (i.e. potential savings) x is a distribution x ~ beta( hits + 1, N – hits + 1) Precision depends on N, but expected savings varies little. Notice the Speedup Ratio 1/(1-x) is heavily right-skewed. An incorrect estimate carries little downside risk with high upside benefit.
What NOT to Learn From This Each iteration consisted of Two Phases –A : Diagnosis, via random pausing –B : Intervention, to get the speedup Whats the important thing to learn? The important thing to learn is that A is more important than B. So many times I see people skip A and go directly to B, and get little or no speedup.
Conclusion Random-Pausing should be in every programmers toolkit.