Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance Tuning Panotools - PTMender. Layout Project Goal About Panotools Multi-threading SIMD, micro-architectural pitfalls Results.

Similar presentations


Presentation on theme: "Performance Tuning Panotools - PTMender. Layout Project Goal About Panotools Multi-threading SIMD, micro-architectural pitfalls Results."— Presentation transcript:

1 Performance Tuning Panotools - PTMender

2 Layout Project Goal About Panotools Multi-threading SIMD, micro-architectural pitfalls Results

3 Project Goal Gaining performance on PanoTools This goal will be achieved through: 1.Multi-threading the application – using new multi- core machines which is the most significant performance promise. 2.Using SSE code. 3.Trying to find micro-architectural pitfalls and solving them – using VTune tuning assist.

4 About Panotools Panotools is the cross-platform library behind Panorama Tools and many other GUI photo stitchers. Gaining much popularity as back-end engine for many panoramic applications. Selected to participate in the Google Summer Of Code We focused on the PTMender module of the library. More details on Panotools on:

5 Multi-threading Two major approaches in multi-threading an existing single-threaded application: 1.Data decomposition – Dividing data to smaller parts and performing parallel work on each part. This is not always possible due to algorithmic dependencies between divided parts. 2.Functional decomposition – Dividing the work according to functional tasks. Each thread performs a unique predefined task. This is harder to perform and requires deep understanding of original algorithm.

6 Multi-threading – contd. Naturally we started looking for Data decomposition. In theory, because PTMender works on several files we could have processed a number of files simultaneously. Alternatively, we could have divided a single file and processed its parts simultaneously. In practice, using the Call Graph function in VTune, we noticed a native division of each file into independent parts on which the algorithm runs. Clearly, the chosen method was the later because it provides a better scalability.

7 VTune - Call graph

8 The serial (Original) model Serial task

9 The Parallel model thread0 thread1

10 Multi-threading – contd. Data sharing – We created arrays of thread specific data structures. And not: Padding is used to create full cache line separation between array entries and prevent false sharing. typedef struct thread_vars{ Image result; TrformStr transform; int pad[16]; }thread_vars_t; thread_vars_t thread_private[NUM_THREADS] Image result[NUM_THREADS] TrformStr transform[NUM_THREADS];

11 Thread Checker

12 Thread Checker - Debug

13 Noise Effects of data races were later obvious from output observations

14 Thread Checker – Debug - Contd. Adding synchronization around critical sections #ifdef PROTECT_WRITE // Request ownership of mutex. dwWaitResult = WaitForSingleObject( hTiffWriteMutex, // handle to mutex 5000L); // five-second time-out interval if (dwWaitResult == WAIT_OBJECT_0){ __try { // Write to the database. #endif

15 Thread Profiler

16 Thread Profiler – contd.

17 Image comparison

18 SIMD & uArchitecture Unfortunately we did not find good opportunities for vectorizing. Main Micro-architectural issue is Mispredicted indirect calls. This cannot be solves since the panotools mechanism works allot with function pointers for flexibility FP activity is significant. We changed floating point model in compilation from precise to fast and reduced instruction count in benchmark to under 90% from original code generation

19 Results

20 Thank you


Download ppt "Performance Tuning Panotools - PTMender. Layout Project Goal About Panotools Multi-threading SIMD, micro-architectural pitfalls Results."

Similar presentations


Ads by Google