Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.

Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

WavPack – Description WavPack is a an open source audio compression format. –Allows lossless audio compression. Compresses WAV files to WV files –Average compression ratio is 30-70%. Support for windows and mobile devices. –Cowon A3 PMP, iRiver, iPod, Nokia phones, and more.

Project Goals Enhance the Wavpack performance by: –Working and analyzing with Intel® VTune™ Performance Analyzer. –Studying and applying instructions of Intel®’s new processors. –Implementing multi-threading techniques in order to achieve high performance. Return the source code to the community.

Algorithm Description Input file is processed in blocks of 512kb. –A global context exists for all blocks. –Blocks are divided into sub blocks. 24,000 samples equivalent to 0.5 second of wav at CD quality. –Encodes each block and writes to output. –Updates context data for next block.

Is Lossless & Stereo Configuration stereo/mono bps { 8,16,24,32}, pass count, etc. Configuration stereo/mono bps { 8,16,24,32}, pass count, etc. Go over the buffer. take a block of 24,000 samples Go over the buffer. take a block of 24,000 samples Read buffer of 512kb from Input File Transform l.ch & r.ch to mid, diff … more options Perform wavpack decorralation algorithm on the buffer Perform wavpack decorralation algorithm on the buffer Write the resulted buffer to the output. This is the compression stage. Write the resulted buffer to the output. This is the compression stage. 1 st part of the wavpack algorithm 2 nd Part of the wavpack algorithm This is why parallelizing of the entire flow fails Calculate additional information for compression Perform the compression bit by bit Count ones and zeros until change occurs Each subset of bytes depends on an indeterminate subset of the previous bytes. Context Global Information Passed down to each function Context Global Information Passed down to each function … more options Init x Pass count Finish

Testing Environment Hardware –Core i7 2.66GHz CPU, Quad6600 2.4GHz. –4GB of RAM. Software –Windows XP/Vista. –Visual studio 2008. –Intel VTune Toolkit. –Compiled with Microsoft compiler. Tests are done on a 330Mb WAV file.

Original Implementation Single threaded application –Read from disk. –Encode. –Write to disk directly. Old MMX Instructions are used. Processing of 330Mb Wav file takes about 30 seconds.

Optimizations Parallel IO/CPU

General –Separate read, write and processing operations into several threads. Flow –Use the main thread to read input file. Create “jobs” and submit them into a work queue. –Use an additional thread to process the “jobs”. Output is redirected to memory instead of disk. –Another thread writes the processed output to the disk.

Optimizations Parallel IO/CPU – cont. Benchmark –VTune analysis showed the following results –Average running time is about 29 seconds. –Speedup is 1.026. Refers to original results. Conclusions –No significant improvement. –I/O operations take considerably less time than the blocks processing. Reads are done long before the processing is done. Writing thread is almost never busy.

Optimizations Multi Threaded Processing

General –Obstacle: Each block is dependent on the previous processed block. Parallelizing entire flow is impossible. –Multithreading parts of the algorithm. Locate parts of the code where the program spends most of the time. Parallelize several functions in these parts. Implementation –Using “Thread Pool”. –Work is separated to left and right channel. At each channel, each sample is dependent on the previous sample. Can’t use more than two threads. –Each thread uses different memory area. Results must be combined after work is done.

Is Lossless & Stereo Processing thread more options… Worker thread 2 Fill two new “Thread Args” structures. One with left channel data and one with the Right. Fill two new “Thread Args” structures. One with left channel data and one with the Right. Submit each work to the “Thread Pool” Wait on the “OnComplete” mutex worker thread 1 Wait for work to arrive into the “Thread Pool” and start the work. Wait for work to arrive into the “Thread Pool” and start the work. Perform Wavpack decorrelation algorithm on the buffer Perform Wavpack decorrelation algorithm on the buffer Write the resulted buffer to the output. This is the compression stage. Write the resulted buffer to the output. This is the compression stage. Calculate additional information for compression Perform the compression bit by bit Count ones and zeros until change occurs x Pass count Return to “Thread Pool” Right Channel Wait for work to arrive into the “Thread Pool” and start the work. Wait for work to arrive into the “Thread Pool” and start the work. Return to “Thread Pool” Left Channel Interleave left & right channels data to one output buffer Create a duplicates of each shared data structure to avoid cache conflicts Create a duplicates of each shared data structure to avoid cache conflicts

Optimizations Multi Threaded Processing – cont. Benchmark –VTune analysis showed the following results –Average running time is about 25 seconds. –Speedup is 1.167. Refers to original results. Conclusions –About 17% of the running time is parallelized. –Total improvement – Due to overhead improvement is a little bit smaller.

Optimizations Moving to SIMD

General –Locate mathematical calculations and loops. Where the program spends most of the time. –Use 128bit width instructions. –Convert four operations of 32bit to one of 128bit. Theoretically, performance can be x4 faster. In practice, there is overhead (load, store). Implementation –Re-factor the code as a basis for adding SIMD operations. –Loop unrolling. Make sure to complete the “leftovers” of the loop. –Re-implement using SIMD code.

Optimizations Moving to SIMD – cont. Benchmark –VTune analysis showed the following results –Average running time is about 28 seconds. –Speedup is 1.043. Refers to original results. Conclusions –Mathematical calculations can be mainly done with SSE2, SSE3. –SSE4 instructions were not useful for this application. –Improvement alone isn’t significant. More significant when combined with Multi Threading Optimization.

Optimizations Implementation Improvements

General –We found several hot spots of the program that we couldn’t improve using the mentioned methods. Branch misprediction. –Re-implement in a more efficient way. Implementation –Focused on one main function. Lots of branch mispredictions. 16bit Integer was used as buffered output. –Removed most of the branch instructions. –Re-implemented same logic with 64bit Integer buffer. Largest register size. SIMD would require too much overhead.

Optimizations Implementation Improvements – cont. Benchmark –VTune analysis showed the following results –Average running time is about 28 seconds. –Speedup is 1.06. Refers to original results. Conclusions –Branch instructions and branch mispredictions were reduced. –Improvement in performance – almost 2 seconds less. –Implementation is centered in one method. Easy to re-factor. Requires no major architecture changes.

Summary The most significant optimization was multi threading code sections. –16% speedup. The most insignificant was the multithreaded I/O. –2.6% speedup.

Summary – Cont. Benchmark –VTune analysis showed the following results –Average running time is about 22 seconds. –Total speedup we achieved is 1.335. The program runs faster by 33.5%.

Summary – Cont. Conclusions –Multithreading is something to be considered in the architectural stages of the application. In this application, the performance improvement does not worth the development and maintenance effort. –SIMD Optimizations should only be used in specific cases. Harder to use and understand the code. –Decreasing branch mispredictions and cache misses is a better way to improve performance. Refactoring only specific methods. Easier to implement and usually simplifies the code. Using VTune and similar analysis tools is a good practice. –Leveraging new CPU instructions should be the compiler’s responsibility. Don’t really need developer to do this job. Code gets clattered.

Sources WavPack official website –http://www.wavpack.comhttp://www.wavpack.com Intel® VTune™ Performance Analyzer Sourceforge website –http://sourceforge.net/http://sourceforge.net/ Software lab website –http://softlab.technion.ac.il/http://softlab.technion.ac.il/ MSDN –http://msdn.microsoft.comhttp://msdn.microsoft.com Wikipedia –http://en.wikipedia.org/wiki/http://en.wikipedia.org/wiki/ Intel website –http://www.intel.com/http://www.intel.com/

Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.

Similar presentations

Presentation on theme: "Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.

Similar presentations

Presentation on theme: "Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb."— Presentation transcript:

Similar presentations

About project

Feedback