Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speex encoder project Presented by: Gruper Leetal Kamelo Tsafrir Instructor: Guz Zvika Software performance enhancement using multithreading, SIMD and.

Similar presentations


Presentation on theme: "Speex encoder project Presented by: Gruper Leetal Kamelo Tsafrir Instructor: Guz Zvika Software performance enhancement using multithreading, SIMD and."— Presentation transcript:

1 Speex encoder project Presented by: Gruper Leetal Kamelo Tsafrir Instructor: Guz Zvika Software performance enhancement using multithreading, SIMD and architectural considerations

2 Speex Speex is an Open Source/Free Software audio compression format designed for speech. Speex is designed for packet networks and voice over IP (VoIP) application. File-based compression is of course also supported.

3 Initial code analysis Original Speex run time

4 Tuned functions using SSE Tuned function using threading & SSE Initial code analysis

5 Speex encoding overview Audio Stream Frame 1Frame 2Frame 4 Encode Read Frame Frame 3 Write frame

6 Our work on Speex Threading, SSE and compiling with Intel compiler

7 Threading by data decomposition – trial 1 Creating a couple of threads and giving each one of them a frame to work on. Creating a couple of threads and giving each one of them a frame to work on. Main thread that read from the input file frames and for each frame creates a "worker" thread. Each worker thread does the encoding on the given frame and writes it to the output file. Main thread that read from the input file frames and for each frame creates a "worker" thread. Each worker thread does the encoding on the given frame and writes it to the output file.

8 Problem & Solution Problem: it takes a lot of time creating a new thread for every new frame. Problem: it takes a lot of time creating a new thread for every new frame. Solution: create a couple of threads (same as the number of cores in the CPU) only in the beginning. For example, if we have 2 cores, we have a main thread and 2 worker threads. One deal with all the even frames and one deal with all the odd frames. Solution: create a couple of threads (same as the number of cores in the CPU) only in the beginning. For example, if we have 2 cores, we have a main thread and 2 worker threads. One deal with all the even frames and one deal with all the odd frames.

9 Main thread Encoding Thread …… Input wav file Output spx file

10 Threading by data decomposition – trial 1 65% speedup! 65% speedup! But… But… –although Speex is robust to packet lost, it does depend on the previous packet when it can → definite reduction in the quality of sound.

11 Threading by data decomposition – trial 2 Split the file into two. One thread will handle half a file and the other will handle the other half.

12 Thread A Thread B Input wav fileOutput spx file Encoding Encoding

13 Threading by data decomposition – trial 2 87% speedup for 2 threads 87% speedup for 2 threads 93% speedup for 22 threads 93% speedup for 22 threads

14

15 Threading by data decomposition – trial 2 Passed Vtunes thread checker: Passed Vtunes thread checker:

16 vq_nbest vq_nbest - finds the n best vectors in the codebook which means that it checks each vector against the given vector and compares it against the rest of the vectors it already found to be best. vq_nbest - finds the n best vectors in the codebook which means that it checks each vector against the given vector and compares it against the rest of the vectors it already found to be best. In a regular run, the loop is performed 256 times on average – big waste of time. In a regular run, the loop is performed 256 times on average – big waste of time.

17 Threading by functional decomposition - Trial 1 We splited the search in the codebook between a couple of threads. We splited the search in the codebook between a couple of threads. Each thread received a portion of the codebook and was sapouse to find the n best vectors in it. Then, some of the threads would fined the n best between there own found vectors and the ones a different thread found until we get one set of n best vectors. Each thread received a portion of the codebook and was sapouse to find the n best vectors in it. Then, some of the threads would fined the n best between there own found vectors and the ones a different thread found until we get one set of n best vectors.

18 Threading by functional decomposition - Trial 1 Because of the overhead of creating all the threads we actually got worse performances. Because of the overhead of creating all the threads we actually got worse performances. It takes approximately 72µs to create a thread. The original vq_nbest runtime is 13 µs. Even if we only create 1 thread each time and not to mention the overhead time of synchronizing all the threads, we will still get a slowdown of ~5.5 in the vq_nbest runtime. It takes approximately 72µs to create a thread. The original vq_nbest runtime is 13 µs. Even if we only create 1 thread each time and not to mention the overhead time of synchronizing all the threads, we will still get a slowdown of ~5.5 in the vq_nbest runtime. As a result, the total runtime of Speex was 1.45 minutes! As a result, the total runtime of Speex was 1.45 minutes!

19 Threading by functional decomposition - Trial 2 a "main" thread and a couple of "worker" threads that will live throughout all the run of Speex. a "main" thread and a couple of "worker" threads that will live throughout all the run of Speex. The main thread will do all the routines that mustn't be done in parallel and every time it will encounter a function that can be split into a couple of threads, it will give the work to the existing threads. The main thread will do all the routines that mustn't be done in parallel and every time it will encounter a function that can be split into a couple of threads, it will give the work to the existing threads.

20 Threading by functional decomposition - Trial 2 Original vq_nbest runs a total time of 10.5sec. Original vq_nbest runs a total time of 10.5sec. After the SSE improvement, it runs a total time of 8 sec. After the SSE improvement, it runs a total time of 8 sec. perfect parallelization would bring us to 4 sec + overhead of synchronization ~ 5 sec. perfect parallelization would bring us to 4 sec + overhead of synchronization ~ 5 sec. → total Speex runtime would drop to 33.14sec. → total Speex runtime would drop to 33.14sec. In the most optimal prediction, the speedup of the new method including the SSE would be 16.5% → 10% improvement after SSE improvement. In the most optimal prediction, the speedup of the new method including the SSE would be 16.5% → 10% improvement after SSE improvement.

21 Threading by Pipelined Data Domain Decomposition Each frame encoding is depended on the previous’ frame gain, excitation and the adaptive codebook. Each frame encoding is depended on the previous’ frame gain, excitation and the adaptive codebook. One main object holding all the parameters for the encoding process. One main object holding all the parameters for the encoding process.

22 qmf_decompspeex_encodespx_autocorr Compute the two sub-bands using the input wav and h0,h1 (filters computed in the prev frame) Encode the narrowband part using the input from qmf_decomp and h0 High-band buffering / sync with low band Start encoding the high- band Compute auto- correlation using the high band mixed with the narrow band Moving the high band bits in the time domain and inserting the narrow band bits in the correct place filtering in the power- spectrum domain Using the auto- correlation WLD Levinson-Durbin using the auto- correlation and the lpc lpc_to_lsp LPC to LSPs (x-domain) transform x-domain to angle domain of lsp Final signal synthesis from excitation using the excitation, the high band part and the lsp iir_mem2 fir_mem_up Using the h0, h1 filters and the gain on the full frame LSP quantization LSP interpolation (quantized and unquantized) Using the current lsp’ and the previous’ frame lsp’ Compute mid-band (4000 Hz for wideband) response of low-band and high-band filters Using the lsp, excitation and gain of the wideband

23 Threading by Pipelined Data Domain Decomposition unfortunately, we could not perform Pipelined Data Domain Decomposition on the Speex. unfortunately, we could not perform Pipelined Data Domain Decomposition on the Speex.

24 Streaming SIMD Extensions using intrinsic SSE commends we have re-written the next functions: using intrinsic SSE commends we have re-written the next functions: –Inner_prod Function speedup: 52%. Total speedup: 1.31% Function speedup: 52%. Total speedup: 1.31% – vq_nbest Function speedup: 31.25%. Total speedup 7% Function speedup: 31.25%. Total speedup 7% –vq_nbest_sign Function speedup: 40%. Total speedup 3% Function speedup: 40%. Total speedup 3%

25 – split_cb_search_shape_sign Function speedup: 2%. Total speedup 0.5% Function speedup: 2%. Total speedup 0.5%

26

27

28 Compiling with Intel® C Compiler We used the next flags for the compilation of the final version of Speex with the SSE and the threading improvements: We used the next flags for the compilation of the final version of Speex with the SSE and the threading improvements: General: General: –Detect 64-bit portability issues – Yes(/Wp64) –Optimization - Maximize Speed (/O2) –inline intrinsic functions - Yes (/Oi) –favor size or speed - Favor Fast Code (/Ot)

29 Intel specific: Intel specific: –Global optimization - Yes (/Og) –Floating point precision improvement – None –Floating point speculation - Fast (/Qfp- speculationfast) –Use Intel® processor extensions - Intel CoreTM 2 Duo Processor (/QaxT) –Parallelization - Enable Parallelization (/Qparallel)

30 Main thread 2 worker threads

31 sec

32 %

33 Summery We got 127.54% speedup using threading, intrinsic SSE functions and the Intel compiler. We got 127.54% speedup using threading, intrinsic SSE functions and the Intel compiler. The project goal was achieved!! The project goal was achieved!!

34 Thank you Leetal and Tsafrir


Download ppt "Speex encoder project Presented by: Gruper Leetal Kamelo Tsafrir Instructor: Guz Zvika Software performance enhancement using multithreading, SIMD and."

Similar presentations


Ads by Google