Speex encoder project Presented by: Gruper Leetal Kamelo Tsafrir Instructor: Guz Zvika Software performance enhancement using multithreading, SIMD and.

Slides:

Advertisements

Similar presentations

Wideband Speech Coding for CDMA2000® Systems

Advertisements

Performance Tuning Panotools - PTMender. Layout Project Goal About Panotools Multi-threading SIMD, micro-architectural pitfalls Results.

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Streaming SIMD Extension (SSE)

Multi-Threading LAME MP3 Encoder

MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.

Intel Parallel Advisor Workflow David Valentine Computer Science Slippery Rock University.

N Team 15: Final Presentation Peter Nyberg Azadeh Bararsani Adie Tong N N multicodec minisip.

Time Optimization of HEVC Encoder over X86 Processors using SIMD

Home Exam 1: Video Encoding on Intel x86 using Streaming SIMD Extensions (SSE) and Advanced Vector Extensions (AVX) Home Exam 1: Video Encoding on Intel.

Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.

Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.

Speech codecs and DCCP with TFRC VoIP mode Magnus Westerlund

Speech Coding Nicola Orio Dipartimento di Ingegneria dell’Informazione IV Scuola estiva AISV, 8-12 settembre 2008.

Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)

Chapter 7 End-to-End Data

1.  Project goals  Project description ◦ What is Musepack? ◦ Using multithreading approach ◦ Applying SIMD ◦ Analyzing Micro-architecture problems 

Speech & Audio Processing

Optimizing Ogg Vorbis performance using architectural considerations Adir Abraham and Tal Abir.

Overview of Adaptive Multi-Rate Narrow Band (AMR-NB) Speech Codec

Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Evgeny Gokhfeld 06/2006.

Software Performance Tuning Project Flake Prepared by: Meni Orenbach Roman Kaplan Advisors: Zvika Guz Kobi Gottlieb.

Speeding up VirtualDub Presented by: Shmuel Habari Advisor: Zvika Guz Software Systems Lab Technion.

1 Efficient Multithreading Implementation of H.264 Encoder on Intel Hyper- Threading Architectures Steven Ge, Xinmin Tian, and Yen-Kuang Chen IEEE Pacific-Rim.

Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.

Software Performance Tuning Project Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb.

Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel.

NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

SAGE: Self-Tuning Approximation for Graphics Engines

Data and Computer Communications Chapter 8 – Multiplexing

„Bandwidth Extension of Speech Signals“ 2nd Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 22nd and 23rd June.

Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop

Develop and Implementation of the Speex Vocoder on the TI C64+ DSP

Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Performance Enhancement of Video Compression Algorithms using SIMD Valia, Shamik Jamkar, Saket.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

8. 1 MPEG MPEG is Moving Picture Experts Group On 1992 MPEG-1 was the standard, but was replaced only a year after by MPEG-2. Nowadays, MPEG-2 is gradually.

Aug 25, 2005 page1 Aug 25, 2005 Integration of Advanced Video/Speech Codecs into AccessGrid National Center for High Performance Computing Speaker: Barz.

Pipes & Filters Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.

ITU-T G.729 EE8873 Rungsun Munkong March 22, 2004.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

1.INTRODUCTION The use of the adaptive codebook (ACB) in CELP-like speech coders allows the achievement of high quality speech, especially for voiced segments.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

S.Klimenko, LSC March, 2001 Update on Wavelet Compression Presented by S.Klimenko University of Florida l Outline Ø Wavelet compression concept E2 data.

Present document contains informations proprietary to France Telecom. Accepting this document means for its recipient he or she recognizes the confidential.

Chapter 20 Speech Encoding by Parameters 20.1 Linear Predictive Coding (LPC) 20.2 Linear Predictive Vocoder 20.3 Code Excited Linear Prediction (CELP)

Single Node Optimization Computational Astrophysics.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

CPU-GPU Collaboration for Output Quality Monitoring Mehrzad Samadi and Scott Mahlke University of Michigan March 2014 Compilers creating custom processors.

Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.

Time Optimization of HEVC Encoder over X86 Processors using SIMD Kushal Shah Advisor: Dr. K. R. Rao Spring 2013 Multimedia.

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.

First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.

2nd Workshop on Wideband Speech Quality - June nd Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 22nd.

Processor Level Parallelism 1

Presented by: Class Presentation of Custom DSP Implementation Course on: This is a class presentation. All data are copy rights of their respective authors.

Applications of Multirate Signal Processing

Steven Ge, Xinmin Tian, and Yen-Kuang Chen

FPGAs in AWS and First Use Cases, Kees Vissers

Audio Henning Schulzrinne Dept. of Computer Science

Vector Processing => Multimedia

C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,

Performance Optimization for Embedded Software

VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder

Presentation transcript:

Speex encoder project Presented by: Gruper Leetal Kamelo Tsafrir Instructor: Guz Zvika Software performance enhancement using multithreading, SIMD and architectural considerations

Speex Speex is an Open Source/Free Software audio compression format designed for speech. Speex is designed for packet networks and voice over IP (VoIP) application. File-based compression is of course also supported.

Initial code analysis Original Speex run time

Tuned functions using SSE Tuned function using threading & SSE Initial code analysis

Speex encoding overview Audio Stream Frame 1Frame 2Frame 4 Encode Read Frame Frame 3 Write frame

Our work on Speex Threading, SSE and compiling with Intel compiler

Threading by data decomposition – trial 1 Creating a couple of threads and giving each one of them a frame to work on. Creating a couple of threads and giving each one of them a frame to work on. Main thread that read from the input file frames and for each frame creates a "worker" thread. Each worker thread does the encoding on the given frame and writes it to the output file. Main thread that read from the input file frames and for each frame creates a "worker" thread. Each worker thread does the encoding on the given frame and writes it to the output file.

Problem & Solution Problem: it takes a lot of time creating a new thread for every new frame. Problem: it takes a lot of time creating a new thread for every new frame. Solution: create a couple of threads (same as the number of cores in the CPU) only in the beginning. For example, if we have 2 cores, we have a main thread and 2 worker threads. One deal with all the even frames and one deal with all the odd frames. Solution: create a couple of threads (same as the number of cores in the CPU) only in the beginning. For example, if we have 2 cores, we have a main thread and 2 worker threads. One deal with all the even frames and one deal with all the odd frames.

Main thread Encoding Thread …… Input wav file Output spx file

Threading by data decomposition – trial 1 65% speedup! 65% speedup! But… But… –although Speex is robust to packet lost, it does depend on the previous packet when it can → definite reduction in the quality of sound.

Threading by data decomposition – trial 2 Split the file into two. One thread will handle half a file and the other will handle the other half.

Thread A Thread B Input wav fileOutput spx file Encoding Encoding

Threading by data decomposition – trial 2 87% speedup for 2 threads 87% speedup for 2 threads 93% speedup for 22 threads 93% speedup for 22 threads

Threading by data decomposition – trial 2 Passed Vtunes thread checker: Passed Vtunes thread checker:

vq_nbest vq_nbest - finds the n best vectors in the codebook which means that it checks each vector against the given vector and compares it against the rest of the vectors it already found to be best. vq_nbest - finds the n best vectors in the codebook which means that it checks each vector against the given vector and compares it against the rest of the vectors it already found to be best. In a regular run, the loop is performed 256 times on average – big waste of time. In a regular run, the loop is performed 256 times on average – big waste of time.

Threading by functional decomposition - Trial 1 We splited the search in the codebook between a couple of threads. We splited the search in the codebook between a couple of threads. Each thread received a portion of the codebook and was sapouse to find the n best vectors in it. Then, some of the threads would fined the n best between there own found vectors and the ones a different thread found until we get one set of n best vectors. Each thread received a portion of the codebook and was sapouse to find the n best vectors in it. Then, some of the threads would fined the n best between there own found vectors and the ones a different thread found until we get one set of n best vectors.

Threading by functional decomposition - Trial 1 Because of the overhead of creating all the threads we actually got worse performances. Because of the overhead of creating all the threads we actually got worse performances. It takes approximately 72µs to create a thread. The original vq_nbest runtime is 13 µs. Even if we only create 1 thread each time and not to mention the overhead time of synchronizing all the threads, we will still get a slowdown of ~5.5 in the vq_nbest runtime. It takes approximately 72µs to create a thread. The original vq_nbest runtime is 13 µs. Even if we only create 1 thread each time and not to mention the overhead time of synchronizing all the threads, we will still get a slowdown of ~5.5 in the vq_nbest runtime. As a result, the total runtime of Speex was 1.45 minutes! As a result, the total runtime of Speex was 1.45 minutes!

Threading by functional decomposition - Trial 2 a "main" thread and a couple of "worker" threads that will live throughout all the run of Speex. a "main" thread and a couple of "worker" threads that will live throughout all the run of Speex. The main thread will do all the routines that mustn't be done in parallel and every time it will encounter a function that can be split into a couple of threads, it will give the work to the existing threads. The main thread will do all the routines that mustn't be done in parallel and every time it will encounter a function that can be split into a couple of threads, it will give the work to the existing threads.

Threading by functional decomposition - Trial 2 Original vq_nbest runs a total time of 10.5sec. Original vq_nbest runs a total time of 10.5sec. After the SSE improvement, it runs a total time of 8 sec. After the SSE improvement, it runs a total time of 8 sec. perfect parallelization would bring us to 4 sec + overhead of synchronization ~ 5 sec. perfect parallelization would bring us to 4 sec + overhead of synchronization ~ 5 sec. → total Speex runtime would drop to 33.14sec. → total Speex runtime would drop to 33.14sec. In the most optimal prediction, the speedup of the new method including the SSE would be 16.5% → 10% improvement after SSE improvement. In the most optimal prediction, the speedup of the new method including the SSE would be 16.5% → 10% improvement after SSE improvement.

Threading by Pipelined Data Domain Decomposition Each frame encoding is depended on the previous’ frame gain, excitation and the adaptive codebook. Each frame encoding is depended on the previous’ frame gain, excitation and the adaptive codebook. One main object holding all the parameters for the encoding process. One main object holding all the parameters for the encoding process.

qmf_decompspeex_encodespx_autocorr Compute the two sub-bands using the input wav and h0,h1 (filters computed in the prev frame) Encode the narrowband part using the input from qmf_decomp and h0 High-band buffering / sync with low band Start encoding the high- band Compute auto- correlation using the high band mixed with the narrow band Moving the high band bits in the time domain and inserting the narrow band bits in the correct place filtering in the power- spectrum domain Using the auto- correlation WLD Levinson-Durbin using the auto- correlation and the lpc lpc_to_lsp LPC to LSPs (x-domain) transform x-domain to angle domain of lsp Final signal synthesis from excitation using the excitation, the high band part and the lsp iir_mem2 fir_mem_up Using the h0, h1 filters and the gain on the full frame LSP quantization LSP interpolation (quantized and unquantized) Using the current lsp’ and the previous’ frame lsp’ Compute mid-band (4000 Hz for wideband) response of low-band and high-band filters Using the lsp, excitation and gain of the wideband

Threading by Pipelined Data Domain Decomposition unfortunately, we could not perform Pipelined Data Domain Decomposition on the Speex. unfortunately, we could not perform Pipelined Data Domain Decomposition on the Speex.

Streaming SIMD Extensions using intrinsic SSE commends we have re-written the next functions: using intrinsic SSE commends we have re-written the next functions: –Inner_prod Function speedup: 52%. Total speedup: 1.31% Function speedup: 52%. Total speedup: 1.31% – vq_nbest Function speedup: 31.25%. Total speedup 7% Function speedup: 31.25%. Total speedup 7% –vq_nbest_sign Function speedup: 40%. Total speedup 3% Function speedup: 40%. Total speedup 3%

– split_cb_search_shape_sign Function speedup: 2%. Total speedup 0.5% Function speedup: 2%. Total speedup 0.5%

Compiling with Intel® C Compiler We used the next flags for the compilation of the final version of Speex with the SSE and the threading improvements: We used the next flags for the compilation of the final version of Speex with the SSE and the threading improvements: General: General: –Detect 64-bit portability issues – Yes(/Wp64) –Optimization - Maximize Speed (/O2) –inline intrinsic functions - Yes (/Oi) –favor size or speed - Favor Fast Code (/Ot)

Intel specific: Intel specific: –Global optimization - Yes (/Og) –Floating point precision improvement – None –Floating point speculation - Fast (/Qfp- speculationfast) –Use Intel® processor extensions - Intel CoreTM 2 Duo Processor (/QaxT) –Parallelization - Enable Parallelization (/Qparallel)

Main thread 2 worker threads

sec

%

Summery We got % speedup using threading, intrinsic SSE functions and the Intel compiler. We got % speedup using threading, intrinsic SSE functions and the Intel compiler. The project goal was achieved!! The project goal was achieved!!

Thank you Leetal and Tsafrir