Multi-Threading LAME MP3 Encoder

Slides:



Advertisements
Similar presentations
Performance Tuning Panotools - PTMender. Layout Project Goal About Panotools Multi-threading SIMD, micro-architectural pitfalls Results.
Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Department of Computer Engineering University of California at Santa Cruz MPEG Audio Compression Layer 3 (MP3) Hai Tao.
Developement and Implementation of an MPEG1 Layer III Decoder on x86 and TMS320C6711 platforms Braidotti Enrico (Farina Simone)
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.
Audio Coding Team Member: ChungMing Yan, Chun Tong.
Time-Frequency Analysis Analyzing sounds as a sequence of frames
Achieving over 50% system speedup with custom instructions and multi-threading. Kaiming Ho Fraunhofer IIS June 3 rd, 2014.
Computer Abstractions and Technology
Department of Computer Science University of the West Indies.
Optimizing Ogg Vorbis performance using architectural considerations Adir Abraham and Tal Abir.
1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.
Speeding up VirtualDub Presented by: Shmuel Habari Advisor: Zvika Guz Software Systems Lab Technion.
- 1 - Copyright © 2004 Intel Corporation. All Rights Reserved. Maximizing Application’s Performance by Threading, SIMD and micro arcitecture tuning Koby.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
1 Efficient Multithreading Implementation of H.264 Encoder on Intel Hyper- Threading Architectures Steven Ge, Xinmin Tian, and Yen-Kuang Chen IEEE Pacific-Rim.
Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:
1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.
Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:
Technology Institute Of Israel – Digital Lab. Presentation date: 11/28/2000. Executers: Raveh Yariv & Frenkel Oren. Technion supporter: Eran Pinchasov.
Speex encoder project Presented by: Gruper Leetal Kamelo Tsafrir Instructor: Guz Zvika Software performance enhancement using multithreading, SIMD and.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Multicore Designs Presented By: Mahendra B Salunke Asst. Professor, Dept of Comp Engg., SITS, Narhe, Pune. URL:
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Spring 2009.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Basics and Architectures
MPEG: (Moving Pictures Expert Group) A Video Compression Standard for Multimedia Applications Seo Yeong Geon Dept. of Computer Science in GNU.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
GU Junli SUN Yihe 1.  Introduction & Related work  Parallel encoder implementation  Test results and Analysis  Conclusions 2.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
8. 1 MPEG MPEG is Moving Picture Experts Group On 1992 MPEG-1 was the standard, but was replaced only a year after by MPEG-2. Nowadays, MPEG-2 is gradually.
RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
UNDER THE GUIDANCE DR. K. R. RAO SUBMITTED BY SHAHEER AHMED ID : Encoding H.264 by Thread Level Parallelism.
1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:
Developing a 64-bit Strategy Craig McMurtry Developer Evangelist, Software Vendors Developer and Platform Evangelism Microsoft Corporation.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
DEEP-3 Decryption and Encryption of MP3. Structure of Presentation Project Brief Aims and Goals Partitioning Design Route ASIP Design Route Application.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Unit II Intel IA-64 and Itanium Processor By N.R.Rejin Paul Lecturer/VIT/CSE CS2354 Advanced Computer Architecture.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
SSU 1 Dr.A.Srinivas PES Institute of Technology Bangalore, India 9 – 20 July 2012.
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
1  2004 Morgan Kaufmann Publishers No encoding: –1 bit for each datapath operation –faster, requires more memory (logic) –used for Vax 780 — an astonishing.
Processor Level Parallelism 1
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
GCSE Computing - The CPU
September 2 Performance Read 3.1 through 3.4 for Tuesday
Microarchitecture.
A Closer Look at Instruction Set Architectures
William Stallings Computer Organization and Architecture 8th Edition
Computer Structure Multi-Threading
Prof. Hsien-Hsin Sean Lee
Steven Ge, Xinmin Tian, and Yen-Kuang Chen
Simultaneous Multithreading in Superscalar Processors
EE 4xx: Computer Architecture and Performance Programming
CS 286 Computer Organization and Architecture
The University of Adelaide, School of Computer Science
Programming with Shared Memory Specifying parallelism
GCSE Computing - The CPU
Presentation transcript:

Multi-Threading LAME MP3 Encoder TECHNION - ISRAEL INSTITUTE OF TECHNOLOGY Electrical Engineering Department Software Systems Lab Multi-Threading LAME MP3 Encoder Performed by : Gilad Riachshtian Copyright, 2004 © Gilad Raichshtain.

Talk Layout What is the L.A.M.E. Project ? Project Goal MP3 Encoding & Hyper-Threading Overview Multi-Threading strategies Results & Remarks Future Work

What is the L.A.M.E. Project? An Open Source project An Educational Tool used for learning about MP3 encoding It’s goal is to improve Psycho-acoustics quality The speed of MP3 encoding Lame is the most popular state of the art MP3 encoder/decoder used by today’s leading products. FOR MORE INFO... http://lame.sourceforge.net

Project Goal Speeding up the encryption of an audio stream Turning LAME into a Multi-Threaded (MT) engine Be 1:1 bit compatible with the original version Optimize specifically for SMT platforms (implementation on Intel’s P4 with Hyper-Threading Technology)

Thread Level Parallelism Provides thread level parallelism on each processor Resulting in Increased use of processor execution resources Higher processing throughput Achieved by duplicating the architectural state on each processor, while sharing one set of processor execution resources

MP3 Encoding Overview Specifically in LAME Break up the audio stream into frames (uniform chunks, typically ~1K) Frame 1 Audio Stream Frame 2 Frame 3 Frame 4 Read Frame Psycho-Acoustic Perceptual Model Analysis Filterbank MDCT Quantization Huffman Encoding Bitstream Encode Specifically in LAME

LAME MT – Intuitive approach The intuitive approach: Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 Frame 6 Thread 1: Thread 2: An unbreakable dependence due to Huffman Encoding This is actually Data Decomposition

LAME MT – Functional Decomposition Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 Frame 6 Floating Point Intensive T1: Psycho-Acoustic Read Frame Analysis Filterbank MDCT Quantization Huffman Encoding T2: Integer Intensive

Results

Results due to Multi-Threading SMT Platform CBR / VBR SMP Platform Using Microsoft’s Compiler 22% / 32% 38% / 62% Using Intel’s Compiler 8.1 20% / 29% 44% / 59%

Results using Intel’s Compiler 8.1 SMT Platform CBR / VBR SMP Platform LAME Original Code 3.97a 21% / 19% 22% / 17% LAME MT Code 19% / 17% 28% / 15%

Overall Performance Results SMT Platform CBR / VBR SMP Platform LAME MT code + Using Intel’s Compiler 8.1 52% / 70% 78% / 109%

Remarks Architectural Issues Implement a PNI version for FFT Pitfall found in version 3.93: Memory access to two different pages with the same offset ~11% speedup achieved by fixing it No longer relevant in later versions  No major arch issues found in versions 3.94-3.97a Implement a PNI version for FFT No significant gain achieved Overall ~40 blocks of code were change and are under #ifdef

Future work

Future Work Splitting the encoding process into more than two steps Reading frames in parallel

That's all Folks