Software Performance Tuning Project Flake Prepared by: Meni Orenbach Roman Kaplan Advisors: Zvika Guz Kobi Gottlieb.

Slides:



Advertisements
Similar presentations
Efficient Program Compilation through Machine Learning Techniques Gennady Pekhimenko IBM Canada Angela Demke Brown University of Toronto.
Advertisements

Performance Tuning Panotools - PTMender. Layout Project Goal About Panotools Multi-threading SIMD, micro-architectural pitfalls Results.
Multi-Threading LAME MP3 Encoder
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.
MISTY1 Block Cipher Undergrad Team U8 – JK FlipFlop Clark Cianfarini and Garrett Smith.
Computer Abstractions and Technology
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.
Improvement of CT Slice Image Reconstruction Speed Using SIMD Technology Xingxing Wu Yi Zhang Instructor: Prof. Yu Hen Hu Department of Electrical & Computer.
Lecture 36: Chapter 6 Today’s topic –RAID 1. RAID Redundant Array of Inexpensive (Independent) Disks –Use multiple smaller disks (c.f. one large disk)
Copyright © 2007 Intel Corporation. ® 16bit 3D Convolution Implementation SSE + OpenMP Benchmarking on Penryn Dr. Zvi Danovich, Senior Application Engineer.
© Fastvideo, Key Points We implemented the fastest JPEG codec Many applications using JPEG can benefit from our codec.
1.  Project goals  Project description ◦ What is Musepack? ◦ Using multithreading approach ◦ Applying SIMD ◦ Analyzing Micro-architecture problems 
Optimizing Ogg Vorbis performance using architectural considerations Adir Abraham and Tal Abir.
Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Evgeny Gokhfeld 06/2006.
Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Konstantin Muradov 01/2006.
Speeding up VirtualDub Presented by: Shmuel Habari Advisor: Zvika Guz Software Systems Lab Technion.
- 1 - Copyright © 2004 Intel Corporation. All Rights Reserved. Maximizing Application’s Performance by Threading, SIMD and micro arcitecture tuning Koby.
1 Efficient Multithreading Implementation of H.264 Encoder on Intel Hyper- Threading Architectures Steven Ge, Xinmin Tian, and Yen-Kuang Chen IEEE Pacific-Rim.
Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lab 2 presentations Prof. Sherief Reda Division of Engineering,
Software Performance Tuning Project Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb.
Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel.
Introduction to Intel Core Duo Processor Architecture Al-Asli, Mohammed.
Speex encoder project Presented by: Gruper Leetal Kamelo Tsafrir Instructor: Guz Zvika Software performance enhancement using multithreading, SIMD and.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Architecture Basics ECE 454 Computer Systems Programming
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
Multi-core architectures. Single-core computer Single-core CPU chip.
Component-Based Programming with Streams Philip Garcia University of Wisconsin - Madison Johannes Helander Microsoft Research.
DIGITAL RECORDING SYSTEM Installation Setup Wizard About Our Software.
SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.
Performance of mathematical software Agner Fog Technical University of Denmark
ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.
1 Ch. 1: Software Development (Read) 5 Phases of Software Life Cycle: Problem Analysis and Specification Design Implementation (Coding) Testing, Execution.
FFT Accelerator Project Rohit Prakash Anand Silodia Date: June 7 th, 2007.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
QCAdesigner – CUDA HPPS project
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Introduction to MMX, XMM, SSE and SSE2 Technology
Processor Architecture
An Investigation of Xen and PTLsim for Exploring Latency Constraints of Co-Processing Units Grant Jenks UCLA.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.
A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Recursion Unrolling for Divide and Conquer Programs Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.
Vanderbilt University Toshiba IR Test Apparatus Project Final Design Review Ahmad Nazri Fadzal Zamir Izam Nurfazlina Kamaruddin Wan Othman.
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Embedded Systems Design
Exploiting Parallelism
Steven Ge, Xinmin Tian, and Yen-Kuang Chen
Data Compression.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
STUDY AND IMPLEMENTATION
Govt. Polytechnic Dhangar(Fatehabad)
Lecture 11: Machine-Dependent Optimization
Presentation transcript:

Software Performance Tuning Project Flake Prepared by: Meni Orenbach Roman Kaplan Advisors: Zvika Guz Kobi Gottlieb

FLAC – Free Lossless Audio Codec. FLAC specially designed for efficient packing of audio data. Can achieve compression ratios of 30% - 50% for most music. Flake – FLAC encoder

Platform and Benchmark Used Platform: Intel 64 bit Pentium Core 2 duo 2.4GHZ, 2GB of RAM and with a Windows XP operating System. Benchmark: - 238MB song. - Original Encoding Duration: Sec

Algorithm description The input file is read frame by frame. Every frame contains a constant number of channels. Each channel is encoded independently with special Huffman codes called RICE.

Flake – Data flow Encoding every Frame Encoding the error for every channel Using LPC Algorithm

Flake – Optimization method Dealing with the most time consuming functions. Two approaches were taken: –Multi-threading. –SIMD.

Optimization Method 1: Threads Flake was managed by a single thread. Parallelization creates simultaneous work. While paralleling Flake we considered: - The algorithm. - The data flow.

Encoding Process In Flake MultiThread Here! MultiThread Here! MultiThread Here!

Conclusions: Possible Ways to Parallel Flake 1.Parallel the reads and writes from the file. 2.Parallel the encoding phase for each frame separately. 3.Parallel the encoding phase for each channel separately. Combination of the above.

Our Resolution We chose to parallel the channel encoding. Our reasons for doing so:  Limit of channels and limit of threads.  Limited access to a shared device (the disk) for I/O.  Multiple reads of the file needed for frame encoding.  Higher synchronization rate needed for frame encoding.

Implementing The Solution, First Try Create as many threads as channels Every thread encodes and terminates. This solution achieved a speedup of x1.68. Overhead from opening and closing threads.

Vtune Thread Profiler, First Try

Implementing The Solution, Second Try Create as many threads as channels. Every thread encodes and waits for a signal. Save thread handlers to recall the same threads. Saving time by not closing the threads! Gaining a bigger speedup!

Vtune Thread Profiler, Second Try Note: in our benchmark there are only 2 channels.

SpeedUp Gained Through MultiThreading Total speedup from using MT: x1.85!

Optimization Method 2: SIMD Mainly used SSE and SSE2 instructions. Operations with Double FP and Integers. Two main functions we used SSE on: –calc_rice_params(). –compute_autocorr().

calc_rice_params () - Improvements Logic operations with Integers. The original loop was unrolled by 4. The input and output arrays were aligned to prevent ‘Split loads’.

calc_rice_params () – The code Old code for (i=0; i<n; i++) { udata[i] = (2*data[i]) ^ (data[i]>>31); } New code for (i=0;i<n;i+=4) { temp1 = _mm_load_si128((data+i)); temp2 = _mm_slli_epi32(temp1, 1); temp3 = _mm_srai_epi32(temp1, 31); temp1 = _mm_xor_si128(temp2, temp3); _mm_store_si128((udata+i),temp1); } Shift right by 31 bits Bitwise XOR

SIMD - compute_autocorr() Contains another Inline function named apply_welch_window() - the first to do calculations. Speedup will be calculated for both functions together.

Old code vs. new code: for (i=0; i > 1); i++) { w = ((c-i) * (c-i)); w_data[i] = data[i] * w; w_data[len-1-i] = data[len-1-i] * w; } Conversion to FP and Multiplicationapply_welch_window() iup_align = _mm_load_si128 (data+i); fpup = _mm_cvtepi32_pd (iup_align); fpup = _mm_mul_pd (fpup, w_d_low); _mm_store_pd (w_data+i, fpup); iup_align = _mm_shuffle_epi32 (iup_align, _MM_SHUFFLE(1,0,3,2)); fpup = _mm_cvtepi32_pd (iup_align); fpup = _mm_mul_pd (fpup, w_d_high); _mm_store_pd (w_data+i+2, fpup); Loading 4 Integers at once – Cutting 50% of the load operations

compute_autocorr() Uses the output array from apply_welch_window(). Loop unrolling steps 1.Every ‘Inner Loop’ unrolled by 2. 2.‘Main Loop’ unrolled by 2 - every Inner Loop unrolled by 4.

compute_autocorr() – The code for (i=0; i<=lag; ++i) { temp = 1.0; temp2 = 1.0; for (j=0; j<=lag-i; ++j) temp += data1[j+i] * data1[j]; for (j=lag+1; j<=len-1; j+=2) { temp += data1[j] * data1[j-i]; temp2 += data1[j+1] * data1[j+1-i]; } autoc[i] = temp + temp2; } Short ‘Inner loop’ ‘Main loop’ If (lag%2==0) { a_high = a_low = _mm_loadu_pd(data1+j); b_low = _mm_loadu_pd(data1+j-i); b_high = _mm_load_pd(data1+j-i-1); } else { a_high = a_low = _mm_load_pd(data1+j); b_low = _mm_load_pd(data1+j-i); b_high = _mm_loadu_pd(data1+j-i-1); } a_low = _mm_mul_pd(a_low, b_low); c_low = _mm_add_pd(a_low, c_low); a_high = _mm_mul_pd(a_high, b_high); c_high = _mm_add_pd(a_high, c_high); } Long ‘Inner loop’ (unrolled in the original code) Using as many aligned loads (and stores) as we can Multiplying and adding the result

compute_autocorr() – Speedup Speedups using SIMD summary: calc_rice_params () local speedup: x1.14. Overall speedup: x1.04. compute_autocorr() local speedup: x1.92! Overall speedup: x1.03. Total speedup using SIMD: x1.07.

Intel Tuning Assistant When using aligned arrays, split loads didn't occur. No Micro-Architectural problems found in the optimized code.

Final Results A total speedup of x1.985 was achieved by using only MT and SIMD.