Comparative Analysis of Parallel OPIR Compression on Space Processors

Slides:



Advertisements
Similar presentations
Acceleration of Cooley-Tukey algorithm using Maxeler machine
Advertisements

Parallelism Lecture notes from MKP and S. Yalamanchili.
Computer Organization Lab 1 Soufiane berouel. Formulas to Remember CPU Time = CPU Clock Cycles x Clock Cycle Time CPU Clock Cycles = Instruction Count.
Ascendent's Fusion 360 hybrid platform creates a true hybrid surveillance system by utilizing the advantages of Analog, Megapixel, and IP technologies.
EEL6686 Guest Lecture February 25, 2014 A Framework to Analyze Processor Architectures for Next-Generation On-Board Space Computing Tyler M. Lovelly Ph.D.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Parallel Processors Todd Charlton Eric Uriostique.
Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.
Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.
© Fastvideo, Key Points We implemented the fastest JPEG codec Many applications using JPEG can benefit from our codec.
Computer Organization and Architecture 18 th March, 2008.
Portable Image File Viewer ENEE 408G: Multimedia Signal Processing Seun Fabayo John Glancy Gordon Krauthamer.
Video on DSP and FPGA John Johansson April 12, 2004.
1 Efficient Multithreading Implementation of H.264 Encoder on Intel Hyper- Threading Architectures Steven Ge, Xinmin Tian, and Yen-Kuang Chen IEEE Pacific-Rim.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
GPGPU platforms GP - General Purpose computation using GPU
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
2014 IEEE Aerospace Conference March 2, 2014 A Framework to Analyze Processor Architectures for Next-Generation On-Board Space Computing Tyler M. Lovelly.
Measuring zSeries System Performance Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012 Sponsored in part by Deer &
CS 1308 Computer Literacy and the Internet. Creating Digital Pictures  A traditional photograph is an analog representation of an image.  Digitizing.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
High Throughput Compression of Double-Precision Floating-Point Data Martin Burtscher and Paruj Ratanaworabhan School of Electrical and Computer Engineering.
A New Reference Design Development Environment for JPEG 2000 Applications Bill Finch CAST, Inc. Warren Miller AVNET Design Services DesignCon 2003 January.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Binary Image Compression via Monochromatic Pattern Substitution: A Sequential Speed-Up Luigi Cinque and Sergio De Agostino Computer Science Department.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Gregory Fotiades.  Global illumination techniques are highly desirable for realistic interaction due to their high level of accuracy and photorealism.
GU Junli SUN Yihe 1.  Introduction & Related work  Parallel encoder implementation  Test results and Analysis  Conclusions 2.
Survey on Improving Dynamic Web Performance Guide:- Dr. G. ShanmungaSundaram (M.Tech, Ph.D), Assistant Professor, Dept of IT, SMVEC. Aswini. S M.Tech CSE.
History of Microprocessor MPIntroductionData BusAddress Bus
PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University.
Addressing Image Compression Techniques on current Internet Technologies By: Eduardo J. Moreira & Onyeka Ezenwoye CIS-6931 Term Paper.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
4/19/20021 TCPSplitter: A Reconfigurable Hardware Based TCP Flow Monitor David V. Schuehler.
Class 9 LBSC 690 Information Technology Multimedia.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Aerospace Conference ‘12 A Framework to Analyze, Compare, and Optimize High-Performance, On-Board Processing Systems Nicholas Wulf Alan D. George Ann Gordon-Ross.
Comp 335 File Structures Data Compression. Why Study Data Compression? Conserves storage space Files can be transmitted faster because there are less.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Parallelizing an Image Compression Toolbox MSE Project Final Presentation Hadassa Baker.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
MAHARANA PRATAP COLLEGE OF TECHNOLOGY SEMINAR ON- COMPUTER PROCESSOR SUBJECT CODE: CS-307 Branch-CSE Sem- 3 rd SUBMITTED TO SUBMITTED BY.
Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,
Parallel Programming Models
Application and Desktop Sharing
File Compression 3.3.
Presenter: Darshika G. Perera Assistant Professor
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
Bus Systems ISA PCI AGP.
School of Engineering University of Guelph
Evaluating Partial Reconfiguration for Embedded FPGA Applications
File Compression 3.3.
Steven Ge, Xinmin Tian, and Yen-Kuang Chen
ENG3050 Embedded Reconfigurable Computing Systems
Data Compression.
Intel’s Core i7 Processor
Linchuan Chen, Xin Huo and Gagan Agrawal
Edited by : Noor Alhareqi
Edited by : Noor Alhareqi
Performance Evaluation of the Parallel Fast Multipole Algorithm Using the Optimal Effectiveness Metric Ioana Banicescu and Mark Bilderback Department of.
CLUSTER COMPUTING.
Using NFFI Web Services on the tactical level: An evaluation of compression techniques 13th ICCRTS: C2 for Complex Endeavors Frank T. Johnsen.
Hossein Omidian, Guy Lemieux
Chapter 4 Multiprocessors
Experiments on Wireless VR for EHT
Lecture 20 Parallel Programming CSE /27/2019.
Fog Computing for Low Latency, Interactive Video Streaming
Presentation transcript:

Comparative Analysis of Parallel OPIR Compression on Space Processors 2017 IEEE Aerospace Conference Eric Shea M.S. Student University of Pittsburgh An Ho M.S. Graduate, University of Florida (now at Harris) Dr. Alan D. George Mickle Chair Professor of ECE University of Pittsburgh Dr. Ann Gordon-Ross Assoc. Professor of ECE University of Florida

Outline Introduction Video Preprocessing Methods Compression Methods Experimental Setup Platforms Analyzed 16-Bit Encoder Analysis 8-Bit Encoder Analysis 16-Bit vs 8-Bit Encoder Analysis Conclusions and Future Research

Introduction Video compression on space processors Increasing high bit-depth on next-generation image sensors presents unique challenges Limited resources, such as memory and download bandwidth, requires more efficient preprocessing and compression methods Existing methods use limited/incompatible proprietary data encoders lacking preprocessing for limited resource environments Balance between video quality and data throughput We present novel real-time processing on high bit-depth video for different processor architectures Perform preprocessing and compression faster than transmitting raw data based on available bandwidth Achieve better compression ratios with minimal effect on video quality and latency

Video Preprocessing Methods 1) Bit-Stream Splitting (BSS) [1] Splits high bit-depth video into separate upper and lower byte video files Enables use of 8-bit encoders and parallelization 2) SuperFrame [1] Consecutive frames placed in one massive SuperFrame to leverage image encoders Bit-Stream Splitting SuperFrame [1] A. Ho, A. George, A. Gordon-Ross “Improving Compression Ratios for High Bit-Depth Grayscale Video Formats,” Proc. Of IEEE Aerospace Conference 2016

Compression Methods 1) Baseline method 2) Parallelized method Serially compresses each split file using a single thread 2) Parallelized method Uses FFmpeg’s built-in multithreading capabilities to operate on both split files in parallel 3) Enhanced Parallelized method Executes multiple encoder threads in parallel Dynamically re-allocates threads to cores after thread completes execution Performed with BSS and 8-bit encoders

Experimental Setup Using video data taken from 14-bit Overhead-Persistent InfraRed (OPIR) image sensor Video files contain ~4000 frames of various cloud cover Files provided by AFRL Space Using open source FFmpeg compression library [5] Encoders tested: PNG, FFV1, FFVhuff, JPEG-LS, FFV1 V3, x264 Performed lossless compression with 16-bit encoders, and lossless and lossy compression with 8-bit x264 [5] FFMPEG [Online]. Available: http//:www.ffmpeg.org

COTs equivalent and space-grade processors Platforms Analyzed ODROID-C2 Amlogic S905 with quad-core ARM Cortex-A53 (akin to HPSC) 1.50 GHz clock frequency Avnet Zedboard (akin to CSP) Xilinx Zynq-7020 with dual-core ARM Cortex-A9 0.766 GHz clock frequency NXP P5040 Evaluation Board FreeScale P5040 with quad-core e5500 2.20 GHz clock frequency BAE Systems RAD5545 Extrapolated results [8] using device metrics which revealed that RAD5545 achieves, on average, 21.8% of P5040 performance 0.466 GHz clock frequency COTs equivalent and space-grade processors [8] T. Lovelly, D. Bryan, K. Cheng, R. Kreynin, A. George, A. Gordon-Ross, and G. Mounce, “A Framework to Analyze Processor Architecture for Next-Generation On-Board Space Computing,” Proc. of IEEE Aerospace Conference 2014.

16-Bit Encoder Analysis No BSS required PNG and FFV1 version 3 offered significant improvement with increasing number of cores Negligible improvement for FFV1 and FFVhuff FFVhuff offered fastest execution time due to inefficient compression of OPIR video P5040 and Amlogic S905 offered best 16-bit performance due to faster clock speeds 1.0x 1.0x 3.9x 2.6x RAD5545 achieves 21.8% performance of P5040 1.0x 1.0x 1.0x Offers 4.7x average speedup over next fastest platform, S905 No improvement 1.0x No improvement 1.0x 1.0x 2.0x 2.4x 3.9x 1.7x 2.9x 2.6x FFVhuff performed extremely well offering 4.2x speedup over PNG FFVhuff performed extremely well offering 11.6x speedup over PNG

8-Bit Encoder Analysis x264 supports multithreading using pthreads BSS required x264 supports multithreading using pthreads Execution time decreased by nearly 2x using two threads compared to single thread Increases overall with increasing quality factor due to addition of quantization step Augmented compression method performed extremely well with 3.4x average speedup over baseline method Relative speedup between compression methods increases with increasing quality factor 1.0x 1.0x 1.9x 3.2x 3.5x 1.0x RAD5545 achieves 21.8% performance of P5040 S905 outperforms P5040 but was opposite for 16-bit encoders Best speedup between compression methods with increasing quality factor Fastest due to efficient handling of threads 1.3x 1.2x 1.1x 1.0x 1.0x 1.0x Offers 2.8x average speedup over next fastest platform, P5040 1.0x 1.0x 3.6x 4.2x 1.9x 1.0x 2.3x 1.0x 3.5x 2.9x 3.2x 3.2x 1.0x 3.6x 1.0x 1.0x

16-Bit vs 8-Bit Encoder Analysis BSS with x264 compression results in fastest total run-time for bandwidths less than 3 Mbps Provided highest compression ratio [1] FFVhuff provided fastest 16-bit execution time but performed worst in total run-time Does not compress OPIR video efficiently Mix between RAW and JPEGLS being fastest at much higher bandwidths Raw data fastest at all bandwidths Total Run Time = Execution Time + Transfer Time 1.0x FFVhuff averaged 2.6x slower than x264 1.0x 1.0x 2.7x 2.2x 2.7x 2.7x JPEGLS fastest at high bandwidths 1.0x JPEGLS fastest at high bandwidths RAW fastest RAW Fastest [1] A. Ho, A. George, A. Gordon-Ross “Improving Compression Ratios for High Bit-Depth Grayscale Video Formats,” Proc. Of IEEE Aerospace Conference 2016

Conclusions and Future Research Analysis of performance of 16-bit and 8-bit encoders on multiple space processors Analyzed execution time using three different compression methods PNG and FFV1 version 3 offered much improvement with increasing number of cores 16-bit encoders executed faster since BSS not needed P5040 had 4.7x speedup over Amlogic S905 Analyzed total run-time BSS and x264 combination was fastest at lower bandwidths while JPEGLS and RAW were fastest at higher bandwidths Amlogic S905 and P5040 had best performance due to highest clock speeds Expand analysis to new devices and SuperFrame method Use SuperFrame method to leverage image encoders Integrate encoder IP cores on FPGAs and analyze total run-time