HW/SW Co-Design of an MPEG-2 Decoder Pradeep Dhananjay Kiran Divakar Leela Kishore Kothamasu Anthony Weerasinghe.

Slides:



Advertisements
Similar presentations
VHDL Design of Multifunctional RISC Processor on FPGA
Advertisements

Sumitha Ajith Saicharan Bandarupalli Mahesh Borgaonkar.
H.264 Intra Frame Coder System Design Özgür Taşdizen Microelectronics Program at Sabanci University 4/8/2005.
Internal Logic Analyzer Final presentation-part B
University Of Vaasa Telecommunications Engineering Automation Seminar Signal Generator By Tibebu Sime 13 th December 2011.
Jpeg Hardware Compression Chaitanya Vardhana – S/W Anthony Louviere – H/W & H/W TB Reazul Hasan – H/W & Tools.
Conversion Between Video Compression Protocols Performed by: Dmitry Sezganov, Vitaly Spector Instructor: Stas Lapchev, Artyom Borzin Cooperated with:
Fall 2006Lecture 16 Lecture 16: Accelerator Design in the XUP Board ECE 412: Microcomputer Laboratory.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Double buffer SDRAM Memory Controller Presented by: Yael Dresner Andre Steiner Instructed by: Michael Levilov Project Number: D0713.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
1 Pupil Detection and Tracking System Lior Zimet Sean Kao EE 249 Project Mentors: Dr. Arnon Amir Yoshi Watanabe.
1 An Exploration of the MPEG Algorithm Using Latency Insensitive Design EE249 Presentation (12/04/1999) Trevor Meyerowitz Mentored by: Luca Carloni.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Super Fast Camera System Performed by: Tokman Niv Levenbroun Guy Supervised by: Leonid Boudniak.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
HW/SW CODESIGN OF THE MPEG-2 VIDEO DECODER Matjaz Verderber, Andrej Zemva, Andrej Trost University of Ljubljana Faculty of Electrical Engineering Trzaska.
HW/SW CODESIGN OF THE MPEG-2 VIDEO DECODER Matjaz Verderber, Andrej Zemva, Andrej Trost University of Ljubljana Faculty of Electrical Engineering Trzaska.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
OS Implementation On SOPC Final Presentation
Sept EE24C Digital Electronics Project Design of a Digital Alarm Clock.
Constructive Computer Architecture Tutorial 4: SMIPS on FPGA Andy Wright 6.S195 TA October 7, 2013http://csg.csail.mit.edu/6.s195T04-1.
Viterbi Decoder Project Alon weinberg, Dan Elran Supervisors: Emilia Burlak, Elisha Ulmer.
Delevopment Tools Beyond HDL
NTSC to VGA Converter Marco Moreno Adrian De La Rosa
USB host for web camera connection
1 Chapter 2. The System-on-a-Chip Design Process Canonical SoC Design System design flow The Specification Problem System design.
Final presentation – part B Olga Liberman and Yoav Shvartz Advisor: Moshe Porian April 2013 S YMBOL G ENERATOR 2 semester project.
ECE 545 Project 1 Part IV Key Scheduling Final Integration List of Deliverables.
Live Action First Person Shooter Game Patrick Judd Ian Katsuno Bao Le.
Elad Hadar Omer Norkin Supervisor: Mike Sumszyk Winter 2010/11, Single semester project. Date:22/4/12 Technion – Israel Institute of Technology Faculty.
ASIC/FPGA design flow. FPGA Design Flow Detailed (RTL) Design Detailed (RTL) Design Ideas (Specifications) Design Ideas (Specifications) Device Programming.
Implementation of MAC Assisted CORDIC engine on FPGA EE382N-4 Abhik Bhattacharya Mrinal Deo Raghunandan K R Samir Dutt.
GBT Interface Card for a Linux Computer Carson Teale 1.
ECE Department: University of Massachusetts, Amherst ECE 354 Lab 5: Data Compression.
SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.
Introduction to Computer Programming Using C Session 23 - Review.
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.
GRECO - CIn - UFPE1 A Reconfigurable Architecture for Multi-context Application Remy Eskinazi Sant´Anna Federal University of Pernambuco – UFPE GRECO.
Chonnam national university VLSI Lab 8.4 Block Integration for Hard Macros The process of integrating the subblocks into the macro.
Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
Dhanshree Nimje Smita Khartad
- 1 - EE898_HW/SW Partitioning Hardware/software partitioning  Functionality to be implemented in software or in hardware? No need to consider special.
LZRW3 Decompressor dual semester project Part A Mid Presentation Students: Peleg Rosen Tal Czeizler Advisors: Moshe Porian Netanel Yamin
Hardware Image Signal Processing and Integration into Architectural Simulator for SoC Platform Hao Wang University of Wisconsin, Madison.
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
1 Implementation in Hardware of Video Processing Algorithm Performed by: Yony Dekell & Tsion Bublil Supervisor : Mike Sumszyk SPRING 2008 High Speed Digital.
Sub-Nyquist Sampling Algorithm Implementation on Flex Rio
PROJECT - ZYNQ Yakir Peretz Idan Homri Semester - winter 2014 Duration - one semester.
1 Extending FPGA Verification Through The PLI Charles Howard Senior Research Engineer Southwest Research Institute San Antonio, Texas (210)
1 Hardware/Software Co-Design Final Project Emulation on Distributed Simulation Co-Verification System 陳少傑 教授 R 黃鼎鈞 R 尤建智 R 林語亭.
Part A Final Dor Obstbaum Kami Elbaz Advisor: Moshe Porian August 2012 FPGA S ETTING U SING F LASH.
November 29, 2011 Final Presentation. Team Members Troy Huguet Computer Engineer Post-Route Testing Parker Jacobs Computer Engineer Post-Route Testing.
Proposal for an Open Source Flash Failure Analysis Platform (FLAP) By Michael Tomer, Cory Shirts, SzeHsiang Harper, Jake Johns
Fall 2000M.B. Ibáñez Lecture 26 I/O Systems II. Fall 2000M.B. Ibáñez Application I/O Interface I/O system calls encapsulate device behaviors in generic.
ECE 448 Lecture 6 Finite State Machines State Diagrams vs. Algorithmic State Machine (ASM) Charts.
ASIC/FPGA design flow. Design Flow Detailed Design Detailed Design Ideas Design Ideas Device Programming Device Programming Timing Simulation Timing Simulation.
CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.
1 The user’s view  A user is a person employing the computer to do useful work  Examples of useful work include spreadsheets word processing developing.
Lab 4 HW/SW Compression and Decompression of Captured Image
Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform
Introduction to cosynthesis Rabi Mahapatra CSCE617
Matlab as a Development Environment for FPGA Design
High Level Synthesis Overview
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
RTL Design Methodology
RTL Design Methodology
Presentation transcript:

HW/SW Co-Design of an MPEG-2 Decoder Pradeep Dhananjay Kiran Divakar Leela Kishore Kothamasu Anthony Weerasinghe

Outline Objectives Background HW-SW Partitioning SW/HW Design Testing and Debug VGA Display Driver Results Lessons Learned Future Work

Objectives Accelerate MPEG-2 Decoder – Identify bottlenecks – Isolate bottleneck functions and partition design – Convert SW functions to HW blocks – Design HW/SW interfaces for communication – Measure accelerated performance Design VGA display driver on FPGA – Attempt to display decoded stream in real-time

Background Development Platform – TLL-5000 prototyping board ARMv9, Spartan3 FPGA, VGA DAC (ADV7125) Source code for MPEG2 Decoder – Obtained from sourceforge.net

Background – MPEG2 Consists of Group of pictures (GOP) sequence Types of pictures – I-picture (Intra coded) – P-picture (Forward predicted) – B-picture (Bidirectional predicted)

Background – MPEG2

HW-SW Partitioning Linux profiling done to determine critical functions – Results based on a particular input (mpeg file) – Assumed to be representative of a typical use case – Profiling done on x86 Linux and as well as on the board gmon.out generated on board

Profiling on x86-Linux

Profiling on ARM-Linux

HW-SW Partitioning

SW Design IDCT function uses pointers to access an input array – Not suitable for synthesis by Catapult-C – Converted all pointer accesses to array accesses IDCT performs non sequential accesses with varying stride – Modified caller of the IDCT function to re-organize access pattern into sequential form – Created temporary array, which is passed to function – Return array from function is re-distributed to correct locations Changes to software verified using golden code

SW Flow Chart MPEG2 SW code.…… ……. IDCT function call.…… ……. Create temporary buffer Pass input values in temporary buffer to FPGA memory Issue Start command to FPGA IDCT does computation and stores data back in FPGA memory Generates interrupt signal after computation is done Reads values from FPGA memory to temporary buffer Stores values from temp buffer back to original array in order start Wait for Interrupt interrupt

HW Design Mentor Catapult-C Synthesis Tool – High level synthesis from C/C++ to Verilog RTL

HW Design High Level Synthesis – Tool schedules operations on a cycle-by cycle basis – Constrained to available resources Uses target device and library information – Built RTL as a interface + controller + datapath

Example: Y = A*C + B*D

HW Design Code conversion for synthesis – Isolate IDCT function from MPEG2 code – Merge initialization functions One initialization construct was needed – Remove all global variables Few dependencies for the IDCT function – Convert pointer arithmetic to array offsets Most work needed for this conversion No standard guidelines available

HW Design Pointer conversions

HW Design Hardware Interface

HW Design Verifying Isolated IDCT function in C and RTL – C testbench written to test isolated IDCT function – Catapult-C allows testing of C function vs. RTL Ensure RTL generation matches expected behavior Un-converted pointer code generated wrong RTL

HW Design Integration with communication interface – Communication FSM given – Integrate IDCT block

Problems Faced IDCT RTL would not synthesize to 66 MHz – 27 MHz clock used instead IDCT code takes ~30 minutes to synthesize – Inefficiency of using Catapult-C to generate code Catapult code difficult to debug Some reads not returning correct values – Read/Write alignment – Synthesis could be a problem

Debug Techniques Removed IDCT block for fast synthesis – Used to check interface memory writes – Showed 16 bit writes were not successful Routed state bits to board LEDS – Helpful when program hangs due to lack of DTACK – OR’d DTACK with DIP switch to prevent hang printf and printk statements to check addresses and data being sent

Delay Values Hardware Delay – Approximately 10 us to compute IDCT Based on cycle count provided by Catapult-C and 27 MHz clock frequency of FPGA Pure software implementation – Approximately 30 us Overhead for communication – ~15000 us

VGA Display Block Diagram VGA Application Driver VGA Controller Main FSM RAM 1 RAM 2 ADV 7125 Monitor VGA On Board FPGA ARM Generated ppm files

VGA Hardware: ADV7125 Video DAC ADV7125 has triple 8-bit video DAC’s VGA DAC requires R, G, B 8-bit values Needs H-Sync and V-Synch signals

VGA Controller Used double buffer to store frame data – FIFO implementation didn’t work ARM cannot keep up with the display data rate requirement – Frame resolution: 64X48 – Each frame transfer requires 3072 words – Used 12KB RAM memory to implement double buffer One full frame transferred with single driver call – Reduces system call overhead – Each call overhead ~26 μs Interrupt used to communicate to User application – Fills the next buffer

VGA Display Demonstration

Lessons Learned Debugging on an FPGA is difficult! Hand-conversion of C code could have been more efficient Create test bench to simulate ARM-FGPA communication – Allows quick debug of FPGA hardware – Visibility into internal signals Hardware partition should have high computation to communication ratio – IDCT called many times with small computation time – ~10 us of computation; ~15000 us of communication

Future Work Fix erroneous reads from IDCT Integrate VGA display driver and MPEG2 Decoder

Thank you!