Presentation is loading. Please wait.

Presentation is loading. Please wait.

Motion Estimation Based Frame Rate Conversion Hardware Designs by Özgür Taşdizen PhD Thesis Sabancı University May 2010.

Similar presentations


Presentation on theme: "Motion Estimation Based Frame Rate Conversion Hardware Designs by Özgür Taşdizen PhD Thesis Sabancı University May 2010."— Presentation transcript:

1 Motion Estimation Based Frame Rate Conversion Hardware Designs by Özgür Taşdizen PhD Thesis Sabancı University May 2010

2 2 Outline Introduction Motion Estimation Hexagon Based Motion Estimation Algorithm and Hardware Architectures for Its Implementation Dynamic Step Search Motion Estimation Algorithms and a Hardware Architecture for Their Implementation Computation Reductions for Vector Median Filtering Frame Interpolation Hardware Conclusions

3 3 Frame Rate Up-Conversion (FRC)  FRC is the conversion of a lower frame rate video signal to a higher frame rate video signal.  Broadcasting standards are fixed to 50 Hz for PAL and 60 Hz for NTSC and movies are recorded either in 24,25 or 30 fps, whereas currently available flat panel displays support frame rates of up to 240 Hz.  Simple techniques degrade the quality of the video, whereas motion estimation based techniques producing high quality results are difficult to implement in real-time.

4 4 Motion Estimation (ME) Previous FrameCurrent FrameMotion Vectors (MVs)  ME is the process of finding similarities between adjacent frames.  Application areas: Video Compression, FRC, De-interlacing, De-Noising and Super Resolution.  Block Based ME is the most preferred method due to its simplicity.

5 5 Block Based ME M, N: MacroBlock (MB) of Size MxN d(dx, dy): MV c : Current Frame r : Reference Frame [-p,p] : Search Window The goal is finding the closest matching block.

6 6 Full Search (FS)  FS computes SADs for each search location in the search window.  Processing of HD videos require larger search ranges due to larger motions between consecutive frames. There are (2p + 1) 2 search locations in a (±p, ±p) search window.  A search range of (±63, ±63) pixels consists of 16129 search locations. Implementing FS in this search range for 1920x1080 resolution and 25 fps video requires 2.51 TOPS. Search locations of FS (p=4)

7 7 Fast Search Algorithms  They are developed for low bit-rate applications and they try to reach the PSNR of FS by checking fewer search locations.  However, they obtain lower PSNR results for larger search ranges which is necessary for HD video.  In addition, they are not suitable for HW implementation due to their sequential nature.  Successful fast search algorithms: TSS, 2D-LS, NTSS, FSS, BBGDS, DS, HEXBS, ARPS, ADCS, FTS. Pixel Location Search Location of First Step Selected Location of Second Step Selected Location of First Step Search Location of Second Step Search Location of Third Step Selected Location of Third Step

8 8 Motivation  Fast search ME algorithms perform very well for low bit-rate applications such as video phone and video conferencing.  However, fast search algorithms do not produce satisfactory results for the recently available consumer electronics devices such as high frame rate HD flat panel display.  The computational complexity of FS algorithm is very high, especially for the recently available HD applications.  Therefore, we propose new ME algorithms, which have similar performance with FS algorithm, and hardware architectures for efficiently implementing these algorithms in order to support real-time processing of HD videos on low cost FPGA devices.

9 9 Pixel Location Search Location of First Step Selected Location of Second and Third Steps Selected Location of First Step Search Location of Second Step Search Location of Third Step Search Location of Fourth Step Selected Location of Fourth Hexagon Based Search (HEXBS) Coarse Pattern Fine Pattern  Consists of two search patterns; coarse and fine.  Each new hexagon brings three new search locations.

10 10 Proposed Hexagon Based ME Algorithm 32x16 Main Search Pattern Fine Search Patterns PlusSideDoubleCross  Generalization of the HEXBS ME algorithm  Consists of two search steps: Main and Fine  32x16 main search pattern consist of all the search locations that can be checked by HEXBS algorithm during several iterations  Used with various fine search patterns

11 11 Proposed Smaller Search Patterns 10x9 Main Search Pattern  10x9, 12x12, 14x15 main search patterns have two pixel gap in the vertical direction. Patterns Search Range Number of Search Locations 10x9(±10,±9)73 12x12(±12,±12)113 14x15(±14,±15)159 32x16(±32,±16)533

12 12 Benchmark Suite TableTennis (704x480,15 fps)Flowers (704x480,15 fps)Susie (704x480,15 fps) Foreman (352x288,15 fps) Spiderman (720x576,25 fps)Gladiator (720x576,25 fps)IRobot (720x576,25 fps)  Used to evaluate the performances of proposed search patterns.  The video sequences are 100 frames long.

13 13 Performance Evaluation MAD (u,v) = MAD results (Frame Distance = 2)

14 14 Proposed 16x16 Generic Architecture

15 15 Comparison of Generic Architectures for Various Block Sizes

16 16 Proposed Systolic Hardware Architecture

17 17 Datapath of The Systolic Hardware Architecture  256 Processing Elements (PE) to process 16x16 MB  16 Block RAMs (BRAM), each configured as 16-bit wide 01234567...............01234567...............

18 18 Dataflow Through The Systolic PE Array  Loading the reference data of the initial search location takes 8 clock cycles.  Consequtive search locations require only 1 clock cycle.  Therefore, completing the search for a single MB takes 672, 236, 176, and 122 clock cyles for 32x16, 14x15, 12x12 and 10x9 patterns, respectively.

19 19  Proposed architectures are implemented in VHDL, verified with simulations using Modelsim and mapped to a low-cost XC3S1200E-5 FPGA using Xilinx ISE.  The systolic architecture consumes 6648 LUTs and 16 BRAMs.  Both hardware architectures can run at 144 MHz when implemented on a low cost XC3S1200E-5 FPGA, and they can process 25 1920x1080 frames per second for the largest search range of (±32,±16) pixels. Implementation Results

20 20 Summary  A Hexagon-Based ME algorithm having lower computational complexity than FS ME algorithm is proposed.  The simulation results showed that the PSNR obtained by this algorithm is better than the PSNR obtained by other fast search algorithms.  Two high performance hardware architectures, generic and systolic, for implementing this algorithm are proposed.  Generic architecture can be used for implementing any search pattern but its on- chip memory area and on-chip memory bandwidth requirement is high.  Systolic architecture is an effective way of implementing proposed search patterns.

21 21  DVSS has different number and size of search steps which can be dynamically reconfigured.  DVSS algorithm decreases the computational complexity by adaptively changing between search patterns A1, A2, A3.  The number of steps and the search range of each step are determined for the current block based on the size and SAD value of the previously found MV for the left neighboring block, which is called as Left Neighboring Motion Vector (LNMV). Dynamically Variable Step Search (DVSS) Search Pattern 10x9-±10, ±9±3, ±373 14x15-±14, ±15±3, ±3159 A1±48, ±24±6, ±6±3, ±3405 A2±24, ±12±6, ±6±3, ±3161 A3-±18, ±10±3, ±3249 32x16-±32, ±16±3, ±3553 B±48, ±24±12, ±12±6, ±6565 C±48, ±24±24, ±12±12, ±6793 48x24-±48, ±24±3, ±31221 FS--±48, ±244753 Search Range of First Step Search Range of Second Step Search Range of Third Step Number of Search Locations

22 22 Search Pattern A1

23 23 Pseudo Code of DVSS Algorithm  If LNMV falls within a smaller search range, DVSS decreases the search granularity and search range size, because for small motions doing the search in a smaller search range is sufficient and doing a finer granularity search in a smaller search range can give better MAD results. If there is no left neighboring block Do Pattern A1 Else if SAD value of LNMV exceeds the threshold (τ) Switch to next coarser pattern Else If LNMV is within (±8, ±4) pixels Do FS in (±10, ±5) search range Else if LNMV is within (±16, ±8) pixels Do Pattern A3 Else if LNMV is within (±24, ±12) pixels Do Pattern A2 Else Do Pattern A1

24 24 MAD Results of DVSS Algorithm  The MAD performance gap between the search pattern A1 (405 search locations) and FS (4753 search locations) is only 7.5% on the average.  DVSS decreases the computational complexity significantly with a small decrease in the MAD performance.

25 25 Proposed Reconfigurable and Systolic Hardware Architecture

26 26 16x16 Reconfigurable PE Array  Multiplexers are placed between PEs to implement reconfigurability.

27 27  The reference pixels for the first search location in a line of the search window are loaded in 4 clock cycles.  SAD value of next search location is calculated in 1 cycle.  Reference data is shifted to right in the PE array in each consecutive clock cycle and shift amount can be 4, 2 or 1 pixels depending on the type of the step; coarse, medium or fine respectively. Dataflow Through The Reconfigurable & Systolic PE Array

28 28 Search PatternSupported Frame Size & Rate A16332053711920x1080, 25.3 fps B9571358411366x768, 33.1 fps C12211064701366x768, 25.9 fps 10x912211803271920x1080, 145.7 fps 14x152366101691920x1080, 75.3 fps 32x166722142851920x1080, 26.4 fps 48x2414251010521366x768, 24.6 fps FS510325475720x576, 15.7 fps Performance of Proposed Hardware Architecture  Works at 130MHz on a XC3S1500-5, and consumes 9128 slices and 16 BRAMs.  A1, A2, and A3 require 633, 357, and 380 clock cycles, respectively.  DVSS requires on the average 467 cycles per MB when τ is set to 256 and can support 34.3 HD fps. Required Clock Cycles per MB Processed MBs per Second

29 29 Threshold (τ) Spider25696094246162059427.056 Spider102490284377162055828.797 Gladiator25687299334162053929.782 Gladiator102480952068162050032.117 Irobot25677966499162048233.347 Irobot102474177157162045835.051 Susie25659212520132044935.778 Susie102451666864132039241.003 Flowers25652181938132039640.598 Flowers102449586582132037642.723 TableTennis25653382291132040539.685 TableTennis102447136775132035844.944 Performance of Proposed Hardware Architecture for DVSS Algorithm Required Cycles for 100 Frames Video Sequence MBs per Frame Average Cycles per MB Supported 1920x1080 fps

30 30 Recursive Dynamically Variable Step Search (RDVSS) RDVSS has three distinct search paths: 1)Temporal Search: Can track the camera movement, applied if Motion Vector Fields (MVFs) are similar 2)Spatial Search: Can track the object movement, applied if Temporal path fails 3)Main Search: Performs well for complex motions, applied if Spatial and Temporal paths fail Main Search has a maximum range of (±64,±64) pixels Spatially Searched Areas Current MB Temporally Searched Area MB (i,j, t) MB (i,j-1, t) MB (i-1,j, t) MB (i-1,j-1, t) MB (i+1,j-1,t)

31 31 RDVSS Search Patterns  As in the DVSS algorithm, each search pattern has a maximum of three different granularity search steps with different size search ranges.  In the first, second, and third steps, horizontal and vertical distances between search locations are 4, 2, and 1 pixels, respectively.

32 32 Early Search Termination Based on an Adaptive Threshold Level  The minimum SAD values of four neighboring MBs (i-1,j-1,t), (i,j-1,t), (i+1,j-1,t), and (i-1,j,t) are selected and compared with a pre-determined threshold (τ). The comparison selects the greater value as the early search termination level.

33 Pseudo Code of RDVSS Algorithm Iteration 1:If (TD is equal or less than (±4,±4) pixels) Do Recursive Small Pattern around MV(i,j,t-1) Else if (TD is equal or less than (±8,±8) pixels) Do Recursive Medium Pattern around MV(i,j,t-1) Else if (TD is equal or less than (±16,±16) pixels) Do Recursive Large Pattern around MV(i,j,t-1) Else Do 1x1 Full Search Pattern around MV(i,j,t-1) Iteration 2:If (SD is equal or less than (±3,±3) pixels) Do 3x3 Full Search Pattern around ASNMV Else Do 1x1 Full Search Pattern around MV(i-1,j-1,t), MV(i,j-1,t), MV(i+1,j-1,t), and MV(i-1,j,t) Iteration 3:If (SD is equal or less than (±16,±16) pixels) Main Small Pattern around (0,0) Else if (SD is equal or less than (±32,±32) pixels) Do Main Medium Pattern around (0,0) Else Do Main Large Pattern around (0,0) Until (Main Large Pattern is used) Do next larger Main Pattern around (0,0) Temporal Search Spatial Search Main Search 33

34 34 HD Videos Added to The Benchmark Suite IceAge2 (1920x1080,25 fps)ParkJoy (1920x1080,25 fps) Spiderman3 (1280x576,25 fps)Ducks (1280x760,25 fps)SthlmPan (1280x760,25 fps) “IceAge2”, “ParkJoy1080p”, “Ducks”, and “ParkJoy720p” are 50 frames long and remaining videos are 100 frames long.

35 35 MAD Performance Results of RDVSS  When threshold level is set to 256, RDVSS performs 14.7% close to FS.  RDVSS has nearly the same performance with the main coarse pattern, which checks 1113 search locations.  DVSS gives slightly better results for videos containing very small motions, because it checks more search locations in the fine search step.

36 36 Computation Savings of RDVSS Video Sequence RDVSS τ = 256 RDVSS τ = 512 RDVSS τ = 1024 ParkJoy1080p (1920x1080, 25fps)959933738 IceAge2 (1920x1080, 25fps)601448301 Ducks(1280,760, 25fps)380372366 ParkJoy720p (1280x720, 25fps)921805723 Spider3 (1280x576, 25fps)529429322 Spider2 (720x576, 25fps)843660327 Susie (704x480, 15fps)850729365 TableTennis (704x480, 15fps)782716204 Number of search locations  FS checks 16641 search locations within (±64,±64) search window, whereas RDVSS checks 418 (97.5% less) search locations on the average when the threshold value is set to 1024.  When compared with DVSS for a maximum search range of (±48, ±24) pixels and for the same threshold level (τ = 256), RDVSS searches 34% less search locations on the average while giving nearly the same MAD results.

37 37 Vector Median Filter (VMF)  VMFs are non-linear filters.  They are mainly used for removing the noise from a signal by smoothing out the signal. Recently, they are used for FRC for finding the true motion information.  The output of the VMF is chosen as the vector among the input vectors that minimizes the sum of distances to all the other vectors.  VMFs are difficult to implement in real-time because of their high computational complexity.

38 38 Smoothing Current frame and its MVF Original MVF and smoothed MVF

39 39 Proposed Data-Reuse Technique Savings of data-reuse technique Required number of computations for various window sizes Some of the vector distances for consecutive filtering windows : (a) t n, (b) t n+1, (c) t n+2 Vectors belonging to 3x3 windows

40 40 Proposed Spatial Correlations Techniques Number of storage operations for proposed techniques Number of comparison operations for proposed techniques  Correlation 1 compares new vectors entering the window among themselves.  Correlation 2 compares new vectors with vector in the middle of the current filtering window.  Correlation 3 compares new vectors with the remaining vectors of the current filtering window.  Correlation 1, 2 and 3 require (N 2 -N), 2N and 2(N 3 -N 2 ) comparisons, respectively.  Correlation 2 and 3 require 2 and 2(N 2 -N) storage operations, respectively.

41 41 Computation Reductions for 3x3 VMF  For a 3x3 window size, the proposed data-reuse and spatial correlations 1 and 2 techniques together reduced the number of operations from 414 to 114.

42 42 “Dif” Parameter Computation reductions for 3x3 VMF for various “dif” values Average computation reductions

43 43 VMF Hardware Top-level Block DiagramWeighting & Minimum Selector  To the best of our knowledge, no hardware architecture is presented in the literature implementing the proposed techniques.  The proposed architecture is implemented for a 3x3 filtering window but it is scalable to any filtering window size.

44 44 VMF Datapath

45 45 Implementation Results  The proposed hardware architecture is implemented in VHDL, and mapped to a low cost Xilinx XC3S400A-5 FPGA.  It consumes 1426 slices.  It can work at 145 MHz.  Since processing a 1920x1080 HD frame with MVs corresponding to 4x4 MBs requires 1569440 clock cycles, processing a frame takes 10.824ms. Therefore, the proposed hardware can process 92 HD frames per second when there is no computation reduction by using spatial correlations. When spatial correlations 1 and 2 are used, on the average, 110 HD frames per second can be processed.

46 46 Frame Interpolation  Real-time interpolation of HD frames is a major design challenge.  We propose a low cost reconfigurable hardware architecture for real-time implementation of frame interpolation algorithms.  The main bottleneck in an FRC system is accessing the frame memory. The required memory bandwidth of the example system shown below is 6.5 GB/s.

47 47 Frame Interpolation Techniques Linear Interpolation Motion Compensated Averaging Static Median Filtering Dynamic Median Filtering Soft Switching Cascaded Median Filtering

48 48 PSNR Results  All the ME based techniques perform better than LI.  Although checking much fewer search locations, DVSS obtains similar PSNR results with FS. The performance gap between DVSS and FS is less than 1%.

49 49 Frame Interpolation Hardware  The proposed architecture implements LI, MCA, SMF, DMF, SS and CMF.  It allows adaptive selection of these algorithms for each MB.  It takes the selected interpolation algorithm and the MV as inputs.

50 50 Datapath and BRAMs  Each box labeled R, G, B represents a processing element.  30 16-bit rotators are used for aligning the interpolated MB.

51 51 Processing Element

52 52 Implementation Results  The proposed hardware is implemented in VHDL and mapped to a Xilinx XC3SD1800A-4 FPGA.  It consumes 15384 Slices and 32 BRAMs.  It can run at 101 MHz.  When operated in any mode except CMF, it interpolates a 16x16 MB in 16 clock cycles after the first result is ready. When operated in CMF mode, it interpolates a 16x16 MB in 48 clock cycles after the first result is ready.

53 53 Contributions  We proposed ME algorithms (Hexagon Based ME, DVSS, RDVSS) and hardware architectures (Generic, Systolic, Reconfigurable and Systolic) to efficiently implement these algorithms. Proposed algorithms perform very close to FS algorithm by searching much fewer search locations than FS algorithm and they outperform successful fast search ME algorithms by searching more search locations than these algorithms.  We proposed techniques to reduce the computational complexity of VMFs by using data reuse methodology and by exploiting the spatial correlations in the vector field. We designed and implemented an efficient VMF hardware including the computation reduction techniques exploiting the spatial correlations in the vector field.  We proposed a low cost hardware architecture for real-time implementation of frame interpolation algorithms. The proposed hardware architecture is reconfigurable and it allows adaptive selection of frame interpolation algorithms for each MB.

54  O. Tasdizen, A. Akin, H. Kukner, I. Hamzaoglu, and H. F. Ugurrdag, “High Performance Hardware Architectures for a Hexagon-Based Motion Estimation Algorithm”, Proc. IEEE 16th IFIP International Conference on VLSI – SoC, Rhodes, Greece, October 2008.  O. Tasdizen, H. Kukner, A. Akin, I. Hamzaoglu, “A High Performance Reconfigurable Motion Estimation Hardware Architecture”, Proc. IEEE DATE Conference, Nice, France, Apr. 2009.  O. Tasdizen, A. Akin, H. Kukner, and I. Hamzaoglu, "Dynamically variable step search motion estimation algorithm and a dynamically reconfigurable hardware for its implementation," IEEE Transactions on Consumer Electronics, vol. 55, no. 3, pp. 1645-1653, Aug. 2009.  O. Tasdizen and I. Hamzaoglu, "A reconfigurable frame interpolation hardware architecture for high definition video," in Euromicro DSD, Patras, Greece, Aug. 2009.  O. Tasdizen and I. Hamzaoglu, "Recursive dynamically variable step search motion estimation algorithm for high definition video," in ICPR, Istanbul, Turkey, Aug. 2010.  O. Tasdizen and I. Hamzaoglu, "Computation reduction techniques for vector median filtering and their hardware implementation," in Euromicro DSD, Lille, France, Sep. 2010. 54 Published Papers


Download ppt "Motion Estimation Based Frame Rate Conversion Hardware Designs by Özgür Taşdizen PhD Thesis Sabancı University May 2010."

Similar presentations


Ads by Google