Hardware Implementation of 2-D Wavelet Transforms in Viva on Starbridge Hypercomputer S. Gakkhar, A. Dasu Utah State University Why Wavelet Transforms.

Hardware Implementation of 2-D Wavelet Transforms in Viva on Starbridge Hypercomputer S. Gakkhar, A. Dasu Utah State University Why Wavelet Transforms ? A signal cannot be represented as a point in the time – frequency space.  Fourier Transforms and Integrals can be used to represent any arbitrary signal in terms of Sines and Cosines. However, though Fourier expansion of a signal can have possibly infinite support in frequency domain, Fourier expansions contain only frequency resolution and no time resolution. That is, though it is possible to determine all possible frequencies in the given signal, it cannot be determined when they are present.  Introducing a windowing on the basis function to extract time resolution translates into convolution between the windowing function and signal in time domain or multiplication in frequency domain. Now since windowing functions tend to contain a wide range of frequencies, for instance Dirac pulse comprises of all possible frequencies, this leads to smearing of signal, and posing the exact opposite problem as Fourier transforms – presence of time resolution, but an absence of a frequency resolution. …the solution - Wavelet Transforms  Wavelet transforms address this issue by introducing a fully scalable and modulated windowing function. As such Wavelet transforms involve multi- resolution analysis where the windowing function is scaled and shifted with regard to the signal while tracking freq100ency and ‘spatial’ spectrums  An important characteristic of Wavelet Transforms is that Wavelet Transforms demonstrate perfect reconstructability, that is, it is possible to obtain the original signal by taking the inverse wavelet transform. Continuous Wavelet Transforms - A theoretical foundation Building on fundamentals of function spaces, Continuous Wavelet Transforms, can be mathematically written as decomposition of signal function f(t), onto a set of basis functions called wavelets (represented by ψ s, ד (t) where s represents the scale, and ד the translation – the new variables after the transformation) – ϒ(s, ד ) = ∫ f(t)ψ* s, ד (t) dt The wavelets are generated from a single basic wavelet (called the mother wavelet) ψ(t) by scaling and translation – ψ s, ד (t) = 1/√s ψ((t- ד )/s) The factor 1/√s is the normalizing factor …but Continuous Wavelet Transform is impractical to use in the mathematical form  CWT is redundant for it involves correlation between a continuously shifting / scaling function and the signal.  Most functions lack an analytical wavelet transform  Wavelets in transform are infinite Discrete Wavelet Transform A modified definition of Wavelet Transform in which wavelets are scaled and shifted in discrete steps addresses these flaws – ψ j,k (t) = 1/√s 0 j ψ((t - k ד 0 s 0 j )/ s 0 j ) j, k are integers, while ד 0 is chosen as 1 for dyadic sampling of time axis and, s 0 is chosen as 2 for dyadic sampling of frequency axis. The DWT Algorithm Subband Coding Principles for DWT Computational Implementation of DWT utilizes Subband Coding Schemes by modeling the transform as filtering through a set of filter banks. The filters banks used are designated as Wavelet (high pass) and Scaling (low pass), based on their attributes. In a nutshell, the algorithm for 2-D DWT comprises of convolution along x axis, followed by convolution along y axis (on previous transformed matrix) with the two filters taken in all possible permutations. The matrix is then down sampled by 2 along both axis. g(k) h(k) 2↓ ϒ j-1 λ j-1 λjλj 2↓ DWT algorithm optimizations for the design Let A[i][j] represent the elements of a matrix, F[N] and G[M] be the two filter banks, having N and M taps..: Matrix Transformations are → B[0][0]…B[0][j-1] C[0][0]…C[0][j-1] B[1][0]…B[1][j-1] C[1][0]…C[1][j-1] B[i-1][0]…B[i-1][j-1] C[i-1][0]…C[i-1][j-1] A[0][0]…A[0][j-1] B[0][0]…B[0][j-1] A[1][0]…A[1][j-1] B[1][0]…B[1][j-1] A[i-1][0]…A[i-1][j-1] B[i-1][0]…B[i-1][j-1] B[i][j] corresponds to A[i][j] * F[0] +.... A[i][j+N-1] * F [N-1] C[i][j] corresponds to B[i][[j] * G[0] +.... B[i+M-1][j] * G [M-1] Let,  Convolution with F be represented by  Convolution with G be represented by Therefore, the above transformation can be represented as → And the target transform as → Note:  2↓ represents down-sampling of the final transformed matrix  The horizontal bar indicates the convolution performed first 2↓2↓2↓2↓ 2↓2↓2↓2↓ 2↓2↓ a b c d e f g h i j k l m n o p The matrix transforms to Take for example the following input matrix for two four tap filters - Each of the sub-matrices represents the down sampled result of one set of convolutions, as such considering only one set of convolutions the original 16 point matrix transforms to just four element matrix. However, only point [0][0] can be computed in the final sub-matrix, as rest of the points depend on elements external to this matrix. This is easy to visualize if one considers the corner and edge matrices in an image. Such a scenario is referred to as an edge condition. A commonly adopted way to dealing with this is to assume the elements outside the matrix to be zero.. Matrix defining pseudo edge conditions for red sub-matrix The sub-matrices can be so formed that the edge condition matrix is reduced to - M N First M x N sub matrix First Set of Data Written to Cache j i Doubling the throughput The target transform can be represented in a tree structure → LP HP LPHP LP High Pass Filter Low Pass Filter First Stage Convolution Second Stage Convolution Computed In Parallel With this factor of 2, the parallelism extracted is M*N*2, or for Dabushies’ four tap scaling and wavelet filters, the parallelism extracted is 32, along with Pipelining introduced between the two stages of convolution. Design Overview  Hardware Xilinx 2V6000 FPGA, PE6 on the Starbridge® Hypercomputer.  Software Viva 2.4.1, polymorphic data and information rate EDA tool A B C D E F G H I J K L M N O P DWT – A Quasi Recursive Approach Consider a 2x2 matrix, with two tap filters. It is evident that it will require only one step to generate the entire first stage output, and this output can directly be fed into the second stage processing. Now if the bigger image can be broken down into small chunks that can directly be fed into second stage then the redundant memory read is eliminated as the two convolution stages have been collapsed into one. But, since two tap filters are quite uncommon (to say the least), the methodology has to be generalized for arbitrary filter sizes (MxN). This leads to an additional complication that the first stage cannot be calculated completely as it is dependent on data outside the matrix. This is quite similar to an edge condition, except that the elements can no longer be treated to be zero. Let such a situation be referred as a pseudo edge condition. Now for every chunk of data a “pseudo edge condition” matrix can be defined which comprises of elements that are required for complete evaluation of first stage convolution of the matrix. New matrix defining pseudo edge conditions for red sub-matrix Algorithmic Overview Based on this recursive outlook, it is possible to modify the algorithm so as to eliminate the potential pitfalls associated with hardware implementation of DWT. The image is traversed row-wise. An Nx(M-2) matrix is brought in, and is concatenated with a 2x1 matrix already in cache. Initially the cache contains all zeroes, as such this gives an intrinsic method for dealing with true edge conditions. The First Stage convolution is computed, and this is fed into the second stage convolution. For M = N = 4, it has been demonstrated that only one element in final convolution can be fully computed. This element is written to memory, while partially computed elements are stored in a cache associated with second stage convolution. This is done till an entire row is traversed. The cache for first stage convolution is flushed as a new edge condition needs to be dealt with for the next set of rows. The first stage convolution is calculated in precisely the same manner, but for second stage convolution, partially computed values of second stage values from last iteration are used to compute the new second stage values. The new partial values now overwrite the previous partial values in second cache. Hardware Issues with DWT Implementation High Computational Overhead High computational redundancy arises due to way the filtering is required by the algorithm. The combination of the filters used in the algorithm allows sharing of the first stage convolution results for two second stage convolutions. However, on a scalar processor the memory accesses (if this scheme of sharing first stage results is to be utilized) are so scattered that multiple page faults are induced, which hits the performance extremely hard. Multiple / Redundant Memory Accesses Implementation of the algorithm requires two sets of memory accesses (Reads/ Writes) – i)First, for original image read, and write for first stage of convolution ii)Second, for first stage convolution result read, and final result write Data Access Pattern For filters of sizes M and N, the basic matrix size is (MxN). Since, data is down-sampled after first convolution (essentially every other row is thrown away), therefore, we can cache the first two columns of the matrix, and then at each cycle Nx(M-2) elements are brought in, the first two columns of the new matrix replace the old block in cache after the data is processed 2 Design Overview Cache Associated w/ First Stage Cache Associated w/ Second Stage Second Stage ConvolutionFirst Stage Convolution Output Input Viva Schematic for Processing Core Results  Processing Core implements wavelet transform for two single precision floating point four tap filters. The design is completely polymorphic, though it has been implemented with floating point multipliers and all intermediates are cast to single precision floating point values.  The core is pipelined and generates four final pixel (after both convolutions) values every clock cycle. This is equivalent to 24 floating point multiplies and 18 floating point additions.  The logic utilization is 93%  Generating 4 pixel values every clock – cycle, this design can compute the wavelet transform of 512x512 image in 65536 clock cycles GakkharMAPLD 2005/163

Hardware Implementation of 2-D Wavelet Transforms in Viva on Starbridge Hypercomputer S. Gakkhar, A. Dasu Utah State University Why Wavelet Transforms.

Similar presentations

Presentation on theme: "Hardware Implementation of 2-D Wavelet Transforms in Viva on Starbridge Hypercomputer S. Gakkhar, A. Dasu Utah State University Why Wavelet Transforms."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hardware Implementation of 2-D Wavelet Transforms in Viva on Starbridge Hypercomputer S. Gakkhar, A. Dasu Utah State University Why Wavelet Transforms.

Similar presentations

Presentation on theme: "Hardware Implementation of 2-D Wavelet Transforms in Viva on Starbridge Hypercomputer S. Gakkhar, A. Dasu Utah State University Why Wavelet Transforms."— Presentation transcript:

Similar presentations

About project

Feedback