Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECE 734 VLSI Array Structures for Digital Signal Processing Topic: Implementation of JPEG 2000 component algorithm—DWT in TI TMS32060 Team Members: Peng.

Similar presentations


Presentation on theme: "ECE 734 VLSI Array Structures for Digital Signal Processing Topic: Implementation of JPEG 2000 component algorithm—DWT in TI TMS32060 Team Members: Peng."— Presentation transcript:

1 ECE 734 VLSI Array Structures for Digital Signal Processing Topic: Implementation of JPEG 2000 component algorithm—DWT in TI TMS32060 Team Members: Peng Zhang and Xun Zhang Advisor: Yu Hen Hu Spring 2004

2 Agenda Abstract DWT C implementation DWT TMS320 C62 Assembly Code Without optimization Speed optimization Pipeline optimization (by us) Result comparison Jpeg 2000 and DWT (if we have free time)

3 Abstract In this project, we would like to implement and optimize DWT algorithm,which is used as a key algorithm in JPEG2000, on TI TMS320C62 platform. 1 st Step, we implemented 2D DWT algorithm by C code; 2 nd Step, we implemented 2D DWT algorithm at TI TMS320C62 platform 2 times, without any optimization and with the fastest speed optimization; 3 rd Step, we did advanced optimization to assembly code, mainly used pipeline; 4 th Step, we compare the performance between before and after our optimization. Spring 2004

4 C code Implementation... #define S(i) a[x*(i)*2]... void dwt_deinterleave(int *a, int n, int x) { int dn, sn, i; int *b; dn=n/2; sn=(n+1)/2; b=(int*)malloc(n*sizeof(int)); for (i=0; i<sn; i++) b[i]=a[2*i*x];... } /// Forward wavelet tranform in 1-D. void dwt_encode_1(int *a, int n, int x) {... dwt_deinterleave(a, n, x); } /// Forward wavelet tranform in 2-D. void dwt_encode(int *a, int w, int h, int l) { int i, j, rw, rh; for (i=0; i<l; i++) { rw=int_ceildivpow2(w, i); rh=int_ceildivpow2(h, i); for (j=0; j<rw; j++) dwt_encode_1(a+j, rh, w);... } void main() {... dwt_encode(image[0], 200, 165, 8);... } Spring 2004

5 Assembly Code without any optimization ;---------------------------------------------------------------------- ; 24 | void dwt_deinterleave(int *a, int n, int x) ;---------------------------------------------------------------------- _dwt_deinterleave: ;** ------------------------------------------------------------------------ --*... ;---------------------------------------------------------------------- ; 31 | for (i=0; i<sn; i++) ;---------------------------------------------------------------------- ZERO.D2 B4 ; |31| STW.D2T2 B4,*+SP(24) ; |31| LDW.D2T2 *+SP(24),B5 ; |31| LDW.D2T2 *+SP(20),B4 ; |31| NOP 4 CMPLT.L2 B5,B4,B0 ; |31| [!B0] B.S1 L2 ; |31| NOP 5 ; BRANCH OCCURS ; |31| L1:.line9 ; 32 | b[i]=a[2*i*x]; ;---------------------------------------------------------------------- LDW.D2T2 *+SP(24),B4 ; |32| LDW.D2T2 *+SP(12),B5 ; |32| LDW.D2T2 *+SP(4),B6 ; |32| NOP 2 ADD.D2 B4,B4,B4 MPYLH.M2 B5,B4,B8 ; |32| MPYLH.M2 B4,B5,B7 ; |32| MPYU.M2 B5,B4,B5 ; |32| ADD.D2 B8,B7,B4 ; |32| SHL.S2 B4,16,B4 ; |32| ADD.S2 B5,B4,B4 ; |32| || LDW.D2T2 *+SP(28),B7 ; |32| LDW.D2T2 *+B6[B4],B4 ; |32| LDW.D2T2 *+SP(24),B5 ; |32| NOP 4 STW.D2T2 B4,*+B7[B5] ; |32| LDW.D2T2 *+SP(24),B4 ; |32| NOP 4 ADD.D2 1,B4,B4 ; |32| STW.D2T2 B4,*+SP(24) ; |32| LDW.D2T2 *+SP(24),B5 ; |32| LDW.D2T2 *+SP(20),B4 ; |32| NOP 4 CMPLT.L2 B5,B4,B0 ; |32| [ B0] B.S1 L1 ; |32| NOP 5 ; BRANCH OCCURS ; |32| ;----------------------------------------------------------------------...

6 Assembly Code with speed optimization _dwt_deinterleave: … ;** ------------------------------------------------------------------------ || MV.D2 B4,B11.line5 MV.D2 B11,B0 ; |28| SHRU.S2 B0,31,B4 ; |28| ADD.D2 B4,B0,B4 ; |28| SHR.S2 B4,1,B0 ; |28| MV.D2 B0,B12 ; |28|.line6 ADD.D2 1,B11,B10 ; |29| SHRU.S2 B10,31,B4 ; |29| ADD.D2 B4,B10,B4 ; |29| SHR.S2 B4,1,B4 ; |29| MV.S1X B4,A12 ; |29|.line7 B.S1 _malloc ; |30| MVKL.S2 RL0,B3 ; |30| SHL.S1X B11,2,A4 ; |30| MVKH.S2 RL0,B3 ; |30| NOP 2 RL0: ; CALL OCCURS ; |30|.line8 CMPLT.L2 B10,2,B0 [ B0] B.S1 L2 ; |31| MV.D2 B10,B4 [!B0] MV.D1 A4,A3 [!B0] MV.S1 A10,A0 NOP 2 ; BRANCH OCCURS ; |31| ;** ------------------------------------------------------------------------ --* ;** ----------------------- U$22 = a; ;** ----------------------- U$25 = b; ;** 32----------------------- L$1 = K$7>>1; ;** ----------------------- X$4 = x<<3; ;** ----------------------- #pragma MUST_ITERATE(1, 1073741823, 1).line9 SHR.S2 B4,1,B0 ; |32| || SHL.S1 A11,3,A6 ;**-----------------------g3: ;** 32----------------------- *U$25++ = *U$22; ;** 32----------------------- U$22 += X$4; ;** 32----------------------- if ( --L$1 ) goto g3; SUB.D2 B0,1,B0 ; |32| L1: [ B0] B.S1 L1 ; |32| || LDW.D1T1 *A0,A5 ; |32| ADD.S1 A6,A0,A0 ; |32| [ B0] SUB.D2 B0,1,B0 ; |32| NOP 2 STW.D1T1 A5,*A3++ ; |32| ; BRANCH OCCURS ; |32| ;** -----------------------------------------------------------------------*...

7 Speed optimized code analysis for (i=0; i<sn; i++) b[i]=a[2*i*x]; Assume sn = n+1 6*(n+1) clock cycles are needed 0…6i+16i+26i+36i+46i+56i+6 …(load a[2ix])sn-1 &a[0]…&a[2(i+1)x] …a[i] &b[0]…&b[i+1] …b[i]=a[i] [ B0] B.S1 L1 ; |32| || LDW.D1T1 *A0,A5 ; |32| ADD.S1 A6,A0,A0 ; |32| [ B0] SUB.D2 B0,1,B0 ; |32| NOP 2 STW.D1T1 A5,*A3++ ; |32|

8 Assembly Code with pipeline optimization SHR.S2 B4,1,B0 CMPGT.L2 B0,6,B1 [ B1] B.S1 L2 SHL.S1 A10,3,A3 [!B1] SUB.D2 B0,1,B0 NOP 3 ;** --------------------------------------------------------------------------*... ;** --------------------------------------------------------------------------* L2: ADD.S1 A3,A4,A4 || SUB.D2 B0,7,B0 || LDW.D1T1 *A4,A6 ;** --------------------------------------------------------------------------* L3: ; PIPELINED LOOP PRE-PROCESS MV.S2X A0,B4 || [ B0] B.S1 L4 || ADD.L1 A3,A4,A0 || [ B0] SUB.D2 B0,1,B0 || LDW.D1T1 *A4,A0 ADD.L1 A3,A0,A0 || [ B0] SUB.D2 B0,1,B0 || LDW.D1T1 *A0,A0 || [ B0] B.S1 L4 [ B0] B.S1 L4 || ADD.L1 A3,A0,A0 || [ B0] SUB.D2 B0,1,B0 || LDW.D1T1 *A0,A0 ADD.L1 A3,A0,A0 || [ B0] SUB.D2 B0,1,B0 || LDW.D1T1 *A0,A0 || [ B0] B.S1 L4 MV.S2X A6,B5 || [ B0] B.S1 L4 || ADD.L1 A3,A0,A4 || [ B0] SUB.D2 B0,1,B0 || LDW.D1T1 *A0,A0 ;** --------------------------------------------------------------------------* L4: ; PIPELINED LOOP STW.D2T2 B5,*B4++ || MV.S2X A0,B5 || [ B0] B.S1 L4 || ADD.L1 A3,A4,A4 || [ B0] SUB.L2 B0,1,B0 || LDW.D1T1 *A4,A0 ;** --------------------------------------------------------------------------* L5: ; PIPELINED LOOP PAST-PROCESS MV.S2X A0,B5 || STW.D2T2 B5,*B4++ MV.S2X A0,B5 || STW.D2T2 B5,*B4++ MV.S2X A0,B5 || STW.D2T2 B5,*B4++ MVC.S2 B6,CSR || MV.L2X A0,B5 || STW.D2T2 B5,*B4++ ;** --------------------------------------------------------------------------* MV.S2X A0,B5 || STW.D2T2 B5,*B4++ STW.D2T2 B5,*B4++ ;** --------------------------------------------------------------------------*

9 Pipeline optimized code design for (i=0; i<sn; i++) b[i]=a[2*i*x]; Assume sn = n+1 n+7 clock cycles are needed 01234567…n-5…n+1n+2n+3n+4n+5n+6n+7 sn-7sn-8sn-9sn- 10 sn- 11 sn- 12 sn- 13 …0… &a[0]&a[2 x] &a[4 x] &a[6 x] &a[8 x] &a[1 0x] &a[1 2x] &a[1 4x] ……&a[2( n+1) x] a[0]a[2x]……a[2(n -5)x] a[2(n -4)x] a[2(n -3)x] a[2(n -2)x] a[2(n -1)x] a[2n x] &b[0]&b[1]……&b[n- 5] &b[n- 4] &b[n- 3] &b[n- 2] &b[n- 1] &b[n] b[0]= a[0] ……b[n- 6]=a[ 2(n- 6)x] b[n- 5]=a[ 2(n- 5)x] b[n- 4]=a[ 2(n- 4)x] b[n- 3]=a[ 2(n- 3)x] b[n- 2]=a[ 2(n- 2)x] b[n- 1]=a[ 2(n- 1)x] b[n]= a[2n x] L4: ; PIPELINED LOOP STW.D2T2 B5,*B4++ || MV.S2X A0,B5 || [ B0] B.S1 L4 || ADD.L1 A3,A4,A4 || [ B0] SUB.L2 B0,1,B0 || LDW.D1T1 *A4,A0

10 Comparison optimized code with speed (by C6) vs. optimized code with pipeline (by us) for (i=0; i<sn; i++) b[i]=a[2*i*x]; Assumed sn = n+1; Result: Speed optimized code used 6(n+1) clock cycles Pipeline optimized code used n+7 clock cycles Spring 2004

11 JPEG2000 Lossy Image Compression DWT Quantizer Entropy Coder Encoder Spring 2004

12 1-Level Wavelet Decomposition (2D DWT) H1H1 H2H2 H1H1 H2H2 2222 H1H1 H2H2 22 Row-wise operationsColumn-wise operations HiHi x[n]y[n] 2 Keep one out of two pixels Filter Decimator Input Image LL Component HL Component LH Component HH Component (Low pass) (High pass) Spring 2004

13 Multi-Level Wavelet Decomposition LL HL1 LH1HH1 2D-DWT LL HL2 HH2 LH2 HL1 LH1 HH1 Spring 2004

14 Thanks! Questions?


Download ppt "ECE 734 VLSI Array Structures for Digital Signal Processing Topic: Implementation of JPEG 2000 component algorithm—DWT in TI TMS32060 Team Members: Peng."

Similar presentations


Ads by Google