ECE 734 VLSI Array Structures for Digital Signal Processing Topic: Implementation of JPEG 2000 component algorithm—DWT in TI TMS32060 Team Members: Peng Zhang and Xun Zhang Advisor: Yu Hen Hu Spring 2004
Agenda Abstract DWT C implementation DWT TMS320 C62 Assembly Code Without optimization Speed optimization Pipeline optimization (by us) Result comparison Jpeg 2000 and DWT (if we have free time)
Abstract In this project, we would like to implement and optimize DWT algorithm,which is used as a key algorithm in JPEG2000, on TI TMS320C62 platform. 1 st Step, we implemented 2D DWT algorithm by C code; 2 nd Step, we implemented 2D DWT algorithm at TI TMS320C62 platform 2 times, without any optimization and with the fastest speed optimization; 3 rd Step, we did advanced optimization to assembly code, mainly used pipeline; 4 th Step, we compare the performance between before and after our optimization. Spring 2004
C code Implementation... #define S(i) a[x*(i)*2]... void dwt_deinterleave(int *a, int n, int x) { int dn, sn, i; int *b; dn=n/2; sn=(n+1)/2; b=(int*)malloc(n*sizeof(int)); for (i=0; i<sn; i++) b[i]=a[2*i*x];... } /// Forward wavelet tranform in 1-D. void dwt_encode_1(int *a, int n, int x) {... dwt_deinterleave(a, n, x); } /// Forward wavelet tranform in 2-D. void dwt_encode(int *a, int w, int h, int l) { int i, j, rw, rh; for (i=0; i<l; i++) { rw=int_ceildivpow2(w, i); rh=int_ceildivpow2(h, i); for (j=0; j<rw; j++) dwt_encode_1(a+j, rh, w);... } void main() {... dwt_encode(image[0], 200, 165, 8);... } Spring 2004
Assembly Code without any optimization ; ; 24 | void dwt_deinterleave(int *a, int n, int x) ; _dwt_deinterleave: ;** *... ; ; 31 | for (i=0; i<sn; i++) ; ZERO.D2 B4 ; |31| STW.D2T2 B4,*+SP(24) ; |31| LDW.D2T2 *+SP(24),B5 ; |31| LDW.D2T2 *+SP(20),B4 ; |31| NOP 4 CMPLT.L2 B5,B4,B0 ; |31| [!B0] B.S1 L2 ; |31| NOP 5 ; BRANCH OCCURS ; |31| L1:.line9 ; 32 | b[i]=a[2*i*x]; ; LDW.D2T2 *+SP(24),B4 ; |32| LDW.D2T2 *+SP(12),B5 ; |32| LDW.D2T2 *+SP(4),B6 ; |32| NOP 2 ADD.D2 B4,B4,B4 MPYLH.M2 B5,B4,B8 ; |32| MPYLH.M2 B4,B5,B7 ; |32| MPYU.M2 B5,B4,B5 ; |32| ADD.D2 B8,B7,B4 ; |32| SHL.S2 B4,16,B4 ; |32| ADD.S2 B5,B4,B4 ; |32| || LDW.D2T2 *+SP(28),B7 ; |32| LDW.D2T2 *+B6[B4],B4 ; |32| LDW.D2T2 *+SP(24),B5 ; |32| NOP 4 STW.D2T2 B4,*+B7[B5] ; |32| LDW.D2T2 *+SP(24),B4 ; |32| NOP 4 ADD.D2 1,B4,B4 ; |32| STW.D2T2 B4,*+SP(24) ; |32| LDW.D2T2 *+SP(24),B5 ; |32| LDW.D2T2 *+SP(20),B4 ; |32| NOP 4 CMPLT.L2 B5,B4,B0 ; |32| [ B0] B.S1 L1 ; |32| NOP 5 ; BRANCH OCCURS ; |32| ;
Assembly Code with speed optimization _dwt_deinterleave: … ;** || MV.D2 B4,B11.line5 MV.D2 B11,B0 ; |28| SHRU.S2 B0,31,B4 ; |28| ADD.D2 B4,B0,B4 ; |28| SHR.S2 B4,1,B0 ; |28| MV.D2 B0,B12 ; |28|.line6 ADD.D2 1,B11,B10 ; |29| SHRU.S2 B10,31,B4 ; |29| ADD.D2 B4,B10,B4 ; |29| SHR.S2 B4,1,B4 ; |29| MV.S1X B4,A12 ; |29|.line7 B.S1 _malloc ; |30| MVKL.S2 RL0,B3 ; |30| SHL.S1X B11,2,A4 ; |30| MVKH.S2 RL0,B3 ; |30| NOP 2 RL0: ; CALL OCCURS ; |30|.line8 CMPLT.L2 B10,2,B0 [ B0] B.S1 L2 ; |31| MV.D2 B10,B4 [!B0] MV.D1 A4,A3 [!B0] MV.S1 A10,A0 NOP 2 ; BRANCH OCCURS ; |31| ;** * ;** U$22 = a; ;** U$25 = b; ;** L$1 = K$7>>1; ;** X$4 = x<<3; ;** #pragma MUST_ITERATE(1, , 1).line9 SHR.S2 B4,1,B0 ; |32| || SHL.S1 A11,3,A6 ;** g3: ;** *U$25++ = *U$22; ;** U$22 += X$4; ;** if ( --L$1 ) goto g3; SUB.D2 B0,1,B0 ; |32| L1: [ B0] B.S1 L1 ; |32| || LDW.D1T1 *A0,A5 ; |32| ADD.S1 A6,A0,A0 ; |32| [ B0] SUB.D2 B0,1,B0 ; |32| NOP 2 STW.D1T1 A5,*A3++ ; |32| ; BRANCH OCCURS ; |32| ;** *...
Speed optimized code analysis for (i=0; i<sn; i++) b[i]=a[2*i*x]; Assume sn = n+1 6*(n+1) clock cycles are needed 0…6i+16i+26i+36i+46i+56i+6 …(load a[2ix])sn-1 &a[0]…&a[2(i+1)x] …a[i] &b[0]…&b[i+1] …b[i]=a[i] [ B0] B.S1 L1 ; |32| || LDW.D1T1 *A0,A5 ; |32| ADD.S1 A6,A0,A0 ; |32| [ B0] SUB.D2 B0,1,B0 ; |32| NOP 2 STW.D1T1 A5,*A3++ ; |32|
Assembly Code with pipeline optimization SHR.S2 B4,1,B0 CMPGT.L2 B0,6,B1 [ B1] B.S1 L2 SHL.S1 A10,3,A3 [!B1] SUB.D2 B0,1,B0 NOP 3 ;** *... ;** * L2: ADD.S1 A3,A4,A4 || SUB.D2 B0,7,B0 || LDW.D1T1 *A4,A6 ;** * L3: ; PIPELINED LOOP PRE-PROCESS MV.S2X A0,B4 || [ B0] B.S1 L4 || ADD.L1 A3,A4,A0 || [ B0] SUB.D2 B0,1,B0 || LDW.D1T1 *A4,A0 ADD.L1 A3,A0,A0 || [ B0] SUB.D2 B0,1,B0 || LDW.D1T1 *A0,A0 || [ B0] B.S1 L4 [ B0] B.S1 L4 || ADD.L1 A3,A0,A0 || [ B0] SUB.D2 B0,1,B0 || LDW.D1T1 *A0,A0 ADD.L1 A3,A0,A0 || [ B0] SUB.D2 B0,1,B0 || LDW.D1T1 *A0,A0 || [ B0] B.S1 L4 MV.S2X A6,B5 || [ B0] B.S1 L4 || ADD.L1 A3,A0,A4 || [ B0] SUB.D2 B0,1,B0 || LDW.D1T1 *A0,A0 ;** * L4: ; PIPELINED LOOP STW.D2T2 B5,*B4++ || MV.S2X A0,B5 || [ B0] B.S1 L4 || ADD.L1 A3,A4,A4 || [ B0] SUB.L2 B0,1,B0 || LDW.D1T1 *A4,A0 ;** * L5: ; PIPELINED LOOP PAST-PROCESS MV.S2X A0,B5 || STW.D2T2 B5,*B4++ MV.S2X A0,B5 || STW.D2T2 B5,*B4++ MV.S2X A0,B5 || STW.D2T2 B5,*B4++ MVC.S2 B6,CSR || MV.L2X A0,B5 || STW.D2T2 B5,*B4++ ;** * MV.S2X A0,B5 || STW.D2T2 B5,*B4++ STW.D2T2 B5,*B4++ ;** *
Pipeline optimized code design for (i=0; i<sn; i++) b[i]=a[2*i*x]; Assume sn = n+1 n+7 clock cycles are needed …n-5…n+1n+2n+3n+4n+5n+6n+7 sn-7sn-8sn-9sn- 10 sn- 11 sn- 12 sn- 13 …0… &a[0]&a[2 x] &a[4 x] &a[6 x] &a[8 x] &a[1 0x] &a[1 2x] &a[1 4x] ……&a[2( n+1) x] a[0]a[2x]……a[2(n -5)x] a[2(n -4)x] a[2(n -3)x] a[2(n -2)x] a[2(n -1)x] a[2n x] &b[0]&b[1]……&b[n- 5] &b[n- 4] &b[n- 3] &b[n- 2] &b[n- 1] &b[n] b[0]= a[0] ……b[n- 6]=a[ 2(n- 6)x] b[n- 5]=a[ 2(n- 5)x] b[n- 4]=a[ 2(n- 4)x] b[n- 3]=a[ 2(n- 3)x] b[n- 2]=a[ 2(n- 2)x] b[n- 1]=a[ 2(n- 1)x] b[n]= a[2n x] L4: ; PIPELINED LOOP STW.D2T2 B5,*B4++ || MV.S2X A0,B5 || [ B0] B.S1 L4 || ADD.L1 A3,A4,A4 || [ B0] SUB.L2 B0,1,B0 || LDW.D1T1 *A4,A0
Comparison optimized code with speed (by C6) vs. optimized code with pipeline (by us) for (i=0; i<sn; i++) b[i]=a[2*i*x]; Assumed sn = n+1; Result: Speed optimized code used 6(n+1) clock cycles Pipeline optimized code used n+7 clock cycles Spring 2004
JPEG2000 Lossy Image Compression DWT Quantizer Entropy Coder Encoder Spring 2004
1-Level Wavelet Decomposition (2D DWT) H1H1 H2H2 H1H1 H2H H1H1 H2H2 22 Row-wise operationsColumn-wise operations HiHi x[n]y[n] 2 Keep one out of two pixels Filter Decimator Input Image LL Component HL Component LH Component HH Component (Low pass) (High pass) Spring 2004
Multi-Level Wavelet Decomposition LL HL1 LH1HH1 2D-DWT LL HL2 HH2 LH2 HL1 LH1 HH1 Spring 2004
Thanks! Questions?