Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECE 734 VLSI Array Structures for Digital Signal Processing Topic: Implementation of JPEG 2000 component algorithm—DWT in TI TMS32060 Team Members: Peng.

Similar presentations


Presentation on theme: "ECE 734 VLSI Array Structures for Digital Signal Processing Topic: Implementation of JPEG 2000 component algorithm—DWT in TI TMS32060 Team Members: Peng."— Presentation transcript:

1 ECE 734 VLSI Array Structures for Digital Signal Processing Topic: Implementation of JPEG 2000 component algorithm—DWT in TI TMS32060 Team Members: Peng Zhang and Xun Zhang Advisor: Yu Hen Hu Spring 2004

2 Agenda Abstract DWT C implementation DWT TMS320 C62 Assembly Code Without optimization Speed optimization Pipeline optimization (by us) Result comparison Jpeg 2000 and DWT (if we have free time)

3 Abstract In this project, we would like to implement and optimize DWT algorithm,which is used as a key algorithm in JPEG2000, on TI TMS320C62 platform. 1 st Step, we implemented 2D DWT algorithm by C code; 2 nd Step, we implemented 2D DWT algorithm at TI TMS320C62 platform 2 times, without any optimization and with the fastest speed optimization; 3 rd Step, we did advanced optimization to assembly code, mainly used pipeline; 4 th Step, we compare the performance between before and after our optimization. Spring 2004

4 C code Implementation... #define S(i) a[x*(i)*2]... void dwt_deinterleave(int *a, int n, int x) { int dn, sn, i; int *b; dn=n/2; sn=(n+1)/2; b=(int*)malloc(n*sizeof(int)); for (i=0; i

5 Assembly Code without any optimization ; ; 24 | void dwt_deinterleave(int *a, int n, int x) ; _dwt_deinterleave: ;** *... ; ; 31 | for (i=0; i

6 Assembly Code with speed optimization _dwt_deinterleave: … ;** || MV.D2 B4,B11.line5 MV.D2 B11,B0 ; |28| SHRU.S2 B0,31,B4 ; |28| ADD.D2 B4,B0,B4 ; |28| SHR.S2 B4,1,B0 ; |28| MV.D2 B0,B12 ; |28|.line6 ADD.D2 1,B11,B10 ; |29| SHRU.S2 B10,31,B4 ; |29| ADD.D2 B4,B10,B4 ; |29| SHR.S2 B4,1,B4 ; |29| MV.S1X B4,A12 ; |29|.line7 B.S1 _malloc ; |30| MVKL.S2 RL0,B3 ; |30| SHL.S1X B11,2,A4 ; |30| MVKH.S2 RL0,B3 ; |30| NOP 2 RL0: ; CALL OCCURS ; |30|.line8 CMPLT.L2 B10,2,B0 [ B0] B.S1 L2 ; |31| MV.D2 B10,B4 [!B0] MV.D1 A4,A3 [!B0] MV.S1 A10,A0 NOP 2 ; BRANCH OCCURS ; |31| ;** * ;** U$22 = a; ;** U$25 = b; ;** L$1 = K$7>>1; ;** X$4 = x<<3; ;** #pragma MUST_ITERATE(1, , 1).line9 SHR.S2 B4,1,B0 ; |32| || SHL.S1 A11,3,A6 ;** g3: ;** *U$25++ = *U$22; ;** U$22 += X$4; ;** if ( --L$1 ) goto g3; SUB.D2 B0,1,B0 ; |32| L1: [ B0] B.S1 L1 ; |32| || LDW.D1T1 *A0,A5 ; |32| ADD.S1 A6,A0,A0 ; |32| [ B0] SUB.D2 B0,1,B0 ; |32| NOP 2 STW.D1T1 A5,*A3++ ; |32| ; BRANCH OCCURS ; |32| ;** *...

7 Speed optimized code analysis for (i=0; i

8 Assembly Code with pipeline optimization SHR.S2 B4,1,B0 CMPGT.L2 B0,6,B1 [ B1] B.S1 L2 SHL.S1 A10,3,A3 [!B1] SUB.D2 B0,1,B0 NOP 3 ;** *... ;** * L2: ADD.S1 A3,A4,A4 || SUB.D2 B0,7,B0 || LDW.D1T1 *A4,A6 ;** * L3: ; PIPELINED LOOP PRE-PROCESS MV.S2X A0,B4 || [ B0] B.S1 L4 || ADD.L1 A3,A4,A0 || [ B0] SUB.D2 B0,1,B0 || LDW.D1T1 *A4,A0 ADD.L1 A3,A0,A0 || [ B0] SUB.D2 B0,1,B0 || LDW.D1T1 *A0,A0 || [ B0] B.S1 L4 [ B0] B.S1 L4 || ADD.L1 A3,A0,A0 || [ B0] SUB.D2 B0,1,B0 || LDW.D1T1 *A0,A0 ADD.L1 A3,A0,A0 || [ B0] SUB.D2 B0,1,B0 || LDW.D1T1 *A0,A0 || [ B0] B.S1 L4 MV.S2X A6,B5 || [ B0] B.S1 L4 || ADD.L1 A3,A0,A4 || [ B0] SUB.D2 B0,1,B0 || LDW.D1T1 *A0,A0 ;** * L4: ; PIPELINED LOOP STW.D2T2 B5,*B4++ || MV.S2X A0,B5 || [ B0] B.S1 L4 || ADD.L1 A3,A4,A4 || [ B0] SUB.L2 B0,1,B0 || LDW.D1T1 *A4,A0 ;** * L5: ; PIPELINED LOOP PAST-PROCESS MV.S2X A0,B5 || STW.D2T2 B5,*B4++ MV.S2X A0,B5 || STW.D2T2 B5,*B4++ MV.S2X A0,B5 || STW.D2T2 B5,*B4++ MVC.S2 B6,CSR || MV.L2X A0,B5 || STW.D2T2 B5,*B4++ ;** * MV.S2X A0,B5 || STW.D2T2 B5,*B4++ STW.D2T2 B5,*B4++ ;** *

9 Pipeline optimized code design for (i=0; i

10 Comparison optimized code with speed (by C6) vs. optimized code with pipeline (by us) for (i=0; i

11 JPEG2000 Lossy Image Compression DWT Quantizer Entropy Coder Encoder Spring 2004

12 1-Level Wavelet Decomposition (2D DWT) H1H1 H2H2 H1H1 H2H H1H1 H2H2 22 Row-wise operationsColumn-wise operations HiHi x[n]y[n] 2 Keep one out of two pixels Filter Decimator Input Image LL Component HL Component LH Component HH Component (Low pass) (High pass) Spring 2004

13 Multi-Level Wavelet Decomposition LL HL1 LH1HH1 2D-DWT LL HL2 HH2 LH2 HL1 LH1 HH1 Spring 2004

14 Thanks! Questions?


Download ppt "ECE 734 VLSI Array Structures for Digital Signal Processing Topic: Implementation of JPEG 2000 component algorithm—DWT in TI TMS32060 Team Members: Peng."

Similar presentations


Ads by Google