ECE 734 VLSI Array Structures for Digital Signal Processing Topic: Implementation of JPEG 2000 component algorithm—DWT in TI TMS32060 Team Members: Peng.

Slides:



Advertisements
Similar presentations
Mergesort CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.
Advertisements

Dynamic Allocation and Linked Lists. Dynamic memory allocation in C C uses the functions malloc() and free() to implement dynamic allocation. malloc is.
Stacks, Queues, and Linked Lists
FIFO Queues CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.
For(int i = 1; i
1 Lecture: Pipelining Hazards Topics: Basic pipelining implementation, hazards, bypassing HW2 posted, due Wednesday.
1 ECE734 VLSI Arrays for Digital Signal Processing Chapter 3 Parallel and Pipelined Processing.
Chapter 4 Retiming.
Are You Smarter Than a ? th Grader? 1,000,000 5th Level Topic 1 5th Level Topic 2 4th Level Topic 3 4th Level Topic 4 3rd Level Topic 5 3rd Level Topic.
Are You Smarter Than a 5 th Grader? 1,000,000 5th Grade Topic 4th Grade Topic 3rd Grade Topic 2nd Grade Topic 1st Grade Topic 400, , ,000.
1 st Place Post-Secondary Winner. 2 nd Place Post-Secondary Winner.
Agenda  Review: pointer & array  Relationship between pointer & array  Dynamic memory allocation.
Searching and Sorting Topics  Sequential Search on an Unordered File  Sequential Search on an Ordered File  Binary Search  Bubble Sort  Insertion.
Digital Systems Emphasis for Electrical Engineering Students Digital Systems skills are very valuable for electrical engineers Digital systems are the.
ECE734 VLSI Arrays for Digital Signal Processing Algorithm Representations and Iteration Bound.
Topic 2 Pointers CSE1303 Part A, Summer Semester,2002 Data Structures and Algorithms.
Spring 07, Jan 16 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.
Insertion Sorting Lecture 21. Insertion Sort Start from element 2 of list location 1 –In first iteration: Compare element 1 with all of its elements to.
More Miscellaneous Topics CS-2301 B-term More Miscellaneous Topics CS-2301, System Programming for Non-majors (Slides include materials from The.
1 8-Bit Binary-to-Gray Code Converter Mike Wong Scott Echols Advisor: Dave Parent May 11, 2005.
Sorting Algorithms Insertion and Radix Sort. Insertion Sort One by one, each as yet unsorted array element is inserted into its proper place with respect.
1 7.5 Heapsort Average number of comparison used to heapsort a random permutation of N items is 2N logN - O (N log log N).
Chapter 5 Unfolding.
Are You Smarter Than a 5 th Grader? 1,000,000 5th Grade Topic 1 5th Grade Topic 2 4th Grade Topic 3 4th Grade Topic 4 3rd Grade Topic 5 3rd Grade Topic.
Searching and Sorting Topics Sequential Search on an Unordered File
CMPSC 16 Problem Solving with Computers I Spring 2014 Instructor: Tevfik Bultan Lecture 12: Pointers continued, C strings.
Chapter 8 Searching and Sorting Arrays Csc 125 Introduction to C++ Fall 2005.
ELEC692/04 course_des 1 ELEC 692 Special Topic VLSI Signal Processing Architecture Fall 2004 Chi-ying Tsui Department of Electrical and Electronic Engineering.
Are You Smarter Than a 5 th Grader? 1,000,000 5th Grade TOPIC 1 5th Grade TOPIC2 4th Grade TOPIC3 4th Grade TOPIC4 3rd Grade TOPIC5 3rd Grade TOPIC6.
Design Objectives The design should fulfill the functional requirements listed below Functional Requirements Hardware design – able to calculate transforms.
Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,
Struct 1. Definition: Using struct to define a storage containing different types. For example it can contain int, char, float and array at the same time.
Pointers Value, Address, and Pointer. Values and Addresses int x, y, z; y x z values of x,
CSC Programming for Science Lecture 34: Dynamic Pointers.
 Life Expectancy is 180 th in the World.  Literacy Rate is 4 th in Africa.
Advanced Pointer Topics. Pointers to Pointers u A pointer variable is a variable that takes some memory address as its value. Therefore, you can have.
Implementation of Turbo Code in TI TMS320C8x Hao Chen Instructor: Prof. Yu Hen Hu ECE734 Spring 2004.
SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.
Are You Smarter Than a 5th Grader?
Are You Smarter Than a 5th Grader?
Are you SMARTER than GRADER ?th.
Are You Smarter Than a 5th Grader?
High Level Programming Languages
Analysis of Bubble Sort and Loop Invariant
Rattapoom vudhichamnong University of Wisconsin
Are You Smarter Than a 5th Grader?
Are You Smarter Than a 5th Grader?
Are You Smarter Than a 5th Grader?
CUI BIN AVS team of the MPL at UTA
4n + 2 1st term = 4 × = 6 2nd term = 4 × = 10 3rd term
,. . ' ;; '.. I I tI I t : /..: /.. ' : ····t I 'h I.;.; '..'.. I ' :".:".
Are You Smarter Than a 5th Grader?
4 Years Milestone plan – (Company Name)
Are You Smarter Than a 5th Grader?
Standardize Automate Protect Monitor team-based development
Are You Smarter Than a 5th Grader?
Are You Smarter Than a 5th Grader?
Are You Smarter Than a 5th Grader?
Are You Smarter Than a 5th Grader?
Are You Smarter Than a 5th Grader?
Are You Smarter Than a 5th Grader?
Are You Smarter Than a 5th Grader?
Are You Smarter Than a 5th Grader?
Are You Smarter Than a 5th Grader?
Are You Smarter Than a 5th Grader?
Are You Smarter Than a 5th Grader?
Are You Smarter Than a 5th Grader?
Are You Smarter Than a 5th Grader?
Are You Smarter Than a 5th Grader?
Presentation transcript:

ECE 734 VLSI Array Structures for Digital Signal Processing Topic: Implementation of JPEG 2000 component algorithm—DWT in TI TMS32060 Team Members: Peng Zhang and Xun Zhang Advisor: Yu Hen Hu Spring 2004

Agenda Abstract DWT C implementation DWT TMS320 C62 Assembly Code Without optimization Speed optimization Pipeline optimization (by us) Result comparison Jpeg 2000 and DWT (if we have free time)

Abstract In this project, we would like to implement and optimize DWT algorithm,which is used as a key algorithm in JPEG2000, on TI TMS320C62 platform. 1 st Step, we implemented 2D DWT algorithm by C code; 2 nd Step, we implemented 2D DWT algorithm at TI TMS320C62 platform 2 times, without any optimization and with the fastest speed optimization; 3 rd Step, we did advanced optimization to assembly code, mainly used pipeline; 4 th Step, we compare the performance between before and after our optimization. Spring 2004

C code Implementation... #define S(i) a[x*(i)*2]... void dwt_deinterleave(int *a, int n, int x) { int dn, sn, i; int *b; dn=n/2; sn=(n+1)/2; b=(int*)malloc(n*sizeof(int)); for (i=0; i<sn; i++) b[i]=a[2*i*x];... } /// Forward wavelet tranform in 1-D. void dwt_encode_1(int *a, int n, int x) {... dwt_deinterleave(a, n, x); } /// Forward wavelet tranform in 2-D. void dwt_encode(int *a, int w, int h, int l) { int i, j, rw, rh; for (i=0; i<l; i++) { rw=int_ceildivpow2(w, i); rh=int_ceildivpow2(h, i); for (j=0; j<rw; j++) dwt_encode_1(a+j, rh, w);... } void main() {... dwt_encode(image[0], 200, 165, 8);... } Spring 2004

Assembly Code without any optimization ; ; 24 | void dwt_deinterleave(int *a, int n, int x) ; _dwt_deinterleave: ;** *... ; ; 31 | for (i=0; i<sn; i++) ; ZERO.D2 B4 ; |31| STW.D2T2 B4,*+SP(24) ; |31| LDW.D2T2 *+SP(24),B5 ; |31| LDW.D2T2 *+SP(20),B4 ; |31| NOP 4 CMPLT.L2 B5,B4,B0 ; |31| [!B0] B.S1 L2 ; |31| NOP 5 ; BRANCH OCCURS ; |31| L1:.line9 ; 32 | b[i]=a[2*i*x]; ; LDW.D2T2 *+SP(24),B4 ; |32| LDW.D2T2 *+SP(12),B5 ; |32| LDW.D2T2 *+SP(4),B6 ; |32| NOP 2 ADD.D2 B4,B4,B4 MPYLH.M2 B5,B4,B8 ; |32| MPYLH.M2 B4,B5,B7 ; |32| MPYU.M2 B5,B4,B5 ; |32| ADD.D2 B8,B7,B4 ; |32| SHL.S2 B4,16,B4 ; |32| ADD.S2 B5,B4,B4 ; |32| || LDW.D2T2 *+SP(28),B7 ; |32| LDW.D2T2 *+B6[B4],B4 ; |32| LDW.D2T2 *+SP(24),B5 ; |32| NOP 4 STW.D2T2 B4,*+B7[B5] ; |32| LDW.D2T2 *+SP(24),B4 ; |32| NOP 4 ADD.D2 1,B4,B4 ; |32| STW.D2T2 B4,*+SP(24) ; |32| LDW.D2T2 *+SP(24),B5 ; |32| LDW.D2T2 *+SP(20),B4 ; |32| NOP 4 CMPLT.L2 B5,B4,B0 ; |32| [ B0] B.S1 L1 ; |32| NOP 5 ; BRANCH OCCURS ; |32| ;

Assembly Code with speed optimization _dwt_deinterleave: … ;** || MV.D2 B4,B11.line5 MV.D2 B11,B0 ; |28| SHRU.S2 B0,31,B4 ; |28| ADD.D2 B4,B0,B4 ; |28| SHR.S2 B4,1,B0 ; |28| MV.D2 B0,B12 ; |28|.line6 ADD.D2 1,B11,B10 ; |29| SHRU.S2 B10,31,B4 ; |29| ADD.D2 B4,B10,B4 ; |29| SHR.S2 B4,1,B4 ; |29| MV.S1X B4,A12 ; |29|.line7 B.S1 _malloc ; |30| MVKL.S2 RL0,B3 ; |30| SHL.S1X B11,2,A4 ; |30| MVKH.S2 RL0,B3 ; |30| NOP 2 RL0: ; CALL OCCURS ; |30|.line8 CMPLT.L2 B10,2,B0 [ B0] B.S1 L2 ; |31| MV.D2 B10,B4 [!B0] MV.D1 A4,A3 [!B0] MV.S1 A10,A0 NOP 2 ; BRANCH OCCURS ; |31| ;** * ;** U$22 = a; ;** U$25 = b; ;** L$1 = K$7>>1; ;** X$4 = x<<3; ;** #pragma MUST_ITERATE(1, , 1).line9 SHR.S2 B4,1,B0 ; |32| || SHL.S1 A11,3,A6 ;** g3: ;** *U$25++ = *U$22; ;** U$22 += X$4; ;** if ( --L$1 ) goto g3; SUB.D2 B0,1,B0 ; |32| L1: [ B0] B.S1 L1 ; |32| || LDW.D1T1 *A0,A5 ; |32| ADD.S1 A6,A0,A0 ; |32| [ B0] SUB.D2 B0,1,B0 ; |32| NOP 2 STW.D1T1 A5,*A3++ ; |32| ; BRANCH OCCURS ; |32| ;** *...

Speed optimized code analysis for (i=0; i<sn; i++) b[i]=a[2*i*x]; Assume sn = n+1 6*(n+1) clock cycles are needed 0…6i+16i+26i+36i+46i+56i+6 …(load a[2ix])sn-1 &a[0]…&a[2(i+1)x] …a[i] &b[0]…&b[i+1] …b[i]=a[i] [ B0] B.S1 L1 ; |32| || LDW.D1T1 *A0,A5 ; |32| ADD.S1 A6,A0,A0 ; |32| [ B0] SUB.D2 B0,1,B0 ; |32| NOP 2 STW.D1T1 A5,*A3++ ; |32|

Assembly Code with pipeline optimization SHR.S2 B4,1,B0 CMPGT.L2 B0,6,B1 [ B1] B.S1 L2 SHL.S1 A10,3,A3 [!B1] SUB.D2 B0,1,B0 NOP 3 ;** *... ;** * L2: ADD.S1 A3,A4,A4 || SUB.D2 B0,7,B0 || LDW.D1T1 *A4,A6 ;** * L3: ; PIPELINED LOOP PRE-PROCESS MV.S2X A0,B4 || [ B0] B.S1 L4 || ADD.L1 A3,A4,A0 || [ B0] SUB.D2 B0,1,B0 || LDW.D1T1 *A4,A0 ADD.L1 A3,A0,A0 || [ B0] SUB.D2 B0,1,B0 || LDW.D1T1 *A0,A0 || [ B0] B.S1 L4 [ B0] B.S1 L4 || ADD.L1 A3,A0,A0 || [ B0] SUB.D2 B0,1,B0 || LDW.D1T1 *A0,A0 ADD.L1 A3,A0,A0 || [ B0] SUB.D2 B0,1,B0 || LDW.D1T1 *A0,A0 || [ B0] B.S1 L4 MV.S2X A6,B5 || [ B0] B.S1 L4 || ADD.L1 A3,A0,A4 || [ B0] SUB.D2 B0,1,B0 || LDW.D1T1 *A0,A0 ;** * L4: ; PIPELINED LOOP STW.D2T2 B5,*B4++ || MV.S2X A0,B5 || [ B0] B.S1 L4 || ADD.L1 A3,A4,A4 || [ B0] SUB.L2 B0,1,B0 || LDW.D1T1 *A4,A0 ;** * L5: ; PIPELINED LOOP PAST-PROCESS MV.S2X A0,B5 || STW.D2T2 B5,*B4++ MV.S2X A0,B5 || STW.D2T2 B5,*B4++ MV.S2X A0,B5 || STW.D2T2 B5,*B4++ MVC.S2 B6,CSR || MV.L2X A0,B5 || STW.D2T2 B5,*B4++ ;** * MV.S2X A0,B5 || STW.D2T2 B5,*B4++ STW.D2T2 B5,*B4++ ;** *

Pipeline optimized code design for (i=0; i<sn; i++) b[i]=a[2*i*x]; Assume sn = n+1 n+7 clock cycles are needed …n-5…n+1n+2n+3n+4n+5n+6n+7 sn-7sn-8sn-9sn- 10 sn- 11 sn- 12 sn- 13 …0… &a[0]&a[2 x] &a[4 x] &a[6 x] &a[8 x] &a[1 0x] &a[1 2x] &a[1 4x] ……&a[2( n+1) x] a[0]a[2x]……a[2(n -5)x] a[2(n -4)x] a[2(n -3)x] a[2(n -2)x] a[2(n -1)x] a[2n x] &b[0]&b[1]……&b[n- 5] &b[n- 4] &b[n- 3] &b[n- 2] &b[n- 1] &b[n] b[0]= a[0] ……b[n- 6]=a[ 2(n- 6)x] b[n- 5]=a[ 2(n- 5)x] b[n- 4]=a[ 2(n- 4)x] b[n- 3]=a[ 2(n- 3)x] b[n- 2]=a[ 2(n- 2)x] b[n- 1]=a[ 2(n- 1)x] b[n]= a[2n x] L4: ; PIPELINED LOOP STW.D2T2 B5,*B4++ || MV.S2X A0,B5 || [ B0] B.S1 L4 || ADD.L1 A3,A4,A4 || [ B0] SUB.L2 B0,1,B0 || LDW.D1T1 *A4,A0

Comparison optimized code with speed (by C6) vs. optimized code with pipeline (by us) for (i=0; i<sn; i++) b[i]=a[2*i*x]; Assumed sn = n+1; Result: Speed optimized code used 6(n+1) clock cycles Pipeline optimized code used n+7 clock cycles Spring 2004

JPEG2000 Lossy Image Compression DWT Quantizer Entropy Coder Encoder Spring 2004

1-Level Wavelet Decomposition (2D DWT) H1H1 H2H2 H1H1 H2H H1H1 H2H2 22 Row-wise operationsColumn-wise operations HiHi x[n]y[n] 2 Keep one out of two pixels Filter Decimator Input Image LL Component HL Component LH Component HH Component (Low pass) (High pass) Spring 2004

Multi-Level Wavelet Decomposition LL HL1 LH1HH1 2D-DWT LL HL2 HH2 LH2 HL1 LH1 HH1 Spring 2004

Thanks! Questions?