Multi-core SOC for Future Media Processing

Multi-core SOC for Future Media Processing
Qin Xing, Yan Xiaolang The Institute of VLSI Design, Zhejiang University

Outline Opportunities & challenges from media processing
Multimedia algorithm characteristics & mapping Multi-core SOC architecture & technology Benchmarking results Project status Future work The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Opportunities Video conference IP-phone Smart terminal PDA
Video camera HDTV Set-top box … The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Challenges—multiple standards
1st MPEG-2 Encoder 6 MPEG-2 2nd Generation Encoder MPEG-4 5 H.26L H.263 H.264 3rd Generation Encoder WMV 4 VP3 AVS 4th Generation Encoder Mbit/s 3 5th Generation Encoder WMV 2 VP3 AVS 1 H.264 / MPEG-4 part 10 The Institute of VLSI Design, Zhejiang Univ. 1994 1995 1996 1997 1998 1999 2000 2018/9/17 2001 2002 2003 2004 2005

Challenges — excellent hardware
Very high computation complexity H.264 encoding of 720 x frames/s needs up to 30 GOPS Multiple standards co-exist Demands of flexibility & programmability Low power Low cost Best choice : Application Specific Instruction Processor The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Multimedia algorithm characteristics
Outer-loop and inner loop Outer loop: Interface (GUI) Os (Linux) Bit-stream parsing (park/unpack, VLC, CABAC) Data transferring Inner loop: Regular algorithms (Prediction, FIR, DCT, motion estimation) The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Multimedia algorithm mapping
Programmable and heterogeneous processors are the preferred choice for the implementation General MCU (RISC core) — outer loop Enhanced DSP(EDSP, +bit wise operation) —outer loop Vector processor(VP, VLIW+SIMD) — inner loop The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Multi-core SOC architecture
Top level Media processing kernel The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Inside the media processing kernel
GAG1 GAG2 GAG3 GAG4 GDM GTM V-DM1 V-DM2 V-DM3 V-DM4 EDSP-control path Vector control path DMA and off chip memories 2D crossbar connection network E-DP V-DP1 V-DP2 V-DP3 V-DP4 The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Technologies— specified instruction set
__asm{ mov edx, mptr movdqu xmm1, [edx] packssdw xmm1,xmm1// read m50] from memory to xmm1} __asm{ movdqu xmm4, [edx +48] packssdw xmm4,xmm4// read m5[3] from memory} __asm{ movq xmm5,xmm1 psubw xmm1,xmm3 //m6[1]=(m5[0]-m5[2]); paddw xmm3,xmm5 //m6[0]=(m5[0]+m5[2]); movq xmm5, xmm2 psraw xmm2,1 psubw xmm2,xmm4 //m6[2]=(m5[1]>>1)-m5[3] psraw xmm4,1 paddw xmm4,xmm5 //m6[3]=m5[1]+(m5[3]>>1)} for (j=0;j<BLOCK_SIZE;j++){ for (i=0;i<BLOCK_SIZE;i++){ m5[i]=img->cof[i0][j0][i][j]; } m6[0]=(m5[0]+m5[2]); m6[1]=(m5[0]-m5[2]); m6[2]=(m5[1]>>1)-m5[3]; m6[3]=m5[1]+(m5[3]>>1); Our IS 6 cycles adapt programmable processors to specific algorithms by introducing specialized instructions for frequently occurring operations of higher complexity. Integer IDCT in H.264 Intel MMX:13 cycles The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Technologies—instruction mergence
Load/Store 30% result = 0; pres_y = dy == 1 ? y_pos : y_pos+1; pres_y = max(0,min(maxold_y,pres_y));//load for(x=-2;x<4;x++) //control { pres_x = max(0,min(maxold_x,x_pos+x));//load result += imY[pres_y][pres_x]*COEF[x+2]; // computation, permutation and load } result1 = max(0, min(255, (result+16)/32));//computation Permutation 25% Computation 35% Control 10% Ld/St and Perm. Merged Computation 6 – tap sub- pixels interpolation Control The Institute of VLSI Design, Zhejiang Univ. 2018/9/17 Reduce a half of time

Benchmarking results for CPU core
CK520 The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Simulation results for DSP performance
Enhanced DSP CAVLC(context adaptive variable length coding) OGG(new audio standard) Sequence (CIF) MIPS/frame Max Average Foreman 0.147,832 0.029,898 Mobile 0.541,943 0.134,240 Function MIPS/frame MDCT 6 De_VQ 2.5 Floor/Coupling 3.5 The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Simulation results for DSP performance
Vector processor H.264 baseline decoder Sequence (298 frames) frames Max Average QCIF Foreman 28.1 12.7 Aikyo 19.8 5.3 CIF 116.3 52.3 92.9 22.8 The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Project status Finished 2 versions of CPU Core
Released DSP instruction set Writing and verifying RTL of the enhanced DSP Benchmarking vector processor Developing software tools The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Future work Scheduling for task level parallelism(TLP) between heterogeneous processors Simulation/debugging tools for heterogeneous processors Methodologies for design space exploration The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Thank you! The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Multi-core SOC for Future Media Processing

Similar presentations

Presentation on theme: "Multi-core SOC for Future Media Processing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multi-core SOC for Future Media Processing

Similar presentations

Presentation on theme: "Multi-core SOC for Future Media Processing"— Presentation transcript:

Similar presentations

About project

Feedback