Presentation is loading. Please wait.

Presentation is loading. Please wait.

GenTera’s I M A G I N E 3 Introducing: GenTera’s I M A G I N E 3 HANS DE VRIES.

Similar presentations


Presentation on theme: "GenTera’s I M A G I N E 3 Introducing: GenTera’s I M A G I N E 3 HANS DE VRIES."— Presentation transcript:

1 GenTera’s I M A G I N E 3 Introducing: GenTera’s I M A G I N E 3 HANS DE VRIES

2 GenTera’s I M A G I N E 3 Building Blocks PCI/AGP Bus interface PCI/AGP Bus interface 128 bit DDR- SDRAM Bus 128 bit DDR- SDRAM Bus Imagine 3 Core Processor Multi-Stream (32) Scalar / Vector Processor 80 Billion operations / second Imagine 3 Core Processor Multi-Stream (32) Scalar / Vector Processor 80 Billion operations / second Advanced High Quality 3D Graphics / Volume processing Pipelines 220 Billion operations / second Advanced High Quality 3D Graphics / Volume processing Pipelines 220 Billion operations / second Graphics Mask Generator Graphics Mask Generator Motion Estimator 100 Billion op/s Motion Estimator 100 Billion op/s Data (Video) Input Data (Video) Input Data flow Ring Input Data flow Ring Input Data (Video) Output Data (Video) Output Data flow Ring Output Data flow Ring Output 2.0 Gigabyte/s 160 Megabyte/s1.0 Gigabyte/s 4.2 Gigabyte/s0.5 Gigabyte/s

3 GenTera’s I M A G I N E 3 Core Processor HISC ™ processor architecture 120 General Purpose registers (2x32 bit) 256 Vector registers (2x32 bit) 256x4 MAC Vector registers (2x32 bit) 128 Special Purpose control registers. (2x32 bit), 1200 control table registers (2x32 bit) 80 Billion operations per second (320 operations per cycle) 10 Giga Byte per second streaming I/O (memory & processor I/O) including 64 Multiply Accumulates per cycle with saturate. 40 Conditional operations per cycle. 24 internal addresses per cycle 32 simultaneous concatenated vector streams (32 bit) (128 in byte mode) Single cycle 2D and 3D addressing modes. (1D, 2D and 3D memory management) C and C++ compiler,Image Processing Library Assembler, Linker, Debugger3D graphics Library Visual SimulatorMulti Media Library Soft In circuit EmulatorMachine Vision Library

4 GenTera’s I M A G I N E 3 HISC Processor Architecture RISC LEVEL: provides C and C++ compatibility VLIW LEVEL: A moderate length VLIW instruction word plus fully programmable bus interconnect directly controlled by the instruction code. EXTENDED VECTOR PROCESSING: Numerous function specific Control Register add extended functionality that is activated by the of group extended operations (as opposed to the basic operations) This increases the effective instruction word for vector operations to 1000+ bits VARIABLE LENGTH VECTOR PROCESSING: Enables up to 32 simultaneous and concatenated Vector Processing Streams. Word based Vector Processing (32, 2x16, 4x8) is symmetrically applied throughout the entire architecture. HISC: Hierarchical Instruction Set Computer

5 GenTera’s I M A G I N E 3 Core Processor Examples of Basic Processor Stream performance (from external memory to external memory) Standard GUI functions: Screen to Screen Copy2000 Mega pixels/s 8 bit pixels 500 Mega pixels/s 32 bit pixels 3 operand ROPS1000 Mega pixels/s 8 bit pixels Bitmap to Color expansion2000 Mega pixels/s 8 bit pixels Windows Direct Draw GUI functions: Pseudo to True Color 500 Mega pixels/s 8 bit pseudo to 16 bit or 32 bit colors True Color to Pseudo 500 Mega pixels/s 32,16 bit color to 8 bit pseudo color Z buffer aware copy 666 Mega pixels/s 8 bit pixels, 16 bit Z buffer 500 Mega pixels/s 16 bit pixels, 16 bit Z buffer Alpha Blended Copy 250 Mega pixels/s 32 bit ARGB pixels

6 GenTera’s I M A G I N E 3 Core Processor Examples of Core Processor stream performance (2) (from external memory to external memory) Multi Media Functions: (numbers in result pixels/s) YUV to RGB conversion 500 Mega pixels/s ( 32 bit color, 16 bit hi-color, 8 bit pseudo) DCT and IDCT (8x8 blocks) 167 Mega pixels/s ( 16 bit values, 32 bit calculations) DCT and IDCT (8x8 blocks) 667 Mega pixels/s ( 8 bit values, 16 bit calculations) Photo shop type Image Processing Functions: (numbers in result pixels/s) 3x3 kernel convolution2000 Mega pixels/s (8 bit pixels, 16 bit calculations) 7x7 kernel convolution 500 Mega pixels/s (8 bit pixels, 16 bit calculations) Bi-cubic Rotation1000 Mega pixels/s (8 bit pixels, 16 bit calculations) Bi-cubic Scaling1000 Mega pixels/s (8 bit pixels, 16 bit calculations) 3D graphics Geometry: (4x4) homogeneous transformations plus perspective divides for X, Y and Z for meshed triangles in 32 bit floating point (IEEE):50 Million triangles/s

7 GenTera’s I M A G I N E 3 Core Processor DIO WR VIO WR X0 MACX0 ALU X0 X1 MAC X1 ALU X1 Y0 MACY0 ALU Y0 Y1 MAC Y1 ALU Y1 Interconnect (100 % connectivity) REG A0 VIO 0 A0 REG A1 VIO 1 A1 REG B0 DIO 0 B0 REG B1 DIO 1 B1 REG WR1 REG WR0 Data Read Ports Data Processing Units Data Write Ports Data Write Ports

8 GenTera’s I M A G I N E 3 Core Processor A1/0 DIO A0/1 I3D0 B0 MES0 B0 RING0 A0B0 REG X0 ALU Y0 ALU X0 MAC Y0 MAC B0/1 VIO 0 Control Register Busses SEQ Control reg bus 1 bits [63:32] Control reg bus 0 bits [31:0] bus interconnect I3D1 A1/0 MES1 B1 RING1 B1 REG A1B1 ALU X1 ALU Y1 MAC X1 MAC Y1 VIO 1 B1/0 MSK0 VAU 0 VAU 1 MSK1 MTAB EMI

9 GenTera’s I M A G I N E 3 Instruction Word DdWr0B0A0Y0X0 DaWr1B1A1Y1X1 127123112641008876 635948 0362412 Highly orthogonal VLIW instruction word ND0 = 0 Data Processing Functions

10 GenTera’s I M A G I N E 3 Interconnect Select path 1 A0A1B0B1X0X1Y0Y1 Select path 2 A0A1B0B1X0X1Y0Y1 Data Processing Unit Select path A0A1B0B1X0X1Y0Y1 Data Write Port Instruction Word provides 8-way Interconnectivity In Scalar-Processing Mode

11 GenTera’s I M A G I N E 3 Interconnect Select path 1Select path 2 Data Processing Unit Data Write Port Instruction Word provides 100% Interconnectivity In Vector Processing Mode A0 R E G A0 M E M B0 R E G B0 M E M X0 A L U X0 M A C Y0 A L U Y0 M A C A1 R E G A1 M E M B1 R E G B1 M E M X1 A L U X1 M A C Y1 A L U Y1 M A C Select path 2 A0 R E G A0 M E M B0 R E G B0 M E M X0 A L U X0 M A C Y0 A L U Y0 M A C A1 R E G A1 M E M B1 R E G B1 M E M X1 A L U X1 M A C Y1 A L U Y1 M A C A0 R E G A0 M E M B0 R E G B0 M E M X0 A L U X0 M A C Y0 A L U Y0 M A C A1 R E G A1 M E M B1 R E G B1 M E M X1 A L U X1 M A C Y1 A L U Y1 M A C

12 GenTera’s I M A G I N E 3 Instruction Word 01Shift, Ufupath 1path 2 01Shift, Ufupath 1path 2 242016 0 12 84 Y0 X0 1 MACpath 1path 2 00ALUpath 1path 2 1 MACpath 1path 2 00ALUpath 1path 2 Data processing instruction fields

13 GenTera’s I M A G I N E 3 Instruction Word 484440 24 36 32 28 B0 A0 Data read ports instruction fields memory port 000 01register size 10control register size 00Be 31 16 bit imm. [15:8] register port 00Be 20 16 bit imm. [7:0] 01register 1 size 11 bit signed immediate 0VIO functionsize 0000DIO readsize register port memory port

14 GenTera’s I M A G I N E 3 Instruction Word 123 63 48 56 52 register port Wr0 ND 0 DIO address DIO address / data and (control-) register write ports fields size 0register path DIO address select wr addr Non data- processing function 1control register path 127 size rd addr 59 DIO rd/wr DIO data select 62 xwr data xrd addr 58

15 GenTera’s I M A G I N E 3 Parallel Conditional Processing 64 bit Uniform Status Register X1Y1X1Y1X1Y1X1Y1X0Y0X0Y0X0Y0X0Y0 [63:56][55:48][47:40][39:32][31:24][23:16][15:8][7:0] Status for Byte 0 Status for Byte 1 Status for Byte 2 Status for Byte 3 Status for Byte 4 Status for Byte 5 Status for Byte 6 Status for Byte 7 S0C0M0Z0 W0L0H0I0 ALU Status: Overflow, Carry, Minus, Zero (ALU, Shifts, Unary functions) MAC Status: Wrong, Lower, Higher, Inside

16

17 GenTera’s I M A G I N E 3 Register File 256 vector registers 2 x 32 bit wide 4 x 16 bit wide 8 x 8 bit wide up to 24 independent and conditional byte addresses up to 8 independent and conditional byte write enables 256 vector registers 2 x 32 bit wide 4 x 16 bit wide 8 x 8 bit wide up to 24 independent and conditional byte addresses up to 8 independent and conditional byte write enables 120 general registers 2 x 32 bit / 4 x16 bit / 8 x 8 bit 120 general registers 2 x 32 bit / 4 x16 bit / 8 x 8 bit 8 x Write Indices 8 x Read A Indices 8 x Read B Indices Write Port C Vector Index generators Write Port C Vector Index generators Read Port A Vector Index generators Read Port A Vector Index generators Read Port B Vector Index generators Read Port B Vector Index generators General Register Addresses From the Instruction Code Write Port C Input BUS select Write Port C Input BUS select Read Port A output BUS register Read Port A output BUS register Read Port B output BUS register Read Port B output BUS register INTERNALBUSMATRIXINTERNALBUSMATRIX ADDRESSESDATA PORTS GENERAL PURPOSE REGISTERS, VECTOR REGISTERS 2 x Read B Address 2 x Read A Address 2 x Write Address Write Data 2,4,8 x Read A Data 2,4,8 x Read B Data 2,4,8 x A1 A0 B1 B0

18 GenTera’s I M A G I N E 3 Function Units A L U Arithmetic, Boolean, Shift / Rotate, Unary Functions 4 x 8, 2 x 16, 1 x 32 32 bit float A L U Arithmetic, Boolean, Shift / Rotate, Unary Functions 4 x 8, 2 x 16, 1 x 32 32 bit float MULTIPLIER (un)signed x (un)signed binary point at: end, middle or top graphics formats ( 0.0..1.0 == 00..ff ) 4 x 8, 2 x 16, 1 x 32 32 bit float MULTIPLIER (un)signed x (un)signed binary point at: end, middle or top graphics formats ( 0.0..1.0 == 00..ff ) 4 x 8, 2 x 16, 1 x 32 32 bit float MAC Vector Registers 256 words x 64 bit MAC Vector Registers 256 words x 64 bit ACCUMULATOR Variable Range Clamp

19

20 GenTera’s I M A G I N E 3 Multiplier / Accumulator 8 bit Matrix functions: Open GL Blend Function ( 8 multiplies & 4 adds per MAC) Coefficients fixed or derived from the input operands: 16 bit 32 bit input data into a 4 tab shift register (4 times for each byte) 8 bit16 bit 8 bit16 bit 8 bit16 bit 8 bit16 bit 32 bit input data into a 4 tab shift register (4 times for each byte) 8 bit16 bit 8 bit16 bit 8 bit16 bit 8 bit16 bit 0 BLEND_CONSTANT 1 BLEND_ZERO 2 BLEND_ONE 3 SRC_COLOR 4 INV_SRC_COLOR 5 SRC_ALPHA 6 INV_SRC_ALPHA 7 DST_ALPHA 8 INV_DST_ALPHA 9 DST_COLOR 10 INV_DST_COLOR 11 SRC_ALPHA_SATURATE 12 BOTH_SRC_ALPHA (source) BOTH_SRC_ALPHA (dest) 13 BOTH_INV_SRC_ALPHA (source) BOTH_INV_SRC_ALPHA (dest) 14 MAX_INTENSITY (source) MAX_INTENSITY (dest) 15 MIN_INTENSITY (source) MIN_INTENSITY (dest)

21 GenTera’s I M A G I N E 3 Multiplier / Accumulator 16 bit Matrix functions: Convolute (4 multiplies & 2 adds per Multiplier) Transform (4 multiplies & 2 adds per Multiplier) 32 bit input data into a 2 tab shift register (2 times for each 16 word) 16 bit 32 bit 16 bit 32 bit 16 bit 32 bit 16 bit 32 bit 32 bit input data distributed to both columns ( 2 times for each 16 word) 16 bit 32 bit 16 bit 32 bit 16 bit 32 bit 16 bit 32 bit Mix : M H [63:32] =Coef 10 [31:0]. Mb [31:16] + Coef 11 [31:0]. Ma [31:16] M L [ 31:0 ] =Coef 00 [31:0]. Mb [ 15:0 ] + Coef 01 [31:0]. Ma [ 15:0 ] Merge : M H [63:32] =Coef 10 [31:0]. Ma [31:16] + Coef 11 [31:0]. Ma [ 15:0 ] M L [ 31:0 ] =Coef 00 [31:0]. Mb [31:16] + Coef 01 [31:0]. Mb [ 15:0 ]

22 GenTera’s I M A G I N E 3 Multiplier/Accumulator 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern Single Multiplier/Accumulator handles all with the same hardware! 32 x 32 bit extern 32 x 32 bit intern 64 bit accumulate Single Multiplier/Accumulator handles all with the same hardware! 32 x 32 bit extern 32 x 32 bit intern 64 bit accumulate Imagine 3 operations per cycle: 64: 8x16 bit: quad in-product (4 comp.) 64: 8x16 bit: 4x4 matrix x vector 32: 8x16 bit: Open GL blending functions 16: 16x16 bit: in-product, cross-product 16: 16x16 bit: complex product 16: 16x32 bit: FIR filter 16: 16x32 bit: in-product, cross-product 16: 16x32 bit: complex product 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 8 x 8 extern 8 x16 intern 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate 16 x 16 bit extern 16 x 32 bit intern 32 bit accumulate Single Multiplier/Accumulator handles all with the same hardware! 32 x 32 bit extern 32 x 32 bit intern 64 bit accumulate Single Multiplier/Accumulator handles all with the same hardware! 32 x 32 bit extern 32 x 32 bit intern 64 bit accumulate Single Multiplier/Accumulator handles all with the same hardware! 32 x 32 bit extern 32 x 32 bit intern 64 bit accumulate Single Multiplier/Accumulator handles all with the same hardware! 32 x 32 bit extern 32 x 32 bit intern 64 bit accumulate Each of the 4 Multiplier/Accumulators handles all operations by utilizing the same hardware! 32 x 32 bit extern 32 x 32 bit intern 32 x 32 bit floating point 64 bit accumulate Each of the 4 Multiplier/Accumulators handles all operations by utilizing the same hardware! 32 x 32 bit extern 32 x 32 bit intern 32 x 32 bit floating point 64 bit accumulate

23 GenTera’s I M A G I N E 3 Vector processing 121616 1717 1818 1919 2020 21212 34567891010 111212 1313 1414 1515 2323 2424 ACTUAL ASSEMBLY CODE FOR THE EXAMPLE ABOVE: repeat, graph (label_1);;; label_1: genad(A0) => B0=input, A0=rd4x8(ri) => X0=mult(A,V,nuu ) ===> genad(A1) =>A1=rd4x8(ri) => Y0=subsat(X0,A1), B1=rd4x8(RING_Data) => X1=mult(Y0,B1,nus) ===> DA=Again ==> D0=word4x8(uI), X0=addsat(X1,D0) => Y0=matxvec(X0), Y1=inproduct(X0) =====> X1=addsat(Y0,Y1) => outputV1; Variable length vector processing made simple. 2626 2727 2828 2929 3030 3131 3232 25253 3434 3535 genad(A0) genad(A1) A0=rd4x8(ri) A1=rd4x8(ri) Y0=subsat(X0,A1) B1=rd(RING_Data) B0=input X0=mult(A0,B0,nuu) X1=mult(Y0,B1,nus) X0=addsat(X1,D0) Y0=matxvec(X0) Y1=inproduct(X0) X1=addsat(Y0,Y1) DA=again D0=word4x8(uI) outputV1

24 GenTera’s I M A G I N E 3 10 Gigabyte Streaming I/O I M A G I N E 3 Internal Data Processing Core VECTOR UNITS: Simultaneous input and output to and from memory DATA CACHE or 3D GRAPHICS /VOLUME pipelines INPUT AND OUTPUT Dataflow Ring input Dataflow Ring output The Imagine 3 core can stream data from memory or other processors at 10 GByte/sec. (Compared to 0.48 GByte/sec. for the Imagine 1 )

25 GenTera’s I M A G I N E 3 Non-aligned S I M D SIMD processing made simple with non-aligned memory accesses (No complex time-consuming shift-mask-merge operations needed) 32 bit memory word 32 bit word 8 bit

26

27

28 GenTera’s I M A G I N E 3 1, 2 and 3D memory management 1 M Byte PAGE 1024 x 1024 8 bit pixel TILE 256 x 1024 32 bit pixel TILE 512 x 1024 16 bit pixel TILE X Y 128 x 128 x 128 16 bit voxel BRICK 256 x 128 x 128 8 bit voxel BRICK 64 x 128 x 128 32 bit voxel BRICK Y Z X

29 GenTera’s I M A G I N E 3 3D texture/volume Hardware Very High Quality 220 Billion operations/sec: 2 x 440 operations per cycle (4 ns) Texture Quality:BI linear, TRI Linear and QUAD interpolation. Texture Types:32 bit ARGB, 16 bit (4 types), 8,4,2 and 1 bit pseudo color 16 bit and 32 bit greyscale (signed and unsigned), 2x16 bit complex Texture Size:16,384 x 16,384 max (2d)2048 x 2048 x 2048 max (3d) Texture Dimension:1, 2 and 3 dimensional textures. Texture Clamping:Clamp and Wrap for all 3 co-ordinates. Texture Border:0 or 1 pixels texture borders, Border Color supported. Texture MIP mapsup to 16 levels: selection made for each individual pixel. Perspective division for al 9 parameters: S, T, R, Alpha, Red, Green, Blue, Fog, Z Perspective Correct Texture Mapping, Perspective Correct Texture Lighting, Perspective Correct Linear and Exponential (2 types) Fog, Perspective Correct Depth Buffering,

30 GenTera’s I M A G I N E 3 3D graphics Pipelines D BUS 3D graphics pipeline control unit 3D graphics pipeline control unit Perspect. MIP map processing pipeline Perspect. MIP map processing pipeline Bressenham Edge Start Interpolators(Q,R,S,T,Z -1 ) (F,A,R,G,B) Bressenham Edge Start Interpolators(Q,R,S,T,Z -1 ) (F,A,R,G,B) Vector Start Interpolators(Q,R,S,T,Z -1 ) (F,A,R,G,B) Vector Start Interpolators(Q,R,S,T,Z -1 ) (F,A,R,G,B) Pixel Value Interpolators(Q,R,S,T,Z -1 ) (F,A,R,G,B) Pixel Value Interpolators(Q,R,S,T,Z -1 ) (F,A,R,G,B) Perspective 3D co-ordinate Generator 5 stages Perspective 3D co-ordinate Generator 5 stages Perspective 3D correct Lighting 5 stages Perspective 3D correct Lighting 5 stages Perspective MIP Map Addresses Calculations 2 stages Perspective MIP Map Addresses Calculations 2 stages Perspective Interpolation Coefficients Perspective Interpolation Coefficients Perspective Lighting & Fog Coefficients Perspective Lighting & Fog Coefficients Memory Access Input Fifo / Port Select Memory Access Input Fifo / Port Select External Memory with MIP Map Textures 4 - 6 stages External Memory with MIP Map Textures 4 - 6 stages Memory Access Re-order buffers Memory Access Re-order buffers Memory Access Internal Delay Line for Interpolation, Lighting & Fog Coefficients 3 - 17 stages Memory Access Internal Delay Line for Interpolation, Lighting & Fog Coefficients 3 - 17 stages Memory Access Data Load unit Memory Access Data Load unit Texel Interp./ Lighting control unit Texel Interp./ Lighting control unit Texel Selection / Expansion Texel Color Look Up Texel Color Look Up Texel Interpolation / Lighting coefficients generator Texel Interpolation / Lighting coefficients generator Texel Interpolation / Lighting Multiply stage Texel Interpolation / Lighting Multiply stage Texel Interpolation / Lighting Summation stage Texel Interpolation / Lighting Summation stage

31 GenTera’s I M A G I N E 3 3D texture/volume Hardware 3D graphics Pipeline + Core stream performance (from external memory to external memory) Direct Draw functions: (numbers in result pixels/s) Bilinear Image Scale: 333 Mega pixels/s (32 bit gray scale or 32 bit color pixels ) Bilinear Image Rotate: 333 Mega pixels/s (32 bit gray scale or 32 bit color pixels ) Bilinear Affine Transform: 333 Mega pixels/s (32 bit gray scale or 32 bit color pixels ) MPEG functions: (numbers in result pixels/s) Bilinear Scaling plus kYUV to αRGB 333 Mega pixels/s (32 bit αRGB pixels) 3D functions: (numbers in result pixels/sec) Z-buffered, Perspective Correct, Bilinear Interpolated Texture mapping with perspective correct lighting and exponential fog (Texture size up to 16k x 16k), MIP-Mapping: 300 Mega pixels/sec. (32 bit αRGB pixels, 16 bit hi-color, 8 bit pseudo, 16 bit Z values)

32 GenTera’s I M A G I N E 3 Fan Beam Back projection The 3D Texture/Volume pipelines and the Multiplier / Accumulators in the Imagine 3 can handle eight 16 bit linear interpolated samples per cycle with 32 bit accuracy. Vector Direction Back Projection Direction

33 GenTera’s I M A G I N E 3 Cone beam reconstruction The Back projection in cone beam systems requires the: Inverse perspective mapping from filtered images back to a 3D volume. The Imagine 3 performs this directly with it’s 3D volume pipelines.

34 GenTera’s I M A G I N E 3 De-blur filtering FIR filter performance (16 bit input, 32 bit calculations) 128 Tab: 32 Mega-pixels / second 256 Tab: 16 Mega-pixels / second 512 Tab: 8 Mega-pixels / second 324 projections 512 values 840 projections 928 values 256x256 result image 512 x 512 result image Filtered Backprojection for Medical Imaging 324 x 512 to 256 x 256 De-blur filtering 10 ms (256 tabs) Backprojection 11 ms Reconstruction 21 ms Filtered Backprojection for Medical Imaging 840 x 928 to 512 x 512 De-blur filtering 100 ms (512 tabs) Backprojection 108 ms Reconstruction 208 ms

35 GenTera’s I M A G I N E 3 De-blur filtering (FFT) Complex input Fast Fourier Transform performance (vectorized) 32 bit Floating Point32 bit Integer16 bit Integer 256 Point: 8 μs 4 μs 2.0 μs 512 Point: 18 μs 9 μs 4.4 μs 1024 Point: 40 μs 20 μs 10 μs 2048 Point: 88 μs 44 μs 22 μs 4096 Point:192 μs 96 μs 48 μs 8192 Point:436 μs218 μs109 μs 16384 Point:896 μs448 μs224 μs 1200 projections of 960 values 512 x 512 result image Filtered Back-projection for Medical Imaging 1200 x 960 to 512 x 512 FFT filtering 106 ms (2048 point FP) Back-projection 157 ms Reconstruction 263 ms

36 GenTera’s I M A G I N E 3 Radar Display Processing Cartesian to Polar conversion with bi-linear interpolation 32 bit colors: 250 Mega-pixels /second

37 GenTera’s I M A G I N E 3 Motion Estimators Motion Estimation Unit for MPEG1…MPEG4 video encoding 100 Billion operations / second - software controllable, - arbitrary MxN kernel sizes up to 256 by 256 - arbitrary search space sizes up to 4096 by 4069 for HDTV and higher - allows optimizing algorithms (reduced search space) - forward and backward prediction - vector processing co-operation with core for bi-cubic pixel interpolation / rotation Performance: Compare a 16x16 pixel block with any other 16x16 pixel block (half, quarter, 1/8th, 1/16th pixels with bi-cubic interpolation) 120 Million Block Compares / second

38 GenTera’s I M A G I N E 3 Graphics Mask Generators Generates Transparent and Opaque Masks for 512 pixels multiple units work in parallel: Window Mask Generator Automatically clips pixels outside the View Port (scissoring) Span line Mask Generator for Concave Polygons and arbitrary Objects Range Mask generator for Depth Buffer Tests, Stencil Buffer Tests, Alpha Test, Chroma Keying Tests et cetera Complex Mask Generator for Concave and Complex Polygons according to the odd/even or winding rules Alpha Mask Generator For objects with partially covered pixels

39 GenTera’s I M A G I N E 3 Graphics Mask Generators Spanline Address Overlap triangle Window X min /max Window Y min /max Spanline 0 Start/ End Spanline 1 Start/ End Spanline 2 Start/ End Spanline 3 Start/ End Spanline Delta Start Spanline Delta End Spanline Y min / max Spanline Length (-1) Range mask 0 Range mask 1 Range mask 2 Range mask 3 Complex mask 0 Complex mask 1 Complex mask 2 Complex mask 3 The Range Mask contains the result of the Depht buffer test (overlapping triangle) The Complex Mask is used in this example to hold the Polygon Stipple pattern The Spanline registers define the outlines of the triangle The Window is defined by the Window registers

40 GenTera’s I M A G I N E 3 Multi media I/O units Video Output (Α), R, G, B outputs with 330 MHz dot clock for 1800 x 1400 screen format at 90 Hz. 12 (16) bit video out for Studio Quality video processing. Interface to DVI-TFT transmitters for high resolution, high quality LCD displays. Video Input CCIR 656: 8 bit digital video input for NTSC, PAL, SECAM, HDTV and custom formats Audio Codec 97 Interface Standard from Intel, Creative Labs, Yamaha, Analog Devices and Nat.Semiconductor Supports Analog speakers, Microphone, Headphone + Headphone micro, Telephony and Modem signals, CD analog audio in, Analog Video Sound In, PC beep in, et cetera Digital Audio: 4 stereo serial I/O ports (I 2 S type and S type emulation capabilities) Supports CD, DVD and Dolby AC3 input or output External Device Control 8 bit classic μP interface bus and I 2 C type emulation capability MIDI interface (Input and output for synthesizers and keyboards)

41 GenTera’s I M A G I N E 3 Real Time Support MULTI MEDIA REAL TIME SUPPORT Level 1 Events (1 micro second response time requirement) Horizontal Sync interrupts, Video I/O interrupts, Register Virtualization interrupts. Level 2 Events(2 - 100 micro second response time requirement) Communication Fifo interrupts, Mailbox Interrupts, I 2 S Fifo Interrupts, Ac97 Fifo Interrupts Midi Interrupt, I 2 C interrupt, Vertical Sync Interrupts, Scheduler Clock Tick, et cetera Threads ( 100 micro - 10 millisecond response time requirement) Host Command Queues Manager Audio Stream managers Modem Stream managers User definable threads

42 GenTera’s I M A G I N E 3 High-end Board 8 Processors: 3.2 Tera operations/s 4 GigaByte memory IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3

43 GenTera’s I M A G I N E 3 High-end Board 8 Imagine 3 processors, 3200 Billion operations per second 32 GigaByte per second Memory Bandwidth 16 GigaByte per second Inter-Processor Bandwidth - Perspective Volume Rendering: 1000 x 1000 x 1000 at 15 frames/second (based on 25% volume traversal) - Cone Beam Reconstruction: 512 x 512 x 512 from 1000 2 x128 in 4 seconds - Real Time 3D ultra sound reconstruction and visualization - Real Time HDTV MPEG 4 video encoding - Advanced Radar Processing

44 GenTera’s I M A G I N E 3 High Speed Dataflow Ring IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 Up to 2 Gigabyte per second Dataflow Ring (SSTL-2) Point-to-point with Broadcast options and auto configuration

45 GenTera’s I M A G I N E 3 High Speed System I/O The Dataflow Ring also provides very high speed System I/O. Entry level system can use the programmable Video Data I/O for general purpose I/O. ( 160 MB/s per processor, 1 GB/s per processor ) IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 IMAGINE3IMAGINE3 Video In 160 MB/s Video out 1 GB/s Optional System I/O FPGA e.g: Xilinx Virtex II Optional System I/O FPGA e.g: Xilinx Virtex II Data- flow input: Up to 2.0 GB/s Data- Flow Output: Up to 2.0 GB/s

46 GenTera’s I M A G I N E 3 Pipeline Processing The Dataflow Ring allows long vector processing pipelines over multiple processors. Here an example with just 2 processors MAC as 3D blend unit MAC as 3D blend unit ALU 256 entry vector register 256 entry vector register MAC as FIR filter MAC as FIR filter ALU Bi linear Interpolated Data from the Graphics pipeline Vector Write to memory Vector Write to memory Vector Read from memory Vector Read from memory Vector Write to memory Vector Write to memory Vector Read from memory Vector Read from memory Dataflow Ring Dataflow Ring Dataflow Ring Dataflow Ring Dataflow Ring Dataflow Ring

47 GenTera’s I M A G I N E 3 128 bit memory bus (reads) 16 kbyte 1 st Level data cache 16 kbyte 1 st Level instruction cache Dual 128 word x 128 bit Vector input fifo’s Dual 3D-graphics pipelines PCI/AGP Memory Read access Video Output 128 word x 128 bit fifo 4.2 Gigabyte /second Memory Bus: 128 bit PC2100 128 bit

48 GenTera’s I M A G I N E 3 128 bit memory bus (writes) 16 kbyte 1 st level data cache Dual 128 word x 128 bit Vector output fifos 16 word x 128 bit write buffer PCI/AGP Memory Write access 4.2 Gigabyte /second Memory Bus. (128 bit PC2100) 128 bit 8-fold address interleaved memory reads and writes. Out of order accesses with coherency checking

49 GenTera’s I M A G I N E 3 END GenTera’s I M A G I N E 3 HANS DE VRIES


Download ppt "GenTera’s I M A G I N E 3 Introducing: GenTera’s I M A G I N E 3 HANS DE VRIES."

Similar presentations


Ads by Google