Exploiting On-chip Memory Bandwidth in the VIRAM Compiler

Exploiting On-chip Memory Bandwidth in the VIRAM Compiler
Dave Judd, Katherine Yelick, Christoforos Kozyrakis, David Martin, and David Patterson

IRAM Overview A processor architecture for embedded/portable systems running media applications MIPS scalar core with vector co-processor Embedded DRAM Flag 0 Flag 1 Instr Cache (8KB) FPU Flag Register File (512B) MIPS64™ 5Kc Core CP IF Arith 0 Arith 1 256b 256b SysAD IF I will start with what is interesting about Vector IRAM. This is a prototype microprocessor that integrates a vector unit with 256 bit datapaths with a 16 MByte embedded DRAM memory system. The design uses 150 million transistors and occupies nearly 300 square mm. While operating at just 200 MHz, Vector IRAM achieves 3.2 giga ops and consumes 2 Watts. Vector IRAM also comes with an industrial strength vectorizing compiler for software development. Vector IRAM is being implemented by a group of only 6 graduate students, responsible for architecture, design, simulation and testing. So, if Patterson and Hennessy decide to introduce performance/watt/man year as major processor metric in the new version of their book, this processor will likely be one of the best in this class. Data Cache (8KB) Vector Register File (8KB) 64b 64b Memory Unit TLB JTAG IF 256b DMA Memory Crossbar … JTAG DRAM0 (2MB) DRAM1 (2MB) DRAM7 (2MB)

Why Vectors? Utilizes on-chip bandwidth of IRAM
parallelism within instructions Efficient architecture for vectorizable code avoids area, power, and design of reorder logic low instruction decode overhead Multimedia algorithms are vectorizable e.g., vectorize across pixels in an image Scales easily across chip generations e.g., 32-way parallelism in instruction can be implemented by 1, 2, 4, 8-way Leverages well-known compiler technology

Architecture Details MIPS64™ 5Kc core (200 MHz) Vector unit (200 MHz)
Single-issue scalar core with 8 Kbyte I&D caches Vector unit (200 MHz) 8 KByte register file (32 64b elements per register) 256b datapaths, can be subdivided into 16b, 32b, 64b: 2 arithmetic (1 FP, single), 2 flag processing Memory unit 4 address generators for strided/indexed accesses Main memory system 8 2-MByte DRAM macros 25ns random access time, 7.5ns page access time Crossbar interconnect 12.8 GBytes/s peak bandwidth per direction (load/store) Off-chip interface 2 channel DMA engine and 64n SysAD bus The vector unit is also connected to the coprocessor interface of the MIPS processor and works at 200 MHz. It includes a multiported 8 Kbyte register file. This allows each of the 32 registers to hold 32 64bit elements or 64 32b elements and so on. The flag register file has capacity of half a Kbyte. There are two functional units for arithmetic operations. Both can executed integer and logical operations, but only one can executed floating-point. There are also 2 flag processing units which provide support for predicated execution and exception handling. Each of the functional units has a 256 bit pipelined datapath. One each cycle, 4 64b operations or 8 32b operations or 16 16b operations can execute in parallel. To simplify the design and reduce area requirements, our prototype does not implement 8b integer operations and double precision arithmetic. All operations excluding divides are fully pipelined. The vector coprocessor also includes one memory or load/store unit. The LSU can exchange up to 265b per cycle with the memory system and has four address generators for strided and indexed accesses. Address translation is performed in a two level TLB structure. The hardware managed, first level, microTLB has four entries and four ports, while the main TLB has 32 double-page entries and a single access port. The main TLB is managed by software. The memory unit is pipelined and up to 64 independent accesses may be pending at any time. The 64b SysAD bus connects to external chip-set at 100 MHz

Floorplan Technology: IBM SA-27E 0.18mm CMOS, 6 metal layers
290 mm2 die area 225 mm2 for memory/logic Transistor count: ~150M Power supply 1.2V for logic, 1.8V for DRAM Typical power consumption: 2.0 W 0.5 W (scalar) W (vector) W (DRAM) W (misc) Peak vector performance 1.6/3.2/6.4 Gops wo. multiply-add (64b/32b/16b operations) 3.2/6.4 /12.8 Gops w. madd 1.6 Gflops (single-precision) Tape-out planned for Spring ‘01 14.5 mm 20.0 mm This figure presents the floorplan of Vector IRAM. It occupies nearly 300 square mm and 150 million transistors in a 0.18um CMOS process by IBM. Blue blocks on the floorplan indicate DRAM macros or compiled SRAM blocks. Golden blocks are those designed at Berkeley. They included synthesized logic for control and the FP datapaths, and full custom logic for register files, integer datapaths and DRAM. Vector IRAM operates at 200MHz. The power supply is 1.2V for logic and 1.8V for DRAM. The peak performance for the vector unit is 1.6 giga ops for 64bit integer operations. Performance doubles or quadruples for 32 and 16b operations respectively. Peak floating point performance is 1.6 Gflops. There are several interesting things to notice on the floorplan. First the overall design modularity and scalability. It mostly consists of replicated DRAM macros and vector lanes connected through a crossbar. Another very interesting feature is the percentage of this design directly visible to software. Compilers can control any part of the design that is registers, datapaths or main memory. They do that by scheduling proper arithmetic or load store instructions. The majority of our design is used for main memory, vector registers and datapaths. On the other hand, if you take a look at a processor like Pentium 3, you will see that less than 20% of its are is used for datapaths and registers. The rest is caches and dynamic issue logic. While this usually work for the benefit of applications, they cannot be controlled by compiler and they cannot be turned off when not necessary.

Scalable Design Scaling number of lanes for performance, energy, area
4 lanes, 8 MB 3.2 Gops (32-bit) 2 lanes, 4 MB 1.6 Gops 1 lane, 2 MB .8 Gops Scaling number of lanes for performance, energy, area Number of DRAM banks may scale independently e.g., 16 banks rather than 8

Vector Architectural State
VP0 VP1 VPvl-1 vr0 vr1 vr31 vpw Data Registers Virtual Processors (vl) Number of VPs given by the Vector Length register vl Width of each VP given by the register vpw vpw is one of {8b,16b,32b,64b} Maximum vector length is given by a read-only register mvl mvl depends on implementation and vpw: {128,128,64,32} in VIRAM-1

VIRAM Compiler Frontends Optimizer Code Generators C T3D/T3E Cray’s
PDGCS C++ C90/T90/SV1 Fortran95 SV2/VIRAM Based on the Cray’s production compiler Challenges: narrow data types and scalar/vector memory consistency Advantages relative to media-extensions: powerful addressing modes and ISA independent of datapath width Apart from the hardware, we have also worked on software development tools. We have a vectorizing compiler with C, C++, and Fortran front-ends. It is based on the production compiler by Cray for its vector supercomputers, which we ported to our architecture. Its has extensive vectorization capabilities including outer-loop vectorization. Using this compiler, vectorize applications written in high level languages without necessarily using optimized libraries or “special” (non-standard) variable types in his application.

Compiler Challenges Can compiled code effectively use VIRAM design?
Is on-chip DRAM bandwidth sufficient How well do multimedia applications vectorize Generating code for variable width data

Matrix-Vector Multiplication
Source vector Destination vector Matrix Assume row layout Vector matrix multiply (= mvm with column layout) saxpy: 2 vloads, 1 vstore (all unit stride) Matrix vector multiply dot: 2 vloads, (both unit stride + a reduction) saxpy: 2 vloads, 1 vstore (2 strided + 1 unit) Sparse matrix-vector multiply dot: 3 vloads (1 indexed, 2 unit + reduction) saxpy: 3 vloads, 1 vstore (2 indexed, 2 unit) needs column layout

Matrix Vector Multiplication
Performance of various source optimizations Column performance ~= peak

Comparison of MVM Performance
Double precision floating point compiled for VIRAM (note: chip only does single) hand- or Atlas-optimized for other machines 100x100 matrix As matrix size increases, performance: drops on cache-based designs increases on vector designs MFLOPS

Sparse MVM Performance
Performance is matrix-dependent: lp matrix compiled for VIRAM using “independent” pragma sparse column layout Sparsity-optimized for other machines sparse row (or blocked row) layout MFLOPS

Generating Code for Variable VPW
Strategy: vectorizer determines minimum correct vpw for each loop nest Vectorizer assumes vpw=64 initially At end of vectorization, discard vectorized copy of loop if greatest width encountered is less than 64 and start vectorization over with new vpw. Code gen checks vpw for each loop nest. Limitation: a single loop nest will run at the speed of the widest type. Reason: simplicity & performance of the common case No attempt to split/combine loops based on vpw

Media Benchmarks Mostly from U Toronto’s benchmark suite
8-bit data, 16-bit operations Colorspace: strided loads/stores Composition: unit stride Convolve: strided Mixed 16 and 32-bit integer Detect Decrypt 32-bit Floating point FIR filter SAXPY 64: 64 element SAXPY 1K: 1024 element matmul: matrix multiplication

Integer Benchmarks Strided access important, e.g., RGB
narrow types limited by address generation Outer loop vectorization and unrolling used helps avoid short vectors spilling can be a problem Tiling could probably help

Floating Point DSP Benchmarks
Performance is competitive with hand-coding Vector length is important (e.g., saxpy) but multiple vectors is fine (e.g., matmul)

Conclusions VIRAM ISA shows high performance on compiled code
competitive with modern processors limitations are address generation for strided and indexed memory operations Compiler effectively uses variable width data allows media applications to vectorize performance scales with inverse data width Future compiler work Tiling Fixed point support Better register allocation

Backup slides

Performance Summary Performance of compiled code is generally good
matmul and saxpy meet or beat hand-coded 3 addressing modes very useful Limitations to performance Dependencies or inadequate compiler analysis Inadequate memory bandwidth Lack of address generators Short vectors Future compiler work Tiling Fixed point support Better register allocation

Scaling Media Benchmarks

Compiled matrix-vector multiplication: 2 Flops/element
Easy compilation problem; stresses memory bandwidth Compare to 304 Mflops (64-bit) for Power3 (hand-coded) Performance scales with number of lanes up to 4 Need more memory banks than default DRAM macro for 8 lanes

Outline Why vectors for IRAM? Including media types
The virtual lane model Virtual processor width Limitations to performance Dependencies or inadequate compiler analysis Inadequate memory bandwidth Lack of address generators Short vectors Comparisons to other architectures Conclusions

Matrix-Vector Multiply
Scaling Matrix-Vector Multiplication

Performance on Media Benchmarks
Using compiled code: 1, 2, 4, and 8 lanes

Compiled matrix-vector multiplication: 2 Flops/element
Easy compilation problem; stresses memory bandwidth Compare to 304 Mflops (64-bit) for Power3 (hand-coded) Performance scales with number of lanes up to 4 Need more memory banks than default DRAM macro for 8 lanes MFLOPS

Compiling Media Kernels on IRAM
The compiler generates code for narrow data widths, e.g., 16-bit integer Compilation model is simple, more scalable (across generations) than MMX, VIS, etc. Strided and indexed loads/stores simpler than pack/unpack Maximum vector length is longer than datapath width (256 bits); all lane scalings done with single executable The IBM Power 3 number is from the latest LAPACK manual, and is for the BLAS2 (dgemv) performance, presumably hand-coded by IBM experts. The IRAM numbers are from the Cray compiler. This algorithm requires either strided access or reduction operations. The compiler uses the strided accesses. (Reductions are worse, because more time is spent with short vectors.) Because of the strided accesses, we start to have bank conflicts with more lanes. I think we had trouble getting the simulator to do anything reasonable with subbanks, so this reports 16 banks, rather than 8 banks with 2 subbanks per bank. The BLAS numbers for the IBM are better than for most other machines without such expensive memory systems. E.g., the Pentium III is 141, the SGI O2K is 216, the Alpha Miata is 66, and the Sun Enterprise is 450 is 267. Only the a AlphaServer DS-20 at 372 is beats VIRAM-1 (4 lanes, 8 banks) at 312. None of the IRAM number use a multiply add – performance would definitely increase with that.

Vector Vs. SIMD: Example
Simple image processing example: conversion from RGB to YUV Y = [( 9798*R *G *B) / 32768] U = [(-4784*R *G *B) / 32768] + 128 V = [(20218*R – 16941*G – 3277*B) / 32768] + 128

VIRAM Code (22 instructions)
RGBtoYUV: vlds.u.b r_v, r_addr, stride3, addr_inc # load R vlds.u.b g_v, g_addr, stride3, addr_inc # load G vlds.u.b b_v, b_addr, stride3, addr_inc # load B xlmul.u.sv o1_v, t0_s, r_v # calculate Y xlmadd.u.sv o1_v, t1_s, g_v xlmadd.u.sv o1_v, t2_s, b_v vsra.vs o1_v, o1_v, s_s xlmul.u.sv o2_v, t3_s, r_v # calculate U xlmadd.u.sv o2_v, t4_s, g_v xlmadd.u.sv o2_v, t5_s, b_v vsra.vs o2_v, o2_v, s_s vadd.sv o2_v, a_s, o2_v xlmul.u.sv o3_v, t6_s, r_v # calculate V xlmadd.u.sv o3_v, t7_s, g_v xlmadd.u.sv o3_v, t8_s, b_v vsra.vs o3_v, o3_v, s_s vadd.sv o3_v, a_s, o3_v vsts.b o1_v, y_addr, stride3, addr_inc # store Y vsts.b o2_v, u_addr, stride3, addr_inc # store U vsts.b o3_v, v_addr, stride3, addr_inc # store V subu pix_s,pix_s, len_s bnez pix_s, RGBtoYUV Note very long instruction and variables names (single column)

MMX Code (part 1) RGBtoYUV: movq mm1, [eax] pxor mm6, mm6
movq mm0, mm1 psrlq mm1, 16 punpcklbw mm0, ZEROS movq mm7, mm1 punpcklbw mm1, ZEROS movq mm2, mm0 pmaddwd mm0, YR0GR movq mm3, mm1 pmaddwd mm1, YBG0B movq mm4, mm2 pmaddwd mm2, UR0GR movq mm5, mm3 pmaddwd mm3, UBG0B punpckhbw mm7, mm6; pmaddwd mm4, VR0GR paddd mm0, mm1 pmaddwd mm5, VBG0B movq mm1, 8[eax] paddd mm2, mm3 movq mm6, mm1 paddd mm4, mm5 movq mm5, mm1 psllq mm1, 32 paddd mm1, mm7 punpckhbw mm6, ZEROS movq mm3, mm1 pmaddwd mm1, YR0GR movq mm7, mm5 pmaddwd mm5, YBG0B psrad mm0, 15 movq TEMP0, mm6 movq mm6, mm3 pmaddwd mm6, UR0GR psrad mm2, 15 paddd mm1, mm5 movq mm5, mm7 pmaddwd mm7, UBG0B psrad mm1, 15 pmaddwd mm3, VR0GR packssdw mm0, mm1 pmaddwd mm5, VBG0B psrad mm4, 15 movq mm1, 16[eax]

MMX Code (part 2) paddd mm6, mm7 movq mm7, mm1 psrad mm6, 15
psllq mm7, 16 movq mm5, mm7 psrad mm3, 15 movq TEMPY, mm0 packssdw mm2, mm6 movq mm0, TEMP0 punpcklbw mm7, ZEROS movq mm6, mm0 movq TEMPU, mm2 psrlq mm0, 32 paddw mm7, mm0 movq mm2, mm6 pmaddwd mm2, YR0GR movq mm0, mm7 pmaddwd mm7, YBG0B packssdw mm4, mm3 add eax, 24 add edx, 8 movq TEMPV, mm4 movq mm4, mm6 pmaddwd mm6, UR0GR movq mm3, mm0 pmaddwd mm0, UBG0B paddd mm2, mm7 pmaddwd mm4, pxor mm7, mm7 pmaddwd mm3, VBG0B punpckhbw mm1, paddd mm0, mm6 movq mm6, mm1 pmaddwd mm6, YBG0B punpckhbw mm5, movq mm7, mm5 paddd mm3, mm4 pmaddwd mm5, YR0GR movq mm4, mm1 pmaddwd mm4, UBG0B psrad mm0, 15 paddd mm0, OFFSETW psrad mm2, 15 paddd mm6, mm5 movq mm5, mm7

MMX Code (pt. 3: 121 instructions)
pmaddwd mm7, UR0GR psrad mm3, 15 pmaddwd mm1, VBG0B psrad mm6, 15 paddd mm4, OFFSETD packssdw mm2, mm6 pmaddwd mm5, VR0GR paddd mm7, mm4 psrad mm7, 15 movq mm6, TEMPY packssdw mm0, mm7 movq mm4, TEMPU packuswb mm6, mm2 movq mm7, OFFSETB paddd mm1, mm5 paddw mm4, mm7 psrad mm1, 15 movq [ebx], mm6 packuswb mm4, movq mm5, TEMPV packssdw mm3, mm4 paddw mm5, mm7 paddw mm3, mm7 movq [ecx], mm4 packuswb mm5, mm3 add ebx, 8 add ecx, 8 movq [edx], mm5 dec edi jnz RGBtoYUV

IRAM Status Chip Compiler Application & Benchmarks
ISA has not changed significantly in over a year Verilog complete, except SRAM for scalar cache Testing framework in place Compiler Backend code generation complete Continued performance improvements, especially for narrow data widths Application & Benchmarks Handcoded kernels better than MMX,VIS, gp DSPs DCT, FFT, MVM, convolution, image composition,… Compiled kernels demonstrate ISA advantages MVM, sparse MVM, decrypt, image composition,… Full applications: H263 encoding (done), speech (underway) To conclude my talk, today I have presented to you Vector IRAM. This is an integrated architecture for media processing that combines a 256 bit vector unit with 16 Mbytes of embedded DRAM. It uses 150 million transistors and 300 square mm. At just 200 MHz, it achieves 3.2 giga ops for 32b integers and consumes 2 Watts. It is a simple, scalable design that is efficient in terms of performance, power, and area. The current status of the prototype design is the following. We are currently in the verification and back-end stage of the design. RTL development and the design of several full custom components has been completed. We expect to tape-out the design by the end of the year. The compiler is also operational and is being tuned for performance. We are also working on applications for this system.

Backup from Dave Judd’s Talk

VIRAM Tools vas: assembler vdis: disassembler vsim-isa: simulator
vsim-db: debugger vsim-p: performance simulator vsim-sync:memory consistency simulator

Compiler Testing C regression test suite (commercial test suite)
Scalar emphasis, C conformance All tests pass except: Small numerical differences due to lack on 128 f.p. support C++ test suite 1167 of 1183 tests execute correctly. 12 failures in compilation: “undefined variables” 4 failures in execution: bad answers

Compiler Testing Vector regression test suites (CRAY)
Specifically tests for vectorization Compares vector and scalar results Easy to isolate problems “vector” status: 59 of 62 tests pass Some minor numerical differences 1 bad answer, 2 integer overflow “vector4” status 163 of 165 tests execute correctly 1 bad anwer, 1 illegal use of vector inst.

Kernel Performance: mvm matrix-vector multiplication
64x64, 32 bit floating pt. Hand optimized assembly code 579 mflops vcc w/ restrict keywords added 352 mflops + 1 element padding to avoid bank conflicts 401 mflops + shortloop directive Loops interchanged & outer loop vectorized by vcc. 592 mflops

Mods to mvm code /* Original code mvm.c */ /* Modified code */
void mvm (float * A, void mvm (float * restrict A, float * X, float * restrict X, float * Y, float * restrict Y, int n, int n, int acol ) { int acol ) { int i,j; int i,j; float x_elem < if ( n <= 64 ) { if ( n <= 64 ) { for (i = 0; i < n; i++) { for (i = 0; i < n; i++) { #pragma shortloop for (j = 0; j < n; j++) { for (j = 0; j < n; j++) { Y[j] += A[j*acol+i] * x_elem; Y[j] += A[j*acol+i] * X[i]; } } } } } } } }

Kernel performance: mm_mul matrix –matrix multiplication
64x64x64, 32 bit float, 1.6 gigaflop theoretical peak Hand coded assembly mm-mul-small.s 1.58 gigaflops vcc w/ restrict and shortloop keywords 0.852 gigaflops + inner two loops in separate function, allows outer loop vectorization 1.51 gigaflops

Kernel performance: saxpy
32 bit floating point ops N=64 256 1024 4096 Hand coded assembly vcc w/restrict keywords 379 593 691 720 385 596 692 721

Kernel performance: motion_estimate
32 bit integer ops, finding the minimum sum of absolute differences for a reference block and a region in an image. Hand optimized assembly 1.181 gigaops vcc w/restrict keywords 170 mops + shortloop directives 253 mops + outer loop unroll directive 257 mops* *No improvement because of spilling.

Dongarra loops 100 loops to test compiler vectorization capability
Rewritten in C by Cray (?) vcc vectorizes 74 loops vcc partially vectorizes 3 loops vcc conditionally vectorizes 3 loops 1 loop not vectorized because vector sin/cos not currently available on viram. 19 other loops not vectorized Data provided by Sam Williams

Features Remaining: Support version 3 isa and version 4 isa:
Isa changes required by Mips Inc. scalar core Performance simulator only supports “old”isa Finish sync support take advantage of Cray implementation VIRAM machine “target” Allow easier maintainence of frontend and optimizer mods for viram User documentation Summary of differences w/Cray compiler Useful options, hints for vector code

Performance Features Remaining
Additional tuning: instruction scheduler Support new SV2 inliner for C/C++ Shortloop enhancements Reduce spilling Scheduler concern with registers Ordering of blocks for register assignment within “priority groups” Special vector registers carried across calls Loop unrolling for vector loops Tune for key benchmarks

Other Future Compiler Features ?
Support for speculative execution Compiler extensions for fixed point hardware Support for vector functions; vector mlib

Summary vcc is a reasonably robust compiler for VIRAM
Performance on kernels is good w/appropriate directives, some effort for optimum vectorization Need to prioritize remaining work

Codegen/optimizer issues for VIRAM
Variable virtual processor width (VPW) Variable maximum vector register length (MVL) Vector flag registers treated as 1 bit wide vector register Multiple base, incr, stride regs. + autoincrement Fixed point arithmetic (saturating add, etc.) Memory consistency New vector instructions not available on SV2

Generating Code for Variable MVL
Maximum vector length is not specified in IRAM ISA. However, compiler assumes mvl at compile time mvl based on vpw mvl assumption dependent on VIRAM-1 hardware implementation Recompiling required for future hardware versions if mvl changes MVL knowledge useful for code gen and vectorizer: register spilling short loop vectorization length-dependent vectorization ( and may eliminate safe vector length computation at run time) for (i = 0; i < n; i=++) a[i] = a[i+32]

Memory consistency Sync instructions: SaV VaS VaV vp RaW WaR WaW

Exploiting On-chip Memory Bandwidth in the VIRAM Compiler

Similar presentations

Presentation on theme: "Exploiting On-chip Memory Bandwidth in the VIRAM Compiler"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Exploiting On-chip Memory Bandwidth in the VIRAM Compiler

Similar presentations

Presentation on theme: "Exploiting On-chip Memory Bandwidth in the VIRAM Compiler"— Presentation transcript:

Similar presentations

About project

Feedback