Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rough Schedule 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break

Similar presentations


Presentation on theme: "Rough Schedule 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break"— Presentation transcript:

1 Rough Schedule 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break
3:15-3:30 Financial 4:00-5:00 Future

2 IRAM Hardware and Software
Kathy Yelick Computer Science Division UC Berkeley

3 Intelligent RAM: IRAM Proc L2$ L o g i c f a b Bus D R A M I/O
Microprocessor & DRAM on a single chip: 10X capacity vs. DRAM on-chip memory latency 5-10X, bandwidth X improve energy efficiency 2X-4X (no off-chip bus) serial I/O 5-10X v. buses smaller board area/volume IRAM advantages extend to: a single chip system a building block for larger systems D R A M f a b Proc Bus I/O $B for separate lines for logic and memory Single chip: either processor in DRAM or memory in logic fab

4 VIRAM: System on a Chip 0.18 um EDL process 16 MB DRAM, 8 banks
MIPS Scalar core and 200 MHz 4 64-bit vector unit 200 MHz 17x17 mm, 2 Watts target 25.6 GB/s memory (6.4 GB/s per direction and per Xbar) 0.8 Gflops (64-bit), 6.4 GOPs (16-bit) Memory (64 Mbits / 8 MBytes) 4 Vector Pipes/Lanes C P U +$ Xbar Memory (64 Mbits / 8 MBytes)

5 IRAM Chip Update IBM supplying embedded DRAM/Logic (100%)
Agreement in place and technology files available MIPS supplying scalar core (100%) MIPS processor, caches, TLB MIT supplying FPU (100%) VIRAM-1 Tape-out scheduled for late-2000 Simplifications Floating point Network Interface

6 VIRAM-1 Chip Design Status
MIPS scalar core Synthesizable RTL code received from MIPS Cache RAMs to be compiled for IBM technology FPU RTL code almost compete Vector unit RTL models for sub-blocks developed; currently integrated and tested Control logic to be compiled for IBM technology Full-custom layout for multipliers/adders developed; layout for shifters to be developed Memory system Synthesizable model for DRAM controllers done To be integrated with IBM DRAM macros Full-custom layout for crossbar under development Testing infrastructure Environment developed for automatic test & validation Directed tests for single/multiple instruction groups developed Random instruction sequence generator developed

7 IRAM Architecture Update
ISA mostly frozen since 6/99 Changes in 2H 99 for better fixed-point model and some instructions for short vectors (auto increment and in-register permutations) Minor changes in 1H 00 to address new co-processor interface in MIPS core ISA manual publicly available Suite of simulators actively used vsim-isa (functional) Major rewrite underway for new scalar processor All UCB code vsim-p (performance), vsim-db (debugger), vsim-sync (memory synchronization)

8 IRAM Compiler Status Vectorizer C Fortran C++ Frontends
Code Generators PDGCS IRAM C90 Retarget of Cray Backend Steps in compiler development Build MIPS backend (done) Build VIRAM bacckend for vectorized loops (done) Instruction scheduling for VIRAM-1 (works, but could be improved) Insertion of memory barriers (using Cray strategy, improving) Optimizations for short loops (reduce overhead) Feedback results to Cray, new version from Cray (ongoing)

9 IRAM Compiler Update Study of compiler quality using 100 “Dongarra loops” 70 vectorized Average 10x reduction in dynamic instruction count Average vector length of 42 30 did not, usually due to a dependence Some reductions missed Vector version of math libraries (sin, cos, etc.) needed Some failed due to bugs in benchmark Identified 2 specific areas for improvements in loop overhead Use VL and MVL more carefully Use auto-increment instruction more extensively

10 Compiled Applications Update
Applications using compiler Speech processing under development Developed new small-memory algorithm for speech processing Uses some existing kernels (FFT and MM) Vector search algorithm is most challenging DIS image understanding application under development Compiles, but does not yet vectorize well Singular Value Decomposition Better than 2 VLIW machines (TI C67 and TM 1100) Challenging BLAS-1,2 work well on IRAM because of memory BW Kernels SAXPY, MVM, etc. Will include DIS stress-marks

11 (10n x n SVD, rank 10) (From Herman, Loo, Tang, CS252 project)

12 Hand-Coded Applications Update
Image processing kernels (old FPU model) Note BLAS-2 performance

13 Problem: General Element Permutation
16 1 15 Hardware for a full vector permutation instruction (128 16b elements, 256b datapath) Datapath: 16 x 16 (x 16b) crossbar; scales by 0(N^2) Control: to-1 multiplexors; scales by 0(N*logN) Time/energy wasted on wide vector register file port

14 Simple Vector Permutations
1 15 Simple steps of butterfly permutations A register provides the butterfly radix Separate instructions for moving elements to left/right Sufficient semantics for Fast reductions of vector registers (dot products) Fast FFT kernels

15 Hardware for Simple Permutations
64 shift 3 Hardware for b elements, 256b datapath Datapath: 2 buses, 8 tristate drivers, 4 multiplexors, 4 shifters (by 0, 16b, 32b only); Scales by O(N) Control: 6 control cases; scales by O(N) Other benefits Consecutive result elements written together; Buses used only for small radices

16 FFT: Uses In-Register Permutations
Without in-register permutations

17 Summary IRAM takes advantage of high on-chip bandwidth
BLAS-2 performance confirms this Vector IRAM ISA utilizes this bandwidth Unit, strided, and indexed memory access patterns supported Exploits fine-grained parallelism, even with pointer chasing Compiler Well-understood compiler model, semi-automatic Still some work on code generation quality Application benchmarks Compiled and hand-coded Include FFT, SVD, MVM, sparse MVM, and other kernels used in image and signal processing

18 IRAM as Building Block for ISTORE
System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk Target for years: building block: 2006 MicroDrive integrated with IRAM 9GB disk, 50 MB/sec disk (projected) connected via crossbar switch O(10) Gflops 10,000+ nodes fit into one rack!


Download ppt "Rough Schedule 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break"

Similar presentations


Ads by Google