Presentation is loading. Please wait.

Presentation is loading. Please wait.

Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas FlexRAM.

Similar presentations


Presentation on theme: "Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas FlexRAM."— Presentation transcript:

1 Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas http://iacoma.cs.uiuc.edu torrellas@cs.uiuc.edu FlexRAM

2 People Involved Students Other faculty Michael Huang Seung Yoo H. V. Jagadish Joe Renau David Padua Jaejin Lee Daniel Reed

3 Technological Landscape Merged Logic and DRAM (MLD): IBM, Mitsubishi, Samsung, Toshiba and others 0.18 um (chips for 1 Gbit DRAM) Powerful: e.g. IBM SA-27E ASIC (Feb 99) Logic frequency: 400 MHz IBM PowerPC 603 proc + 16 KB I, D caches = 3% Further advances in the horizon Opportunity: How to exploit MLD best?

4 Terminology Processor In Memory (PIM) Intelligent Memory or Intelligent RAM (IRAM) =

5 Key Applications Benefit from HW Data Mining (decision trees and neural networks) Computational Biology (protein sequence matching) Molecular Dynamics (short-range forces) Financial Modeling (stock options, derivatives) Multimedia (MPEG-2) Decision Support Systems (TPC-D) Speech Recognition All these are Data Intensive Applications

6 Example App: DNA Matching z Problem: Find areas of database DNA chains that match (modulo some mutations) the sample DNA chains

7 How the Algorithm Works z Generate 50+ most-likely mutations zPick 4 consecutive aminoacids from sample

8 Example App: DNA Matching z If match is found: try to extend it zCompare them to every positions in the database DNAs

9 How to Use MLD 1. Main compute engine of the machine Add proc to DRAM chip Include a vector processor or multiple processors Incremental gains Hard to program UC Berkeley: IRAM Notre Dame: Execube, Petaflops MIT: Raw Stanford: Smart Memories

10 How to Use MLD (II) 2. Co-processor, special-purpose processor ATM switch controller Process data beside the disk Graphics accelerator Stanford: Imagine UC Berkeley: ISTORE

11 How to Use MLD (III) 3. Our approach: take the place of memory chips in a workstation or server PIM chip processes the memory-intensive parts of the program Illinois: FlexRAM UC Davis: Active Pages USC-ISI: DIVA

12 Our Solution: Principles zExtract high bandwidth from DRAM: yMany simple processing units zRun legacy codes with high performance: yDo not replace off-the-shelf uP in workstation yTake place of memory chip. Same interface as DRAM yIntelligent memory defaults to plain DRAM zSmall increase in cost over DRAM: ySimple processing units, still dense zGeneral purpose: yDo not hardwire any algorithm. No Special purpose

13 Architecture Proposed

14 The FlexRAM Memory System Can exploit multiple levels of parallelism For a high-end workstation: 100’s of P.Mems in memory (e.g. IBM PowerPC 603) 1 P.Host processor (e.g. Merced, IBM GP) 100,000’s of very simple P.Arrays in memory

15 Chip Organization

16 Memory in one FlexRAM Chip 64 Mbytes of DRAM organized as 16Mx32 bits Organized in 64 1-Mbyte banks 1 single port Associated to 1 P.Array 2 2-Kbyte row buffers (no P.Array cache) P.Array access to memory: 10 ns (row hit) or 20 ns (miss) On-chip memory bandwidth: 102 Gbytes/second Each bank:

17 Memory in one FlexRAM Chip Group of 4 P.Arrays share one 8-Kbyte, 4-ported SRAM instruction memory Holds the P.Array code Small because short code Aggressive access time: 1 cycle = 2.5 ns

18 P.Array 64 P.Arrays per chip. Not SIMD but SPMD 32-bit integer arithmetic; 16 registers 4 P.Arrays share one multiplier No caches, no floating point 28 different 16-bit instructions Can access own 1 Mbyte of DRAM plus DRAM of left and right neighbors. Connection forms a ring Broadcast and notify primitives: Barrier

19 P.Mem 2-issue static superscalar like IBM PowerPC 603 16-Kbyte I, D caches Communication with P.Arrays: Executes serial sections Broadcast/notify or plain write/read to memory Communication with other P.Mems: Memory in all chips is visible Access via the inter-chip network Must flush caches to ensure data coherence

20 Issues Communication P.Mem-P.Host: P.Mem cannot be the master of bus P.Host polls a register in Rambus interf. of master P.Mem P.Host starts P.Mems by writing register in Rambus interf. If P.Mem not finished: memory controller retries. Retries are invisible to P.Host Virtual memory: They share a range of virtual addresses with P.Host P.Mems and P.Arrays use virtual memory

21 Chip Architecture

22 Basic Block

23 Area Estimation (mm ) PowerPC 603+caches: 12 SRAM instruction memory: 34 64 Mbytes of DRAM: 330 P.Arrays: 96 Pads + network interf. + refresh logic 20 Rambus interface: 3.4 Multipliers: 10 Total = 505 Of which 28% logic, 65% DRAM, 7% SRAM 2 VERY CONSERVATIVE

24 Evaluation

25 Utilization zHigh P.Array Utilz Low P.Mem Util

26 Utilization z Low P.Host Utilization

27 Speedups zConstant Problem Szz Scaled Problem Sz

28 Speedups zVarying Logic Frequency

29 Programming FlexRAM FlexRAM programmed in C + extensions: C-Flex Library of Intelligent Memory Operations (IMOs) Executed by P.Arrays or P.Mem C subroutines that can be called from main pgm Operate on large data sets with poor locality Library also contains plain subroutines Link program with IMOs or plain subroutines

30 C-Flex Programming Extensions On processor_range: where the following code is executed Waitfor processor_range: processors waiting for others Release object Map object to processor_range: mapping of pages Flush(object), Flush&Inval(object): flush from cache Broadcast(address), Poll(), Receive(address), Notify() FlexRAM_malloc(), P_mem_malloc(), P_array_malloc()

31 Performance Evaluation zHardware performance monitoring embedded in the chip zSoftware tools to extract and interpret performance info

32 Current Status zIdentified and wrote all applications zDesigned architecture based on apps & feasible technology zConceived ideas behind language/compiler zNeed to do: chip layout and fabrication development of the compiler zFunds needed for: yprocessor core (P.Mem) ychip fabrication yhardware and software engineers

33 Overall Goal Fabricate chips Build a workstation with an intelligent memory system Demonstrate significant speedups on real applications Build a compiler for the intelligent memory system

34 Conclusion zWe have a handle on: yA promising technology (MLD) yKey applications of industrial interest zReal chance to transform the computing landscape


Download ppt "Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas FlexRAM."

Similar presentations


Ads by Google