Presentation is loading. Please wait.

Presentation is loading. Please wait.

Embedded DRAM for a Reconfigurable Array S.Perissakis, Y.Joo 1, J.Ahn 1, A.DeHon, J.Wawrzynek University of California, Berkeley 1 LG Semicon Co., Ltd.

Similar presentations


Presentation on theme: "Embedded DRAM for a Reconfigurable Array S.Perissakis, Y.Joo 1, J.Ahn 1, A.DeHon, J.Wawrzynek University of California, Berkeley 1 LG Semicon Co., Ltd."— Presentation transcript:

1 Embedded DRAM for a Reconfigurable Array S.Perissakis, Y.Joo 1, J.Ahn 1, A.DeHon, J.Wawrzynek University of California, Berkeley 1 LG Semicon Co., Ltd

2 Outline Reconfigurable architecture overview Motivation for on-chip DRAM Configurable Memory Block (CMB) Evaluation Conclusion

3 Long Term Architecture Goal On-chip CPU LUT-based compute pages DRAM memory pages Fat pyramid network fat tree + shortcuts

4 Long Term Architecture Goal On-chip CPU LUT-based compute pages DRAM memory pages Fat pyramid network fat tree + shortcuts

5 Long Term Architecture Goal On-chip CPU LUT-based compute pages DRAM memory pages Fat pyramid network fat tree + shortcuts

6 Long Term Architecture Goal On-chip CPU LUT-based compute pages DRAM memory pages Fat pyramid network fat tree + shortcuts

7 Long Term Architecture Goal CPU K e r n e l 1K e r n e l 2 Reconfigure ( p r o d u c e r )( c o n s u m e r )

8 Motivation – Stream buffers Reduce reconfiguration frequency – Configuration memory Speed up reconfiguration – Application memory Speed up individual kernels Need large on-chip memory for:

9 Challenges Harder to use – Row/Col accesses & variable latency – Refresh Lower performance – Increased access latency Q: Is it worth the trouble ? DRAM offers increased density (10X to 20X that of SRAM), but:

10 Trumpet test chip Trumpet One compute page One memory page Corresponding fraction of network

11 CMB Functions Configuration source State source/sink Data store Input/output

12 CMB Overview Stall Buffers Retiming Registers Address & Data Xbars Rate Matching CMB Controller DRAM Macro DQ[127:0] [127:0][63:0] Ctl[1:0]Addr[17:0] Addr[9:0] Ctl[1:0] Tree[159:0] Short[159:0] Cmd From compute page From host

13 DRAM Macro 0.25µm, 4 metal eDRAM process 1 to 8 Mbits (2 Mbits in test chip) 128-bit wide SDRAM interface Up to 125 MHz clock  2 GB/s peak B/W 36ns/12ns row/col latencies Row buffers to hide precharge & refresh Designed by LG Semicon

14 SRAM Abstraction SRAM-like interface Req, R/W, Address, Data Row buffers  simple direct-mapped cache 6-cycle minimum latency, pipelined Misses handled by logic stalls 10-cycle miss latency “hidden” from logic

15 Stalls Stall sources: – Row buffer miss (10 cycles) – Write after read (4 cycles) – DRAM/logic clock alignment (1 cycle) – Refresh ( Halt from host) Multicycle stall distribution

16 Stall Buffers Memory page is never stalled – Must buffer read data during stall – Must buffer requests during stall distribution Input Stall Buf Output DRAM macro User logic CMB logic

17 Trumpet Test Chip 0.25  DRAM, 0.4  logic 2 Mbits + 64 LUTs 125 MHz operation 1 GB/sec peak bandwidth 10  sec reconfiguration 10 x 5 mm 2 die 1 W @ 125 MHz

18 CMB Area Breakdown 13.95 mm 2 total 2 Mbits capacity  147 Kbits/mm 2 average density Compare to 700-900 Kbits/mm 2 commodity DRAM DRAM Macro CMB Logic

19 Using a Custom Macro Existing: – 13.95 mm 2 – 147 Kbits/mm 2 Custom: – 9.4 mm 2 – 218 Kbits/mm 2

20 Comparison to SRAM CMB DRAM (custom macro)  218 Kb/mm 2 SRAM (equal area)  25 Kb/mm 2 With typical SRAM core densities and:  No stall buffers  Simplified controller Close to 1 order of magnitude density advantage for DRAM 

21 Performance Configuration / state swap: peak 1 GB/s User accesses: dependent on access patterns – Peak if high locality – Near peak for sequential patterns (62-93%) – Column latency exposed when dependencies exist, or on mixed R/W – Row latency exposed on random accesses

22 Performance (example) Row 8 8 Input image Scanline order 8x8 DCT block 1 Kbit = 1 DRAM row Column Row: ~ 4 misses / DCT block Col: 2 misses / DCT block  73% efficiency

23 Refresh Overhead 8 to 16 ms retention time expected 2.5% to 5.0% bandwidth loss Can reduce by refreshing only active part of memory May skip refresh for short-lived data

24 Conclusion Q: Is on-chip DRAM advantageous to SRAM ? Our experience so far: – User-friendly abstraction possible – Can maintain density advantage – Effect on application performance: » Large buffer space  less frequent reconfiguration » High bandwidth  faster reconfiguration » Effect on individual kernels often limited by DRAM core latency


Download ppt "Embedded DRAM for a Reconfigurable Array S.Perissakis, Y.Joo 1, J.Ahn 1, A.DeHon, J.Wawrzynek University of California, Berkeley 1 LG Semicon Co., Ltd."

Similar presentations


Ads by Google