Embedded DRAM for a Reconfigurable Array S.Perissakis, Y.Joo 1, J.Ahn 1, A.DeHon, J.Wawrzynek University of California, Berkeley 1 LG Semicon Co., Ltd.

Slides:



Advertisements
Similar presentations
Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
Chapter 5 Internal Memory
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
Anshul Kumar, CSE IITD CSL718 : Main Memory 6th Mar, 2006.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
CS.305 Computer Architecture Memory: Structures Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made.
Memories and the Memory Subsystem; The Memory Hierarchy; Caching; ROM.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Lecture 16B Memories. 2 Memories in General Computers have mostly RAM ROM (or equivalent) needed to boot ROM is in same class as Programmable Logic.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 13, 2002 Topic: Main Memory (DRAM) Organization.
Trends toward Spatial Computing Architectures Dr. André DeHon BRASS Project University of California at Berkeley.
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1 Lecture 16B Memories. 2 Memories in General RAM - the predominant memory ROM (or equivalent) needed to boot ROM is in same class as Programmable Logic.
Main Memory by J. Nelson Amaral.
8-5 DRAM ICs High storage capacity Low cost Dominate high-capacity memory application Need “refresh” (main difference between DRAM and SRAM) -- dynamic.
Memory Technology “Non-so-random” Access Technology:
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
A Relational Algebra Processor Final Project Ming Liu, Shuotao Xu.
CPE232 Memory Hierarchy1 CPE 232 Computer Organization Spring 2006 Memory Hierarchy Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
Review: Basic Building Blocks  Datapath l Execution units -Adder, multiplier, divider, shifter, etc. l Register file and pipeline registers l Multiplexers,
ISCA2000 Norman Margolus MIT/BU SPACERAM: An embedded DRAM architecture for large-scale spatial lattice computations.
EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff Case.
1 CSCI 2510 Computer Organization Memory System I Organization.
Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
EEE-445 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)
Lecture 13 Main Memory Computer Architecture COE 501.
Main Memory CS448.
EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
The Memory Hierarchy Lecture # 30 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir.
Chapter 4 Memory Design: SOC and Board-Based Systems
CS/EE 5810 CS/EE 6810 F00: 1 Main Memory. CS/EE 5810 CS/EE 6810 F00: 2 Main Memory Bottom Rung of the Memory Hierarchy 3 important issues –capacity »BellÕs.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
ECE/CS 552: Main Memory and ECC © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
COMPUTER SYSTEMS ARCHITECTURE A NETWORKING APPROACH CHAPTER 12 INTRODUCTION THE MEMORY HIERARCHY CS 147 Nathaniel Gilbert 1.
CS35101 Computer Architecture Spring 2006 Lecture 18: Memory Hierarchy Paul Durand ( ) [Adapted from M Irwin (
CSE431 L18 Memory Hierarchy.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 18: Memory Hierarchy Review Mary Jane Irwin (
“With 1 MB RAM, we had a memory capacity which will NEVER be fully utilized” - Bill Gates.
CPEG3231 Integration of cache and MIPS Pipeline  Data-path control unit design  Pipeline stalls on cache misses.
Buffering Techniques Greg Stitt ECE Department University of Florida.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
Administration Midterm on Thursday Oct 28. Covers material through 10/21. Histogram of grades for HW#1 posted on newsgroup. Sample problem set (and solutions)
CS 704 Advanced Computer Architecture
Yu-Lun Kuo Computer Sciences and Information Engineering
Reducing Hit Time Small and simple caches Way prediction Trace caches
Cache Memory Presentation I
CSCI206 - Computer Organization & Programming
Lecture: DRAM Main Memory
Digital Logic & Design Dr. Waseem Ikram Lecture 40.
Graphics Hardware: Specialty Memories, Simple Framebuffers
If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?
Presented by David Wolinsky
DRAM Hwansoo Han.
Main Memory Background
Bob Reese Micro II ECE, MSU
ADSP 21065L.
Presentation transcript:

Embedded DRAM for a Reconfigurable Array S.Perissakis, Y.Joo 1, J.Ahn 1, A.DeHon, J.Wawrzynek University of California, Berkeley 1 LG Semicon Co., Ltd

Outline Reconfigurable architecture overview Motivation for on-chip DRAM Configurable Memory Block (CMB) Evaluation Conclusion

Long Term Architecture Goal On-chip CPU LUT-based compute pages DRAM memory pages Fat pyramid network fat tree + shortcuts

Long Term Architecture Goal On-chip CPU LUT-based compute pages DRAM memory pages Fat pyramid network fat tree + shortcuts

Long Term Architecture Goal On-chip CPU LUT-based compute pages DRAM memory pages Fat pyramid network fat tree + shortcuts

Long Term Architecture Goal On-chip CPU LUT-based compute pages DRAM memory pages Fat pyramid network fat tree + shortcuts

Long Term Architecture Goal CPU K e r n e l 1K e r n e l 2 Reconfigure ( p r o d u c e r )( c o n s u m e r )

Motivation – Stream buffers Reduce reconfiguration frequency – Configuration memory Speed up reconfiguration – Application memory Speed up individual kernels Need large on-chip memory for:

Challenges Harder to use – Row/Col accesses & variable latency – Refresh Lower performance – Increased access latency Q: Is it worth the trouble ? DRAM offers increased density (10X to 20X that of SRAM), but:

Trumpet test chip Trumpet One compute page One memory page Corresponding fraction of network

CMB Functions Configuration source State source/sink Data store Input/output

CMB Overview Stall Buffers Retiming Registers Address & Data Xbars Rate Matching CMB Controller DRAM Macro DQ[127:0] [127:0][63:0] Ctl[1:0]Addr[17:0] Addr[9:0] Ctl[1:0] Tree[159:0] Short[159:0] Cmd From compute page From host

DRAM Macro 0.25µm, 4 metal eDRAM process 1 to 8 Mbits (2 Mbits in test chip) 128-bit wide SDRAM interface Up to 125 MHz clock  2 GB/s peak B/W 36ns/12ns row/col latencies Row buffers to hide precharge & refresh Designed by LG Semicon

SRAM Abstraction SRAM-like interface Req, R/W, Address, Data Row buffers  simple direct-mapped cache 6-cycle minimum latency, pipelined Misses handled by logic stalls 10-cycle miss latency “hidden” from logic

Stalls Stall sources: – Row buffer miss (10 cycles) – Write after read (4 cycles) – DRAM/logic clock alignment (1 cycle) – Refresh ( Halt from host) Multicycle stall distribution

Stall Buffers Memory page is never stalled – Must buffer read data during stall – Must buffer requests during stall distribution Input Stall Buf Output DRAM macro User logic CMB logic

Trumpet Test Chip 0.25  DRAM, 0.4  logic 2 Mbits + 64 LUTs 125 MHz operation 1 GB/sec peak bandwidth 10  sec reconfiguration 10 x 5 mm 2 die MHz

CMB Area Breakdown mm 2 total 2 Mbits capacity  147 Kbits/mm 2 average density Compare to Kbits/mm 2 commodity DRAM DRAM Macro CMB Logic

Using a Custom Macro Existing: – mm 2 – 147 Kbits/mm 2 Custom: – 9.4 mm 2 – 218 Kbits/mm 2

Comparison to SRAM CMB DRAM (custom macro)  218 Kb/mm 2 SRAM (equal area)  25 Kb/mm 2 With typical SRAM core densities and:  No stall buffers  Simplified controller Close to 1 order of magnitude density advantage for DRAM 

Performance Configuration / state swap: peak 1 GB/s User accesses: dependent on access patterns – Peak if high locality – Near peak for sequential patterns (62-93%) – Column latency exposed when dependencies exist, or on mixed R/W – Row latency exposed on random accesses

Performance (example) Row 8 8 Input image Scanline order 8x8 DCT block 1 Kbit = 1 DRAM row Column Row: ~ 4 misses / DCT block Col: 2 misses / DCT block  73% efficiency

Refresh Overhead 8 to 16 ms retention time expected 2.5% to 5.0% bandwidth loss Can reduce by refreshing only active part of memory May skip refresh for short-lived data

Conclusion Q: Is on-chip DRAM advantageous to SRAM ? Our experience so far: – User-friendly abstraction possible – Can maintain density advantage – Effect on application performance: » Large buffer space  less frequent reconfiguration » High bandwidth  faster reconfiguration » Effect on individual kernels often limited by DRAM core latency