The Memory/Logic Interface in FPGA’s with Large Embedded Memory Arrays The Memory/Logic Interface in FPGA’s with Large Embedded Memory Arrays Steven J. E. Wilton, Member, IEEE, Jonathan Rose, Member, IEEE, and Zvonko G. Vranesic, Senior Member, IEEE Laboratory of Reliable Computing Department of Electrical Engineering National Tsing Hua University Hsinchu, Taiwan
Reference S. J. E. Wilton, “Architectures and algorithms for field-programmable gate arrays with embedded memory,” Ph.D. dissertation, Dept. Elect. Comput. Eng., Univ. Toronto, Toronto, Ont., Canada, 1997.
Outline Introduction Baseline architecture Experiment methodology and result Enhanced architecture and its improvement
Introduction In the past, FPGA’s have been primarily used to implement small logic subcircuits As the capacities of FPGA’s grow, they will be use to implement much larger circuits than ever before In order to address the storage requirement of large system, FPGA with large embedded memory arrays are now developed by many vendors
Introduction One of the challenges when embedding memory arrays into FPGA is to provide enough interconnect between memory arrays and logic resources
Baseline Architecture
Memory/Logic Interconnect Block
Benchmark Circuit Generation Need to generate benchmark circuit for the architecture because Typical circuits have only a few memories each To gather hundreds of those is not feasible The solution is to study the types of memory configuration found in systems, and develop a stochastic memory configuration generator Make sure they are realistic by some circuit analysis
Circuit Analysis Memory configuration Logic memory clustering Interconnect patterns Point to point patterns Shared-connection patterns Point to point with no shuffling patterns
Memory Configurations 171 circuits with total of 268 user memories, they are from Recent conference proceeding Recent journal articles Local designer Customer study conducted by Atera
Memory Configurations
Logic Memory Clustering
Interconnect Patterns
Stochastic Circuit Generation A stochastic circuit generator is developed using the statistics gathered during circuit analysis The steps of generating a benchmark circuit Choosing logical memory configuration Division logical memories into cluster Choosing interconnect pattern for each cluster Choosing number of data-in data-out subcircuits for the clusters Generate logic subcircuits and connect them to memory arrays
Implementation Tool Each benchmark circuit generated is “implemented” in each FPGA Logical to physical mapping Placement Place memory and logic blocks simultaneously Routing Initially nets to memory have higher priority Between each iteration the nets are reordered Repeat 10 times Increase W Determine the minimum value of W
Memory/Logic Flexibility Result
Area Result The area of the FPGA is the sum of Logic blocks Memory blocks Routing resources Programmable switch Programming bits Metal routing segments
Area Result
Delay Result A delayed model is used to measure the memory read time of all memories in the circuit CACTI: to estimate array access time Elmore: address in and data out
Delay Result
Issues Issues Nets connect more than one memory block to one or more than one logic block When combining the small memory arrays to implement a large one When data in pins of several user memories are driven by a common data bus Such nets often appear but unfortunately they are hard to route, especially for larger architecture We can use higher value of Fm for larger architecture or?
Further Investigation
Enhanced Architecture The above motivates them to study memory to memory connection more closely An enhanced architecture Adding extra switches between memory arrays to support these nets Result Extra switches take up negligible area Improvement in both speed and routability
Enhanced Architecture
Baseline Architecture
Enhanced Architecture
Evaluation of Enhanced Architecture Maze routing algorithm must be restricted such that it uses memory-to-memory switches only to implement memory-to-memory connection If the maze router is not modified…
Routing Result Using Standard Maze
Modified Maze Even though some tracks will be wasted if a circuit contains no or few memory-to-memory connections, it alleviates the problem above
Area Result
Delay Result
Conclusion Even with this relatively unaggressive use of the memory-to-memory switches, area is improved somewhat and speed is improved significantly The development of algorithms that use these tracks more aggressively is left as future work The enhanced architecture reduces the channel width by 0.5~1 tracks, and improved the speed by 25%