Download presentation

Presentation is loading. Please wait.

Published byColin Randell Modified over 2 years ago

1
A Hierarchical C2RTL Framework for FIFO-Connected Stream Applications Shuangchen Li, Yongpan Liu, Daming Zhang, Xinyu He, Pei Zhang, Huazhong Yang ypliu@tsinghua.edu.cn 2012-Jan-30

2
Outline Background Motivation Hierarchical C2RTL framework Algorithm for FIFO-connected blocks Experiments Future works Reference 2

3
Background A rapid rising demand for the high quality C language to RTL (C2RTL) tools [1] –Design gap between design productivity and transistor resources –Huge silicon capacity –Extensive use of embedded processors –Reuse of behavior IPs –Extensive adoption of accelerators –More time-to-market pressure 3

4
Background Designers have successfully developed various applications using C2RTL tools with much shorter design time.[2]-[4] –F–Focusing on Stream applications GAUT[5], ROCCC[6], Impulse C[7], ASC[8] –F–Focusing on general purpose applications Catapult C[9] Little control on the architecture leads to suboptimal results.[10] Hierarchical C2RTL framework Leaving the user to determine the FIFO capacity between modules. Algorithm for FIFO-connected blocks 4

5
To design a stream application, researchers also had investigated on FIFO sizing.[11]-[14] All of them adopts a more complicated behavior model for PE streams, which is not necessary in the hierarchical C2RTL framework. Our Hierarchical C2RTL Framework for FIFO-Connected Stream Applications: –The hierarchical implementation provides even10.43 times speedup compared to the flatten design. –We develop an algorithm to find the optimal FIFO capacity in a multiple-module design. –We demonstrate the proposed method in seven real applications. 5

6
Outline Background Motivation Hierarchical C2RTL framework Algorithm for FIFO-connected blocks Experiments Future works Reference 6

7
Motivation Hierarchical vs Flatten Approach –The synthesized quality for larger algorithms is generally not so good as small ones. –The translating time is unacceptable. –10.43 times performance speedup can be made by JPEG encoder case. 7 Performance with Different FIFO Capacity –FIFO capacity affect both speed and area. –Designers need to decide the FIFO size in an iterative way which is time consuming.

8
Outline Background Motivation Hierarchical C2RTL framework Algorithm for FIFO-connected blocks Experiments Future works Reference 8

9
Hierarchical C2RTL framework Step 1: We partition C codes into appropriate-size functions. Step 2: We use C2RTL tools to translate each function into a hardware process element (PE), which has a FIFO interface. Step 3: We connect those PEs with proper sized FIFOs. 9

10
Hierarchical C2RTL framework Step 1: −The C code partition has a great impact on the hardware performance. − A example of GSM case is shown in the figure. −Currently, we use a manual partition strategy. −An integer linear programming based partition approach is presented in [16]. 10

11
Hierarchical C2RTL framework Step 2: −We use the C2RTL tool of eXCite from Y Exploration, Inc[15] Step 3: −We define the FIFO interface and Modeling the FIFO −Define the PE interface Modeling PEs. 11

12
Outline Background Motivation Hierarchical C2RTL framework Algorithm for FIFO-connected blocks Experiments Future works Reference 12

13
Given a design consisting of N PEs, we need to determine the depth D (i-1)i of each FIFO, which maximizes the throughput TH all and minimizes the area A all. That is: Without losing generality, we set TH ref =TH best and A ref =∞. Algorithm for FIFO-connected blocks 13 FIFO Interconnection Formulation:

14
Algorithm for FIFO-connected blocks Optimal FIFO Capacity for Two PEs: –For PE 1 and PE 2, we assume that FIFO 01 never empty and FIFO 23 never full, and then we calculate FIFO 12 ’s capacity. 14

15
Algorithm for FIFO-connected blocks Step 1: –Find out the bottleneck –Talk about later Step 2: –FIFO sizing for two block –Optimal FIFO Capacity for Two PEs theory Step 3: –Merging two PEs into a new one –Talk about later 15 FIFO Capacity for Multiple Blocks:

16
Step 1(Find out bottleneck): −We set TH n as the throughput of the system consisting of PE 1 to PE n. We have a recursive equation As TH all =TH N, we can express TH all : Step 3(Merging two PEs): −We give out the equations to compute the interface parameters for the merged PE Algorithm for FIFO-connected blocks 16 FIFO Capacity for Multiple Blocks:

17
Outline Background Motivation Hierarchical C2RTL framework Algorithm for FIFO-connected blocks Experiments Future works Reference 17

18
Experiments Experimental Configurations –eXCite + ModelSim + Quartus –Seven benchmarks from real applications. They are: JPEG encode/decode AES encryption/decryption GSM ADPCM Filter Group The benefit of optimize FIFO capacity algorithm: –Area saving –Timing efficiency –The average time cost by the entire exploration time is N *log2(p) *C, using a binary searching algorithm 18

19
Experiments Flatten vs Hierarchical Approach: 19

20
Experiments Optimal FIFO Capacity: 20 −Compared our results with exhaustive simulations under random inputs. −Lists both the analytical results and the experimental ones on FIFO sizing in seven real cases.

21
Future works The automatically C code partition Using our algorithm in more complex architectures with feedback and branches 21

22
Reference [1] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang, “High-level synthesis for fpgas: From prototyping to deployment,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Trans-actions on, vol. 30, no. 4, pp. 473–491, 2011. [2] B. Schafer, A. Trambadia, and K. Wakabayashi, “Design of complex image processing systems in esl,” in Asia and South Pacific Design Automation Conference. IEEE Press, 2010, pp. 809–814. [3] Y. Guo and D. McCain, “Rapid prototyping and vlsi exploration for 3g/4g mimo wireless systems using integrated catapult-c methodology,” in Wireless Communications and Networking Conference. IEEE, 2006, pp. 958–963. [4] M. Rossler, H. Wang, U. Heinkel, N. Engin, and W. Drescher, “Rapid prototyping of a dvb-sh turbo decoder using high-level-synthesis,” in Specification & Design Languages. IEEE Forum on, 2009, pp. 1–6. [5] G. Lhairech-Lebreton, P. Coussy, and E. Martin, “Hierarchical and multiple-clock domain high-level synthesis for low-power design on fpga,” in International Conference on Field Programmable Logic and Applications. IEEE, 2010, pp. 464–468. [6] J. Villarreal, A. Park, W. Najjar, and R. Halstead, “Designing mod-ular hardware accelerators in c with roccc 2.0,” in IEEE Annual International Symposium on Field- Programmable Custom Computing Machines. IEEE, 2010, pp. 127–134. 22

23
Reference [7] M. Gokhale, J. Stone, J. Arnold, and M. Kalinowski, “Stream-oriented fpga computing in the streams-c high level language,” in Field-Programmable Custom Computing Machines. IEEE Symposium, 2000, pp. 49–56. [8] O. Mencer, “Asc: a stream compiler for computing with fpgas,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Trans-actions on, vol. 25, no. 9, pp. 1603–1617, 2006. [9] “http://www.mentor.com.” [10] A. Agarwal, “Comparison of high level design methodologies for algorithmic ips: Bluespec and c-based synthesis,” Ph.D. dissertation, Massachusetts Institute of Technology, 2009. [11] K. Lahiri, A. Raghunathan, and S. Dey, “System-level performance anal-ysis for designing on-chip communication architectures,” in Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 20, no. 6, 2001, pp. 768– 783. [12] R. Cruz, “Quality of service guarantees in virtual circuit switched networks,” in Selected Areas in Communications, IEEE Journal on, vol. 13, no. 6, 1995, pp. 1048– 1056. [13] A. Maxiaguine, S. K ¨unzli, S. Chakraborty, and L. Thiele, “Rate analysis for streaming applications with on-chip buffer constraints,” in Asia and South Pacific Design Automation Conference. IEEE Press, 2004, pp. 131–136. 23

24
Reference [14] Y. Liu, S. Chakraborty, and R. Marculescu, “Generalized rate analysis for media- processing platforms,” in International Conference on Em-bedded and Real-Time Computing Systems and Applications, RTCSA. IEEE, 2006, pp. 305–314. [15] “http://www.yxi.com.” [16] Y. Hara, H. Tomiyama, S. Honda, and H. Takada, “Partitioning of behavioral descriptions with exploiting function-level parallelism,” in Fundamentals,IEICE Trans on, vol. E93-A, 2010, pp. 488–499. [17] Y. Hara, H. Tomiyama, S. Honda, H. Takada, and K. Ishii, “Chstone: A benchmark program suite for practical c-based high-level synthesis,” in Circuits and Systems. IEEE International Symposium on, 2008, pp. 1192–1195. 24

25
Thank you !

Similar presentations

OK

A New Class of High Performance FFTs Dr. J. Greg Nash Centar (www.centar.net) High Performance Embedded Computing (HPEC) Workshop.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar (www.centar.net) High Performance Embedded Computing (HPEC) Workshop.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google