Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications Authors: Shuangchen Li, Yongpan Liu, X.Sharon Hu, Xinyu.

Similar presentations


Presentation on theme: "Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications Authors: Shuangchen Li, Yongpan Liu, X.Sharon Hu, Xinyu."— Presentation transcript:

1 Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications Authors: Shuangchen Li, Yongpan Liu, X.Sharon Hu, Xinyu He, Pei Zhang, and Huazhong Yang 2013/01/23

2 Outline Introduction Overview MILP-Based Solution Heuristic Solution Experimental Evaluation Conclusions and Future work 2

3 Well-developed software libraries Low speed, high power Introduction: Background 3 Application complexity increasing MPSoCs architecture Hardware design complexity How to rapidly design hardware from existing software algorithms?

4 Introduction: Background (cont’d) This challenge is now new. However, –Ever increasing design gap –Progression of EDA tools 4 How to rapidly design hardware from existing software algorithms? C-to-RTL synthesis can help to tackle this challenge [29]

5 A 3g/4g MIMO wireless systems [7] Introduction: Motivation C2RTL tools are promising –A number of C2RTL tools –A lot of successful stories 5 …… A DVB-SH Turbo Decoder [8] 10 days 3x speedup A face detection system [9] Very complicated design

6 Introduction: Motivation (cont’d) However, state-of-the-art C2RTL tools suffer from: –Low Quality of results (QoR) for large C programs –System-level optimization options are limited 6 A Reed-Solomon decoding [28] A JPEG encoder [10] Flatten approach Hierarchical approach speedup Cycles42,475,2024,070,60310.43x Clock69.7474.21.06x Control and optimization at the system level are needed

7 Outline Introduction Overview MILP-Based Solution Heuristic Solution Experimental Evaluation Conclusions and Future work 7

8 Overview: Our work Given –a large C program for a streaming application –system constraints (latency, area, …) Determine –how to partition the code into pipelined blocks –which blocks should be parallelized The objectives –Improve synthesis result quality –Provide more system-level optimization options 8 Partition Parallelization

9 Overview: Design flow STEP 1: –We use eXCite here STEP 2: –Determine partition and parallelization STEP 3: –Synthesize each block with a C2RTL tool STEP 4: –Construct the complete system 9

10 Overview: An example 10 Given a C program: –In the straight-line style Given constraints: –System throughput and area Partition: –Which functions should be synthesized together as one pipeline stage Parallelization: –Which synthesized modules should be parallelized

11 Overview: Challenges The design space is large: –Partition has a great impact on throughput and area –Parallelization has a great impact on throughput and area –The Pareto optimal solutions The importance to simultaneously consider partition and parallelization: –The constraints are for the system after both partition and parallelization –If optimizing them separately, it is not clear how to apply the constraints to each problem individually 11 A GSM case

12 Overview: Related work 12 A somewhat related line of work is mapping C programs to MPSoCs (software mapping): –Blocks (or tasks) can be assigned to the same processor –The processor area is given ApplicationInputTargetPartitionParallelization A. Hagiescu and et al., in DAC2009[11] StreamStreamITMSoPCManuallyHeuristic J. Cong and et al., in DATE2012[12] StreamCFPGAManuallyILP Y. Liu and et al., in Intech Book[13] StreamCFPGAManuallyHeuristic Y. Hara and et al., in IEICE[14] GeneralCFPGAILPN/A This workStreamCFPGA Both MILP and Heuristic (consider simultaneously)

13 Overview: Our Contribution A novel MILP based formulation –Find a partition and parallelization solution with maximum throughput or minimum area while satisfying a given area or throughput constraint, respectively An efficient heuristic algorithm –Overcome the scalability challenge facing the MILP formulation Validation of the proposed methods –Developing FPGA based accelerators for seven streaming applications 13

14 Outline Introduction Overview MILP-Based Solution Heuristic Solution Experimental Evaluation Conclusions and Future work 14

15 MILP-Based Solution: Formulation Given function parameters (Para) –Area, throughput … of each function Determine (x n ) –Which functions should be clustered to form blocks –Which blocks should be parallelized Objective: –min. Area (a all (x n,Para) ) or max. Throughput (r all (x n,Para)) Subject to: –Area constraints (a all <A req ) –Throughput constraints (r all >R req ) –Connectivity constraints 15

16 MILP-Based Solution: Variable We use {x n } ∈ Z to represent partition and parallelization: –Partition: If x n =0: F n and F n+1 are in the same block –Parallelization: If x n ≠0: The parallelism degree of block with F n is x n We also use {y i,j } ∈ Binary to represent partition –y i,j =1 means F i,F i+1 …F j are clustered x n 16

17 MILP-Based Solution: Details To calculate throughput r all (x n,Para): To calculate area a all (x n,Para): Connectivity constraints: 17 (1) (2) (3) (4) (5)(6)

18 Outline Introduction Overview MILP-Based Solution Heuristic Solution Experimental Evaluation Conclusions and Future work 18

19 Heuristic Solution: Overview 19 Motivation: –MILP is not scalable –Bad feasible regions may incur long running time even when N is small Consider partition and oarallelization separately (constructive algorithm): –Parallelization before partition to increase throughput: Incx() –Partition for the given parallelization to reduce area: Clust() –Implement Incx() and Clust() in a backtracking iterative way

20 Heuristic Solution: Algorithm 20 Incx(): Parallelization before Partition to increase throughput Clust(): Partition for the given Parallelization to reduce area

21 Heuristic Solution: Algorithm (cont’d) 21 Incx(), Parallelization before Partition: –Increase the parallelization degree of the bottleneck function Clust(), Partition under the given Parallelization: –Model the blocks and their connections as a graph –Convert the problem to a shortest path problem

22 Outline Introduction Overview MILP-Based Solution Heuristic Solution Experimental Evaluation Conclusions and Future work 22

23 Experiments: Set up 23 7 Benchmark [21]: –ADPCM –JPEG encoder/decoder –AES encryption/decryption –GSM –Filter Groups Environment & flow: –C2RTL: eXCite –Logic synthesis: Quartus II (cyclone II) –Simulation: Modelsim

24 Experiments: Validate proposed method 24 Min. area for GSM case Both MILP and Heuristic are effective –Heuristic solutions differ from the MILP results by 2.3% on average

25 25 Min. Area for 7 benchmarks –Heuristic with a difference of 7.5% on average Solutions are effective with different applications Exp.: Validate proposed method (cont’d)

26 Experiments: Running time 26 Running time: –The heuristic solutions are worse by 7.2% on average 350-15000x shorter time than the MILP

27 Outline Introduction Overview MILP-Based Solution Heuristic Solution Experimental Evaluation Conclusions and Future work 27

28 Conclusions and Future work 28 Conclusions : –Our work adopts a hierarchical framework with automatic C-code partition and block-level parallelization –Both an MILP-based solution and a heuristic solution are proposed –Experimental results obtained from seven real applications show that our approaches are effective Future work: –Extend the solution to C program with feedback –Taking power into consideration

29 Reference 29 [1]-[27] is listed in the paper [28] “Comparison of high level design methodologies for algorithmic ips: Bluespec and c-based synthesis,” Ph.D. dissertation, MIT, 2009 [29] “ITRS roadmap on Design” 2011 Edition

30 THANK YOU! 30

31 MILP-Based Solution: Linearization Linearize x j y i,j : z i,j =x j y i,j Linearize Equation (1): Linearize Equation (2): Linearize Equation (4): 31

32 Exp.: Validate proposed method (cont’d) 32 Min. area or Max. throughput for GSM Solutions are effective with various settings


Download ppt "Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications Authors: Shuangchen Li, Yongpan Liu, X.Sharon Hu, Xinyu."

Similar presentations


Ads by Google