Andy Ye, Jonathan Rose, David Lewis

Architecture of Datapath-oriented Coarse-grain Logic and Routing for FPGAs
Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer Engineering University of Toronto {yeandy, jayar, Good afternoon. In this presentation, I am going to present a new FPGA architecture containing coarse-grain logic and routing resources. The architecture utilizes the regularity of datapath circuits in order to reduce the implementation area of datapath circuits on FPGAs.

Outline Motivation A datapath-oriented FPGA Experimental results
Datapath regularity A datapath-oriented FPGA Architecture CAD flow Experimental results Area efficiency Conclusion Here is a brief outline of our talk. First we present the motivation of our research, namely datapath regularity. Then we present the architecture of our proposed datapath-oriented FPGA. After the architectural description, we present a set of CAD tools that we custom designed for the new architecture. Using the CAD tools, we performed a set of experiments to investigate the area efficiency of our datapath-oriented FPGA. The experimental results are presented next. And finally we conclude.

Modern FPGAs Very large logic capacities
Over 10 million equivalent logic gates Increasingly used to implement large and complex applications Arithmetic-intensive applications Central processing units Graphics accelerators Digital signal processors Internet routers Modern FPGAs have very large logic capacities. Current state of the art FPGAs can have a logic capacities of over 10 million equivalent logic gates [Xilinx Website]. Because of this large logic capacity, FPGAs have been increasingly used to implement very large and very complex applications. And one class of these large and complex applications is the arithmetic-intensive applications. As the name suggests, these are the applications that contain a great amount of arithmetic operations. Examples of these arithmetic-intensive applications include CPUs, graphics accelerators, digital signal processors And, to some extent, internet routers.

Datapath Circuits Arithmetic-intensive applications Datapath circuits
Implement arithmetic operations using datapath circuits Datapath circuits Consist of multiple identical logic structures called bit-slices Regularity Predictability Arithmetic operations in arithmetic-intensive applications usually are implemented through a class of circuits called datapath circuits. Which typically consist of multiple identical logic structures called bit-slices. These bit-slices make datapath structures very regular and very predictable.

An Example A0 A1 A2 A3 B0 B1 B2 B3 Full Adder Full Adder Full Adder
Carry In Carry Out Here is a very simple example of a datapath circuit. This is a ripple carry adder that processes four-bit wide data It adds two four-bit wide numbers A0-A3 and B0-B3 together and produces another four-bit wide number represented by C0-C3 as its output. The ripple carry adder consists of four identical bit-slices, and each bit-slice is called a full adder. The function of each full adder is to add one-bit of each input data together and produce one-bit of output data C0 C1 C2 C3

Research Goal Design a new FPGA architecture
Utilize datapath regularity Reduce the implementation area of datapath circuits on FPGAs Implement a full set of CAD tools for the new architecture Synthesis Packing Placement Routing So the goal of our research is to design a new FPGA architecture that utilizes datapath regularity in order to reduce the implementation area of datapath circuits on FPGAs. We also want to design and implement a full suite of CAD tools to support this new datapath-oriented FPGA architecture. These tools include synthesis, packing, placement, and routing tools. All these tools should preserve datapath regularity and efficiently map datapath regularity onto the new architecture.

Key Architectural Features
A bus-oriented logic block architecture A mixture of coarse-grain routing tracks and fine-grain routing tracks Our datapath-oriented FPGA contains two key architectural features. They are a bus-oriented logic block architecture; and A mixture of coarse-grain routing tracks and fine-grain routing tracks. Each is explained in more detail in the next few slides.

Datapath FPGA Overview
L S Routing Channels The overall structure of our datapath-oriented architecture is shown on this slide. It consists of a two-dimensional array of logic blocks. Each logic block contains several Look-Up Tables and D-type Flip-Flops. and has its own local routing resources. L Logic Block Coarse grain routing tracks Fine grain routing tracks S Switch Block

Logic Block — Super-cluster
BLE Cluster 1 Cluster 2 Cluster 3 Cluster 4 Local Routing Network BLE A Cluster MUX LUT DFF M A Basic Logic Element (BLE) The detailed structure of a logic block is shown here. The logic block is called a super-cluster. A super-cluster is divided into several clusters, in this case we have four clusters inside our super-cluster. Clusters are connected together by carry chains. Here we have two groups of carry chains running horizontally connecting the four cluster together. Now we take a look at the structure of a cluster. Our cluster is very much like the clusters used in many previous studies. It consists of several Basic Logic Elements, BLEs, and a fully connected local routing network. Each basic logic element, consists of a lookup table and a D-type flip flop. - A multiplexer and a configuration memory select the output of the BLE from the LUT output and the DFF output.

Datapath FPGA Overview
L S Routing Channels Now we return to our top view of of FPGA and take a look at the global routing resources. The global routing network consists of horizontal and vertical routing channels connected by switch boxes. Each routing channel in our FPGA not only contains routing tracks of various length, but also of different granularities. In this example, our routing channel contains routing tracks with two different granularities the coarse grain routing tracks and the fine grain routing tracks. L Super-cluster Coarse grain routing tracks Fine grain routing tracks S Switch Block

Coarse-grain Routing Tracks
Super-cluster Cluster Switch Block M M And now we will take look at these two types of routing tracks in detail. Shown here is a super-cluster containing four clusters. We also have two kinds of routing resources Fine-grain routing tracks and Coarse-grain routing tracks. The fine-grain routing tracks are very much like regular routing tracks in a regular FPGA. Each fine-grain routing track is controlled by a single set of configuration memory. The coarse-grain routing tracks are grouped into groups. Each group contains several routing tracks. The number of tracks in a group is called the granularity of the coarse grain routing. For example in this example, the granularity of our coarse-grain routing is four. We control a group of coarse-grain routing tracks with a single set of configuration memory. For example, inside the switch box. For fine-grain routing, a single switch is controlled by a single bit of SRAM. For coarse-grain routing, on the other hand, a group of four switches are controlled by a single bit of SRAM. The output connection blocks are similar. For fine-grain routing tracks, each logic cluster output is independently controlled by a single bit of SRAM. For coarse-grain tracks, a group of four outputs are controlled by a single bit of SRAM. Here we have individually controlled output switches for fine grain routing; and collectively controlled output switches in the coarse grain routing. Coarse-grain routing tracks are more efficient at routing bus signals, which originates from one logic cluster and terminals at another. Fine-grain Routing M Coarse-grain Routing

CAD Flow CAD flow for the datapath-oriented FPGA consists of
Synthesis Packing Placement Routing Conventional CAD flow Minimize area and delay metrics Destroy datapath regularity Having designed the datapath-oriented architecture, we also designed an automated CAD flow that maps datapath circuits onto the FPGA architecture. The CAD flow includes synthesis, packing, placement, and routing tools. We will briefly describe the functionalities of these tools in the next few slides. These datapath-oriented tools are based on the conventional CAD tools; and like the conventional tools they tries to minimizes a set of area and delay metrics. However, unlike conventional CAD tools, they do not destroy datapath regularity.

Datapath-oriented CAD Flow
Preserve datapath regularity (bit-sliced structures) Map the preserved regularity onto the datapath-oriented FPGA architecture Maximize the utilization of coarse-grain routing tracks Minimize the implementation area of datapath structures So overall, our datapath-oriented CAD flow tries to do the following: It tries to preserve the regularity of datapath circuits through preserving bit-sliced structures of datapaths. The flow then tries to map the preserved regularity onto the datapath-oriented FPGA architecture. During the mapping process, the flow also tries to maximize the utilization of coarse-grain routing tracks in order to minimize the implementation area of regular datapath structures.

Datapath Representation
Datapath circuits are represent by netlists of datapath components (VHDL or Verilog) Datapath component library Multiplexers Adders/subtracters Shifters Comparators Registers Each component consists of identical bit-slices Our CAD flow requires datapath circuits to be represented using a special format. In this format, we represent our the input datapath circuits using a netlist of predefined datapath components from a datapath component library. The library contains basic datapath building blocks including multiplexers adders/subtracters shifters comparators and registers. Each datapath component consists of multiple identical bit-slices.

Synthesis Enhanced module compaction algorithm
Based on the Synopsys FPGA compiler Augmented with several datapath-oriented features Preserve datapath regularity by preserving bit-slice boundaries Achieve as good area results as the conventional synthesis tools For the synthesis process, we designed a special synthesis algorithm called the enhanced module compaction algorithm. The algorithm is based on the Synopsys FPGA compiler. We augmented the compiler with several datapath-oriented features. These features allow the synthesis process to preserve datapath regularity through the preservation of bit-slice boundaries and the tool achieves nearly as good area results as the conventional synthesis tools.

An Example Datapath Circuit
mux + c0 a0 b0 d0 s0 mux + c1 a1 b1 d1 s1 mux + c2 a2 b2 d2 s2 mux + c3 a3 b3 d3 s3 sel We use an example to demonstrate the synthesis process. Here is a four bit-wide datapath circuit consisting of four identical bit-slices. Each bit-slice consists of a two input multiplexer followed by a two input and gate and finally a full adder. This bit-slice structure is duplicated four times to form the datapath circuit. cin cout

Synthesis a0 b0 4-LUT a0 b0 c0 sel d0 s0 cin sel mux c0 d0 cin + s0
The basic idea of our synthesis process is to synthesize one bit-slice at a time. Here the multiplexer and the and gate is mapped into a single four-input LUT colored in blue. The full adder is implemented using two four input-LUTs colored in silver and gray. cin + s0

Synthesis 4-LUT a0 b0 c0 sel d0 s0 cin 4-LUT a1 b1 c1 sel d1 s1 4-LUT
cout Then the synthesized bit-slice is duplicated four times to recreate the original datapath.

Packing Based on the T-VPACK packing algorithm
Pack adjacent bit-slices into super-clusters Utilize carry connections in super-clusters to minimize the delay of carry chains The packing algorithm that we used is based on the T-VPACK packing algorithm. Using T-VPACK, we pack adjacent bit-slices into super-clusters. We also utilize the carry connections in each super-cluster to minimize the delay of carry chains.

An Example Four clusters per super-cluster Two BLEs per cluster
Six inputs per cluster Now here is an example. We want to pack our example datapath circuit into the following super-cluster architecture. Each of our super-clusters contains four clusters. Each cluster has two BLEs and six inputs. BLE BLE BLE BLE BLE BLE BLE BLE

Packing Into Clusters 4-LUT a0 b0 c0 sel d0 s0 cin a0 b0 c0 sel BLE
So what does our packing algorithm do. It packs one bit-slice at a time into clusters first. Here the blue four-input LUT is packed into a cluster by itself. The silver and the gray four-input LUTs are packed into a separate cluster. Notice that in the second cluster, we utilize its carry connection to connect the carry chain. Once all the LUTs and DFFs have been packed into clusters. We then pack clusters into super-clusters. BLE BLE s0

Packing Into Super-clusters
b0 c0 sel a1 b1 c1 sel a2 b2 c2 sel a3 b3 c3 sel BLE BLE BLE BLE BLE BLE BLE BLE d0 cin d1 d2 d3 When packing clusters into super-clusters, we group clusters containing identical portions of adjacent bit-slices together into super-clusters. Here two super-clusters are created. The top super-cluster implements all the blue LUTs from the four bit-slices. The bottom super-cluster implements the full adders from the four bit-slices. Notice that the connection between the top super-cluster and the bottom super-cluster forms a four-bits wide bus. This bus can be efficiently routed through the coarse-grain routing tracks. Depending on their sources, signals a0-a3 can also form a four-bit wide bus which also can be routed through the coarse-grain routing tracks to save area. Similarly b0-b3, c0-c3, d0-d3, and s0-s3 all can be routed through the coarse-grain routing tracks to improve area efficiency if their sources or sinks are from the same super-cluster. BLE BLE BLE BLE BLE BLE BLE BLE s0 s1 s2 s3 cout

Placement Based on the VPR placer Use simulated annealing algorithm
For super-clusters containing datapath circuits Move super-clusters only For super-clusters containing non-datapath circuits - Move individual clusters Our placer is based on the VPR placer which uses the simulated annealing algorithm to search for optimal placements. In our algorithm, during the simulated annealing process, we only move individual super-clusters if any super-clusters involved contain datapath circuits. However, when we exchange the position of two super-clusters containing non-datapath circuits, we exchange the position of two individual clusters, one from each super-cluster.

Routing Based on the VPR router Use the path finder algorithm
As much as possible Route buses through coarse-grain routing tracks Route individual signals through fine-grain routing tracks When necessary Use coarse-grain routing tracks for individual signals Use fine-grain routing tracks for buses Our routing algorithm is again based on the VPR router, which uses the path finder algorithm to connect super-clusters together. As much as possible, we route the buses through coarse-grain routing tracks, and we route non-bus signals through fine-grain routing tracks. When necessary, however, our router is capable of routing non-bus signals on coarse-grain routing tracks, and, vice versa, routing buses through fine-grain routing tracks. When routing a non-bus signal on coarse-grain routing tracks, only one track in a group of coarse-grain tracks is used. All the other tracks are wasted. So we want to uses as little coarse-grain tracks for routing non-bus signal as possible. Similarly, when we use fine-grain routing tracks to route buses, we are not getting any area savings. So this is also less desirable to route buses on fine-grain routing tracks.

Area Efficiency Benchmarks Architectural assumptions
15 datapath circuits from the Pico-java processor Architectural assumptions Four BLEs per cluster Four clusters per super-cluster Four coarse-grain tracks sharing configuration memory Logic track length of two Disjoint switch block topology Architectural variables Number of coarse-grain tracks So to investigate the area efficiency of the datapath-oriented FPGA. We did a set of experiments using fifteen datapath circuits from the Pico-java processor from SUN Microsystems. We assume that there are four BLEs per cluster and there are four clusters in every super-cluster. Four coarse-grain routing tracks share a single set of configuration memory. We set logic track length to be two. This means that every track expands two super-clusters before it is interrupted by a routing switch. We also assume a disjoint switch block topology for both coarse-grain routing tracks and fine-grain routing tracks. For the experiment, we vary the number of coarse-grain tracks and measure area results.

Area Efficiency normalized circuit area
circuit area in minimum transistor area (x106) 1.60 100.0% 1.50 95.0% And this graph shows the area result. The x-axis shows the number of coarse-grain routing tracks as a percentage of the total number of routing tracks. It is divided into eight bins. The first bin represent the architecture with no coarse-grain routing tracks. The second bin represent the architectures with 0%-10% of coarse-grain routing tracks. The third bin represents the architectures with 10%-20% of coarse-grain routing tracks. The fourth bin represents the architectures with 20%-30% of coarse-grain routing tracks. The fifth bin represents the architectures with 30%-40% of coarse-grain routing tracks. The sixth bin represents the architectures with 40%-50% of coarse-grain routing tracks. The seventh bin represents the architectures with 50%-60% of coarse-grain routing tracks. And finally the eighth bin represents the architectures with 60%-70% of coarse-grain routing tracks. There are two y-axes. The one on the left shows the absolute area in terms of minimum transistor area. The one on the right shows the normalize circuit area, where the area results are normalized against the best comparable conventional FPGA architecture. Each data point shown here, is the average area across all fifteen benchmark circuits. Here we can see that, initially as the number of coarse-grain routing track is increased from none to 10%-20% of the total number of tracks, the area actually gets worse. The primary reason for this is that we are differentiating the routing tracks into two types. One type is the coarse-grain tracks and the other type is the fine-grain tracks. This differentiation reduces the flexibility for the router, and results in increased area. As we further increase the number of coarse-grain routing tracks to over 10%-20% of the total tracks, the area required to implement the circuit starts to decrease. It reaches a minimum when the coarse-grain tracks consist of between 40%-50% of the total number of tracks. And then as the percentage of coarse-grain routing tracks further increases, the area starts to increase again. This increase is due to the fact that we now have too much coarse-grain routing tracks for the available amount of buses that we have to route. So non-bus signals are starting to be routed on coarse-grain routing tracks. The datapath-oriented architecture is always better than the conventional FPGA architecture. The best datapath-oriented architecture is 9.6% smaller than the conventional FPGA architecture. 90.0% 1.40 0% 0%- 10% 10%- 20% 20%- 30% 30%- 40% 40%- 50% 50%- 60% 60%- 70% % of coarse- grain tracks

Logic Track Length Vs. Area
Architectural assumptions Four clusters per super-cluster Four coarse-grain tracks sharing configuration memory 50% of tracks are coarse-grain tracks Disjoint switch block topology Architectural variables Number of BLEs per cluster Logic track length The second experiment that we did is to vary the logic track length and the number of BLEs per cluster and measure the area results. Here again, we assume four clusters per super-cluster. Four coarse-grain routing tracks sharing a single set of configuration memory. We also assume a disjoint switch block topology. And finally we assume that 50% of tracks are coarse-grain and the other 50% are fine-grain tracks. What do we vary is the number of BLEs in a cluster and the logic track length. Again we measure area using the same benchmarks. (Didn’t search for the optimal Fc_pad)

Logic Track Length Vs. Area
circuit area in minimum transistor area (x106) N = 2 N = 4 2.20 N = 8 2.00 N = 10 And this graph shows the results. Here the x-axis is the track length. The y-axis the the area measured in minimum transistor area. We have four lines. Each line represents two BLEs per cluster, four BLEs per cluster, eight BLEs per cluster, and ten BLEs per cluster. Here we see that all cluster sizes achieve the best area results when the track length is two. And the best area result is achieved when there are four BLEs inside each cluster. 1.80 track length 1.60 1 2 4 8 16

Conclusion Proposed a datapath-oriented FPGA architecture and its CAD tools Best area is achieved when 40% - 50% of tracks are coarse-grain routing tracks Four BLEs per cluster Logic track length of two Best area is 9.6% smaller than conventional FPGAs So to conclude. In this presentation we have proposed a datapath-oriented FPGA architecture and a set of CAD tools supporting the architecture. Experimentally we show that the best architectural configuration contains 40%-50% of tracks that are coarse-grain; and it also contains four BLEs per cluster and has a logic track length of two. And in this configuration, the area of the datapath-oriented architecture is 9.6% smaller than the best comparable conventional FPGAs for implementing datapath circuits in our benchmarks.

Andy Ye, Jonathan Rose, David Lewis

Similar presentations

Presentation on theme: "Andy Ye, Jonathan Rose, David Lewis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Andy Ye, Jonathan Rose, David Lewis

Similar presentations

Presentation on theme: "Andy Ye, Jonathan Rose, David Lewis"— Presentation transcript:

Similar presentations

About project

Feedback