# Using Bus-Based Connections to Improve Field-Programmable Gate-Array Density for Implementing Datapath Circuits

Andy Ye, Member, IEEE, and Jonathan Rose, Member, IEEE

Abstract-As the logic capacity of field-programmable gate arrays (FPGAs) increases, they are increasingly being used to implement large arithmetic-intensive applications, which often contain a large proportion of datapath circuits. Since datapath circuits usually consist of regularly structured components (called bitslices) which are connected together by regularly structured signals (called buses), it is possible to utilize datapath regularity in order to achieve significant area savings through FPGA architectural innovations. This paper describes such an FPGA routing architecture, called the multibit routing architecture, which employs busbased connections in order to exploit datapath regularity. It is experimentally shown that, compared to conventional FPGA routing architectures, the multibit routing architecture can achieve 14% routing area reduction for implementing datapath circuits, which represents an overall FPGA area savings of 10%. This paper also empirically determines the best values of several important architectural parameters for the new routing architecture including the most area efficient granularity values and the most area efficient proportion of bus-based connections.

*Index Terms*—Area efficiency, datapath regularity, field-programmable gate arrays (FPGAs), reconfigurable fabric, routing architecture.

## I. INTRODUCTION

TIELD-PROGRAMMABLE gate arrays (FPGAs) that process multiple bits of data at a time are an alternative architectural approach for implementing datapath circuits on reconfigurable hardware that presents new opportunities for exploiting datapath regularity. In particular, multibit processing increases the number of signals that can be grouped and routed as buses and, consequently, can be effectively used to reduce the number of programmable routing connections in an FPGA. This reduction in programmable connections can lead to substantial decreases in the number of routing switches and result in significant increases in FPGA routing density. Referred to as multibit FPGAs in this paper, the detailed implementation of these devices often consists of logic blocks that can process several bits of data at a time and routing resources that connect these logic blocks together. By processing wide data formats, these FPGAs become especially efficient at implementing large arithmetic-intensive datapath circuits including computer

A. Ye is with the Department of Electrical and Computer Engineering, Ry-

erson University, Toronto, ON M5B 2K3, Canada (e-mail: aye@ee.ryerson.ca). J. Rose is with the Department of Electrical and Computer Engineering, University of Toronto, Toronto, ON M5S 3G4, Canada (e-mail: jayar@eecg.utoronto.ca).

Digital Object Identifier 10.1109/TVLSI.2006.876095

graphics, multimedia, digital signal processing, and Internet routing applications.

Several multibit FPGA architectures have been proposed in the past [1]–[12] with a wide range of routing architecture designs; and, in this work, we focus on the problem of incorporating bus-based connections into segmented-style routing resources [13]. In particular, we propose a specific routing architecture, called the multibit routing architecture, and empirically evaluate its area efficiency. The result obtained is highly relevant to the current FPGA research due to the fact that many state-of-the-art commercial FPGAs (including the Altera Flex, Stratix, and Cyclone series [14] and Xilinx 5200, Virtex, and Spartan families [15] of FPGAs) use similarly styled routing resources and, with their ever-increasing logic capacity, commercial FPGAs are being increasingly used to implement large datapath-intensive applications.

It is essential to have a set of automated design tools to make effective use of multibit architectures. A set of datapath-oriented computer-aided design (CAD) tools, including synthesis, packing, placement, and routing tools, have been developed at the University of Toronto; and in this paper, these tools are used to investigate the area efficiency of the multibit routing architecture by experimentally determining the best granularity values and the best amount of bus-based connections for the proposed routing architecture. Extensive research [16]-[21] has been conducted in the past in order to determine the best structures for various conventional FPGA routing architectures. These studies have shown the importance of routing architecture on the overall area-efficiency of FPGAs. None of the studies, however, have explored bus-based connections, which require the preservation of datapath regularity (all these studies use conventional CAD algorithms, which destroy the regularity of datapath circuits and essentially turn datapath into finite state machine-like netlists of "randomly" connected logic gates). In this study, a set of datapath-oriented algorithms [22]-[24] is used, which preserve a great amount of user-specified regularity. The preserved regularity, in turn, is used in the investigation of the area efficiency of the proposed routing architecture.

An early version of this paper appeared in [25]. This work augments that work with the addition of an extended analytical section and additional analysis on the experimental results. Another closely related work in [26] explores the design space of bus-based connections in a limited way by using several simplifying architectural assumptions. In this work, we perform a much more extensive search of the entire architectural design space to more accurately determine the most area efficient

Manuscript received April 1, 2005; revised January 14, 2006.

|  | Logic<br>Block                   | Vertical<br>Routing<br>Channel |
|--|----------------------------------|--------------------------------|
|  | Horizontal<br>Routing<br>Channel | Switch<br>Block                |

Fig. 1. FPGA tile.

values of several important architectural parameters including the best granularity of the bus-based connections. An analytical analysis is also performed to determine the maximum theoretical benefit of bus-based connections on FPGA routing.

The rest of this paper is organized as follows. Section II motivates the multibit routing architecture design by describing the advantages of the bus-based routing connections in implementing datapath circuits. Section III presents the multibit routing architecture in detail. Section IV presents the experimental results on the area efficiency of the proposed routing architecture, and Section V gives concluding remarks.

## II. BUS-BASED ROUTING CONNECTIONS

The multibit routing architecture that we will explore is designed as a tile-based FPGA architecture and is structurally similar to the segmented-style FPGA tiles, which were described in detail in [13]. As shown in Fig. 1, a segmented-style tile consists of a logic block, two routing channels, and a switch block. The logic block is designed as a generalized version of the logic blocks used by the Altera FLEX 8K and FLEX 10K series of FPGAs, and is assumed to contain N output pins and I input pins. The routing resources (consisting of the routing channels and the switch block), on the other hand, are similar to the Xilinx 5200, Virtex, and Spartan families of FPGAs; and each routing channel is assumed to contain W routing tracks.

Various resources in the tile are connected together through four types of programmable routing connections, including the input connections, the output connections, the full-switch block connections, and the half-switch block connections. The logic block input pins are connected to the routing tracks through the input connections; and each pin has  $\lceil F_{ci} \times W \rceil$  connections, where  $F_{ci}$  represents the fraction of tracks in a routing channel that the pin is connected to. Overall, the total number of input connections per tile  $C_{input}$  is equal to  $\lceil F_{ci} \times W \rceil \times I$ . Similarly, logic block output pins are connected to the routing tracks through the output connections; and the total number of output connections per tile  $C_{output}$  is equal to  $\lceil F_{co} \times W \rceil \times N$ , where  $F_{co}$  is equal to the fraction of tracks in a routing channel that an output pin is connected to.

Each routing track in a routing channel is composed of a series of wire segments and each segment extends across L logic blocks. As shown in Fig. 2, a wire segment drives several other wire segments at its ends and at a selected set of internal locations. The set of end connections for the segment (including the driving buffer) as shown in Fig. 2(a) are called a full-switch block connection and the total number of full-switch block connections per tile  $C_{\rm full}$  is a function of the topology of the switch blocks. For the most commonly used switch



Fig. 2. (a) Full and (b) half-switch block connections.

TABLE I PROGRAMMABLE ROUTING CONNECTIONS

| W   | C <sub>input</sub> | Coutput | C <sub>full</sub> | $C_{\rm half}$ | A <sub>FPGA</sub> | A <sub>conn</sub> | $\frac{A_{\rm conn}}{A_{\rm FPGA}}$ |
|-----|--------------------|---------|-------------------|----------------|-------------------|-------------------|-------------------------------------|
| 1   | 10                 | 4       | 2                 | 1              | 2018              | 190               | 9.4%                                |
| 2   | 10                 | 4       | 4                 | 2              | 2140              | 293               | 14%                                 |
| 4   | 20                 | 4       | 8                 | 4              | 2463              | 578               | 23%                                 |
| 6   | 30                 | 8       | 12                | 6              | 2823              | 902               | 32%                                 |
| 8   | 40                 | 8       | 16                | 8              | 3086              | 1128              | 37%                                 |
| 10  | 50                 | 12      | 20                | 10             | 3447              | 1452              | 42%                                 |
| 12  | 60                 | 12      | 24                | 12             | 3710              | 1678              | 45%                                 |
| 16  | 80                 | 16      | 32                | 16             | 4274              | 2168              | 51%                                 |
| 20  | 100                | 20      | 40                | 20             | 4898              | 2718              | 55%                                 |
| 24  | 120                | 24      | 48                | 24             | 5462              | 3207              | 59%                                 |
| 28  | 140                | 28      | 56                | 28             | 6025              | 3697              | 61%                                 |
| 32  | 160                | 32      | 64                | 32             | 6589              | 4187              | 64%                                 |
| 40  | 200                | 40      | 80                | 40             | 7777              | 5227              | 67%                                 |
| 48  | 240                | 48      | 96                | 48             | 8904              | 6206              | 70%                                 |
| 64  | 320                | 64      | 128               | 64             | 11160             | 8165              | 73%                                 |
| 80  | 400                | 80      | 160               | 80             | 13475             | 10185             | 76%                                 |
| 96  | 480                | 96      | 192               | 96             | 15730             | 12144             | 77%                                 |
| 128 | 640                | 128     | 256               | 128            | 20240             | 16062             | 79%                                 |

block topology—the disjoint topology [27]— $C_{\text{full}}$  is equal to  $4\lfloor (W/L) \rfloor$ . Each internal connection of the segment (also including the driving buffer) as shown in Fig. 2(b) is called a half-switch block connection, since it requires fewer switches to implement. The number of half-switch block connections per tile,  $C_{\text{half}}$ , also depends on the switch block topology; and, for the disjoint topology,  $C_{\text{half}}$  is equal to  $2(W - \lfloor (W/L) \rfloor)$ .

Note that  $C_{\text{input}}, C_{\text{output}}, C_{\text{full}}$ , and  $C_{\text{half}}$  are all monotonically increasing functions of W and their values as functions of W are summarized in columns 2–5 of Table I, respectively. For the table, the calculations are based on a typical FPGA tile with  $N = 4, I = 10, F_{\text{ci}} = 0.5, F_{\text{co}} = 0.25, L = 2$ , and the disjoint switch block topology.

As in previous studies [13], in this paper the active area (the area consumed by transistors) A is used to estimate the overall area consumed by various FPGA routing resources; and this area is measured as the number of minimum width transistors based on the following formula:

$$A = \sum_{\text{AllTrans.}} \left( \frac{0.5 + \text{Drive Strength of the Current Trans.}}{2 \times \text{Drive Strength of Min. Width Trans.}} \right).$$
(1)



Fig. 3. Implementation example: (a) datapath circuit and (b) conventional FPGA tile.

Based on the formula, Table I also lists the total active area consumed by the FPGA tiles and the total area consumed by their routing connections in columns 6 and 7, respectively. The total connection area as a percentage of the total FPGA area is shown in column 8. As shown, like the connection count, the connection area as a percentage of the total FPGA area increases with increasing W. For small channel widths, the programmable routing connections consist of 10% to 20% of the total FPGA area; for large channel widths, on the other hand, the connection area consists of over 70% of the total FPGA area. Most importantly, for the channel width of 20 to 40, which are the typical track counts for the given architecture parameters (i.e., the number of tracks needed to make the circuits route), the programmable routing connections consume a substantial amount (between 55% and 67%) of the total FPGA area. (Note that for the active area calculations, all transistors in the tile are properly sized using the methodology outlined in [13].)

The large amount of area consumed by the routing connections motivates the multibit routing architecture design, which uses more sparse bus-based connections in place of the denser, conventional, bit-based connections. The area advantage of the bus-based connections over bit-based connections is best illustrated through an example. Consider implementing the simple 4-bit-slice datapath circuit, shown in Fig. 3(a), using copies of a conventional FPGA tile, shown in Fig. 3(b); and also assume that each logic block (will be called the conventional logic block in the remainder of this paper) is just large enough to accommodate a single bit-slice of the circuit. Fig. 4 shows the conventional FPGA implementation and, as shown, one needs a minimum of four FPGA tiles with four routing tracks per channel in order to implement the logic and transport the input and output signals of the circuit. Overall, these four tiles consume 16 input connections, 16 output connections, 32 full-switch block connections, and 16 half-switch block connections.

On the other hand, using bus-based connections, one can create a larger FPGA tile that encompasses the entire 4-bit wide circuit by quadrupling the logic capacity of the logic block to create a *multibit logic block* (by grouping four conventional logic blocks together) and substituting each individual wire in the original tile by a 4-bit wide *routing bus*. Overall, two buses per channel are needed in order to implement the same circuit and the resulting architecture, shown in Fig. 5, contains only 8 input connections, 8 output connections, 16 full-switch block connections, and 8 half-switch block connections—which is equivalent to an overall reduction of 50% for each type of



Fig. 4. Conventional implementation.



Fig. 5. Multibit FPGA implementation.

routing connection. Note that this reduction is due to the fact that routing resources are grouped into buses and programmable connections are only provided for resources that have the same bit positions in their buses.

In general, the basic building block of an FPGA tile with bus-based routing connections is an M-bit wide multibit logic block, where M is called the *granularity* (defined as the number of conventional logic blocks that a multibit logic block contains) of the tile. This logic block should have the same logic capacity as M conventional logic blocks; and its corresponding routing resources should be constructed out of M-bit wide buses. This means that each routing channel should

TABLE II MULTIBIT ROUTING AND AREA REDUCTION

|    | Bus-I          | Based            | Bit-F                                                |                  |                           |
|----|----------------|------------------|------------------------------------------------------|------------------|---------------------------|
| М  | W <sub>B</sub> | A <sub>bus</sub> | $W = \left\lfloor W_B \times \sqrt{M} \right\rfloor$ | A <sub>bit</sub> | $\frac{A_{bus}}{A_{bit}}$ |
| 2  | 10             | 6894             | 14                                                   | 8022             | 86%                       |
| 4  | 10             | 13788            | 20                                                   | 19590            | 70%                       |
| 8  | 10             | 27578            | 28                                                   | 48203            | 57%                       |
| 12 | 10             | 41366            | 34                                                   | 83400            | 50%                       |
| 16 | 10             | 55156            | 40                                                   | 124428           | 44%                       |
| 20 | 10             | 68944            | 44                                                   | 166810           | 41%                       |
| 24 | 10             | 82732            | 48                                                   | 213703           | 39%                       |
| 28 | 10             | 96522            | 52                                                   | 265107           | 36%                       |
| 32 | 10             | 110310           | 56                                                   | 321021           | 34%                       |

contain  $W_B$  routing buses, where each bus will contain M routing tracks. The input and output pins of the multibit logic block should also be grouped into M-bit wide buses (called input buses and output buses, respectively); and all the buses are connected together in the same way that individual wires are connected together in a conventional FPGA. So in terms of bus-based connections,  $F_{ci}$  and  $F_{co}$  represent the fraction of routing buses in a routing channel that each input bus and output bus are connected to, respectively, L represents the number of multibit logic blocks that each wire segment extends across, and the switch blocks of the bus-based connections are created by applying conventional switch block topologies to routing buses instead of individual routing tracks.

Using these architecture parameters, one can generalize the above example to a set of *ideal datapath circuits*, which can be best modeled as networks of *M*-bit wide *datapath components* that are randomly connected by M-bit wide buses. (Each datapath component is defined to be M bit-slices.) In general, if a datapath component and its associated signals can be implemented by a set of conventional FPGA tiles arranged in a  $H \times H$ square with W tracks per channel, the same circuit can also be implemented by a single FPGA tile with bus-based connections. Specifically, the granularity of the tile, M should be equal to  $H^2$ , and the logic capacity of the multibit logic block should be  $H^2$  times of the logic capacity of a conventional FPGA logic block. Finally, each routing channel of the tile should contain  $W_B M$ -bit wide routing buses, where  $W_B = (W \times H/M) =$  $(W/\sqrt{M})$ . (Note that it is assumed that the formulas for M and  $W_B$  apply equally well for both the integer and the fractional values of H.)

Column 3 of Table II shows the average active area required to implement a single FPGA tile with bus-based connections for a variety of granularity values and column 5 lists the area required to implement the corresponding conventional FPGA tiles. These calculations are based on the following architectural parameters:  $F_{ci} = 0.5$ ,  $F_{co} = 0.25$ , and L = 2. We also assume disjoint switch blocks and I = 10 and N = 4 for the conventional FPGA logic blocks. Column 6 shows the area reduction of the bus-based connections over the bit-based connections. As shown, FPGA tiles with bus-based connections consume significantly less implementation area as their conventional counter parts and this area reduction increases with increasing M. For



Fig. 6. Sets of configuration memory sharing connections.

TABLE III MULTIBIT ROUTING AND AREA REDUCTION WITH CONFIGURATION MEMORY SHARING

|    | Bus-I          | Based            | Bit-F                                                   |                  |                           |
|----|----------------|------------------|---------------------------------------------------------|------------------|---------------------------|
| М  | W <sub>B</sub> | A <sub>bus</sub> | $W = \\ \left\lfloor W_B \times \sqrt{M} \right\rfloor$ | A <sub>bit</sub> | $\frac{A_{bus}}{A_{bit}}$ |
| 2  | 10             | 6402             | 14                                                      | 8022             | 80%                       |
| 4  | 10             | 12312            | 20                                                      | 19590            | 63%                       |
| 8  | 10             | 24134            | 28                                                      | 48203            | 50%                       |
| 12 | 10             | 35954            | 34                                                      | 83400            | 43%                       |
| 16 | 10             | 47776            | 40                                                      | 124428           | 38%                       |
| 20 | 10             | 59596            | 44                                                      | 166810           | 36%                       |
| 24 | 10             | 71416            | 48                                                      | 213703           | 33%                       |
| 28 | 10             | 83238            | 52                                                      | 265107           | 31%                       |
| 32 | 10             | 95058            | 56                                                      | 321021           | 30%                       |

example, for M = 2 the area reduction is around 14% and for M = 32, the area reduction can be as much as 66%. Furthermore, the area reduction can be further increased through the sharing of configuration memory among each set of bus-based connections (as shown in Fig. 6) to take advantage of the identical routing paths that signals take in buses. Assuming configuration memory sharing, Table III shows that the total area reduction can be increased to 20% for M = 2 and 70% for M = 32.

Note that the above comparison assumes the same parameter values, including  $F_{ci}$ ,  $F_{co}$ , and L, for both types of FPGA tiles and it did not experimentally search for the best parameter values for either the bit-based or the bus-based tiles. Also, the comparison only considers tiles with either purely bit-based or purely bus-based connections and the bus-based connections can lose area-efficiency when they are used to implement circuits that are less regular than the ideal datapath model (circuits containing singular signals, narrower buses, or connected signals that are from different bit positions of buses). To implement these circuits well, one needs an architecture that can efficiently accommodate irregularities as well as highly utilize datapath regularity.

To address the above issues, the remainder of this paper outlines the detailed structure of a routing architecture, called the multibit routing architecture, which encompasses both bus-based connections and conventional bit-based connections. The area efficiency of the proposed architecture is then empirically studied using a set of automated CAD tools to determine



Fig. 7. Multibit routing architecture.



Fig. 8. Input connections (M = 4).

the best granularity values and the most appropriate amount of bus-based connections that the architecture should contain. Finally, the architecture is compared against the conventional FPGA routing architectures for area efficiency based on a set of the most area-efficient parameter values that are experimentally determined for both architectures.

## **III. MULTIBIT ROUTING ARCHITECTURE**

In order to accommodate irregularity, the multibit routing architecture contains both bus-based and bit-based connections. The overall structure of the architecture is shown in Fig. 7, which consists of a two-dimensional (2-D) array of multibit logic blocks interconnected by horizontal and vertical routing channels. Each logic block contains M logic clusters, which is constructed out of a series of basic logic elements (BLEs). Each BLE consists of a lookup table (LUT) and a flip-flop. (Note that the exact structures of the BLEs and the logic clusters are defined in detail in [13] and [28].) Each routing channel contains two types of routing tracks—the bus-based tracks (grouped into M-bit wide routing buses), which only use bus-based connections, and the bit-based tracks, which always use bit-based connections.

The detailed implementations of the input connections, the output connections, and typical examples of the routing switches used in the switch blocks are shown in Figs. 8–10, respectively. Note that, except the input connections, configuration memory is shared in all bus-based connections in order to further reduce the implementation area. On the other hand, for the input connections, the most area efficient topology, which shares the input multiplexers between the two types of connections, is used instead of sharing the configuration memory.

The set of potential FPGA architectures can be an extremely large design space for any given FPGA architectural definition. For example, 19 architectural parameters are needed in order to



Fig. 9. Output connections (M = 4).



Fig. 10. Routing switches: (a) bit-based bi-directional and buffered and (b) busbased bi-directional and buffered (M=2).

completely define the multibit routing architecture and its associated multibit logic blocks. On the other hand, the comparable conventional architecture discussed in Section II can be characterized using fewer design parameters since it is completely bit-based. Nevertheless, 12 architectural parameters are still needed. (Note that each logic block of the conventional architecture is assumed to be a logic cluster.)

This combination of parameters creates a design space that is too large to be explored completely. This study uses a more intelligent exploration strategy where many of these parameters are set to known good values from previous FPGA studies. Care is also taken in the parameter selection process to ensure a fair comparison between the multibit architecture and the conventional architecture. In the remainder of this section, we first define each of the architectural parameters and then justify their settings for both the multibit and the conventional routing architectures.

#### A. Summary of Architectural Parameters

The 19 architectural parameters that completely define the multibit routing architecture are listed in Table IV. The first column shows the classification of the parameters and column 2 lists the symbol for each. As shown, the parameters are classified into four categories. First, the routing capacity category contains three members, including  $M, W_f$ , and  $W_c$ . They characterize the structure of a routing channel. The second category contains four members, including K, N, I, and  $T_p$ . Along with M, they completely define the structure of the multibit logic blocks. The third and the fourth category, the connection and switch block parameters, contain five and seven members each and define the structures of the input and output connections and the switch blocks, respectively.

The 12 parameters that describe the conventional routing architecture is a subset of the 19 parameters listed in Table IV, where column 3 of the table indicates if a given architectural parameter also can be used to describe the conventional architecture. Finally, the parameter definition is given in column 4.

TABLE IV Multibit Architectural Parameters

| Class.     | Symb.                      | Conv.<br>FPGA | Definition                                         |
|------------|----------------------------|---------------|----------------------------------------------------|
| Routing    | M                          | no            | the granularity of the architecture                |
| Capacity   | W <sub>f</sub>             | yes           | the number of bit-based tracks per channel         |
| rarameters | W <sub>c</sub>             | no            | the number of bus-based tracks per channel         |
|            | Ľ                          |               | (equal to $W_B \times M$ )                         |
| Logic      | K                          | yes           | LUT size — the number of inputs that a             |
| Block      |                            |               | LUT has                                            |
| Parameters | N                          | yes           | the number of outputs per logic cluster            |
|            |                            |               | (Each logic cluster output is directly con-        |
|            |                            |               | nected to a unique logic block output; and         |
|            |                            |               | logic block outputs are grouped into $N$ out-      |
|            |                            |               | put buses.)                                        |
|            |                            | yes           | the number of inputs per logic cluster (Each       |
|            |                            |               | logic cluster input is directly connected to a     |
|            |                            |               | unique logic block input; and logic block          |
|            |                            |               | inputs are grouped into <i>I</i> input buses)      |
|            | $T_p$                      | yes           | the topology of the physical placement of          |
| Connection | -                          | NOS           | the logic block inputs and outputs                 |
| Parameters | $F_{\rm cif}$              | yes           | block input connects to as a percentage of         |
| rarameters |                            |               | brock input connects to as a percentage of         |
|            |                            |               | W <sub>f</sub>                                     |
|            | F <sub>cic</sub>           | no            | the number of routing buses that an input          |
|            |                            |               | bus connects to as a percentage of $\frac{W_c}{M}$ |
|            | F <sub>cof</sub>           | yes           | the number of bit-based tracks that a logic        |
|            |                            |               | block output connects to as a percentage of        |
|            |                            |               | $W_{f}$                                            |
|            | F <sub>coc</sub>           | no            | the number of routing buses that an output         |
|            |                            |               | bus connects to as a percentage of $\frac{W_c}{M}$ |
|            |                            | VAC           | the topology of the physical placement of          |
|            | $T_i$                      | yes           | isolation buffers                                  |
| Switch     | Т                          | ves           | the switch block topology                          |
| Block      | 1 5                        |               | the hit flexibility of the switch blacks the       |
| Parameters | $F_{\rm sf}$               | yes           | the bit-flexibility of the switch blocks — the     |
|            |                            |               | connects to in a switch block                      |
|            | F                          | no            | the bus-flexibility of the switch blocks —         |
|            | <sup>r</sup> <sub>sc</sub> |               | the number of other buses that a routing bus       |
|            |                            |               | connects to in a switch block                      |
|            | $L_{f}$                    | yes           | the segment length of a bit-based track            |
|            | L <sub>c</sub>             | no            | the segment length of a bus-based track            |
|            | S <sub>f</sub>             | yes           | the type of bit-based routing switches that        |
|            |                            |               | connect bit-based routing tracks to each           |
|            | L                          |               | other in a switch block                            |
|            | S <sub>c</sub>             | no            | the type of bus-based routing switches that        |
|            |                            |               | connect routing buses to each other in a           |
| 1          | 1                          | 1             | ISWITCH BLOCK                                      |

#### **B.** Parameter Values

Two sets of parameter settings are used in this study with one for the investigation of the best granularity and the best proportion of bus-based connections and the other for the comparison of the area efficiency between the multibit and the conventional routing architectures. These settings are summarized in Table V, where the classification and the symbol of each parameter are listed in columns 1 and 2, respectively. As shown by column 3, all 19 parameters are involved in the investigation of granularity and proportion, where  $M, W_f$ , and  $W_c$  are the output (dependent) variables of the investigation. K (LUT size) is set to be 4,

TABLE V VALUES FOR ARCHITECTURAL PARAMETERS

| Class                      | Param.           | Granularity    | Area Efficiency           |                |  |
|----------------------------|------------------|----------------|---------------------------|----------------|--|
| Class.                     |                  | & Proportion   | Multi-Bit                 | Conventional   |  |
| Routing                    | М                | variable       | 4                         | n.a.           |  |
| Capacity<br>Decompositions | $W_{f}$          | variable       | variable                  | variable       |  |
| Farameters                 | W <sub>c</sub>   | variable       | variable                  | n.a.           |  |
| Logic                      | K                | 4              | 4                         | 4              |  |
| Block                      | Ν                | 4              | 4                         | 4              |  |
| Parameters                 | Ι                | 10             | 10                        | 10             |  |
|                            | $T_p$            | see Sec. III.B | see Sec. III.B            | see Sec. III.B |  |
| Connection                 | F <sub>cif</sub> | 0.5            | best                      | best           |  |
| Parameters                 | F <sub>cic</sub> | 0.5            | equal to $F_{\text{cif}}$ | n.a.           |  |
|                            | F <sub>cof</sub> | 0.25           | best                      | best           |  |
|                            | $F_{\rm coc}$    | 0.25           | equal to $F_{\rm coc}$    | n.a.           |  |
|                            | $T_i$            | see Sec. III.B | see Sec. III.B            | see Sec. III.B |  |
| Switch                     | $T_s$            | disjoint       | disjoint                  | disjoint       |  |
| Block                      | $F_{\rm sf}$     | 3              | 3                         | 3              |  |
| T unumotorio               | F <sub>sc</sub>  | 3              | 3                         | n.a.           |  |
|                            | $L_{f}$          | 2              | best                      | best           |  |
|                            | $L_c$            | 2              | equal to $L_f$            | n.a.           |  |
|                            | $S_{f}$          | bi-directional | bi-directional            | bi-directional |  |
|                            | J                | buffered       | buffered                  | buffered       |  |
|                            | S <sub>c</sub>   | bi-directional | bi-directional            | n.a.           |  |
|                            | L                | buffered       | buffered                  |                |  |

since it has been shown to be one of the most efficient LUT sizes [29], [30] and it is also used in many commercial FPGAs [14], [15]. N and I are set to be 4 and 10, respectively, since this combination was shown to be one of the most efficient by [28] and is used in many previous FPGA studies [13], [19], [30]–[33]. The setting of  $T_p$  and  $T_i$  are discussed in more detail later on in the section.

For the remaining parameters, both  $F_{\rm cic}$  and  $F_{\rm cif}$  are set to be 0.50 and  $F_{\rm coc}$  and  $F_{\rm cof}$  are set to be 0.25. These values were found to generate good area results by the study done in [13] for bit-based resources. The disjoint switch block topology [27] with  $F_{\rm sf}$  and  $F_{\rm sc}$  set to be three is used for both the bit-based and the bus-based connections since this is one of the most efficient and widely used topologies for many academic architectures and several commercial ones. A fully buffered global routing architecture is also assumed—all switches in the switch blocks are buffered switches—since buffered switches are widely used in many commercial FPGAs [14], [15], [34]. The track length is set to be two for both types of routing tracks since the track length of two, along with the cluster size of four, was found to generate good area results in [13] for conventional FPGAs.

Comparing the area efficiency of the multibit architecture against the conventional architecture requires the definition of two sets of independent architectural parameters. One set describes the multibit architecture and is shown in column 4. The other set describes the comparable conventional architecture and is shown in column 5. For the multibit architecture,  $W_f$  and  $W_c$  are the output (dependent) variables of the investigation. M is set to be 4, since it is shown to be one of the most area efficient granularity values by the granularity investigation. Again, K is set to be 4.



Fig. 11.  $T_P$  for N = 4 and I = 10. (a) Physical placement of ten logic block input pins or input buses. (b) Physical placement of four logic block output pins or output buses.

The rest of the parameters can be classified into two groups—one group including  $T_p, T_i, T_s, F_{sf}, F_{sc}, S_f$ , and  $S_c$ , describes the topological features of the architecture, and the other group including  $N, I, F_{cif}, F_{cic}, F_{cof}, F_{coc}, L_f$ , and  $L_c$ consists of single numerical values. The topological parameters are set to be the same values as the ones used in the investigation of granularity and proportion. Two of the numerical parameters N and I, describe the multibit logic block structure and they are also set to be the same values as the ones used to address granularity and proportion.

The remaining numerical parameters describe the multibit routing architecture. As shown in Table V, it is assumed that the architectural parameters that describe the bus-based resources are always equal to their corresponding parameters that describe the bit-based resources. (Note that this assumption allows us to effectively isolate the effects of grouping routing resources into buses from the effects of differentiating these resources.) Since the routing resources usually consume the majority of FPGA area, these bit-based parameters, including  $F_{cif}$ ,  $F_{cof}$ , and  $L_f$ , are systematically searched to ensure that the best possible area results are obtained for the multibit routing architecture. These experimentally determined values will be presented in detail in Section VI.

Similarly, the corresponding numerical parameters of the conventional architecture are also systematically searched to find a set of values that generate the best area and all other parameters are set to be the most area efficient values based on the results of the previous studies.

 $T_p$  For the conventional FPGA, the logic block inputs and outputs are assumed to be uniformly distributed around the perimeter of each logic block [13]. This distribution topology takes the advantage of the logical equivalency among the cluster inputs or outputs [13]. An example of the distribution topology is shown in Fig. 11. Here each number represents either a cluster input or a cluster output.

The multibit architecture uses a similar distribution topology. However, instead of uniformly distributing input or output pins, the input buses and the output buses are uniformly distributed. For the multibit routing architecture, each number in Fig. 11 represents an input/output bus instead of an individual input/output pin. Again, this uniform distribution topology takes the advantage of the logical equivalency among input buses or output buses.

 $T_i$  The function of the isolation buffers is to electrically isolate the routing tracks from the input connections. For the conventional routing architecture, each routing track has one isolation buffer for each logic block position that it passes [13].



Fig. 12. Isolation buffer topology: (a) Conventional routing and (b) multibit routing.

An example is shown in Fig. 12(a), where an X indicates the presence of an isolation buffer. In the figure, the conventional FPGA consists of four conventional logic blocks, three horizontal routing channels, and three vertical routing channels. There is one routing track in each channel. In total, there are 12 isolation buffers in the figure. In general, the total number of isolation buffers C in a conventional architecture can be determined by the following formula:

$$C = W \times (X \times (Y+1) + (X+1) \times Y)$$
<sup>(2)</sup>

where W is the number of routing tracks in each routing channel, X is the number of rows of clusters, and Y is the number of columns of clusters.

For the multibit routing architecture, electrically, it is also sufficient to place only one isolation buffer for every multibit logic block position that a routing track passes. However, this topology gives the multibit routing architecture an unfair area advantage since it needs only half of the isolation buffers as compared with an equivalent conventional architecture. This unfairness is illustrated by Fig. 12(b), which is a transformation of Fig. 12(a) by grouping four conventional logic blocks into a multibit logic block. As shown, only six isolation buffers are needed for the new architecture. In general, the total number of isolation buffers, C', needed in the multibit routing architecture is determined by the formula

$$C' = W \times \left( \frac{X \times (Y+1)}{\lceil \sqrt{M} \rceil} + \frac{(X+1) \times Y}{\lceil \sqrt{M} \rceil} \right)$$
$$= \frac{C}{\lceil \sqrt{M} \rceil}.$$
(3)

Since isolation buffers do not influence the overall routability of the FPGAs, extra isolation buffers are added to the multibit architecture to cancel this unfair advantage. For the multibit architecture shown in Fig. 12(b), two isolation buffers, instead of one, are counted for every multibit logic block position that a track passes. (Note that the adjusted isolation buffer placement slightly disadvantages the multibit architecture since all routing channels in Fig. 12(b) should contain two tracks.)



Fig. 13. CAD flows: (a) for the multibit architecture and (b) for the conventional architecture.

| Cinonita    | <b>#BLEs Obtained at Each Synthesis Granularity</b> |      |      |      |      |      |  |  |
|-------------|-----------------------------------------------------|------|------|------|------|------|--|--|
| Circuits    | 1                                                   | 2    | 4    | 8    | 12   | 16   |  |  |
| code_seq_dp | 362                                                 | 364  | 364  | 364  | 364  | 364  |  |  |
| dcu_dpath   | 958                                                 | 962  | 966  | 974  | 982  | 974  |  |  |
| ex_dpath    | 2823                                                | 2747 | 2649 | 2719 | 2947 | 2955 |  |  |
| exponent    | 467                                                 | 517  | 517  | 539  | 567  | 565  |  |  |
| icu_dpath   | 3254                                                | 3237 | 3245 | 3245 | 3273 | 3277 |  |  |
| imdr_dpath  | 1286                                                | 1268 | 1255 | 1286 | 1288 | 1283 |  |  |
| incmod      | 870                                                 | 862  | 867  | 940  | 948  | 1005 |  |  |
| mantissa_dp | 912                                                 | 919  | 942  | 966  | 971  | 982  |  |  |
| multmod_dp  | 1602                                                | 1636 | 1634 | 1636 | 1636 | 1636 |  |  |
| pipe_dpath  | 452                                                 | 499  | 452  | 503  | 503  | 501  |  |  |
| prils_dp    | 363                                                 | 396  | 393  | 385  | 385  | 393  |  |  |
| rsadd_dp    | 350                                                 | 314  | 313  | 305  | 305  | 305  |  |  |
| smu_dpath   | 561                                                 | 557  | 557  | 560  | 563  | 561  |  |  |
| ucode_dat   | 1264                                                | 1273 | 1304 | 1278 | 1282 | 1286 |  |  |
| ucode_reg   | 78                                                  | 80   | 82   | 86   | 86   | 94   |  |  |

TABLE VI EXPERIMENTAL CIRCUITS

## IV. EXPERIMENTAL RESULTS

The multibit routing architecture has been used to implement several benchmark circuits using the datapath-oriented CAD flow shown in Fig. 13(a). We developed a set of 15 benchmark circuits by extracting pieces from the Pico-Java processor from Sun Microsystems [35], which cover all the major datapath components of the processor. These circuits are synthesized into LUTs using the EMC datapath-oriented synthesis process as described in [22], which preserves the regularity of datapath circuits while attempting to minimize area. Table VI gives the name, size (number of BLEs) of each circuit for each synthesis granularity value (here the synthesis granularity is defined to be the maximum datapath width that is preserved by the synthesis process and is an input to the synthesis).

The synthesized circuits are then packed into multibit logic blocks using the CNG datapath-oriented packing algorithm as described in [23]. The algorithm packs adjacent bit-slices into a series of logic blocks and the packed circuits are then placed using a placement algorithm modified from the VPR placer [13], as described in [24]. This datapath-oriented placer moves each multibit logic block as a single unit if it contains adjacent bitslices. Otherwise, each logic cluster is optimized individually. The placed circuits are then routed using the CGR datapath-oriented router as described in [24], which is modified for the efficient use of bus-based routing resources. Using a set of specially designed cost functions, the router tries to balance the use of the bit-based and the bus-based routing resources based on routing congestion and the goal of timing optimization.

Fig. 13(b) shows the CAD flow used for implementing the same circuits on the comparable conventional routing architecture. For this flow the best available flat synthesis results are used instead of the regularity preserving datapath synthesis. The T–VPack algorithm is then used for packing and the VPR tools [13] are used for placement and routing.

The area results are measured at the end of the CAD flows. For each multibit architecture, a set of experiments was performed by repeatedly invoking the CGR router over a range of values for  $W_c$  and  $W_f$ . For each invocation a fixed value of  $W_c$ is first chosen in increments of M starting from 0 bus-based tracks per channel. Then, the router is instructed to search for the minimum number of additional bit-based routing tracks,  $W_f$ , that is needed in order to successfully route each circuit. The resulting architectures are then classified into fixed percentile ranges based on the percentage of bus-based routing tracks in the routing channels.

Within each percentile range, the minimum area obtainable for each circuit is first recorded and then averaged across the benchmarks using arithmetic averaging. Note that since arithmetic averaging is used, the results presented here contain a higher percentage of contribution from the larger benchmark circuits and the results on granularity, proportion, and area are presented in turn.

# A. Effect of Granularity on Area

Fig. 14 plots the average area required to implement the benchmark circuits against the granularity value M for the multibit architecture. Here the granularity is shown on the x-axis and the area is shown on the y-axis. The plot contains two curves. The top curve represents the best area obtainable by a set of multibit architectures that contain no bus-based tracks. The bottom curve represents the most area efficient multibit architectures containing bus-based tracks. The percentile ranges of bus-based tracks used by the architectures are labeled beside each data point on the bottom curve.

As shown, for all granularity values, the bus-based routing tracks can be used to increase the area efficiency of the multibit architecture. For example, when M is equal to 2, the best architecture with bus-based tracks is 5.6% smaller than the architecture with no bus-based tracks and when M is equal to 4, the best architecture with bus-based tracks is 11% smaller. Overall, the most area efficient multibit architectures have a granularity value of 4 and have a bus-based track count that is between 50% and 60% of the total number of routing tracks per channel. This is significantly different from the best granularity values for



Fig. 14. Area as a function of bus-width (granularity).



Fig. 15. Proportion of signals in M-bit wide buses.

implementing the ideal datapath model, where, as discussed in Section II, higher granularity values always result in higher area efficiency. The difference is due to the fact that, unlike the ideal model, a significant percentage of signals in the benchmark circuits cannot be grouped into M-bit wide buses and, as shown by Fig. 15, with increasing M the percentage of signals that can be grouped into M-bit wide buses decreases nearly monotonically across all benchmark circuits.

## B. Effect of Proportion of Buses on Area

Fig. 16 plots the average area required to implement the benchmark circuits against the percentile ranges for the bus-based routing tracks. In the figure, the *x*-axis represents the percentage of bus-based tracks per channel and there are eight percentile ranges on the axis, including (0%, 0%], (0%, 10%], (10%, 20%], (20%, 30%], (30%, 40%], (40%, 50%], (50%, 60%], and (60%, 70%). The *y*-axis represents the area. There are five curves in the figure. Each curve represents a



Fig. 16. Area as a function of the proportion of bus-based routing tracks.

set of multibit architectures with a fixed granularity value. An "\*" marks the location of the minimum area on each curve. As shown, the most area efficient proportion of bus-based tracks remains within the range of 30%–60% for all granularity values.

The best proportion values also closely correlate to P, the percentage of signals that can be grouped into M-bit wide buses after packing, since these signals can be most efficiently routed through the bus-based routing tracks. In the figure, the percentile range that P resides in is marked by an "o" for each granularity value and the graph shows that, with the exception of M = 16, the most area efficient percentile range is always less than 1 percentile range away from P.

### C. Multibit Versus Conventional Routing

To compare the multibit routing architecture against the conventional architecture, two sets of experiments were performed. The first set determines the most area efficient values for the numerical parameters described in Section III. These values are then used in the second set of experiments to measure area.

The best values for three numerical architectural parameters, including  $F_{\rm cof}$ ,  $F_{\rm cif}$ , and  $L_f$ , were systematically searched using a divide-and-conquer approach to reduce the number of searches required to explore the three-dimensional (3-D) design space. To further reduce the number of searches, only the multibit architectures that contain no bus-based tracks are used for  $F_{\rm cof}$  and  $F_{\rm cif}$ . It is assumed that the results are equally applicable across all variations of the multibit architectures and each is described in turn.

 $F_{\rm cof}$  As it is described in [13], for the conventional FPGA architecture, the most area efficient values for  $F_{\rm cof}$  are equal to (1/N). Using the same set of experiments, it is found that the same formula applies to the multibit architecture containing no bus-based routing tracks.

 $F_{\rm cif}$  Fig. 17 plots routing area against  $F_{\rm cif}$  for the multibit routing architecture. As shown, the best routing area occurs



Fig. 17. Routing area as a function of input flexibility— $F_{cif}$ .



Fig. 18. Area as a function of segment length— $L_f$ .

when  $F_{\text{cif}}$  is equal to 0.4. Interestingly, the best  $F_{\text{cif}}$  value for the conventional routing architecture is also equal to 0.4.

 $L_f$  Fig. 18 is a plot of the average area required to implement the benchmark circuits versus the track length  $L_f$ . It is assumed that 50% of the tracks in the multibit architecture are bus-based and  $L_c$  is always equal to  $L_f$ . In the figure, the x-axis represents  $L_f$ , which ranges from 1 to 16. The y-axis represents the area. There are four curves in the figure and each represents a unique cluster size N, including 2, 4, 8, and 10. For these cluster sizes, I is set to be 4, 10, 18, and 22, respectively. These values of I are shown to generate good area results for their corresponding cluster sizes [13]. As shown, the cluster size of 4 and the track length of 2 are the best architectural choices for the multibit routing architecture. Furthermore, the track length of 2 is always the most area efficient value across all cluster sizes. Finally, using the same experiment, the best  $L_f$  value for the conventional routing architecture is determined also to be 2.



Fig. 19. Multibit and conventional implementation area.

Fig. 19 plots the total area versus the percentage of tracks that are bus-based for the multibit architecture. As shown, when there are only a small percentage of bus-based routing tracks (0%-20%), the implementation area of the benchmarks actually increases with the increasing bus-based track count; and there are two main causes of this increase. First, when there are only a few bus-based tracks, not all logic block input pins can be connected to all logic block output pins through these tracks. This limitation dramatically reduces the usefulness of the bus-based connections, hence, resulting in the increases in area. A secondary cause is that when the bus-based tracks are added to the routing fabric, routing resources are differentiated into two types—the bus-based connections and the bit-based connections. This differentiation reduces the routing flexibility of the routing fabric and also accounts for the rise in area.

As the number of bus-based tracks is increased to the 20%-30% range, enough logic block pins can be connected to each other through the bus-based connections, and the benefit of bus-based routing starts to outweigh the decreases in flexibility. As a result, the total area required to implement the benchmark circuits decreases until it reaches the minimum at the 40%-50% range. When the number of bus-based tracks is further increased, the number of bus-based connections provided by the architecture starts to exceed what is actually required by the circuits. As a result, the router is forced to use bus-based tracks to route individual signals, which reduces the area efficiency of the multibit architecture past the 50% point.

Overall, the best area for the multibit architecture is achieved when bus-based tracks account for 40% to 50% of the total number of routing tracks. At this point, the benchmark circuits use 6% less area as compared to the multibit architectures that contain no bus-based tracks. It is interesting to note that even though 90% of the LUTs in the benchmark circuits belong to four-bit wide datapath components, only 40% to 50% of the tracks should be bus-based. This is because many datapath components are not only connected by buses but also by a substantial amount of non-bus control signals (only 48 % of the signals in the benchmarks can be grouped into 4-bit wide buses) and as a

|             | Commentional | Multi-Bit FPGA |                       |  |  |
|-------------|--------------|----------------|-----------------------|--|--|
| Circuits    | FPGA         | 0% Bus-Based   | 40%–50% Bus-<br>Based |  |  |
| code_seq_dp | 3.22         | 2.66           | 2.96                  |  |  |
| dcu_dpath   | 9.02         | 8.81           | 7.24                  |  |  |
| ex_dpath    | 38.5         | 34.6           | 26.9                  |  |  |
| exponent    | 4.79         | 5.40           | 5.30                  |  |  |
| icu_dpath   | 42.0         | 40.7           | 34.2                  |  |  |
| imdr_dpath  | 14.5         | 12.9           | 11.8                  |  |  |
| incmod      | 6.39         | 7.27           | 7.73                  |  |  |
| mantissa_dp | 11.4         | 11.6           | 10.5                  |  |  |
| multmod_dp  | 15.6         | 14.7           | 17.9                  |  |  |
| pipe_dpath  | 3.35         | 2.71           | 3.03                  |  |  |
| prils_dp    | 3.07         | 2.54           | 2.91                  |  |  |
| rsadd_dp    | 2.60         | 2.09           | 2.14                  |  |  |
| smu_dpath   | 4.36         | 4.16           | 4.26                  |  |  |
| ucode_dat   | 13.3         | 11.6           | 10.5                  |  |  |
| ucode_reg   | 0.668        | 0.565          | 0.633                 |  |  |

TABLE VIIPER CIRCUIT ROUTING AREA (10E5)

result, even highly regular circuits still might need a substantial amount of bit-based tracks.

Compared to the conventional architecture, all multibit architectural variations performed better. Even with no bus-based routing tracks, the multibit architecture is 3.6% smaller due to the more efficient datapath-oriented placement and routing. Overall, the best multibit architecture is 9.6% smaller than the best standard architecture, which represents a routing area reduction of over 14%. Note that the theoretical maximum area saving as predicted in Table III is 37% for M = 4. This area reduction, however, assumes 100% bus-based routing tracks. For the percentile range of 40%–50%, the theoretical maximum area saving is between 15% to 19%. In comparison, our experimental results have achieved between 50% to 64% of this theoretical maximum.

Finally, Table VII lists the per-circuit-behavior of the benchmark circuits. As shown, for the 40% to 50% percentile range, 12 circuits (representing 79% of the total benchmark area) require less area to implement when compared with the conventional FPGA architecture. When compared with multibit FPGA architectures that contain no bus-based tracks, 7 out of 15 circuits, representing 72% of the total benchmark area, require less area to implement.

# V. CONCLUSION

This paper has explored the relationship between the busbased connections and the area efficiency of FPGAs. To this end, an analytical analysis of the bus-based connections is first presented, which assumes an ideal datapath model. To account for irregularity that is typically present in realistic datapath circuits, the paper then proposes a routing architecture, called the multibit routing architecture, that incorporates both bus-based connections and bit-based connections. A set of experiments was then performed using the architecture and a set of datapath-oriented CAD tools. The principle conclusion of this empirical study are that the granularity value of 4 gives the best area result for the multibit routing architecture. Furthermore, to achieve the best area, 40% to 50% of the total number of routing tracks should be bus-based despite the fact that, in the benchmark circuits, over 90% of LUTs are in regular datapath components. Finally, the best multibit architecture is 9.6% smaller than the best conventional architecture, which represents an overall routing area savings of over 14%.

#### REFERENCES

- D. Chen and J. Rabaey, "A reconfigurable multiprocessor IC for rapid prototyping of algorithmic-specific high-speed DSP data paths," *IEEE* J. Solid-State Circuits, vol. 27, no. 12, pp. 1895–1904, Dec. 1992.
- [2] A. Yeung and J. Rabaey, "A reconfigurable data driven multi-processor architecture for rapid prototyping of high throughput DSP algorithms," in *Proc. hawaii Int. Conf. Syst. Sci.*, 1993, pp. 169–178.
- [3] R. Bittner and P. Athanas, "Wormhole run-time reconfiguration," in *Proc. ACM/SIGDA Int. Symp. Field Programmable Gate Arrays*, 1997, pp. 79–85.
- [4] D. Cherepacha and D. Lewis, "DP-FPGA: An FPGA architecture optimized for datapaths," J. VLSI Des., vol. 4, no. 4, pp. 329–343, 1996.
- [5] C. Ebeling, "RaPiD—Reconfigurable pipelined datapath," in Proc. Int. Workshop Field-Programmable Logic Appl., 1996, pp. 237–241.
- [6] J. Hauser and J. Wawrzynek, "Garp: A MIPS processor with a reconfigurable coprocessor," in *Proc. IEEE Symp. Field-Programmable Custom Comput. Mach.*, 1997, pp. 24–33.
- [7] E. Waingold, "Baring It all to software: Raw machines," *IEEE Computer*, vol. 30, no. 9, pp. 86–93, Sep. 1997.
- [8] T. Miyamori and K. Olukotun, "A quantitative analysis of reconfigurable coprocessors for multimedia applications," in *Proc. IEEE Symp. Field-Programmable Custom Comput. Mach.*, 1998, pp. 2–11.
- [9] A. Marshall, "A reconfigurable arithmetic array for multimedia applications," in *Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays*, 1999, pp. 135–143.
- [10] A. Alsolaim, "Architecture and application of a dynamically reconfigurable hardware array for future mobile communication systems," in *Proc. IEEE Symp. Field-Programmable Custom Comput. Mach.*, 2000, pp. 205–214.
- [11] S. Goldstein, "PipeRench: A reconfigurable architecture and compiler," *IEEE Computer*, vol. 33, no. 4, pp. 70–77, Apr. 2000.
- [12] K. Leijten-Nowak and J. van Meerbergen, "An FPGA architecture with enhanced datapath functionality," in *Proc. ACM/SIGDA Int. Symp. Field Programmable Gate Arrays*, 2003, pp. 195–204.
- [13] V. Betz, J. Rose, and A. Marquardt, *Architecture and CAD for Deep-Submicron FPGAs.* Norwell, MA: Kluwer, 1999.
- [14] "Altera Documentation Library," Altera Corporation, San Jose, CA, 2004 [Online]. Available: http://www.altera.com
- [15] "Xilinx Data Sheets," Xilinx Inc., San Jose, CA, 2004 [Online]. Available: http://www.xilinx.com
- [16] J. Rose and S. Brown, "Flexibility of interconnection structures for field-programmable gate arrays," *IEEE J. Solid-State Circuits*, vol. 26, no. 3, pp. 277–282, Mar. 1991.
- [17] S. Brown, G. Lemieux, and M. Khellah, "Segmented routing for speed-performance and routability in field-programmable gate arrays," *J. VLSI Des.*, vol. 4, no. 4, pp. 275–291, 1996.
- [18] V. Betz and J. Rose, "Effect of the prefabricated routing track distribution on FPGA area-efficiency," *IEEE Trans. Very Large Scale Integr.* (VLSI) Syst., vol. 6, no. 3, pp. 445–456, Sep. 1998.
- [19] V. Betz and J. Rose, "FPGA routing architecture: Segmentation and buffering to optimize speed and density," in *Proc. ACM/SIGDA Int. Symp. Field Programmable Gate Arrays*, 1999, pp. 59–68.
- [20] M. Sheng and J. Rose, "Mixing buffers and pass transistors in FPGA routing architectures," in *Proc. ACM/SIGDA Int. Symp. Field Pro*grammable Gate Arrays, 2001, pp. 75–84.
- [21] G. Lemieux and D. Lewis, "Circuit design of FPGA routing switches," in *Proc. ACM/SIGDA Int. Symp. Field Programmable Gate Arrays*, 2002, pp. 19–28.
- [22] A. Ye, J. Rose, and D. Lewis, "Synthesizing datapath circuits for FPGAs with emphasis on area minimization," in *Proc. Int. Conf. Field-Programmable Technol.*, 2002, pp. 219–227.
- [23] A. Ye and J. Rose, "Using multi-bit logic blocks and automated packing to improve field-programmable gate array density for implementing datapath circuits," in *Proc. Int. Conf. Field-Programmable Technol.*, 2004, pp. 129–136.

- [24] A. Ye, "Field-Programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits," Ph.D. thesis, Univ. Toronto, Dept. Elect. Comput. Eng., Univ. Toronto, ON, Canada, 2004 [Online]. Available: (http://www.eecg.toronto.edu/~jayar/pubs/ theses/Ye/ AndyYe.pdf)
- [25] A. Ye and J. Rose, "Using bus-based connections to improve field-programmable gate array density for implementing datapath circuits," in *Proc. ACM/SIGDA Int. Symp. Field Programmable Gate Arrays*, 2005, pp. 3–13.
- [26] A. Ye, J. Rose, and D. Lewis, "Architecture of datapath-oriented coarse-grain logic and routing for FPGAs," in *Proc. IEEE Custom Integr. Circuits Conf.*, 2003, pp. 61–64.
- [27] H. Hseih, "Third-generation architecture boosts speed and density of field-programmable gate arrays," in *Proc. IEEE Custom Integr. Circuits Conf.*, 1990, pp. 31.2.1–31.2.7.
- [28] V. Betz and J. Rose, "How much logic should go in an FPGA logic block?," *IEEE Des. Test Comput. Mag.*, vol. 15, no. 1, pp. 10–15, Spring 1998.
- [29] J. Rose, R. Francis, D. Lewis, and P. Chow, "Architecture of field-programmable gate arrays: The effect of logic block functionality on area efficiency," *IEEE J. Solid-State Circuits*, vol. 25, no. 5, pp. 1217–1225, Oct. 1990.
- [30] E. Ahmed and J. Rose, "The Effect of LUT and cluster size on deepsubmicron FPGA performance and density," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 12, no. 3, pp. 288–298, Mar. 2004.
- [31] G. Lemieux and D. Lewis, "Using sparse crossbars within LUT clusters," in Proc. ACM/SIGDA Int. Symp. Field Programmable Gate Arrays, 2001, pp. 59–68.
- [32] A. Marquardt, V. Betz, and J. Rose, "Timing-driven placement for FPGAs," in *Proc. ACM/SIGDA Int. Symp. Field Programmable Gate Arrays*, 2000, pp. 203–213.
- [33] A. Marquardt, V. Betz, and J. Rose, "Speed and area tradeoffs in cluster-based FPGA architectures," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 8, no. 1, pp. 84–93, Feb. 2000.
- [34] P. Leventis, "Cyclone: A low-cost, high-performance FPGA," in Proc. IEEE Custom Integr. Circuits Conf., 2003, pp. 49–52.
- [35] Pico-Java Processor Design Documentation, Sun Microsystems, 1999.



Andy Ye (S'97–M'06) received the B.A.Sc., M.A.Sc., and Ph.D. degrees in computer engineering from the University of Toronto, Toronto, ON, Canada, in 1996, 1999, and 2004, respectively. He graduated first in class in the Engineering Science Program in 1996.

From 1999 to 2000, he participated in the development of the Ultragizmo board for the University of Toronto Undergraduate Microprocessor Laboratory. Currently, he is an Assistant Professor in the Department of Electrical and Computer Engineering at

Ryerson University, Toronto. His research interests include field-programmable gate array (FPGA) architectures, computer-aided design (CAD) tools for FPGAs, logic synthesis, and hardware implementation of computer graphics algorithms.



**Jonathan Rose** (M'80) received the Ph.D. degree in electrical engineering from the University of Toronto, Toronto, ON, Canada, in 1986.

From 1986 to 1989, he was a Post-Doctoral Scholar and a Research Associate in the Computer Systems Laboratory at Stanford University, Stanford, CA. In 1989, he joined the faculty of the University of Toronto. From 1989 to 1999, he was an NSERC University Research Fellow. He spent 1995–1996 as a Senior Research Scientist at Xilinx, San Jose, CA, working on the Virtex FPGA architecture. In October

1998, he co-founded Right Track CAD Corporation, which delivered architecture for FPGAs and packing, placement, and routing software for FPGAs to FPGA device vendors. He was President and CEO of Right Track until May 2000. Right Track was purchased by Altera, and became part of the Altera Toronto Technology Centre, where he was Senior Director until April 2003. His group at Altera Toronto shared responsibility for the development of the architecture for the Altera Stratix, Stratix II, Stratix GX and Cyclone FPGAs. His group was also responsible for placement, routing, delay annotation software and benchmarking for these devices, and for the placement and routing software for the Altera Apex 20K and Flex 10K FPGAs. From May 2003 to April 2004, he held the part-time position of Senior Research Scientist at Altera Toronto. He has worked for Bell-Northern Research and a number of FPGA companies on a consulting basis. Currently, he is a Professor and Chair of the Edward S. Rogers Sr. Department of Electrical and Computer Engineering at the University of Toronto. He is the co-founder of the ACM FPGA Symposium and remains part of that Symposium on its steering committee. His research covers all aspects of FPGAs including their architecture, computer-aided design (CAD), field-programmable systems, soft processors, and graphics, vision and bio-informatic applications of programmable hardware.