### Student name:

Date: December 11, 2007

### **General requirements for the exam:**

- 1. This is CLOSED BOOK examination;
- 2. No questions allowed within examination period;
- 3. If something is not clear in question please, put your assumptions;
- 4. No extra papers cell-phones or programmable calculators are allowed;
- 5. For calculations or assumptions you can use reserved space in the exam paper or opposite side of each page;
- 6. It is allowed for use: Pens and pencils, erasers, simple calculators and rulers.
- 7. All explanations in boxes and tables has to be printed

## 1 <u>Section: Synthesis of Application Specific Computing Architecture</u>

Create the architecture for the application specific data-path for vector processing using the following function:  $\mathbf{y} = k^*a^2 + s^*b^2 + m^*(a + b)$ 

The data segment consists of two vectors, where each vector can contain 32, 64, 128 or 256 elements:  $A = [a_1, a_2, a_3...a_n]$  and  $B = [b_1, b_2, b_3 ...b_n]$ , where n = 32, 64, 128, 256. Each pair of elements of the above vectors:  $a_i$  and  $b_i$  for i=1, 2, 3,...n has to be processed by the given formula during the same execution cycle. Coefficients – k, s, and m are scalar (integer) data, which stay constant during the period of entire vector processing.

### Question 1.1 (5 marks)

In the Box below (Figure 1.1) draw the sequencing graph scheduled with assumption that there is no limit for any logic resources (maximum possible resources can be used and each operation can start "as soon as possible".

Figure 1.1: Sequencing Graph with unconstraint scheduled (ASAP algorithm)

### <u>EE 8208 \*Architectural Synthesis and Design of Digital Systems\*</u> <u>FINAL EXAMINATION</u>

#### Question 1.2 (5 marks)

a) Calculate the maximum number of adders and multipliers for the case of fully pipelined data-path:

Number of adders = \_\_\_\_ Number of multipliers = \_\_\_\_

b) Calculate latency and cycle time (in case of fully pipelined data path) when:

i) "Adder" requires 1 clock cycle (c.c.) for the operation and

ii) "Multiplier" takes 4 c.c. for multiplication.

Latency = \_\_\_\_\_ c.c. Cycle time = \_\_\_\_\_

Please provide time schedule in Figure 1.2 (below) according to sequencing graph in Figure 1.1. This schedule should reflect moment of start and finish of each operation for each of resources (adder- A1, A2, etc. and multipliers – M1, M2, etc.) during the time slots.

| <br>↑     |              |
|-----------|--------------|
| Resources |              |
|           |              |
|           |              |
|           |              |
|           |              |
|           |              |
|           |              |
|           | <b>&gt;</b>  |
|           | Clock cycles |

Figure 1.2: Schedule of resource usage for the first cycle (Latency) and further cycles

### Question 1.3 (5 marks)

In the Box below (Figure 1.3) draw the sequencing graph scheduled with assumption that there are two multipliers and one adder available for this function and each operation can start "as soon as possible".

### <u>EE 8208 \*Architectural Synthesis and Design of Digital Systems\*</u> <u>FINAL EXAMINATION</u>

| Time 0 | A | В |  |
|--------|---|---|--|
| Time 1 |   |   |  |
| Time 2 |   |   |  |
| Time 3 |   |   |  |
| Time4  |   |   |  |
| Time5  |   |   |  |
| Time 6 |   |   |  |

Figure 1.3: Binded sequencing graph scheduled by ASAP algorithm

### Question 1.4 (5 marks)

Calculate latency and cycle time for this data path (Figure 1.3) when: i) "Adder" requires 1 clock cycle (c.c.) for the operation and ii) "Multiplier" takes 4 c.c. for multiplication.

Latency =  $\_$  c.c. Cycle time =  $\_$  c.c.

Please provide time schedule in Figure 1.4 (below) according to sequencing graph in Figure 1.3. This schedule should reflect moment of start and finish of each operation for each of resources (adder- "A" and multipliers – M1 and M2) during the time slots.





## Question 1.5 (5 marks)

Conduct the procedure for conversion of the sequencing graph scheduled and binded according to resource constraints described in Question 1.2.

a) Fill the Table 1.1 for the Component "Adder" according to resource constraints and scheduled time slots.

## EE 8208 \*Architectural Synthesis and Design of Digital Systems\*Page4FINAL EXAMINATION4

Table 1.1

| Time Slot | Operation | Input 1 | Input 2 | Output |
|-----------|-----------|---------|---------|--------|
| 0         |           |         |         |        |
| 1         |           |         |         |        |
| 2         |           |         |         |        |
| 3         |           |         |         |        |
| 4         |           |         |         |        |
| 5         |           |         |         |        |

b) How many sources have to multiplex each of input MUX (Multiplexers) and how many outputs have to de-multiplex DeMUX of the "Adder" Component (see Table 1.1)

Number of MUX Inputs = \_\_\_\_\_ Number of DeMUX outputs = \_\_\_\_\_

## 2 Section: Optimization of Application Specific Computing Architecture

Find the optimal architecture of Application Specific Processor (ASP) for the above function:  $\mathbf{y} = k^* a^2 + s^* b^2 + m^*(a + b)$  and data structure which consists of two vectors, where each vector can contain 32, 64, 128 or 256 elements:  $\mathbf{A} = [a_1, a_2, a_3...a_n]$  and  $\mathbf{B} = [b_1, b_2, b_3 ... b_n]$ , where n = 32, 64, 128, 256. Each pair of elements of the above vectors:  $a_i$  and  $b_i$  for i=1, 2, 3, ... n has to be processed by the given formula during the same execution cycle. Coefficients -k, s, and m are scalar (integer) data, which stay constant during the period of entire vector processing.

<u>The optimization requirements</u> are as follows: ASP should process vectors A and B with 256 elements in each not longer than 3080 clock cycles taking minimum area units. <u>The resources available</u> are as follows:

**R1:** "Adder" may be used in variants: R1.1 – one adder unit or R1.2 – two units in ASP **R2:** "Multiplier" may be used as: R2.1 – one multiplier, R2.2 – two multipliers, R2.3 – four multipliers and R2.4 – five multipliers in ASP

Resource specification:

The period of "ADD" operation = 1 clock cycles

The period of "MULT" operation = 4 clock cycles

One Adder unit requires 20 area units (au) on the chip (e.g. 20 CLBs in the FPGA) One Multiplier unit requires 100 area units (au)

#### Question 2.1 (20 marks)

Provide the Mini-Max analysis of ASP architecture for the above task

**Q2.1.1:** In the box below present the scheduled sequencing graph with resource binding for the variant of architecture which employs maximum of computation resources – Vmax => 2 Adders units and 5 multipliers

### <u>EE 8208 \*Architectural Synthesis and Design of Digital Systems\*</u> <u>FINAL EXAMINATION</u>

| Time 0 | A | В |  |
|--------|---|---|--|
| Time 1 |   |   |  |
| Time 2 |   |   |  |
| Time 3 |   |   |  |
| Time4  |   |   |  |
| Time5  |   |   |  |
| Time 6 |   |   |  |

**<u>Q2.1.2</u>**: Calculate the latency of execution of first pair of vector elements:  $a_1$  and  $b_1$  and cycle time for the rest of vector elements (in clock cycles):

Latency (V max) =  $\_$  c.c.

Cycle time (V max) = \_\_\_\_\_ c.c.

**<u>02.1.3</u>**: Calculate total execution time of full data vectors (256 elements of *a* and *b*)

Total execution time (Vmax) = \_\_\_\_\_ c.c.

Show calculations here

**<u>Q2.1.4</u>**: Calculate total area to be used for the V max

Total area = \_\_\_\_\_ = \_\_\_\_ au

**Q2.1.5:** In the box below present the scheduled sequencing graph with resource binding for the variant of architecture which employs minimum of computation resources – Vmin => 1 Adder unit and 1 multiplier

5

| Time 0 | A | В |
|--------|---|---|
| Time 1 |   |   |
| Time 2 |   |   |
| Time 3 |   |   |
| Time4  |   |   |
| Time5  |   |   |
| Time 6 |   |   |
|        |   |   |
|        |   |   |
|        |   |   |

<u>Q2.1.6</u>: Calculate the latency of execution of first pair of vector elements:  $a_1$  and  $b_1$  and cycle time for the rest of vector elements (in clock cycles):

Latency (V min) =  $\_$  c.c.

Cycle time (V min) = \_\_\_\_\_ c.c.

<u>Q2.1.7</u>: Calculate total execution time of Total execution time (Vmin) =  $\_$  c.c.

Show calculations here

<u>Q2.1.8</u>: Calculate total area to be used for the V min

Total area = \_

\_\_\_ au

=

## EE 8208 \*Architectural Synthesis and Design of Digital Systems\* Page FINAL EXAMINATION Page

7

#### **Question 2.2 (25 marks)**

Conduct the analysis of Critical variants of ASP architecture for each of resources: R1 - "Adder" and R2 - "Multiplier"

**<u>Q2.2.1</u>**: Identify the critical variant of ASP for the "Adder" (R1): put number of units for the each resources in variant identification below (near "?")

 $Vcr(R1) \Rightarrow ?$  Adder units and ? Multipliers

In the box below present the scheduled sequencing graph with resource binding for the critical variant of architecture for the R1 (adder)

| Time 0 | A | В |  |
|--------|---|---|--|
| Time 1 |   |   |  |
| Time 2 |   |   |  |
| Time 3 |   |   |  |
| Time4  |   |   |  |
| Time5  |   |   |  |
| Time 6 |   |   |  |

<u>Q2.2.2</u>: Calculate the latency of execution of first pair of vector elements:  $a_1$  and  $b_1$  and cycle time for the rest of vector elements (in clock cycles):

Latency (V cr[R1]) = \_\_\_\_\_ c.c. Cycle time (V cr[R1]) = \_\_\_\_\_ c.c.

**<u>02.2.3</u>**: Calculate total execution time of full data vectors (256 elements of *a* and *b*)

Total execution time V cr[R1] = \_\_\_\_\_ c.c.

Show calculations here

## EE 8208 \*Architectural Synthesis and Design of Digital Systems\* Page FINAL EXAMINATION Page

8

**Q2.2.4:** Identify the critical variant of ASP for the "Multiplier" (R2): put number of units for the each of resources in variant identification below (near "?")

 $Vcr(R2) \Rightarrow 2$  Adder units and 2 Multipliers

In the box below present the scheduled sequencing graph with resource binding for the critical variant of architecture for the R1 (adder)

**<u>02.2.5</u>**: Calculate the latency of execution of first pair of vector elements:  $a_1$  and  $b_1$  and cycle time for the rest of vector elements (in clock cycles):

Latency (V cr[R2]) =\_\_\_\_\_c.c.

Cycle time  $(V cr[R2]) = \____ c.c.$ 

**<u>Q2.2.6</u>**: Calculate total execution time of full data vectors (256 elements of *a* and *b*)

Total execution time V cr[R2] = \_\_\_\_\_ c.c.

Show calculations here

**Question 2.3 (15 marks)** 

# EE 8208 \*Architectural Synthesis and Design of Digital Systems\* Page FINAL EXAMINATION Page

Conduct the partial arrangement of Architecture Configurations Graph (ACG) representing the design space for the above ASP in order of:
i) Decrease the total execution time from the left terminal to the right terminal of ACG;
ii) Increase the area utilized for resources from time from the left terminal to the right terminal of ACG. <u>Note</u>: The hierarchical arrangement criterion:

 $K(Ri) = [\text{Total exec. Time for Vcr}(Ri) - \text{Total exec. Time for Vmax}] / (m_i - 1),$ where  $m_i$  – number of variants of the resource Ri

**<u>0</u> 2.3.1:** Calculate arrangement criterion for "Adder" – R1

K(R1) = \_\_\_\_\_ = \_\_\_\_

**<u>Q 2.3.2:</u>** Calculate arrangement criterion for "Multiplier" – R2

K(R2) = \_\_\_\_\_ = \_\_\_\_

**<u>Q 2.3.3:</u>** In the box below present the partially arranged ACG with identification of all nodes and edges (e.g. Root  $\rightarrow Ri$ , edge - Ri, *l*, etc.)



Figure 2.1: Partially arranged ACG

**Question 2.4 (15 marks)** Select the optimal variant of ASP architecture for the given function which must: i) provide the computation of full vectors during the period not exceeding 3080 c.c. and ii) with minimum possible resources (area units)

9

# EE 8208 \*Architectural Synthesis and Design of Digital Systems\*Page10FINAL EXAMINATION10

<u>Note:</u> The arrangement of ACG by area utilized for resources is in opposite to arrangement of ACG by total execution time (e.g. variant which provides shorter execution time may require more area units – see Figure 2.1).

<u>**O**</u> 2.4.1: Provide the binary search procedure checking variants # 4 and #6 or #2, etc. to find the border variant number = terminal # on ACG in Figure 2.1. Point the border variant by \*- mark.

Optimal variant number is <u>V#</u>

Г

| Total execution time for V#4 = | $\leq$ 3080 c.c. ? Yes / No |
|--------------------------------|-----------------------------|
| Total execution time for V# =  | $\leq$ 3080 c.c. ? Yes / No |
| Total execution time for V# =  | $\leq$ 3080 c.c. ? Yes / No |

<u>**Q**</u> 2.4.2: Calculate the Total execution time for the optimal variant working with clock frequency = 100 MHz

Total execution time (ASPopt) = \_\_\_\_\_\_ uS (micro-seconds)

Show calculations here

**<u>Q2.4.3</u>**: Calculate total area to be used for the Vopt

Total area = \_\_\_\_\_ = \_\_\_\_ au