The Step-by-Step Guide to Building an ML Accelerator

After you have read the overview of the CFU Playground components and set up your CFU-Playground, it’s time to accelerate a model. This tutorial will walk through the steps for building a basic CFU in your choice of Amaranth or Verilog.

Step 1: Make Your Project

You’ll be building an accelerator in your own project folder. Navigate to the root of your CFU-Playground directory and enter the following in your terminal:

$ cp -r proj/proj_template proj/my_first_cfu
$ cd proj/my_first_cfu

Now that you’ve made a project folder you can choose what model you’d like to accelerate. In the project folder you just made, open the Makefile and you should see something like this:

# This variable lists symbols to define to the C preprocessor
export DEFINES :=

# Uncomment this line to use software defined CFU functions in software_cfu.cc
#DEFINES += CFU_SOFTWARE_DEFINED

# Uncomment this line to skip debug code (large effect on performance)
DEFINES += NDEBUG

# Uncomment this line to skip individual profiling output (has minor effect on performance).
#DEFINES += NPROFILE

# Uncomment to include pdti8 in built binary
DEFINES += INCLUDE_MODEL_PDTI8

# Uncomment to include micro_speech in built binary
#DEFINES += INCLUDE_MODEL_MICRO_SPEECH

#...

By uncommenting lines in this Makefile you can add specific models to your build (you can also remove models by re-commenting the lines).

By default only the pdti8 model is configured as part of the build. If you can fit this model on your board (if you’re using the Arty you can), we recommend using it for this tutorial.

Step 2: Profile and Identify What to Accelerate

Now that we’ve set up our project, we should start by measuring the performance of the software unmodified to both understand what we should accelerate and get a baseline performance reading.

In your project folder (proj/my_first_cfu) run the following:

$ make prog  # if you're not using the arty add TARGET=<your board>
$ make load  # if you're not using the arty add TARGET=<your board>

After the build processes are completed you should see something like the following in your terminal:

CFU Playground
==============
1: TfLM Models menu
2: Functional CFU Tests
3: Project menu
4: Performance Counter Tests
5: TFLite Unit Tests
6: Benchmarks
7: Util Tests

Navigate to the your model’s menu by entering 1, then 1 in the menus. Once at the model menu (shown below) press g to run the golden tests.

Tests for pdti8 model
=====================
1: Run with zeros input
2: Run with no-person input
3: Run with person input
g: Run golden tests (check for expected outputs)
x: eXit to previous menu
pdti8> g

After running the golden tests you should have some output that looks like this:

"Event","Tag","Ticks"
0,DEPTHWISE_CONV_2D,7892
1,DEPTHWISE_CONV_2D,8063
2,CONV_2D,11703
3,DEPTHWISE_CONV_2D,4089
4,CONV_2D,8264
5,DEPTHWISE_CONV_2D,8045
6,CONV_2D,13234
7,DEPTHWISE_CONV_2D,2041
8,CONV_2D,6618
9,DEPTHWISE_CONV_2D,4065
10,CONV_2D,11637
11,DEPTHWISE_CONV_2D,1011
12,CONV_2D,5955
13,DEPTHWISE_CONV_2D,1923
14,CONV_2D,11611
15,DEPTHWISE_CONV_2D,1919
16,CONV_2D,11601
17,DEPTHWISE_CONV_2D,1939
18,CONV_2D,11628
19,DEPTHWISE_CONV_2D,1943
20,CONV_2D,11619
21,DEPTHWISE_CONV_2D,1925
22,CONV_2D,11624
23,DEPTHWISE_CONV_2D,509
24,CONV_2D,5859
25,DEPTHWISE_CONV_2D,922
26,CONV_2D,11257
27,AVERAGE_POOL_2D,51
28,CONV_2D,14
29,RESHAPE,1
30,SOFTMAX,11
 Counter |  Total | Starts | Average |     Raw
---------+--------+--------+---------+--------------
    0    |     0  |     0  |   n/a   |            0
    1    |     0  |     0  |   n/a   |            0
    2    |     0  |     0  |   n/a   |            0
    3    |     0  |     0  |   n/a   |            0
    4    |     0  |     0  |   n/a   |            0
    5    |     0  |     0  |   n/a   |            0
    6    |     0  |     0  |   n/a   |            0
    7    |     0  |     0  |   n/a   |            0
   183M (   183328857) cycles total

The comma-separated lines at the top signify the TensorFlow operation and the number of “ticks” it took to complete. Each tick counted by the profiler is 1024 clock cycles. For easier analysis, you can copy and paste these values into a spreadsheet that you maintain whilst performing optimizations.

The table at the bottom shows statistics from the performance CSRs (if you turned them on), the final line shows the total number of cycles spent during inference.

Summing up all the cycle counts for the tensorFlow operations we see that about 75% of the time is spent inside the CONV_2D operation. That seems like a good place to focus our optimization efforts on.

To further profile CONV_2D, let’s use the performance counters inside the source code.

Inside your project folder run the following:

$ mkdir -p src/tensorflow/lite/kernels/internal/reference/integer_ops/
$ cp \
  ../../third_party/tflite-micro/tensorflow/lite/kernels/internal/reference/integer_ops/conv.h \
  src/tensorflow/lite/kernels/internal/reference/integer_ops/conv.h

This will create a copy of the convolution source code in your project directory. At build time your copy of the source code will replace the regular implementation.

Open the newly created copy at proj/my_first_cfu/src/tensorflow/lite/kernels/ internal/reference/integer_ops/conv.h. The pdti8 model uses the first function in this file. Locate the innermost loop of the first function, it should look something like this:

for (int in_channel = 0; in_channel < input_depth; ++in_channel) {
  int32_t input_val = input_data[Offset(input_shape, batch, in_y,
                                          in_x, in_channel)];
  int32_t filter_val = filter_data[Offset(
      filter_shape, out_channel, filter_y, filter_x, in_channel)];
    /* ... */
  acc += filter_val * (input_val + input_offset);
}

To count how many cycles this inner loop takes you can utilize the performance counters built into the soft-CPU. Add #include "perf.h" at the top of the file and then surround the inner loop with perf functions like below:

#include "perf.h"
/* ... */
perf_enable_counter(0);
for (int in_channel = 0; in_channel < input_depth; ++in_channel) {
  int32_t input_val = input_data[Offset(input_shape, batch, in_y,
                                          in_x, in_channel)];
  int32_t filter_val = filter_data[Offset(
      filter_shape, out_channel, filter_y, filter_x, in_channel)];
    /* ... */
  acc += filter_val * (input_val + input_offset);
}
perf_disable_counter(0);

Re-build the projects (make load) and run the golden tests again, the table at the end of the terminal output should now look something like this:

Counter |  Total | Starts | Average |     Raw
---------+--------+--------+---------+--------------
    0    |   113M | 124418  |   908   |    113064622
    1    |     0  |     0  |   n/a   |            0
    2    |     0  |     0  |   n/a   |            0
    3    |     0  |     0  |   n/a   |            0
    4    |     0  |     0  |   n/a   |            0
    5    |     0  |     0  |   n/a   |            0
    6    |     0  |     0  |   n/a   |            0
    7    |     0  |     0  |   n/a   |            0

If you don’t see any numbers in the first row of the table add some print statements inside the function to check if you’re editing in the correct place. If those print statements execute but the performance counters still don’t count, your FPGA board could be so resource-constrained as to not have room for those extra registers.

Not having the perfomance counter registers isn’t a problem, just replace perf_enable_counter(0); with printf("Entering loop at: %lu\n", perf_get_mcycle()); and perf_disable_counter(0); with printf("Exiting loop at: %lu\n", perf_get_mcycle());. perf_get_mcycle() returns the value of a free-running 32-bit counter. You’ll just need to manually write a script to parse the output and count how many cycles were spent in the inner loop.

Looking at the total cycle count for CONV_2D and the cycle count in the innermost loop, approximately 83% of our time in CONV_2D is spent in that inner loop. Let’s do some optimizations!

Step 3: Software Specialization

Before we write any hardware let’s perform some simple, model-specific optimizations in software. In order to understand what optimizations we can make with our specific model, let’s print out the parameters of the CONV_2D operation. Include “playground_util/print_params.h” and add the following to the top of the function:

#include "playground_util/print_params.h"
/* ... */
inline void ConvPerChannel(
    const ConvParams& params, const int32_t* output_multiplier,
    const int32_t* output_shift, const RuntimeShape& input_shape,
    const int8_t* input_data, const RuntimeShape& filter_shape,
    const int8_t* filter_data, const RuntimeShape& bias_shape,
    const int32_t* bias_data, const RuntimeShape& output_shape,
    int8_t* output_data) {
  // Format is:
  // "padding_type", "padding_width", "padding_height", "padding_width_offset",
  // "padding_height_offset", "stride_width", "stride_height",
  // "dilation_width_factor", "dilation_height_factor", "input_offset",
  // "weights_offset", "output_offset", "output_multiplier", "output_shift",
  // "quantized_activation_min", "quantized_activation_max",
  // "input_batches", "input_height", "input_width", "input_depth",
  // "filter_output_depth", "filter_height", "filter_width", "filter_input_depth",
  // "output_batches", "output_height", "output_width", "output_depth",
  print_conv_params(params, input_shape, filter_shape, output_shape);

  /* ... */

After running the golden tests again, we can observe the following parameters are all constant:

Const. Parameter

Value

stride_width

1

stride_height

1

dilation_width_factor

1

dilation_height_factor

1

filter_height

1

filter_width

1

pad_width

0

pad_height

0

input_offset

128

By replacing all these parameters with literal values in the source code, we get the following speedup in our golden tests:

 Counter |  Total | Starts | Average |     Raw
---------+--------+--------+---------+--------------
    0    |    72M | 124418  |   577   |     71859761
    1    |     0  |     0  |   n/a   |            0
    2    |     0  |     0  |   n/a   |            0
    3    |     0  |     0  |   n/a   |            0
    4    |     0  |     0  |   n/a   |            0
    5    |     0  |     0  |   n/a   |            0
    6    |     0  |     0  |   n/a   |            0
    7    |     0  |     0  |   n/a   |            0
   136M (   135730786) cycles total

Another optimization we can do is called “loop unrolling”. Because input_depth is always a multiple of 4, we can make the innermost loop do 4 times as much work before checking the loop conditions. Implementing this change should make your innermost loop look like:

for (int in_channel = 0; in_channel < input_depth; in_channel += 4) {
  int32_t input_val = input_data[Offset(input_shape, batch, in_y,
                                          in_x, in_channel)];
  int32_t filter_val = filter_data[Offset(
      filter_shape, out_channel, filter_y, filter_x, in_channel)];
  acc += filter_val * (input_val + 128);

  input_val = input_data[Offset(input_shape, batch, in_y,
                                          in_x, in_channel + 1)];
  filter_val = filter_data[Offset(
      filter_shape, out_channel, filter_y, filter_x, in_channel + 1)];
  acc += filter_val * (input_val + 128);

  input_val = input_data[Offset(input_shape, batch, in_y,
                                          in_x, in_channel + 2)];
  filter_val = filter_data[Offset(
      filter_shape, out_channel, filter_y, filter_x, in_channel + 2)];
  acc += filter_val * (input_val + 128);

  input_val = input_data[Offset(input_shape, batch, in_y,
                                          in_x, in_channel + 3)];
  filter_val = filter_data[Offset(
      filter_shape, out_channel, filter_y, filter_x, in_channel + 3)];
  acc += filter_val * (input_val + 128);
}

After this change we get another significant speed-up in our golden tests:

 Counter |  Total | Starts | Average |     Raw
---------+--------+--------+---------+--------------
    0    |    54M | 124418  |   431   |     53743879
    1    |     0  |     0  |   n/a   |            0
    2    |     0  |     0  |   n/a   |            0
    3    |     0  |     0  |   n/a   |            0
    4    |     0  |     0  |   n/a   |            0
    5    |     0  |     0  |   n/a   |            0
    6    |     0  |     0  |   n/a   |            0
    7    |     0  |     0  |   n/a   |            0
   117M (   117297894) cycles total

Even with the simplest possible software specialization, we’ve already seen massive gains in performance. Our innermost loop is now twice as fast and the total number of cycles spent in inference has decreased by 36%.

Step 4: Simple Calculation Gateware

Now that we’ve picked off some low-hanging fruit in our software, let’s direct our attention to our hardware.

In our innermost loop you might notice we load – then multiply and accumulate – 8 different values from int8_t arrays. This is wasteful, our registers are 32 bits wide and the int8_t values are already contiguous in memory. With a custom CFU we could create a instruction that performs a SIMD multiply-and-accumulate operation in one or two cycles.

The instruction will have two 32 bit inputs, a set of four bytes from input_data and a set of four bytes from filter_data. Each time the instruction is executed an internal register will accumulate and return the running sum. We’ll also need some way to reset the internal accumulator, we can use the funct7 field of the assembly instruction for this task. A non-zero funct7 value will reset the internal accumulator to 0. A graphical representation of the inputs and output of the instruction are shown below:

              7 bits
         +--------------+
funct7 = | (bool) reset |
         +--------------+

              int8_t           int8_t           int8_t           int8_t
         +----------------+----------------+----------------+----------------+
   in0 = | input_data[0]  | input_data[1]  | input_data[2]  | input_data[3]  |
         +----------------+----------------+----------------+----------------+

              int8_t           int8_t           int8_t           int8_t
         +----------------+----------------+----------------+----------------+
   in1 = | filter_data[0] | filter_data[1] | filter_data[2] | filter_data[3] |
         +----------------+----------------+----------------+----------------+

                                        int32_t
         +-------------------------------------------------------------------+
output = | output + (input_data[0, 1, 2, 3] + 128) * filter_data[0, 1, 2, 3] |
         +-------------------------------------------------------------------+

Now that we’ve described our CFU it’s time to actually write the gateware. If you’d like to implement your CFU directly in Verilog you can skip the upcoming section about Amaranth CFUs (and likewise if you’re going to be using Amaranth you can skip the Verilog section).

Amaranth CFU Development

There is a fairly robust framework for building a CFU in Amaranth. Inside of <CFU-Playground root>/python/amaranth_cfu there are a set of helper files that you can import in your code. It’s best to read through the doc comments in <CFU-Playground root>/python/amaranth_cfu/cfu.py and <CFU-Playground root>/python/amaranth_cfu/util.py before starting development, but you should be able to get a reasonable understanding of the framework through this example.

from amaranth import C, Module, Signal, signed
from amaranth_cfu import all_words, InstructionBase, InstructionTestBase, pack_vals, simple_cfu
import unittest


# Custom instruction inherits from the InstructionBase class.
class SimdMac(InstructionBase):
    def __init__(self, input_offset=128) -> None:
        super().__init__()

        self.input_offset = C(input_offset, signed(9))

    # `elab` method implements the logic of the instruction.
    def elab(self, m: Module) -> None:
        words = lambda s: all_words(s, 8)

        # SIMD multiply step:
        self.prods = [Signal(signed(16)) for _ in range(4)]
        for prod, w0, w1 in zip(self.prods, words(self.in0), words(self.in1)):
            m.d.comb += prod.eq(
                (w0.as_signed() + self.input_offset) * w1.as_signed())

        m.d.sync += self.done.eq(0)
        # self.start signal high for one cycle when instruction started.
        with m.If(self.start):
            with m.If(self.funct7):
                m.d.sync += self.output.eq(0)
            with m.Else():
                # Accumulate step:
                m.d.sync += self.output.eq(self.output + sum(self.prods))
            # self.done signal indicates instruction is completed.
            m.d.sync += self.done.eq(1)


# Tests for the instruction inherit from InstructionTestBase class.
class SimdMacTest(InstructionTestBase):
    def create_dut(self):
        return SimdMac()

    def test(self):
        # self.verify method steps through expected inputs and outputs.
        self.verify([
            (1, 0, 0, 0),  # reset
            (0, pack_vals(-128, 0, 0, 1), pack_vals(111, 0, 0, 1), 129 * 1),
            (0, pack_vals(0, -128, 1, 0), pack_vals(0, 52, 1, 0), 129 * 2),
            (0, pack_vals(0, 1, 0, 0), pack_vals(0, 1, 0, 0), 129 * 3),
            (0, pack_vals(1, 0, 0, 0), pack_vals(1, 0, 0, 0), 129 * 4),
            (0, pack_vals(0, 0, 0, 0), pack_vals(0, 0, 0, 0), 129 * 4),
            (0, pack_vals(0, 0, 0, 0), pack_vals(-5, 0, 0, 0), 0xffffff84),
            (1, 0, 0, 0),  # reset
            (0, pack_vals(-12, -128, -88, -128), pack_vals(-1, -7, -16,
                                                           15), 0xfffffd0c),
            (1, 0, 0, 0),  # reset
            (0, pack_vals(127, 127, 127, 127), pack_vals(127, 127, 127,
                                                         127), 129540),
            (1, 0, 0, 0),  # reset
            (0, pack_vals(127, 127, 127,
                          127), pack_vals(-128, -128, -128, -128), 0xfffe0200),
        ])


# Expose make_cfu function for cfu_gen.py
def make_cfu():
    # Associate cfu_op0 with SimdMac.
    return simple_cfu({0: SimdMac()})


# Use `../../scripts/pyrun cfu.py` to run unit tests.
if __name__ == '__main__':
    unittest.main()

This CFU implements our instruction specification whilst being easily testable and extendable.

Verilog CFU Development

Developing CFUs in Verilog is lower-level and doesn’t give you access to the nice testing features of Amaranth, but it does give you more control over the CFU. Firstly, delete the cfu.py and cfu_gen.py files from your project folder; we’ll directly be creating and editing a file named cfu.v. To add the cfu.v file in git, you’ll need to use the force option: git add -f cfu.v.

When doing CFU development with Amaranth, the CFU-CPU handshaking is implemented for you in the Cfu base class. In Verilog you will need to implement your own handshaking and for that it’s important to know the CFU module specification.

The “CFU bus” provides communication between the CPU and CFU. The CFU Bus is composed of two independent streams:

  • The CPU uses the command stream (cmd) to send operands and 10 bits of function code to the CFU, thus initiating the CFU computation.

  • The CFU uses the response stream (rsp) to return the result to the CPU. Since the responses are not tagged, they must be delivered in-order.

Each stream has two-way handshaking and backpressure (*_valid and *_ready in the diagram below). An endpoint can indicate that it cannot accept another transfer by pulling its ready signal low. A transfer takes place only when both valid from the sender and ready from the receiver are high.

Note

The data values from the CPU (cmd_function_id, cmd_inputs_0, and cmd_inputs_1) are valid ONLY during the cycle that the handshake is active (when both cmd_valid and cmd_ready are asserted). If your CFU needs to use these values in subsequent cycles, it must store them in registers.

    >--- cmd_valid --------------->
    <--- cmd_ready ---------------<
    >--- cmd_function_id[9:0] ---->
    >--- cmd_inputs_0[31:0] ------>
    >--- cmd_inputs_1[31:0] ------>
CPU                                 CFU
    <--- rsp_valid ---------------<
    >--- rsp_ready --------------->
    <--- rsp_outputs_0[31:0] -----<

With the previous specification in mind, here’s an implementation of our SIMD multiply-and-accumulate instruction in cfu.v:

module Cfu (
  input               cmd_valid,
  output              cmd_ready,
  input      [9:0]    cmd_payload_function_id,
  input      [31:0]   cmd_payload_inputs_0,
  input      [31:0]   cmd_payload_inputs_1,
  output reg          rsp_valid,
  input               rsp_ready,
  output reg [31:0]   rsp_payload_outputs_0,
  input               reset,
  input               clk
);
  localparam InputOffset = $signed(9'd128);

  // SIMD multiply step:
  wire signed [15:0] prod_0, prod_1, prod_2, prod_3;
  assign prod_0 =  ($signed(cmd_payload_inputs_0[7 : 0]) + InputOffset)
                  * $signed(cmd_payload_inputs_1[7 : 0]);
  assign prod_1 =  ($signed(cmd_payload_inputs_0[15: 8]) + InputOffset)
                  * $signed(cmd_payload_inputs_1[15: 8]);
  assign prod_2 =  ($signed(cmd_payload_inputs_0[23:16]) + InputOffset)
                  * $signed(cmd_payload_inputs_1[23:16]);
  assign prod_3 =  ($signed(cmd_payload_inputs_0[31:24]) + InputOffset)
                  * $signed(cmd_payload_inputs_1[31:24]);

  wire signed [31:0] sum_prods;
  assign sum_prods = prod_0 + prod_1 + prod_2 + prod_3;

  // Only not ready for a command when we have a response.
  assign cmd_ready = ~rsp_valid;

  always @(posedge clk) begin
    if (reset) begin
      rsp_payload_outputs_0 <= 32'b0;
      rsp_valid <= 1'b0;
    end else if (rsp_valid) begin
      // Waiting to hand off response to CPU.
      rsp_valid <= ~rsp_ready;
    end else if (cmd_valid) begin
      rsp_valid <= 1'b1;
      // Accumulate step:
      rsp_payload_outputs_0 <= |cmd_payload_function_id[9:3]
          ? 32'b0
          : rsp_payload_outputs_0 + sum_prods;
    end
  end
endmodule

Using the CFU in Software

No matter what language you chose to write your CFU in, it won’t be very useful if you don’t use it in your software!

In your project-specific conv.h file, modify the loops to utilize your CFU SIMD multiply-and-accumulate instruction:

#include "cfu.h"
/* ... */
    for (int out_channel = 0; out_channel < output_depth; ++out_channel) {
      int32_t acc = cfu_op0(/* funct7= */ 1, 0, 0); // resets acc
      for (int filter_y = 0; filter_y < 1; ++filter_y) {
        const int in_y = in_y_origin + filter_y;
        for (int filter_x = 0; filter_x < 1; ++filter_x) {
          const int in_x = in_x_origin + filter_x;

          // Zero padding by omitting the areas outside the image.
          const bool is_point_inside_image =
              (in_x >= 0) && (in_x < input_width) && (in_y >= 0) &&
              (in_y < input_height);

          if (!is_point_inside_image) {
            continue;
          }

          for (int in_channel = 0; in_channel < input_depth; in_channel += 4) {
            uint32_t input_val = *((uint32_t *)(input_data + Offset(
                input_shape, batch, in_y, in_x, in_channel)));

            uint32_t filter_val = *((uint32_t *)(filter_data + Offset(
                filter_shape, out_channel, filter_y, filter_x, in_channel)));
            acc = cfu_op0(/* funct7= */ 0, /* in0= */ input_val, /* in1= */ filter_val);
          }
        }
      }

      if (bias_data) {
        acc += bias_data[out_channel];
      }
      acc = MultiplyByQuantizedMultiplier(
          acc, output_multiplier[out_channel], output_shift[out_channel]);
      acc += output_offset;
      acc = std::max(acc, output_activation_min);
      acc = std::min(acc, output_activation_max);
      output_data[Offset(output_shape, batch, out_y, out_x, out_channel)] =
          static_cast<int8_t>(acc);
    }
/* ... */

After modifying this code, re-build the project and bitstream to test out the changes. When running the golden tests you should first make sure they all pass and then note the cycle count.

 Counter |  Total | Starts | Average |     Raw
---------+--------+--------+---------+--------------
    0    |    22M | 124418  |   178   |     22147970
    1    |     0  |     0  |   n/a   |            0
    2    |     0  |     0  |   n/a   |            0
    3    |     0  |     0  |   n/a   |            0
    4    |     0  |     0  |   n/a   |            0
    5    |     0  |     0  |   n/a   |            0
    6    |     0  |     0  |   n/a   |            0
    7    |     0  |     0  |   n/a   |            0
    86M (    85771518) cycles total

What an improvement! Compared to the unoptimized version, the innermost loop is five times faster and the total number of cycles spent in inference has decreased by 47%.

For extra analysis you can look at the build folder in your project directory. In there you can inspect disassemblies of your software to see how the addition of your CFU improved the code.

Before adding the CFU this is what the assembly of our innermost loop looked like:

# innermost loop before adding CFU
lw      a4,16(sp)
blez    a4,4005f0dc
lw      a4,24(sp)
add     a0,a4,a3
lw      a4,28(sp)
add     a2,a4,a2
lw      a4,36(sp)
add     a3,a4,a3
mv      t5,a3
lb      a4,0(a0)
lb      t0,0(a2)
lb      a3,1(a2)
addi    a4,a4,128
mul     a4,a4,t0
sw      a3,4(sp)
lb      a3,2(a2)
lb      t4,1(a0)
lb      t3,2(a0)
sw      a3,8(sp)
lw      t0,4(sp)
lb      a3,3(a0)
lb      s4,3(a2)
addi    t4,t4,128
add     a4,a4,a5
lw      a5,8(sp)
addi    t3,t3,128
mul     t4,t4,t0
addi    a3,a3,128
addi    a0,a0,4
addi    a2,a2,4
mul     t3,t3,a5
add     t4,a4,t4
mul     a3,a3,s4
add     a5,t4,t3
add     a5,a5,a3
bne     t5,a0,4005f218

After adding the CFU, the assembly has greatly shrunk in size as our CFU does the heavy lifting:

# innermost loop after adding CFU
lw        a7,4(sp)
blez      a7,4005f0dc
lw        a0,12(sp)
add       a7,a0,a5
lw        a0,16(sp)
add       a4,a0,a4
lw        a0,24(sp)
add       a5,a0,a5
lw        a0,0(a7)
lw        t1,0(a4)
cfu[0,0]  a0, a0, t1
addi      a7,a7,4
addi      a4,a4,4
bne       a5,a7,4005f200

With just simple software improvements and a tiny CFU we’ve decreased the number of cycles taken by the innermost loop from 113 million down to just 22 million!

Step 5: Next Steps

This document only briefly touched on very simple hardware and software optimizations, but there’s so much more that can be done. Possible next steps include:

  • Stronger software optimization

  • Moving entire loops from software to gateware

  • Optimization of other TensorFlow operations

  • Investigation of other models

  • Generalizing instructions so they can be used in multiple places