Signed multiplication with FPGA DSP blocks
The Lattice iCE40 FPGA
has a 16x16 DSP block that can replace a "soft" multiplier and adder
with far less logic and with a much higher fMAX. It produces a 32 bit output in a
single cycle, although as of 2021 yosys and nextpnr won't infer a DSP based multiplier,
so it is necessary to explicitly use the SB_MAC16
block.
For softcore 32-bit CPUs like Claire Wolf's picrorv32 that have a 64-bit (optionally signed) multiply instruction, it is possible to chain four of the iCE40's DSP blocks to produce a 32x32-bit to 64-bit multiplier and still meet the 25 MHz timing requirement.
DSP Overview
The DSP function usage for iCE40 devices
app note from Lattice shows the many subcomponents inside the SB_MAC16
primitive.
Externally it has four 16-bit inputs named A
, B
, C
, and D
, as well as carry-in and -out
bits, and a 32-bit output named O
.
Internally it has a 16x16 mutiplier built from two 8x8 multipliers connected to A
and B
, and two 16x16 adders
that can be connected to the outputs of the multipliers as well as C
and D
, and all of this
is configured with numerous control bits.
The mode that we're using as a building block implements a 16x16 multiplier with a 32-bit accumulator, with a 33-bit output (32-bits plus a carry). The Verilator version shows the equivilant operation that the configuration bits select:
module mult16x16(
input clk,
input [15:0] a,
input [15:0] b,
input [31:0] c,
output [32:0] prod
);
`ifdef VERILATOR
wire [32:0] ab = a * b;
assign prod = ab + c;
`else
SB_MAC16 #(
.TOPOUTPUT_SELECT(2'b00), // adder, unregistered
.TOPADDSUB_LOWERINPUT(2'b10), // multiplier hi bits
.TOPADDSUB_UPPERINPUT(1'b1), // input C
.TOPADDSUB_CARRYSELECT(2'b11), // top carry in is bottom carry out
.BOTOUTPUT_SELECT(2'b00), // adder, unregistered
.BOTADDSUB_LOWERINPUT(2'b10), // multiplier lo bits
.BOTADDSUB_UPPERINPUT(1'b1), // input D
.BOTADDSUB_CARRYSELECT(2'b00) // bottom carry in constant 0
) mult16x16_add(
.CLK(clk),
.A(a),
.B(b),
.C(c[31:16]),
.D(c[15:0]),
.O(prod[31:0]),
.CO(prod[32])
);
`endif
endmodule
While a clk
input is supplied, it is unused in this design and there are no flipflops
or registers in this module. The prod
and implicit carry outputs are continuous assignment,
so they will update continuously as the a
, b
, and c
inputs change. It is possible
to configure internal flipflops in the SB_MAC16
block, but they are not used.
Unsigned multiply
Baseline multiplier
module mult32x32unsigned_baseline(
input clk,
input [31:0] a,
input [31:0] b,
output [63:0] prod
);
assign prod = a * b;
endmodule
As a base line, the "fast multiplier" in the picorv32 does an explicit multiply and allows yosys to synthesize the 32x32-bit to 64-bit output, which uses nearly 2700 logic cells of the 5000 in the up5k.
First try
32x32-bit unsigned multiply is "easy" to derive by re-writing the expression.
For instance, if we want to compute the 32-bit multiply of two unsigned
values 0x12345678
and 0x9ABCDEF0
, we can split them into four
16-bit values, where <<
is a shift by that many bits:
0x12345678 = 0x1234 << 16 + 0x5678 << 0
0x90ABCDEF = 0x90AB << 16 + 0xCDEF << 0
The product can then be expanded into four multiplies and some number of additions:
0x12345678 * 0x90ABCDEF
= (0x1234 << 16 + 0x5678 << 0) * (0x90AB << 16 + 0xCDEF << 0)
= (0x1234 * 0x90AB) << 32
+ (0x1234 * 0xCDEF) << 16
+ (0x5678 * 0x90AB) << 16
+ (0x5678 * 0xCDEF) << 0
= 0x0A4968BC << 32
+ 0x0EA4A28C << 16
+ 0x30DD4228 << 16
+ 0x458ED208 << 0
This could be implemented in Verilog exactly like this with the mult16x16()
module:
module mult32x32unsigned_noaccumulate(
input clk,
input [31:0] a,
input [31:0] b,
output [63:0] prod
);
wire [15:0] al = a[15:0];
wire [15:0] ah = a[31:16];
wire [15:0] bl = b[15:0];
wire [15:0] bh = b[31:16];
wire [32:0] al_bl;
wire [32:0] ah_bl;
wire [32:0] al_bh;
wire [32:0] ah_bh;
mult16x16 al_bl_mult(clk, al, bl, 0, al_bl);
mult16x16 ah_bl_mult(clk, ah, bl, 0, ah_bl);
mult16x16 al_bh_mult(clk, al, bh, 0, al_bh);
mult16x16 ah_bh_mult(clk, ah, bh, 0, ah_bh);
assign prod = ah_bh[31:0] << 32
+ (al_bh[31:0] + bh_al[31:0]) << 16
+ al_bl[31:0] << 0;
endmodule
Although the need for a 64-bit ripple adder across the four partial products is a significant slow down on the fMAX for the design. It is, however, a significant improvement over the base line: only 680 logic cells are required.
Using the accumulator
We can do better, however, since the DSP also includes a 32-bit accumulate
input and can implement a "fused multiply accumulate" (FMAC) operation.
The lower 16-bits of al_bl
go directly to the output, while the upper 16-bits
can be fed in as the accumulator input to ah_bl
, and the full 32-bit output
of ah_bl
can be fed into al_bh
. Likewise the lower 16-bits of al_bh
go
directly to the output, while the upper 16-bits can be fed into the accumulator
input for ah_bh
.
The only difficulty is now we must think about where carry-out is possible
so that we can ensure that is correctly propagated. It is usually easiest to
think about what happens when both A
and B
are 0xFFFFFFFF
.
For al*bl
, even if they are both 0xFFFF
the result 0xFFFE0001
is still less
than the 32-bit output, so it can't carry.
The ah*bl + al*bl[31:16]
is 0xFFFE0001 + 0xFFFE = 0xFFFEFFFF
, so it can't
carry either.
But al*bh + (ah*bl + al*bl[31:16])
is 0xFFFE0001 + 0xFFFEFFFF = 0x1FFFD0000
,
so it is possible that there is a carry output and 17-bits need to be propagated
to the ah*bh
computation.
module mult32x32unsigned(
input clk,
input [31:0] a,
input [31:0] b,
output [63:0] prod
);
wire [15:0] al = a[15:0];
wire [15:0] ah = a[31:16];
wire [15:0] bl = b[15:0];
wire [15:0] bh = b[31:16];
wire [32:0] al_bl;
wire [32:0] ah_bl;
wire [32:0] al_bh;
wire [32:0] ah_bh;
mult16x16 al_bl_mult(clk, al, bl, 0, al_bl);
mult16x16 ah_bl_mult(clk, ah, bl, { 16'h0000, al_bl[31:16] }, ah_bl);
mult16x16 al_bh_mult(clk, al, bh, ah_bl[31:0], al_bh);
mult16x16 ah_bh_mult(clk, ah, bh, { 15'h0000, al_bh[32:16] }, ah_bh);
assign prod[63:32] = ah_bh[31:0];
assign prod[31:16] = al_bh[15:0];
assign prod[15:0] = al_bl[15:0];
endmodule
This version uses only 68 logic cells in the iCE40, a 98% reduction! In the nextpnr-ice40 --gui
output the amount of die area used by the Verilog multiplier (on the left) compared to the
DSP multiplier (on the right) is really striking. Additionally if the 32-bit counters
for a
and b
are replaced with something faster, the fMAX of the DSP system is 127 MHz
compared to 20 MHz for the Verilog multiplier.
Signed multiply
Signed multiplication is similiar, except that there is an implicit "sign extension"
on the leading bit if the operand is signed
. In the above example, if both of the
values are signed, 0x12345678
(00010010001101000101011001111000
in binary) is positive
since the leading bit is zero, while 0x90ABCDEF
(10010000101010111100110111101111
in binary)
is negative since the leading bit is 1. This means that the 64-bit multiplication is actually
0x0000000012345678
times 0xFFFFFFFF90ABCDEF
when the leading bits are "extended" to the full size.
In the signed multiplier we can ignore this since 0 times anything is still zero.
This means that the 64-bit result can be expanded into the 16-bit chunks (and ignoring ones where the resulting output is either trivially zero or shifted beyond 64-bits):
0x0000000012345678 * 0xFFFFFFFF90ABCDEF
= (0x00000000 << 32 + 0x1234 << 16 + 0x5678 << 0)
* (0xFFFFFFFF << 32 + 0x90AB << 16 + 0xCDEF << 0)
= (0xFFFFFFFF * 0x12345678) << 32
+ (0x1234 * 0x90AB) << 32
+ (0x1234 * 0xCDEF) << 16
+ (0x5678 * 0x90AB) << 16
+ (0x5678 * 0xCDEF) << 0
=
| 64| 48| 32| 16| 0 |
+ 0x12345677EDCBA988 << 32
+ 0x0A4968BC << 32
+ 0x0EA4A28C << 16
+ 0x30DD4228 << 16
+ 0x458ED208 << 0
The bottom 32-bits are unaffected by this sign extension, so they can be
computed as before, and anything over the 64-bit length can be ignored.
The other thing to note is that multiplying by 0xFFFFFFFF
is the same as negating
the value, so this can be re-written as:
+ 0x0A4968BC << 32
+ 0x0EA4A28C << 16
+ 0x30DD4228 << 16
+ 0x458ED208 << 0
- 0x12345678 << 32
Todo: figure out how to use the DSP to handle this sign extension and carry.