[Lab 2] Failure rate of my master unit scales with buffer size

[Lab 2] Failure rate of my master unit scales with buffer size

by Alexandre Chau -
Number of replies: 1

Hi everyone,

I've been debugging my master unit for the last few days and I'm slowly losing my sanity. Anyway I am currently stuck on a weird bug where my master unit always works for a buffer of length 1, works exactly twice then gets stuck for a length of 2 or 3, works exactly once for a length of 4, and is always stuck for a bigger length >= 5. By stuck, I mean that the interrupt is never raised thus my program waits indefinitely. I would appreciate it if someone could have a fresh look on my design or have any tips, which may help me pinpoint the root cause.

Here's a detailed explanation of my design: my master unit is an Avalon slave / master module which performs the byte swap and bits mirror operation in place on a contiguous array in memory. It must be first setup by the Nios II CPU through the following slave register map:

  • begin_address : register to save the buffer start address inclusive
  • end_address : register to save the buffer end address inclusive
  • start : a write to this register starts the operation and the device becomes busy until finished. An IRQ is then raised.
  • irq_clear : a write to this register clears any pending IRQ
Once the values are set and the operation is started, a master read unit fetches bursts from the memory into a read FIFO. This FIFO is consumed by the swap / mirror operator and fed into a write FIFO. Another master write unit empties the write FIFO also using bursts. The FIFOs are of size 32. Here's a diagram overview (notice that am_r_readdatavalid and am_readdata are directly connected to the read FIFO, and there's a small one-cycle latency logic to transfer words between FIFOs):

block diagram

Thus, the state machine of my master read process is as follows: in short, it fetches bursts of size 16 (the last one may be smaller) if the FIFO (of size 32) has at least 16 free entries until the read pointer reaches the end address. It supports wait requests and readdatavalid = 0. 

read process

Similarly, the master write process waits for the FIFO to have at least 16 elements if there are more than 16 words left to send them as a burst (the last burst may be smaller). Since the FIFO read request must always be asserted one cycle earlier, there's a write_next_output register that delays the output of the write FIFO to the actual Avalon writedata bus (so that a waitrequest can stall an earlier read).

master write fsm

I have simulated the device successfully with bursts of 2, 5 and 33 words and they all work in ModelSim (the interrupt is raised at the end of the operation). I also know that the C software and the interruption mechanism is correct because I can always make it work for a buffer of single value. Thus, the error may only lay in the read or write process (i.e. real Avalon timings differ from the simulation) or the FIFO IPs do not behave as observed in the simulation.

Although it is not very legible, you can see in the next screenshots that the IRQ is raised in all cases in ModelSim (I can also provide the .wlf files if needed). In all simulations, I have tried to insert / ignore various wait requests and data valid assertions:

modelsim burst of size 2

modelsim burst of size 5

modelsim burst of size 33

However, when I perform the test with the real hardware, the device will predictably succeed an exact number of times and fail afterwards, depending on the buffer size.

There's one thing that I think of, which I wonder if it could cause some weird behavior: I have enabled dual-ports on my on-chip memory, and each master unit is connected to a distinct respective port. Here's my Qsys:

qsys

What puzzles me is that the error rate scales with the burst size. Why does a burst of size 1 always work, a burst of size 2 or 3 works exactly twice then fails, and a burst of size 4 works exactly once and then fails; why does a burst of size 5 or bigger never terminates? I'm running out of ideas to debug this.

Any help would be greatly appreciated.

Thank you in advance and best regards,

Alex


Below is the VHDL implementation:

--
-- MIRROR AND SWAP AVALON SLAVE/MASTER
-- RTES Lab 2
--
-- file: master_mirror_swap.vhd
-- author: Alexandre CHAU
-- date: 18/04/2020
-- version: 0.1
--
-- This component is an Avalon slave/master module that performs the bits
-- mirror and swap operation as defined in {@link mirror_swap_op.vhd} on
-- an arbitrary contiguous array in memory. The operation transforms the
-- values in place and triggers an interrupt request when finished.
--
-- The addressable register map for the slave interface is as follows:
-- +---------------+-------------------+------------------------------------------------------+
-- | Register name | Offset (in bytes) | Description |
-- +---------------+-------------------+------------------------------------------------------+
-- | begin_address | 0 | A write to this register when the device is not busy |
-- | | | sets the value as the inclusive start address of the |
-- | | | array to process. The value in this register MUST be |
-- | | | smaller or equal to the one set in end_address. |
-- +---------------+-------------------+------------------------------------------------------+
-- | end_address | 4 | A write to this register when the device is not busy |
-- | | | sets the value as the inclusive end address of the |
-- | | | array to process. The value in this register MUST be |
-- | | | bigger or equal to the one set in start_address. |
-- +---------------+-------------------+------------------------------------------------------+
-- | start | 8 | A write of any value to this register when the |
-- | | | device is not busy will start the processing of the |
-- | | | array. This register MUST only be written once |
-- | | | begin_address and end_address have been set. The |
-- | | | module then becomes busy until an IRQ is raised. |
-- +---------------+-------------------+------------------------------------------------------+
-- | irq_clear | 12 | A write of any value to this register when the |
-- | | | device is not busy will clear a pending interrupt |
-- | | | request. |
-- +---------------+-------------------+------------------------------------------------------+
--
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity master_mirror_swap is
port (
clk : in std_logic;
reset_n : in std_logic;
-- Avalon slave
as_address : in std_logic_vector(1 downto 0);
as_write : in std_logic;
as_writedata : in std_logic_vector(31 downto 0);
-- Avalon master read
am_r_address : out std_logic_vector(31 downto 0);
am_r_read : out std_logic;
am_r_readdata : in std_logic_vector(31 downto 0);
am_r_readdatavalid : in std_logic;
am_r_waitrequest : in std_logic;
am_r_burstcount : out std_logic_vector(10 downto 0);
-- Avalon master write
am_w_address : out std_logic_vector(31 downto 0);
am_w_burstcount : out std_logic_vector(10 downto 0);
am_w_write : out std_logic;
am_w_writedata : out std_logic_vector(31 downto 0);
am_w_waitrequest : in std_logic;
-- Interrupt line
irq : out std_logic
);
end entity master_mirror_swap;
architecture rtl of master_mirror_swap is
    -- Burst size
constant BURST_SIZE : unsigned := to_unsigned(16, am_r_burstcount'length);
    -- Avalon slave port map
constant REG_BEGIN_ADDRESS_OFFSET : std_logic_vector(as_address'length - 1 downto 0) := "00";
constant REG_END_ADDRESS_OFFSET : std_logic_vector(as_address'length - 1 downto 0) := "01";
constant REG_START_OFFSET : std_logic_vector(as_address'length - 1 downto 0) := "10";
constant REG_IRQ_CLEAR_OFFSET : std_logic_vector(as_address'length - 1 downto 0) := "11";
    -- Avalon slave control registers
signal reg_begin_address : std_logic_vector(31 downto 0);
signal reg_end_address : std_logic_vector(31 downto 0);
signal reg_irq_pending : std_logic;
    -- Avalon master read registers
signal reg_read_pointer : std_logic_vector(31 downto 0);
signal reg_read_burstcount : std_logic_vector(10 downto 0);
    -- Avalon master write registers
signal reg_write_pointer : std_logic_vector(31 downto 0);
signal reg_write_next_output : std_logic_vector(31 downto 0);
signal reg_write_output : std_logic_vector(31 downto 0);
signal reg_write_burstcount : std_logic_vector(10 downto 0);
    -- FSM states for Avalon slave
type as_state_type is (
Q_RECEIVE,
Q_BUSY
);
signal as_state : as_state_type;
    -- FSM states for Avalon master read
type am_r_state_type is (
Q_IDLE,
Q_WAIT_READY,
Q_BURST_SETUP,
Q_READ,
Q_DONE
);
signal am_r_state : am_r_state_type;
    -- FSM states for Avalon master write
type am_w_state_type is (
Q_IDLE,
Q_WAIT_DATA,
Q_FETCH_WORD,
Q_HOLD_VALUE,
Q_BURST_SETUP,
Q_WRITE,
Q_DONE
);
signal am_w_state : am_w_state_type;
    -- Mirror/swap operator ports
signal wire_op_input : std_logic_vector(31 downto 0);
signal wire_op_output : std_logic_vector(31 downto 0);
signal op_valid : std_logic;
signal op_valid_next : std_logic;
    -- Declare mirror/swap operator component
component mirror_swap_op
port (
input : in std_logic_vector(31 downto 0);
output : out std_logic_vector(31 downto 0)
);
end component;
    -- Declare FIFO IP component
component fifo_ip
port (
clock : in std_logic;
data : in std_logic_vector(31 downto 0);
rdreq : in std_logic;
wrreq : in std_logic;
almost_empty : out std_logic; -- less or equal than 16 elements in FIFO
almost_full : out std_logic; -- more or equal than 16 elements in FIFO
empty : out std_logic;
full : out std_logic;
q : out std_logic_vector(31 downto 0);
usedw : out std_logic_vector(4 downto 0)
);
end component;
    -- Additional FIFOs plumbing
signal fifo_read_empty : std_logic;
signal fifo_read_almost_empty : std_logic;
signal fifo_write_full : std_logic;
signal fifo_write_q : std_logic_vector(31 downto 0);
signal fifo_write_rdreq : std_logic;
signal fifo_write_almost_full : std_logic;
signal fifo_write_usedw : std_logic_vector(4 downto 0);
begin
    -- FSM process for Avalon slave
as_process : process (clk, reset_n)
begin
if reset_n = '0' then
reg_begin_address <= (others => '0');
reg_end_address <= (others => '0');
reg_irq_pending <= '0';
as_state <= Q_RECEIVE;
elsif rising_edge(clk) then
case as_state is
                when Q_RECEIVE =>
-- In Q_RECEIVE state, the slave process waits for
-- Avalon writes to setup the control registers
if as_write = '1' then
case as_address is
when REG_BEGIN_ADDRESS_OFFSET =>
-- initialize start adddress
reg_begin_address <= as_writedata;
when REG_END_ADDRESS_OFFSET =>
-- sets end address
reg_end_address <= as_writedata;
when REG_START_OFFSET =>
-- transit to busy state
as_state <= Q_BUSY;
when REG_IRQ_CLEAR_OFFSET =>
-- reset IRQ level
reg_irq_pending <= '0';
when others =>
-- ignore otherwise
end case;
end if;
                when Q_BUSY =>
-- In Q_BUSY state, the slave process waits for the
-- the Avalon write master to finish
if am_w_state = Q_DONE then
reg_irq_pending <= '1'; -- trigger IRQ
as_state <= Q_RECEIVE; -- go back to receive state
end if;
end case;
end if;
end process as_process;
    -- Connect IRQ level to pending register
irq <= reg_irq_pending;
    -- FSM process for Avalon master read
am_r_process : process (clk, reset_n)
variable write_words_left : std_logic_vector(am_r_address'length - 1 downto 0);
begin
if reset_n = '0' then
write_words_left := (others => '0');
am_r_state <= Q_IDLE;
reg_read_pointer <= (others => '0');
reg_read_burstcount <= (others => '0');
am_r_read <= '0';
am_r_burstcount <= (others => '0');
am_r_address <= (others => '0');
elsif rising_edge(clk) then
case am_r_state is
                when Q_IDLE =>
-- In Q_IDLE, we wait for the Avalon slave to start
-- in which case we initialize the read pointer
if as_state = Q_BUSY then
reg_read_pointer <= reg_begin_address;
am_r_state <= Q_WAIT_READY;
end if;
                when Q_WAIT_READY =>
-- In Q_WAIT_READY, we wait for the read FIFO to have
-- at least BURST_SIZE empty slots
if fifo_read_almost_empty = '1' then
-- helper to compute the number of words left to write
write_words_left := std_logic_vector(unsigned(reg_end_address) - unsigned(reg_read_pointer) + 1);
-- determine burst size
if unsigned(write_words_left) >= BURST_SIZE then
reg_read_burstcount <= std_logic_vector(BURST_SIZE);
else
reg_read_burstcount <= write_words_left(am_r_burstcount'length - 1 downto 0); -- we can take lowest bits since we know < BURST_SIZE
end if;
am_r_state <= Q_BURST_SETUP;
end if;
                when Q_BURST_SETUP =>
-- In Q_BURST_SETUP, we assert the necessary lines on the
-- Avalon master read to perform a burst read
am_r_read <= '1';
am_r_burstcount <= reg_read_burstcount;
am_r_address <= reg_read_pointer;
reg_read_burstcount <= std_logic_vector(unsigned(reg_read_burstcount) - 1);
am_r_state <= Q_READ;
                when Q_READ =>
-- In Q_READ, we let the FIFO consume one word from the
-- Avalon bus. In case waitrequest is asserted, or data
-- is not valid, we simply wait until it is de-asserted.
if am_r_waitrequest = '0' and am_r_readdatavalid = '1' then
-- No wait request or invalid read data so normal read cycle
if unsigned(reg_read_burstcount) = 0 then
-- Reached the end of the burst
if reg_read_pointer = reg_end_address then
-- Reading is done
am_r_state <= Q_DONE;
else
-- There are still more bursts to read
am_r_state <= Q_WAIT_READY;
end if;
else
reg_read_burstcount <= std_logic_vector(unsigned(reg_read_burstcount) - 1);
end if;
-- In any case, de-assert read, burstcount and address
am_r_read <= '0';
am_r_burstcount <= (others => '0');
am_r_address <= (others => '0');
-- Move on to the next read
reg_read_pointer <= std_logic_vector(unsigned(reg_read_pointer) + 1);
end if;
                when Q_DONE =>
-- In Q_DONE state, we wait for the write unit to finish
if am_w_state = Q_DONE then
am_r_state <= Q_IDLE;
end if;
end case;
end if;
end process am_r_process;
    -- FSM process for Avalon master write
am_w_process : process (clk, reset_n)
variable write_words_left : std_logic_vector(am_w_address'length - 1 downto 0);
begin
if reset_n = '0' then
-- Reset vars
write_words_left := (others => '0');
-- Reset state
reg_write_pointer <= (others => '0');
fifo_write_rdreq <= '0';
am_w_burstcount <= (others => '0');
reg_write_next_output <= (others => '0');
reg_write_output <= (others => '0');
reg_write_burstcount <= (others => '0');
am_w_address <= (others => '0');
am_w_write <= '0';
am_w_state <= Q_IDLE;
elsif rising_edge(clk) then
case am_w_state is
                when Q_IDLE =>
-- In Q_IDLE state, we wait for the Avalon slave to start
-- in which case we initialize the write pointer
if as_state = Q_BUSY then
reg_write_pointer <= reg_begin_address;
am_w_state <= Q_WAIT_DATA;
end if;
                when Q_WAIT_DATA =>
-- In Q_WAIT_DATA state, we wait for the write FIFO to
-- deliver data. If there are less than BURST_SIZE elements
-- we send a smaller burst.
-- helper to compute the number of words left to write
write_words_left := std_logic_vector(unsigned(reg_end_address) - unsigned(reg_write_pointer) + 1);
if unsigned(write_words_left) >= BURST_SIZE and fifo_write_almost_full = '1' then
-- request one word from write FIFO
fifo_write_rdreq <= '1';
reg_write_burstcount <= std_logic_vector(BURST_SIZE);
am_w_state <= Q_FETCH_WORD;
elsif unsigned(write_words_left) < BURST_SIZE and unsigned(fifo_write_usedw) = unsigned(write_words_left) then
-- request one word from write FIFO
fifo_write_rdreq <= '1';
reg_write_burstcount <= write_words_left(am_w_burstcount'length - 1 downto 0); -- we can take lowest bits since we know < BURST_SIZE
am_w_state <= Q_FETCH_WORD;
end if;
                when Q_FETCH_WORD =>
-- In Q_FETCH_WORD, the first word leaves the FIFO
am_w_state <= Q_HOLD_VALUE;
                when Q_HOLD_VALUE =>
-- In Q_HOLD_VALUE, the first word can now be sampled
reg_write_next_output <= fifo_write_q;
am_w_state <= Q_BURST_SETUP;
                when Q_BURST_SETUP =>
-- In Q_BURST_SETUP, we assert the necessary lines on the
-- Avalon master write to perform a burst write
reg_write_next_output <= fifo_write_q;
reg_write_output <= reg_write_next_output;
am_w_burstcount <= reg_write_burstcount;
reg_write_burstcount <= std_logic_vector(unsigned(reg_write_burstcount) - 1);
am_w_address <= reg_write_pointer;
am_w_write <= '1';
am_w_state <= Q_WRITE;
                when Q_WRITE =>
-- In Q_WRITE, we check whether a wait request has been
-- asserted. Otherwise, we move on to the next word
if am_w_waitrequest = '1' then
-- stall the FIFO out pipeline
fifo_write_rdreq <= '0';
elsif unsigned(reg_write_burstcount) = 0 then
-- the burst is finished
-- stop reading from the FIFO and deassert Avalon lines
fifo_write_rdreq <= '0';
am_w_write <= '0';
am_w_address <= (others => '0');
am_w_burstcount <= (others => '0');
reg_write_next_output <= (others => '0');
reg_write_output <= (others => '0');
if reg_write_pointer = reg_end_address then
-- reached the end of all writes
am_w_state <= Q_DONE;
else
-- there are still more bursts to write
am_w_state <= Q_WAIT_DATA;
end if;
-- Must move write pointer to next valid location
reg_write_pointer <= std_logic_vector(unsigned(reg_write_pointer) + 1);
else
-- anticipate end of FIFO out pipeline
if unsigned(reg_write_burstcount) <= 3 then
-- do not request words from FIFO anymore
fifo_write_rdreq <= '0';
else
fifo_write_rdreq <= '1';
end if;
-- move the FIFO out pipeline
reg_write_next_output <= fifo_write_q;
reg_write_output <= reg_write_next_output;
reg_write_pointer <= std_logic_vector(unsigned(reg_write_pointer) + 1);
reg_write_burstcount <= std_logic_vector(unsigned(reg_write_burstcount) - 1);
end if;
                when Q_DONE =>
-- just one cycle to signal other process that we finished
am_w_state <= Q_IDLE;
            end case;
end if;
end process am_w_process;
    -- Connect Avalon master write output
am_w_writedata <= reg_write_output;
    -- Instantiate mirror/swap operator component
op : mirror_swap_op port map(
input => wire_op_input,
output => wire_op_output
);
    -- Assert operation validity to enable word transfer between FIFOs
op_valid_next <= not fifo_read_empty and not fifo_write_full;
    delay_op_valid : process (clk, reset_n)
begin
if reset_n = '0' then
op_valid <= '0';
elsif rising_edge(clk) then
op_valid <= op_valid_next;
end if;
end process delay_op_valid;
    -- Instantiate read FIFO
fifo_read : fifo_ip port map(
clock => clk,
data => am_r_readdata,
rdreq => op_valid_next,
wrreq => am_r_readdatavalid,
almost_empty => fifo_read_almost_empty,
almost_full => open,
empty => fifo_read_empty,
full => open,
q => wire_op_input,
usedw => open
);
    -- Instantiate write FIFO
fifo_write : fifo_ip port map(
clock => clk,
data => wire_op_output,
rdreq => fifo_write_rdreq,
wrreq => op_valid,
almost_empty => open,
almost_full => fifo_write_almost_full,
empty => open,
full => fifo_write_full,
q => fifo_write_q,
usedw => fifo_write_usedw
);
end architecture rtl;



And the implementation of the operator is very straightforward:

--
-- MIRROR AND SWAP OPERATOR
-- RTES Lab 2
--
-- file: mirror_swap_op.vhd
-- author: Alexandre CHAU
-- date: 18/04/2020
-- version: 0.1
--
-- This component is a combinatorial operation of bits mirror and swap.
-- Given a 32 bits input a31..a0 it returns a 32 bits output o31..o0
-- such that:
-- a31..a24 -> o7..o0 i.e. MSB becomes LSB
-- a7..a0 -> o31..o24 i.e. LSB becomes MSB
-- a23..a8 -> o8..o23 i.e. middle bytes are reversed
--
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity mirror_swap_op is
port (
input : in std_logic_vector(31 downto 0);
output : out std_logic_vector(31 downto 0)
);
end entity mirror_swap_op;
architecture comb of mirror_swap_op is
begin
-- a31..a24 -> o7..o0 MSB becomes LSB
output(7 downto 0) <= input(31 downto 24);
-- a7..a0 -> o31..o24 LSB becomes MSB
output(31 downto 24) <= input(7 downto 0);
-- a23..a8 -> o8..o23 middle bytes are reversed
gen: for i in 8 to 23 generate
output(i) <= input(23 - i + 8);
end generate gen;
end architecture comb;

In reply to Alexandre Chau

Re: [Lab 2] Failure rate of my master unit scales with buffer size

by Sahand Kashani-Akhavan -

Hey,

Your design is extremely complicated and I understand it can be difficult to debug. I think the first thing to do is to decouple some parts of the design to simplify debugging. Here are some ideas:

  1. In your schematic, why does the read master need to receive the "done" signal from the write master? If the two entities have independent FIFOs, then the flow control between the FIFOs alone should be enough to transfer data between them. The state machine just need to check if there is enough space in both FIFOs before starting.
  2. I'd recommend you switch to the "first word fall-through" (or "lookahead") FIFO in quartus rather than the standard FIFO you are using. The FWFT FIFO has the advantage of the read signal acting as an acknowledgment that the data has been read (data is available immediately) rather than a request to read data that will arrive on the next cycle. It really simplifies the construction of burstable state machines. You also wouldn't need the delay cycle before the write FIFO if you use the FWFT FIFO as you can just cable read FIFO's "valid" signal to the write port of the second FIFO.
  3. Your end_pointers are computed as (reg_end_address - reg_write_pointer + 1), but your write pointer increments by 1 in the code. Does this mean end_address is not an address, but a word offset? This value is given by the slave, so I'd assume in C code this is a byte address and not a word address?
  4. I would not use the almost_full and almost_empty signals on FIFOs as it forces you to keep counters in the state machines and is an open vector for off-by-one errors to creep in. I'd configure the FIFOs to not even have those ports and instead the state machines would read the USEDW port to see how much space there is. This way there's only a single place where computations are done.
  5. BTW, you can get rid of the for-generate loop in the mirror_swap_op component and just do "output(8 to 23) <= input(23 downto 0)".

I honestly think one can write this accelerator in 1/3 the space after decoupling as it reduces state machine and synchronization complexity. Let us know how it goes!