CS-476: [Lab 2] Failure rate of my master unit scales with buffer size

Hey,

Your design is extremely complicated and I understand it can be difficult to debug. I think the first thing to do is to decouple some parts of the design to simplify debugging. Here are some ideas:

In your schematic, why does the read master need to receive the "done" signal from the write master? If the two entities have independent FIFOs, then the flow control between the FIFOs alone should be enough to transfer data between them. The state machine just need to check if there is enough space in both FIFOs before starting.
I'd recommend you switch to the "first word fall-through" (or "lookahead") FIFO in quartus rather than the standard FIFO you are using. The FWFT FIFO has the advantage of the read signal acting as an acknowledgment that the data has been read (data is available immediately) rather than a request to read data that will arrive on the next cycle. It really simplifies the construction of burstable state machines. You also wouldn't need the delay cycle before the write FIFO if you use the FWFT FIFO as you can just cable read FIFO's "valid" signal to the write port of the second FIFO.
Your end_pointers are computed as (reg_end_address - reg_write_pointer + 1), but your write pointer increments by 1 in the code. Does this mean end_address is not an address, but a word offset? This value is given by the slave, so I'd assume in C code this is a byte address and not a word address?
I would not use the almost_full and almost_empty signals on FIFOs as it forces you to keep counters in the state machines and is an open vector for off-by-one errors to creep in. I'd configure the FIFOs to not even have those ports and instead the state machines would read the USEDW port to see how much space there is. This way there's only a single place where computations are done.
BTW, you can get rid of the for-generate loop in the mirror_swap_op component and just do "output(8 to 23) <= input(23 downto 0)".

I honestly think one can write this accelerator in 1/3 the space after decoupling as it reduces state machine and synchronization complexity. Let us know how it goes!