Unexplained Accelerator Behaviour

Unexplained Accelerator Behaviour

by Filip Slezák -
Number of replies: 9

Hello, so as mentioned on the zoom chat – we have trouble with our accelerator implementation and I believe it is due to one of these two things: the accelerator never starts when we write to it or the accelerator does not modify the values in our table i.e. memory access problem. The implementation is 90% copied from the one of the moodle and most of the other 10% is identical to our custom instruction code that works. Which one of these is most likely? I attach the two main files below - hopefully they are clear by themselves and issue is in there...

Thank you for your responsiveness in answering these questions!

Filip

In reply to Filip Slezák

Re: Unexplained Accelerator Behaviour

by Sahand Kashani -

Hi,

I don't see any place in your C code where you actually read back the values that have been modified to check that the result is what you expect. How are you doing this check? Is it through the "memory view" in the Nios IDE's debugger?

First, if the "memory view" tab is what you are using (i.e., visual inspection with your eyes), that is not a good way to check for correctness. You should read back the values and compare them against their expected values computed with a C implementation (your special_bitswap_optfunction()).

Second, with respect to your accelerator: If the original C function and the custom instruction versions work, I suspect the problem you are facing is that your CPU has a data cache. If you using C code or custom instructions, data from memory is brought into the CPU's caches and acted upon by the ALU in the CPU. So the "memory view" in the Nios IDE can show that memory has changed after your functions run. However, if an accelerator modifies the memory behind the scene, there is no way for a CPU with a data cache to know that the memory has been modified unless there is some cache coherency mechanism being used (which the Nios does not have). Visual inspection of the "memory view" tab therefore won't work here and you should read back the memory with IORD_32DIRECT() which is a non-cacheable load instruction to force the CPU to bypass any cache it may have and fetch the data from memory. This is the only way you can know if the accelerator is working, because, from your description, I understand that the accelerator finishes (your while loop polling the AccFinished register terminates).

In reply to Sahand Kashani

Re: Unexplained Accelerator Behaviour

by Filip Slezák -

Hi, interesting observation, I was not aware of the difference between IOxx_DIRECT and IOxx. Your guess is correct, I do have 4kB data and instruction cache. 

Now, I am not using the memory view, I just made a quick sanity check that the accelerator doesn't pass in the first place... but I see your point and made adjustments at the end of the main function for this - please see my comment about the triple comparison. 

I made a run with the code I attach as screenshot - the accelerator yields back 0, yet if I comment line 128 where the accelerator is called I can read back the original value. This brings me back to my initial guess about the accelerator not writing correctly in memory. I don't know if this is possible but are the variables declared where they should be in the first place? Furthermore, notice how the program suddenly stops in between two prints at the end...any idea why?

Thanks for your help :)

Attachment Screenshot 2021-04-24 at 16.24.21.png
In reply to Filip Slezák

Re: Unexplained Accelerator Behaviour

by Sahand Kashani -

You should not use IOWR or IORD, but always IOWR_xxDIRECT and IORD_xxDIRECT. The reason is that IOWR/IORD use native addressing which is kept for backwards-compatibility only, but it is not recommended to be used for new code any more. Nevertheless, IOWR and IORD are also non-cacheable loads. It's direct memory accesses like randArray[i] which go through the cache.

First, I suggest you do not do randArray[i] = rand(), but rather

int rand_value = rand();
IOWR_32DIRECT(randArray, i * 4, rand_value);
IOWR_32DIRECT(randArray1, i * 4, rand_value);
IOWR_32DIRECT(randArray2, i * 4, rand_value);

This will ensure that the data you place in the randArray is actually in memory and not just contained in the data cache of the CPU, otherwise you have no guarantee the accelerator is reading the random values you intended to write in memory in the first place.

Random data is also hard to debug. Why don't you use some pattern you know in advance in all randArrays like 0x12345678, then place a signaltap probe on your accelerator's master interface and see what it is reading from memory? Then you'll be sure if it is reading correctly or not (you still must use IOWR_32DIRECT to write the arrays though).

Also, your accelerator is more of a decelerator and I don't expect you to get any speedup over the custom instruction implementation since your accelerator is not doing burst transfers and you are basically reading one 32-bit word from memory, performing the bit swap, then writing one 32-bit word to memory. This loop takes 5 cycles in your code at first glance with your state machine, and you are therefore at best using 1/7 of the total bus bandwidth (read takes 2 cycles, write 1 cycle if the bus answers immediately without waitrequest). You should add burst if you want to get any form of speedup.

I can't tell exactly why the printing stops randomly, but I suspect you are overwriting the code of the CPU when your accelerator runs and it causes the program to hang (welcome to the world of unprotected code as there is no OS to protect against wrong programs). Look at your function signatures:

void special_bitswap_optfunction(uint* tab, size_t size)
void special_bitswap_instruction(uint* tab[], size_t size)
void special_bitswap_accelerator(uint* tab[], size_t size)
I'm not 100% sure, but I'm pretty sure the extra [] you've placed in the function signature is causing the argument to be a pointer to a pointer (i.e., uint**, not uint*). Only the optfunction version looks correct.
In reply to Sahand Kashani

Re: Unexplained Accelerator Behaviour

by Filip Slezák -

Thank you again for the lengthy explanation, I will look into it tomorrow. At this stage it just seems simpler to remove caches xD

I have never used bursts before and this code is inspired from the sample on the moodle; is there any change to make to the VHDL or just change the quartus platform designer setting? or could you suggest a good guide to it? - I guess there is otherwise you wouldn't know yourself but I have no clue

The uint* tab[] is indeed an error. It's a remnant from my debugging I forgot to remove everywhere - eclipse was suggesting to add it.

In reply to Filip Slezák

Re: Unexplained Accelerator Behaviour

by Sahand Kashani -

Hi,

We covered burst transfers in Embedded Systems (I'm assuming you didn't take the course last semester). Enabling bursts is not as simple as just configuring something in platform designer as it needs some VHDL modifications when interfacing with the bus.

I would concentrate on getting your current version working without bursts for now as adding bursts afterwards is not too difficult (need to modify the FSM).

In reply to Filip Slezák

Re: Unexplained Accelerator Behaviour

by Filip Slezák -

Thank you both for your input, especially during the weekend. It really helps after having spent endless hours on this. 

I implemented your suggested changes and disabled cache for the moment just to see if it runs without it.

I then managed to pinpoint the error. The accelerator never starts if the array size N is greater than 1 – the problem is I see no reason for this to fail. The vhd clearly states we have 16bits for this size parameter and I don't see a manipulation error of this value in the code. In the C code I also changed all my size_t (64bits) to uint (32bits) as well as tried IOWR_16DIRECT instead of IOWR_32DIRECT here for the table length just in case... but that wasn't the issue.

I attach the console output when N > 1 and adapted code, with N = 0 or 1 the output is as expected. As you expected my accelerator gives the same time as the custom instruction for N=1, a 20x improvement over the C function.

Attachment Screenshot 2021-04-25 at 17.51.32.png
In reply to Filip Slezák

Re: Unexplained Accelerator Behaviour

by Sahand Kashani -

I don't see a real problem my just inspecting the VHDL at this point. Have you written a testbench to see how your counters are behaving in your FSM when you use N>1? That would certainly help pinpoint the issue. There is probably some small mistake that is hard for humans to see.

In reply to Sahand Kashani

Re: Unexplained Accelerator Behaviour

by Filip Slezák -

Hi, we solved this issue with the TAs yesterday, we added a wait for write state to the fsm. It was minor but critical.

In reply to Sahand Kashani

Re: Unexplained Accelerator Behaviour

by René Beuchat -


1) Reading your vhdl code, when you start to write in memory, do NOT wait until avm_WaitRequest is = '0' to put avm_Add, BE, Wr and  WrData

Just wait to finish the transfer to verify that avm_WaitRequest is at '0' 

Depending on the avalon implementation wait_request is allowed to be directly activated when a cycle start.

2) Another way, and more universal when a processor and a DMA have to work together is to use the FLUSH type instructions

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/nios2/edh_ed51002.pdf

"Your application can share non-cache-bypassed memory regions with external masters if it runs the alt_dcache_flush() function before it allows the external master to operate on the memory"

"The alt_dcache_flush_all() function call flushes the entire data cache, but this function is not efficient. Altera recommends that you flush from the cache only the entries for the memory region that you make available to the other master peripheral"

3) Use the IORD_xxDIRECT(base, offset), IOWR_xxDIRECT(base, offset, data) macro from "io.h", offset address in number of byte from a base address, and size of transfers depending on the xx: 8, 16, 32 bits.

IORD() and IOWR() are doing 32 bits access only ond the offset is provide as 32 bits size transfers and not byte. More for programmable interface access than memory. The main characteristic of those macro is to use assembly level uncacheable memory space access.

Another way, but deprecated is to put address bit 31 @'1' when doing an access by the processor with normal instructions, but no more recommended.

Best.

RB



#define __IO_CALC_ADDRESS_DYNAMIC(BASE, OFFSET) \
((void *)(((alt_u8*)BASE) + (OFFSET)))
#define IORD_32DIRECT(BASE, OFFSET) \
__builtin_ldwio (__IO_CALC_ADDRESS_DYNAMIC ((BASE), (OFFSET)))
#define IORD_16DIRECT(BASE, OFFSET) \
__builtin_ldhuio (__IO_CALC_ADDRESS_DYNAMIC ((BASE), (OFFSET)))
#define IORD_8DIRECT(BASE, OFFSET) \
__builtin_ldbuio (__IO_CALC_ADDRESS_DYNAMIC ((BASE), (OFFSET)))
#define IOWR_32DIRECT(BASE, OFFSET, DATA) \
__builtin_stwio (__IO_CALC_ADDRESS_DYNAMIC ((BASE), (OFFSET)), (DATA))
#define IOWR_16DIRECT(BASE, OFFSET, DATA) \
__builtin_sthio (__IO_CALC_ADDRESS_DYNAMIC ((BASE), (OFFSET)), (DATA))
#define IOWR_8DIRECT(BASE, OFFSET, DATA) \
__builtin_stbio (__IO_CALC_ADDRESS_DYNAMIC ((BASE), (OFFSET)), (DATA))
/* Native bus access functions */
#define __IO_CALC_ADDRESS_NATIVE(BASE, REGNUM) \
((void *)(((alt_u8*)BASE) + ((REGNUM) * (SYSTEM_BUS_WIDTH/8))))
#define IORD(BASE, REGNUM) \
__builtin_ldwio (__IO_CALC_ADDRESS_NATIVE ((BASE), (REGNUM)))
#define IOWR(BASE, REGNUM, DATA) \
__builtin_stwio (__IO_CALC_ADDRESS_NATIVE ((BASE), (REGNUM)), (DATA))