Hi,
The goal of this lab is to implement performance-enhancing circuitry for a workload that is difficult to express in C, but which can be done in a very parallel fashion in hardware (bit pattern shuffling). There are multiple hardware implementations possible for this performance-enhancing circuitry:
- Creating a hardware accelerator: A unit which is external to the CPU that contains a read/write DMA engine, as well as the bit pattern shuffling circuitry.
- Creating a custom instruction: A unit that is internal to the CPU that does not contain a read/write DMA, but rather just the bit pattern shuffling circuitry.
Using the custom instruction approach makes it easier to accelerate software code as you can replace a few loops and branches which were used in C to do the bit shuffling with a single instruction that the compiler understands and transmits to the CPU's ALU. However, custom instructions can have less performance than a custom accelerator with a DMA unit, and the goal of this lab is to quantify this difference and get a feel for how significant it is, and if the effort needed to design a custom accelerator with a DMA unit on a bus is worth it depending on the workload.
The profiling tutorial you are asking about is this one:
https://moodlearchive.epfl.ch/2019-2020/pluginfile.php/530941/mod_resource/content/0/laboratories/Profiling_NIOSII_an391.pdf
It shows you how to use the GNU profiler on a Nios II system, along with the caveats this entails.