CS-233(b): Transformers structure

Good question.
Honestly, transformers are hot research topic. The theory and practice relating to it is evolving rapidly, so the structure is not restricted to this.
For example, people study where to put the LayerNorm:
http://proceedings.mlr.press/v119/xiong20b/xiong20b.pdf
https://aclanthology.org/2020.emnlp-main.463/
how to remove the LayerNorm:
https://arxiv.org/pdf/2003.04887.pdf
why you cannot remove the feedfowrad layers (MLP in your figure):
https://arxiv.org/pdf/2103.03404.pdf
why you can remove the feedforward layers:
https://arxiv.org/pdf/1907.01470.pdf
https://aclanthology.org/2020.findings-emnlp.432.pdf
the routed feedforward layers designed for large scale and multi-device model:
https://arxiv.org/pdf/2106.04426.pdf
...
And remember these are not very recent papers. There is no standard answer for 'what' and 'why' in transformers currently.

Discussion Forum

Transformers structure

Re: Transformers structure