Transformers structure

Re: Transformers structure

by Zhen Wei -
Number of replies: 0
Good question.
Honestly, transformers are hot research topic. The theory and practice relating to it is evolving rapidly, so the structure is not restricted to this.
For example, people study where to put the LayerNorm:
http://proceedings.mlr.press/v119/xiong20b/xiong20b.pdf
https://aclanthology.org/2020.emnlp-main.463/
how to remove the LayerNorm:
https://arxiv.org/pdf/2003.04887.pdf
why you cannot remove the feedfowrad layers (MLP in your figure):
https://arxiv.org/pdf/2103.03404.pdf
why you can remove the feedforward layers:
https://arxiv.org/pdf/1907.01470.pdf
https://aclanthology.org/2020.findings-emnlp.432.pdf
the routed feedforward layers designed for large scale and multi-device model:
https://arxiv.org/pdf/2106.04426.pdf
...
And remember these are not very recent papers. There is no standard answer for 'what' and 'why' in transformers currently.