How the number of parameters is calculated if multihead self attention layer is used in a CNN model?

57 ビュー (過去 30 日間)
I have run the example in the following link in two cases:
Case 1: NumHeads = 4, NumKeyChannels = 784 Case 2: NumHeads = 8, NumKeyChannels = 392 Note that:
4x784 = 8x392 = 3136 (size of input feature vector to the attention layer). I have calculated the number of model parameters in the two cases and I got the following: 9.8 M for the first case, and 4.9 M for the second case.
I expected the number of learnable parameters to be the same. However, MATLAB reports different parameter counts.
My understanding from research papers is that the total parameters should not scale with how input is split across heads. The number of parameters should be the same as long as the input feature vector is the same, and the product of the number of heads by the size of each head (number of channels) is equal to the input size.
Why does MATLAB’s selfAttentionLayer produce different parameter counts for these two configurations? Am I misinterpreting how the layer is implemented in this toolbox?
  8 件のコメント
Hana Ahmed
Hana Ahmed 約3時間 前
Thank you for your feedback. Regarding the input argument issue, I understand that the function requires the input X when called, and I am indeed providing it accordingly. I have excuted the code and got the expected results. Also, since I do not use batch size=1, is replacing squeeze with reshape still necessary in my cae?
Hana Ahmed
Hana Ahmed 約3時間 前
Could you please confirm the following points?
  1. Is the functional implementation of the selfAttentionLayer correct?Specifically, are the forward and backward passes implemented correctly, such that using this layer for training and evaluating model accuracy (e.g., classification accuracy) yields valid and reliable results?
  2. Is the issue limited to the reporting of the number of parameters, rather than the actual computation?In other words, is the layer itself functionally sound, and does the discrepancy arise only in how parameters are counted or displayed.
  3. For academic publication, what value should be reported as NumKeyChannels?Should I report NumKeyChannels as the per-head dimension?
It would be very helpful if this behavior could be reviewed and potentially corrected in future releases, so that users are not misled when comparing model sizes or reporting parameter efficiency. Accurate parameter counting is especially important for reproducibility and fair comparison in research.
Thank you for your time and for providing such powerful deep learning tools in MATLAB. I appreciate your support and look forward to your clarification.

サインインしてコメントする。

回答 (0 件)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by