How the number of parameters is calculated if multihead self attention layer is used in a CNN model?

Question

Hana Ahmed 2025 年 8 月 28 日 18:37

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2179666-how-the-number-of-parameters-is-calculated-if-multihead-self-attention-layer-is-used-in-a-cnn-model

コメント済み: Hana Ahmed 約3時間前

I have run the example in the following link in two cases:

https://www.mathworks.com/matlabcentral/answers/1932550-example-of-using-self-attention-layer-in-matlab-r2023a

Case 1: NumHeads = 4, NumKeyChannels = 784 Case 2: NumHeads = 8, NumKeyChannels = 392 Note that:

4x784 = 8x392 = 3136 (size of input feature vector to the attention layer). I have calculated the number of model parameters in the two cases and I got the following: 9.8 M for the first case, and 4.9 M for the second case.

I expected the number of learnable parameters to be the same. However, MATLAB reports different parameter counts.

My understanding from research papers is that the total parameters should not scale with how input is split across heads. The number of parameters should be the same as long as the input feature vector is the same, and the product of the number of heads by the size of each head (number of channels) is equal to the input size.

Why does MATLAB’s selfAttentionLayer produce different parameter counts for these two configurations? Am I misinterpreting how the layer is implemented in this toolbox?

8 件のコメント
6 件の古いコメントを表示6 件の古いコメントを非表示

Umar 2025 年 8 月 29 日 0:08

Dear @Hana Ahmed,

Thank you for sharing your detailed analysis. Your theoretical understanding is absolutely correct, and my testing confirms your observations. After replicating your experiment, I obtained identical results:

% Create complete networks to avoid initialization error
layers1 = [
  featureInputLayer(3136, 'Name', 'input')
  selfAttentionLayer(4, 784, 'Name', 'selfattention')
];

layers2 = [
  featureInputLayer(3136, 'Name', 'input')
  selfAttentionLayer(8, 392, 'Name', 'selfattention')
];

net1 = dlnetwork(layers1);
net2 = dlnetwork(layers2);

% Check parameter structure
 fprintf('Case 1 parameters: %.1fM\n', sum(cellfun(@numel,   
 net1.Learnables.Value))/1e6);
 fprintf('Case 2 parameters: %.1fM\n', sum(cellfun(@numel,      
 net2.Learnables.Value))/1e6);

Results:

Analysis: * Case 1: 4 heads × 784 channels = 9.8M parameters * Case 2: 8 heads × 392 channels = 4.9M parameters * Ratio: Exactly 2:1

Root Cause Analysis: MATLAB's selfAttentionLayer implementation uses parameter scaling proportional to NumKeyChannels^2 rather than maintaining constant parameters when NumHeads × NumKeyChannels = constant. This deviates from standard transformer architecture where parameter count should remain identical for equivalent total dimensionality ( 3136 in both your cases).

So, you are not misinterpreting the research literature. Your understanding of multi-head attention theory is correct - parameters should remain constant regardless of head configuration when total feature dimensionality is preserved. MATLAB's Deep Learning Toolbox has implemented a non-standard version that prioritizes a different computational structure over parameter efficiency.

My recommendation would be if you require true multi-head attention behavior (constant parameters), consider implementing a custom layer following standard transformer equations, as MATLAB's current implementation doesn't align with conventional practice.

Umar 2025 年 8 月 30 日 9:40

Hi @Hana Ahmed,

Thanks for your follow-up! I think writing the multi-head attention mechanism from scratch would be a great way to get the transparency and control you're looking for. It will also help you understand the underlying principles better.

Here’s a quick skeleton of the pseudo code to guide your implementation:

Skeleton of Pseudo Code:

function Y = multiHeadAttention(X, numHeads, keyChannels)
  % X: Input matrix [batchSize, inputDim]
  % numHeads: Number of attention heads
  % keyChannels: Dimensionality per head

    [batchSize, inputDim] = size(X);
    d_k = keyChannels; % Dimension per head

    % Define weights for Q, K, V, and output projection
    W_Q = randn(inputDim, numHeads * d_k);
    W_K = randn(inputDim, numHeads * d_k);
    W_V = randn(inputDim, numHeads * d_k);
    W_O = randn(numHeads * d_k, inputDim);

    % Compute Q, K, V
    Q = X * W_Q;
    K = X * W_K;
    V = X * W_V;

    % Reshape for multiple heads
    Q = reshape(Q, batchSize, numHeads, d_k);
    K = reshape(K, batchSize, numHeads, d_k);
    V = reshape(V, batchSize, numHeads, d_k);

    % Compute attention for each head
    attentionOutput = zeros(batchSize, numHeads, d_k);
    for i = 1:numHeads
        % Compute scaled dot-product attention for each head
        attentionScores = Q(:, i, :) * K(:, i, :)' / sqrt(d_k);
        attentionWeights = softmax(attentionScores, 2);
        attentionOutput(:, i, :) = attentionWeights * V(:, i, :);
    end

    % Concatenate heads and project to output
    attentionOutput = reshape(attentionOutput, batchSize, numHeads * d_k);
    Y = attentionOutput * W_O;
  end

I suggest you try implementing this yourself in MATLAB, following the structure above. This will give you a hands-on understanding of how the attention mechanism works.

If you run into any issues or get stuck, feel free to reach out, and I’d be happy to help debug.

Good luck with the implementation!

Hana Ahmed 2025 年 9 月 3 日 4:17

function [Y, numParams] = multiHeadAttention(X, numHeads, keyChannels)

% multiHeadAttention - Implements multi-head self-attention

% Inputs:

% X: Input matrix [batchSize, inputDim]

% numHeads: Number of attention heads

% keyChannels: Dimensionality per head (d_k)

%

% Outputs:

% Y: Output after multi-head attention [batchSize, inputDim]

% numParams: Total number of trainable parameters

[batchSize, inputDim] = size(X);

d_k = keyChannels;

totalHeadDim = numHeads * d_k;

% Initialize weight matrices (normally these would be learned; here we just count)

W_Q = randn(inputDim, totalHeadDim); % Query weights

W_K = randn(inputDim, totalHeadDim); % Key weights

W_V = randn(inputDim, totalHeadDim); % Value weights

W_O = randn(totalHeadDim, inputDim); % Output projection weights

% Count total number of parameters

numParams = 0;

numParams = numParams + numel(W_Q);

numParams = numParams + numel(W_K);

numParams = numParams + numel(W_V);

numParams = numParams + numel(W_O);

% Compute Q, K, V

Q = X * W_Q; % [batchSize, numHeads * d_k]

K = X * W_K;

V = X * W_V;

% Reshape for multiple heads

Q = reshape(Q, batchSize, numHeads, d_k);

K = reshape(K, batchSize, numHeads, d_k);

V = reshape(V, batchSize, numHeads, d_k);

% Initialize output

attentionOutput = zeros(batchSize, numHeads, d_k);

% Compute attention for each head

for i = 1:numHeads

Q_head = squeeze(Q(:, i, :));

K_head = squeeze(K(:, i, :));

V_head = squeeze(V(:, i, :));

attentionScores = (Q_head * K_head.') / sqrt(d_k);

attentionWeights = softmax(attentionScores, 2);

attentionOutput(:, i, :) = attentionWeights * V_head;

end

% Concatenate all heads

attentionOutput = reshape(attentionOutput, batchSize, totalHeadDim);

% Final linear projection

Y = attentionOutput * W_O; % [batchSize, inputDim]

end

Hana Ahmed 2025 年 9 月 3 日 5:24

編集済み: Hana Ahmed 2025 年 9 月 3 日 5:25

I have implemented the pseudo code you provided in Matlab. I would be grateful if you could check it.

I have run the example in the following link using the new implementation, and the previous incorrect implementation already found in Matlab. I have used 8 heads and 64 keychannels. Using the new implementation, the number of parametrs is 6.4 M. Using the previous implementation, the number of parametrs is 0,8 M. Could you please revise and confirm the correct implementation?

I achieved a high classification accuracy of approximately 99.5% using a self-attention-based network , following the example provided in the following MATLAB Central post that employs selfAttentionLayer(8, 64) for the Digit Dataset. This result is excellent, and my goal is to preserve it while ensuring the correct interpretation of the layer’s parameters in my technical report. I would like to clarify how the numKeyChannels parameter is interpreted in MATLAB’s implementation.

Can I use the built-in MATLAB implementation to report the system accuracy (since it is validated and achieves ~99.5%), while using my custom implementation to report the number of parameters, to reflect a more transparent and explicitly defined architecture?

If so, how can I clearly and accurately communicate this approach in a technical report, ensuring that the reader understands that the accuracy and parameter count come from different but compatible implementations, one for performance evaluation and the other for architectural analysis, without misleading the audience?

https://www.mathworks.com/matlabcentral/answers/1932550-example-of-using-self-attention-layer-in-matlab-r2023a

Hana Ahmed 約3時間前

Thank you for your feedback. Regarding the input argument issue, I understand that the function requires the input X when called, and I am indeed providing it accordingly. I have excuted the code and got the expected results. Also, since I do not use batch size=1, is replacing squeeze with reshape still necessary in my cae?

Hana Ahmed 約3時間前

Could you please confirm the following points?

Is the functional implementation of the selfAttentionLayer correct?Specifically, are the forward and backward passes implemented correctly, such that using this layer for training and evaluating model accuracy (e.g., classification accuracy) yields valid and reliable results?
Is the issue limited to the reporting of the number of parameters, rather than the actual computation?In other words, is the layer itself functionally sound, and does the discrepancy arise only in how parameters are counted or displayed.
For academic publication, what value should be reported as NumKeyChannels?Should I report NumKeyChannels as the per-head dimension?

It would be very helpful if this behavior could be reviewed and potentially corrected in future releases, so that users are not misled when comparing model sizes or reporting parameter efficiency. Accurate parameter counting is especially important for reproducibility and fair comparison in research.

Thank you for your time and for providing such powerful deep learning tools in MATLAB. I appreciate your support and look forward to your clarification.

サインインしてコメントする。

サインインしてこの質問に回答する。

How the number of parameters is calculated if multihead self attention layer is used in a CNN model?

8 件のコメント
6 件の古いコメントを表示6 件の古いコメントを非表示

回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

How the number of parameters is calculated if multihead self attention layer is used in a CNN model?

8 件のコメント 6 件の古いコメントを表示6 件の古いコメントを非表示

回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

8 件のコメント
6 件の古いコメントを表示6 件の古いコメントを非表示