Does the selfattentionLayer also perform softmax and scaling?

A self-attention layer computes single-head or multihead self-attention of its input.
The layer:
  1. Computes the queries, keys, and values from the input
  2. Computes the scaled dot-product attention across heads using the queries, keys, and values
  3. Merges the results from the heads
  4. Performs a linear transformation on the merged result
I wonder if the layer also apply softmax to the scaling (i.e. divide (Q*K) by sqrt(dim))? My understanding is that, within step 2, this softmax and scaling should happen.
Please clarify that for me or more general users.

Accepted Answer

Rohit on 20 Apr 2023
I understand that you want to know whether ‘selfAttentionLayer’ performs softmax and scaling operations which are involved to compute attention score.
Yes, we perform both operations to compute scaled attention score and then apply softmax as required in attention mechanism.

More Answers (1)

cui,xingxing on 11 Jan 2024
Please check out the details of the code I wrote here link.


