andrejmuhic.github.io - Attention dropout

Self attention softmax weights dropout

In the self attention head we apply dropout on softmax weights. This means that the representation of the first word vector in batch can be \(0\) with dropout probability \(p\)?
Moreover, even the representation of the n-th word is 0 with probability \(p^n\)?
Should be bothered with this, maybe?
If we are using multiple layers and multiple heads the “bias” induced by this seems to be negligible.
If we do toy examples this actually seems to slow down the training.
Another separate question is why not to drop whole words, I assume that this is valid but it will make batch generation ragged and harder to parallelize, I would also assume that this would slow down the training as it is a lot more aggressive than dropping weights.
I found this one kind of surprising when I first noticed it. I would welcome any references to this and what is the reason that sampling is not modified to guarantee for example at least one nonzero weight that would ensure nonzero vector?

Toy code example

from torch import nn
from torch.functional import F
n_embd = 20
head_size = 10
block_size = 4
dropout = 0.5
class Head(nn.Module):
    """ one head of self-attention
        https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention
    """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        # This limits us to the maximal context block_size
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)  # (B,T,C=n_embd) -> (B,T,C=head_size)
        q = self.query(x)  # (B,T,C=n_embd) -> (B,T,C=head_size)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2, -1) * C ** -0.5  # (B, T, C=head_size) @ (B, C=head_size), T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))  # (B, T, T)
        wei = F.softmax(wei, dim=-1)  # (B, T, T)
        # The drop out is over full matrix, alternatively it would be better to just drop on mask, this is biased
        # Also it seems maybe conceptually we should just do symmetric dropout
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x)  # (B,T,C=head_size)
        # The matrix multiplication is batched and applied on last two dimensions!
        out = wei @ v  # (B, T, T) @ (B, T, C=head_size) -> (B, T, C=head_size)
        return out

import torch
torch.manual_seed(1367149)
mask = torch.tril(torch.ones(4, 4))
similarities = torch.rand((4, 4)) * mask
dd_similarities = torch.nn.functional.dropout(similarities, p=0.5)
values = torch.rand((4, 10))
dd_similarities @ values

tensor([[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000],
        [0.4227, 1.1676, 0.1916, 0.7144, 1.0432, 0.5021, 0.1680, 0.1247, 1.1446,
         0.2948],
        [2.3534, 2.1808, 1.3240, 2.1463, 2.5636, 0.9744, 2.2360, 1.6751, 1.3820,
         2.3454]])