Step-by-step guide to building transformer components in PyTorch
def self_attention(query, key, value):
# Calculate attention scores
scores = torch.matmul(query, key.transpose(-2, -1))
scores = scores / math.sqrt(key.size(-1))
# Apply softmax for probability distribution
attention = F.softmax(scores, dim=-1)
# Get weighted sum of values
return torch.matmul(attention, value)Self-attention computes relationships between all words in a sequence by calculating similarity scores and creating weighted combinations.
Click on words to see how attention weights distribute across the sequence: