Step-by-step guide to building transformer components in PyTorch
def self_attention(query, key, value): # Calculate attention scores scores = torch.matmul(query, key.transpose(-2, -1)) scores = scores / math.sqrt(key.size(-1)) # Apply softmax for probability distribution attention = F.softmax(scores, dim=-1) # Get weighted sum of values return torch.matmul(attention, value)
Self-attention computes relationships between all words in a sequence by calculating similarity scores and creating weighted combinations.
Click on words to see how attention weights distribute across the sequence: