def self_attention(query, key, value):
    # Calculate attention scores
    scores = torch.matmul(query, key.transpose(-2, -1))
    scores = scores / math.sqrt(key.size(-1))
    
    # Apply softmax for probability distribution
    attention = F.softmax(scores, dim=-1)
    
    # Get weighted sum of values
    return torch.matmul(attention, value)

Self-attention computes relationships between all words in a sequence by calculating similarity scores and creating weighted combinations.

Interactive Attention Demo

Click on words to see how attention weights distribute across the sequence:

The

quick

brown

fox

jumps

Implementation Tips

Memory Efficiency

• Use gradient checkpointing for long sequences
• Implement efficient attention patterns
• Optimize batch size for your hardware

Training Stability

• Initialize weights properly
• Use layer normalization
• Implement warmup scheduling

Implementing Transformers

Interactive Attention Demo

Implementation Tips

Memory Efficiency

Training Stability