The Complete Transformer Architecture: A Deep Dive

The Transformer architecture revolutionized natural language processing when it was introduced in the landmark 2017 paper “Attention Is All You Need” by Vaswani et al. This blog post explores the complete architecture, breaking down each component to provide a thorough understanding of how Transformers work.

The Transformer Architecture Input Embedding Output Embedding Positional Encoding Positional Encoding Encoder Stack (Nx) Multi-Head Attention Self-Attention Q, K, V from input 8 Attention Heads Concat & Linear Feed Forward Network Two linear transformations with ReLU Add & Norm (Layer Normalization) Add & Norm (Layer Normalization) Decoder Stack (Nx) Masked Multi-Head Attention Self-attention with masking Prevents attending to future positions Multi-Head Attention Q from decoder, K & V from encoder Cross-attention mechanism Feed Forward Network Two linear transformations with ReLU Add & Norm Add & Norm Add & Norm Linear Softmax To next encoder layer or output Encoder Components Decoder Components Attention Mechanisms Feed Forward Networks

Introduction

The Transformer model, introduced in the 2017 paper “Attention Is All You Need” by Vaswani et al., marked a pivotal shift in NLP architectures. Unlike recurrent neural networks (RNNs) and convolutional neural networks (CNNs), Transformers rely entirely on attention mechanisms, eliminating the need for recurrence and convolutions. This design allows for significantly more parallelization during training and has become the foundation for models like BERT, GPT, and T5.

Core Architecture

The Transformer follows an encoder-decoder architecture, but with a novel approach:

  1. Encoder: Processes the input sequence and builds representations
  2. Decoder: Generates output sequences using both the encoder’s representations and its own previous outputs

Both components are composed of stacks of identical layers, each with two main sub-layers:

  • Multi-head attention mechanism
  • Position-wise fully connected feed-forward network

Let’s break down each component in detail.

Encoder Architecture

The encoder consists of N identical layers (N=6 in the original paper). Each layer has two sub-layers:

1. Multi-Head Self-Attention

The first sub-layer is a multi-head self-attention mechanism. Self-attention allows the encoder to consider all positions in the input sequence when encoding a specific position, enabling the model to capture relationships regardless of their distance in the sequence.

The self-attention mechanism is calculated as follows:

  1. For each word in the input, three vectors are created:
    • Query (Q): What the current word is looking for
    • Key (K): What the current word contains
    • Value (V): The actual content of the word
  2. For each position, scores are calculated using the query and key vectors: Attention(Q, K, V) = softmax(QK^T / √dk) × V

The paper’s use of “scaled dot-product attention” introduces scaling by 1/√dk (where dk is the dimension of the key vectors) to prevent the softmax function from entering regions with extremely small gradients.

Multi-Head Attention

Rather than performing a single attention function, the Transformer uses multi-head attention:

MultiHead(Q, K, V) = Concat(head₁, head₂, ..., headₕ)W^O
where headᵢ = Attention(QW^Q_i, KW^K_i, VW^V_i)

In the original paper, they used h=8 parallel attention heads. Each head uses different learned linear projections for queries, keys, and values, allowing the model to jointly attend to information from different representation subspaces.

2. Position-wise Feed-Forward Network

The second sub-layer is a simple feed-forward network applied to each position separately and identically:

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

This is a two-layer neural network with a ReLU activation in between. In the original implementation, the inner layer has a dimensionality of 2048, while input and output are of dimension 512.

Residual Connections and Layer Normalization

Around each sub-layer, the encoder employs a residual connection followed by layer normalization:

LayerNorm(x + Sublayer(x))

This helps with training deeper networks and maintains gradient flow.

Decoder Architecture

The decoder also consists of N identical layers (N=6 in the original paper), but each decoder layer has three sub-layers:

1. Masked Multi-Head Self-Attention

The first sub-layer is similar to the encoder’s self-attention but includes a masking mechanism. This masking ensures that predictions for a position can only depend on known outputs at earlier positions, preserving the auto-regressive property needed during training.

2. Multi-Head Attention Over Encoder Output

The second sub-layer performs multi-head attention where:

  • Queries (Q) come from the previous decoder layer
  • Keys (K) and Values (V) come from the encoder’s output

This allows the decoder to focus on appropriate parts of the input sequence.

3. Position-wise Feed-Forward Network

The third sub-layer is identical to the feed-forward network used in the encoder.

Like the encoder, residual connections and layer normalization are applied around each sub-layer.

Embeddings and Positional Encoding

Input and Output Embeddings

Both the encoder and decoder use learned embeddings to convert input tokens to vectors of dimension d_model (512 in the original paper). The same weight matrix is shared between the input embedding layers and the pre-softmax linear transformation.

Positional Encoding

Since the Transformer contains no recurrence or convolution, it needs a way to understand the order of the sequence. The solution is adding “positional encodings” to the input embeddings:

PE(pos, 2i) = sin(pos/10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos/10000^(2i/d_model))

Where pos is the position and i is the dimension. This function creates a unique pattern for each position, allowing the model to understand relative positions in the sequence.

Sinusoidal Positional Encoding Position Value 0 5 10 15 20 25 -1.0 -0.5 0.0 0.5 1.0 dim = 0 (sin) dim = 2 (sin) dim = 4 (sin)

Self-Attention in Detail

The self-attention mechanism is the core innovation of the Transformer. Let’s take a deeper look at how it works:

Let me continue with my detailed breakdown of the Transformer architecture.

Self-Attention Mechanism Input Embeddings Linear Projection (W^Q) Linear Projection (W^K) Linear Projection (W^V) Q (Queries) K (Keys) V (Values) Matmul (Q×K^T) Scale (÷√d_k) Softmax Matmul Attention Output Attention Formula: Attention(Q,K,V) = softmax(QK^T/√d_k)V

Multi-Head Attention in Detail

Rather than performing a single attention function, the Transformer employs multi-head attention, which allows the model to jointly attend to information from different representation subspaces:

Multi-Head Attention Mechanism Input Embeddings Linear W^Q_1 Linear W^K_1 Linear W^V_1 Linear W^Q_h Linear W^K_h Linear W^V_h Head 1 Scaled Dot-Product Attention softmax(Q₁K₁ᵀ/√d_k)V₁ Head h Scaled Dot-Product Attention softmax(QₕKₕᵀ/√d_k)Vₕ Concatenate Linear Projection W^O Multi-Head Attention Output MultiHead(Q,K,V) = Concat(head₁, head₂, …, headₕ)W^O where headᵢ = Attention(QW^Q_i, KW^K_i, VW^V_i)

Training and Optimization

The Transformer model was trained using the Adam optimizer with the following parameters:

  • Learning rate varying according to a formula: lr = d_model^(-0.5) * min(step_num^(-0.5), step_num * warmup_steps^(-1.5))
  • β₁ = 0.9, β₂ = 0.98, ε = 10^(-9)
  • Warmup_steps = 4000

Additional techniques used during training included:

  • Residual dropout with rate 0.1
  • Label smoothing with value ε_ls = 0.1

Why Self-Attention?

The paper outlines several advantages of self-attention over recurrent and convolutional layers:

  1. Computational complexity per layer: Self-attention has lower complexity when the sequence length is smaller than the representation dimensionality.
  2. Amount of computation that can be parallelized: Self-attention requires a fixed number of sequential operations (O(1)), whereas recurrent layers require O(n) sequential operations.
  3. Path length between long-range dependencies: Self-attention creates direct connections between any two positions, resulting in maximum path length O(1), while recurrent layers require O(n) steps.

This table compares the computational characteristics:

Layer TypeComplexity per LayerSequential OperationsMaximum Path Length
Self-AttentionO(n²·d)O(1)O(1)
RecurrentO(n·d²)O(n)O(n)
ConvolutionalO(k·n·d²)O(1)O(log_k(n))

Cross-Attention Mechanism

In the decoder, the second attention layer performs cross-attention, which is a crucial bridge between encoder and decoder:

  1. Queries (Q) come from the previous decoder layer
  2. Keys (K) and Values (V) come from the encoder’s output

This architecture enables the decoder to focus on appropriate parts of the input sequence, creating a context-aware generation process.

The mathematical formulation is the same as self-attention:

CrossAttention(Q, K, V) = softmax(QK^T / √dk) × V

The difference lies in the source of Q, K, and V vectors.

Implementation Details

The original Transformer model had the following hyperparameters:

  • Encoder and decoder each had N=6 identical layers
  • d_model = 512 (dimensionality of embeddings)
  • d_ff = 2048 (dimensionality of feed-forward layers)
  • h = 8 (number of attention heads)
  • d_k = d_v = 64 (dimensionality of keys and values)
  • Dropout rate of 0.1 was applied to the output of each sub-layer and to embeddings and positional encodings

Ablation Studies from the Paper

The authors conducted several experiments to validate design choices:

  1. Varying model size: Performance improved with both d_model and d_ff
  2. Attention heads: They found that multiple attention heads were better than a single head
  3. Attention vs. relative position: They experimented with relative positional representations but found similar results

Applications and Extensions

Since its introduction, the Transformer architecture has been the foundation for numerous breakthrough models:

  1. BERT (Bidirectional Encoder Representations from Transformers): Uses only the encoder portion for bidirectional context understanding
  2. GPT (Generative Pre-trained Transformer): Uses only the decoder portion for autoregressive text generation
  3. T5 (Text-to-Text Transfer Transformer): Frames all NLP tasks as text-to-text problems
  4. Vision Transformer (ViT): Adapts Transformers for image classification by treating image patches as sequence tokens

Limitations and Challenges

Despite its success, the Transformer has some limitations:

  1. Quadratic complexity: The self-attention mechanism scales quadratically with sequence length, making it computationally expensive for very long sequences
  2. Positional encoding limitations: The fixed positional encodings may not capture position information as effectively as recurrent architectures
  3. Limited inductive bias: Without the sequential bias of RNNs or the spatial locality bias of CNNs, Transformers may require more data to learn patterns

Addressing the Complexity Issue

Several approaches have been proposed to address the quadratic complexity issue:

  1. Sparse Attention: Only attend to a subset of positions
  2. Linear Attention: Reformulate attention to achieve linear complexity
  3. Local Attention: Restrict attention to local neighborhoods
  4. Longformer/BigBird: Use a combination of local, global, and random attention patterns

Conclusion

The Transformer architecture represents a paradigm shift in sequence modeling. By replacing recurrence and convolutions with self-attention, it achieves state-of-the-art results while being more parallelizable and requiring fewer parameters. Its success has led to a new generation of pre-trained models that have pushed the boundaries of what’s possible in natural language processing and beyond.

The fundamental principles of the Transformer—self-attention, multi-head attention, and position-wise feed-forward networks—have proven to be remarkably versatile and effective across domains, cementing its place as one of the most significant architectural innovations in deep learning.

References

  1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
  2. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  3. Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.
  4. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1-67.
  5. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Leave a Reply

Your email address will not be published. Required fields are marked *