Deep Learning For NLP Prerequisites

Understanding RNN Architectures for NLP: From Simple to Complex

Natural Language Processing (NLP) has evolved dramatically with the development of increasingly sophisticated neural network architectures. In this blog post, we’ll explore various recurrent neural network (RNN) architectures that have revolutionized NLP tasks, from basic RNNs to complex encoder-decoder models.

Simple RNN: The Foundation

What is a Simple RNN?

A Simple Recurrent Neural Network (RNN) is the most basic form of recurrent architecture designed to handle sequential data. Unlike feedforward networks, RNNs include connections that feed the network’s previous state back into the current state, creating a form of “memory” about past inputs.

How Simple RNNs Work

In a simple RNN, at each time step t, the network:

  1. Takes in the current input x_t
  2. Combines it with the previous hidden state h_t-1
  3. Produces a new hidden state h_t and an output

The formula for this computation is:

h_t = tanh(W_x * x_t + W_h * h_t-1 + b)
y_t = W_y * h_t + b_y

Where:

  • W_x, W_h, and W_y are weight matrices
  • b and b_y are bias vectors
  • tanh is the activation function

Applications in NLP

Simple RNNs can be used for:

  • Next word prediction
  • Part-of-speech tagging
  • Simple text classification

Limitations

The major limitation of simple RNNs is the “vanishing gradient problem.” During backpropagation through time, gradients either vanish or explode as they’re propagated back through many time steps, making it difficult for the network to capture long-term dependencies.

LSTM: Solving the Long-Term Dependency Problem

What is LSTM?

Long Short-Term Memory (LSTM) networks were designed specifically to address the vanishing gradient problem of simple RNNs. Introduced by Hochreiter & Schmidhuber in 1997, LSTMs use a more complex internal structure with gating mechanisms.

How LSTMs Work

LSTMs introduce a cell state (C_t) that runs through the entire sequence, with gates controlling information flow:

  1. Forget Gate: Decides what to forget from the cell state f_t = σ(W_f * [h_t-1, x_t] + b_f)
  2. Input Gate: Decides what new information to store i_t = σ(W_i * [h_t-1, x_t] + b_i) C̃_t = tanh(W_C * [h_t-1, x_t] + b_C)
  3. Cell State Update: Updates the cell state C_t = f_t * C_t-1 + i_t * C̃_t
  4. Output Gate: Controls what to output from the cell state o_t = σ(W_o * [h_t-1, x_t] + b_o) h_t = o_t * tanh(C_t)

Applications in NLP

LSTMs excel at:

  • Machine translation
  • Text summarization
  • Sentiment analysis
  • Named entity recognition
  • Speech recognition

Advantages over Simple RNNs

  • Better at capturing long-term dependencies
  • More resistant to the vanishing gradient problem
  • Higher capacity for learning complex patterns

Bidirectional LSTM: Context from Both Directions

What is a Bidirectional LSTM?

A Bidirectional LSTM (BiLSTM) processes sequences in both forward and backward directions, capturing context from both past and future states.

How BiLSTMs Work

BiLSTMs include two separate LSTMs:

  1. A forward LSTM that processes the sequence from start to end
  2. A backward LSTM that processes from end to start

The outputs of both networks are typically concatenated or summed, providing a representation that incorporates context from both directions.

Applications in NLP

BiLSTMs are especially powerful for:

  • Named entity recognition
  • Part-of-speech tagging
  • Question answering
  • Sentiment analysis

Advantages over Standard LSTMs

  • Captures context from both past and future time steps
  • Provides richer representations for words in the middle of sequences
  • Better performance on tasks where surrounding context matters

Encoder-Decoder Architecture: The Seq2Seq Revolution

What is an Encoder-Decoder Architecture?

The Encoder-Decoder (or Sequence-to-Sequence, Seq2Seq) architecture consists of two RNNs:

  1. An encoder that processes the input sequence
  2. A decoder that generates the output sequence

How Encoder-Decoders Work

  1. Encoder: Processes the input sequence word by word, producing a final hidden state that encapsulates the entire input.
  2. Decoder: Takes the encoder’s final state and generates output tokens one by one, feeding each generated token back as input for the next step.

In modern implementations, both the encoder and decoder typically use LSTM or GRU cells.

Applications in NLP

Encoder-Decoder architectures are ideal for:

  • Machine translation
  • Text summarization
  • Dialogue systems
  • Question answering
  • Code generation

Advanced Variants: Attention Mechanism

The attention mechanism revolutionized encoder-decoder models by allowing the decoder to “pay attention” to different parts of the input sequence when generating each output token. The formula for attention is:

attention_weights = softmax(score(decoder_hidden_state, encoder_hidden_states))
context_vector = sum(attention_weights * encoder_hidden_states)

Comparison of Architectures

ArchitectureStrengthsWeaknessesIdeal NLP Tasks
Simple RNNSimplicity, fewer parametersVanishing gradients, limited memoryVery short sequences, simple classification
LSTMLong-term dependencies, stable trainingMore complex, more parametersTranslation, summarization, general NLP
BiLSTMContext from both directionsTwice as many parameters as LSTMNamed entity recognition, POS tagging
Encoder-DecoderHandles variable-length I/O, maps between sequencesComplex training, slow inferenceMachine translation, summarization
Encoder-Decoder with AttentionFocuses on relevant parts of inputMost complex of all mentionedState-of-the-art MT, summarization

Visual Diagrams

Simple RNN Architecture

    +-----+
    |     |
    |  h  |<---+
    |     |    |
    +-----+    |
       ^       |
       |       |
    +--+--+    |
    |     |    |
x_t->| RNN |----+
    |     |
    +-----+
       |
       v
      y_t

LSTM Cell Structure

    +---+   +---+
    | × |<--| σ |<--+
    +---+   +---+   |
      |              |
      v              |
 +---+---+    +---+  |
 |       |<---| × |<-+
C_t-1 -> | + |    |  |
 |       |--->C_t  | |
 +-------+    ^    | |
                |  | |
              +---+ | |
              | × | | |
              +---+ | |
                ^   | |
                |   | |
              +---+ | |
          +-->| σ | | |
          |   +---+ | |
          |     ^   | |
h_t-1 --->-----+----+ |
          |     |      |
x_t ------+-----+------+
          |     |
          |   +---+
          +-->| σ |
              +---+
                |
                v
               h_t

Bidirectional LSTM Architecture

        Forward LSTM
+-----+     +-----+     +-----+
|     |     |     |     |     |
| LSTM|---->| LSTM|---->| LSTM|
|     |     |     |     |     |
+-----+     +-----+     +-----+
   ^           ^           ^
   |           |           |
  x_1         x_2         x_3
   |           |           |
   v           v           v
+-----+     +-----+     +-----+
|     |     |     |     |     |
| LSTM|<----| LSTM|<----| LSTM|
|     |     |     |     |     |
+-----+     +-----+     +-----+
        Backward LSTM

     [Combined outputs]
        |    |    |
        v    v    v
      y_1   y_2   y_3

Encoder-Decoder Architecture

     Encoder                 Decoder
+-----+-----+-----+     +-----+-----+-----+
|     |     |     |     |     |     |     |
| LSTM| LSTM| LSTM|     | LSTM| LSTM| LSTM|
|     |     |     |     |     |     |     |
+-----+-----+-----+     +-----+-----+-----+
   ^     ^     ^           ^     ^     ^
   |     |     |           |     |     |
  x_1   x_2   x_3      <START> y_1   y_2
                           |     |     |
                           v     v     v
                          y_1   y_2   y_3

Encoder-Decoder with Attention

     Encoder                 Decoder
+-----+-----+-----+     +-----+-----+-----+
|     |     |     |     |     |     |     |
| LSTM| LSTM| LSTM|<--->| LSTM| LSTM| LSTM|
|     |     |     |     |     |     |     |
+-----+-----+-----+     +-----+-----+-----+
   ^     ^     ^           ^     ^     ^
   |     |     |           |     |     |
  x_1   x_2   x_3      <START> y_1   y_2
                           |     |     |
                           v     v     v
                          y_1   y_2   y_3

NLP Algorithms Using These Architectures

Text Classification with LSTM

  1. Preprocessing:
    • Tokenize text
    • Convert tokens to embeddings
  2. Model Architecture:
    • Embedding layer
    • LSTM layer(s)
    • Dense layer with softmax activation
  3. Training:
    • Cross-entropy loss
    • Adam optimizer
  4. Prediction:
    • Feed new text through the model
    • Take argmax of softmax outputs

Named Entity Recognition with BiLSTM

  1. Preprocessing:
    • Tokenize text
    • Convert tokens to embeddings
    • Create BIO/IOB tags for entities
  2. Model Architecture:
    • Embedding layer
    • BiLSTM layer(s)
    • Time-distributed dense layer with softmax
  3. Training:
    • Cross-entropy loss (or CRF loss)
    • Often includes a CRF layer for coherent predictions
  4. Prediction:
    • Feed new text through the model
    • Decode the most likely sequence of tags

Machine Translation with Encoder-Decoder + Attention

  1. Preprocessing:
    • Tokenize source and target text
    • Create vocabulary for both languages
    • Convert tokens to indices
  2. Model Architecture:
    • Source embedding layer
    • Encoder (LSTM/BiLSTM)
    • Attention mechanism
    • Decoder (LSTM)
    • Target embedding layer
    • Output dense layer with softmax
  3. Training:
    • Teacher forcing (use ground truth as next input)
    • Cross-entropy loss
    • Beam search for inference
  4. Prediction:
    • Encode source sentence
    • Generate target tokens one by one
    • Use beam search to find best translation

Conclusion

The evolution from simple RNNs to attention-based encoder-decoder models has dramatically improved the capabilities of NLP systems. While transformers and large language models have since surpassed these architectures in many tasks, understanding these fundamental RNN-based models provides valuable insights into the development of sequence modeling in deep learning.

Each architecture builds upon the previous one, addressing specific limitations:

  • LSTMs solved the vanishing gradient problem of simple RNNs
  • BiLSTMs incorporated context from both directions
  • Encoder-Decoder models enabled variable-length sequence-to-sequence mapping
  • Attention mechanisms allowed models to focus on relevant parts of the input

Understanding these architectures and their evolution provides a solid foundation for working with modern NLP systems and developing new approaches to language understanding and generation.

Leave a Reply

Your email address will not be published. Required fields are marked *