Deep Learning For NLP Prerequisites

Tejas Kamble
March 15, 2025
7 min read
Deep learning,NLP,text analysis

Understanding RNN Architectures for NLP: From Simple to Complex

Natural Language Processing (NLP) has evolved dramatically with the development of increasingly sophisticated neural network architectures. In this blog post, we’ll explore various recurrent neural network (RNN) architectures that have revolutionized NLP tasks, from basic RNNs to complex encoder-decoder models.

Simple RNN: The Foundation

What is a Simple RNN?

A Simple Recurrent Neural Network (RNN) is the most basic form of recurrent architecture designed to handle sequential data. Unlike feedforward networks, RNNs include connections that feed the network’s previous state back into the current state, creating a form of “memory” about past inputs.

How Simple RNNs Work

In a simple RNN, at each time step t, the network:

Takes in the current input x_t
Combines it with the previous hidden state h_t-1
Produces a new hidden state h_t and an output

The formula for this computation is:

h_t = tanh(W_x * x_t + W_h * h_t-1 + b)
y_t = W_y * h_t + b_y

Where:

W_x, W_h, and W_y are weight matrices
b and b_y are bias vectors
tanh is the activation function

Applications in NLP

Simple RNNs can be used for:

Next word prediction
Part-of-speech tagging
Simple text classification

Limitations

The major limitation of simple RNNs is the “vanishing gradient problem.” During backpropagation through time, gradients either vanish or explode as they’re propagated back through many time steps, making it difficult for the network to capture long-term dependencies.

LSTM: Solving the Long-Term Dependency Problem

What is LSTM?

Long Short-Term Memory (LSTM) networks were designed specifically to address the vanishing gradient problem of simple RNNs. Introduced by Hochreiter & Schmidhuber in 1997, LSTMs use a more complex internal structure with gating mechanisms.

How LSTMs Work

LSTMs introduce a cell state (C_t) that runs through the entire sequence, with gates controlling information flow:

Forget Gate: Decides what to forget from the cell state f_t = σ(W_f * [h_t-1, x_t] + b_f)
Input Gate: Decides what new information to store i_t = σ(W_i * [h_t-1, x_t] + b_i) C̃_t = tanh(W_C * [h_t-1, x_t] + b_C)
Cell State Update: Updates the cell state C_t = f_t * C_t-1 + i_t * C̃_t
Output Gate: Controls what to output from the cell state o_t = σ(W_o * [h_t-1, x_t] + b_o) h_t = o_t * tanh(C_t)

Applications in NLP

LSTMs excel at:

Machine translation
Text summarization
Sentiment analysis
Named entity recognition
Speech recognition

Advantages over Simple RNNs

Better at capturing long-term dependencies
More resistant to the vanishing gradient problem
Higher capacity for learning complex patterns

Bidirectional LSTM: Context from Both Directions

What is a Bidirectional LSTM?

A Bidirectional LSTM (BiLSTM) processes sequences in both forward and backward directions, capturing context from both past and future states.

How BiLSTMs Work

BiLSTMs include two separate LSTMs:

A forward LSTM that processes the sequence from start to end
A backward LSTM that processes from end to start

The outputs of both networks are typically concatenated or summed, providing a representation that incorporates context from both directions.

Applications in NLP

BiLSTMs are especially powerful for:

Named entity recognition
Part-of-speech tagging
Question answering
Sentiment analysis

Advantages over Standard LSTMs

Captures context from both past and future time steps
Provides richer representations for words in the middle of sequences
Better performance on tasks where surrounding context matters

Encoder-Decoder Architecture: The Seq2Seq Revolution

What is an Encoder-Decoder Architecture?

The Encoder-Decoder (or Sequence-to-Sequence, Seq2Seq) architecture consists of two RNNs:

An encoder that processes the input sequence
A decoder that generates the output sequence

How Encoder-Decoders Work

Encoder: Processes the input sequence word by word, producing a final hidden state that encapsulates the entire input.
Decoder: Takes the encoder’s final state and generates output tokens one by one, feeding each generated token back as input for the next step.

In modern implementations, both the encoder and decoder typically use LSTM or GRU cells.

Applications in NLP

Encoder-Decoder architectures are ideal for:

Machine translation
Text summarization
Dialogue systems
Question answering
Code generation

Advanced Variants: Attention Mechanism

The attention mechanism revolutionized encoder-decoder models by allowing the decoder to “pay attention” to different parts of the input sequence when generating each output token. The formula for attention is:

attention_weights = softmax(score(decoder_hidden_state, encoder_hidden_states))
context_vector = sum(attention_weights * encoder_hidden_states)

Comparison of Architectures

Architecture	Strengths	Weaknesses	Ideal NLP Tasks
Simple RNN	Simplicity, fewer parameters	Vanishing gradients, limited memory	Very short sequences, simple classification
LSTM	Long-term dependencies, stable training	More complex, more parameters	Translation, summarization, general NLP
BiLSTM	Context from both directions	Twice as many parameters as LSTM	Named entity recognition, POS tagging
Encoder-Decoder	Handles variable-length I/O, maps between sequences	Complex training, slow inference	Machine translation, summarization
Encoder-Decoder with Attention	Focuses on relevant parts of input	Most complex of all mentioned	State-of-the-art MT, summarization

Visual Diagrams

Simple RNN Architecture

    +-----+
    |     |
    |  h  |<---+
    |     |    |
    +-----+    |
       ^       |
       |       |
    +--+--+    |
    |     |    |
x_t->| RNN |----+
    |     |
    +-----+
       |
       v
      y_t

LSTM Cell Structure

    +---+   +---+
    | × |<--| σ |<--+
    +---+   +---+   |
      |              |
      v              |
 +---+---+    +---+  |
 |       |<---| × |<-+
C_t-1 -> | + |    |  |
 |       |--->C_t  | |
 +-------+    ^    | |
                |  | |
              +---+ | |
              | × | | |
              +---+ | |
                ^   | |
                |   | |
              +---+ | |
          +-->| σ | | |
          |   +---+ | |
          |     ^   | |
h_t-1 --->-----+----+ |
          |     |      |
x_t ------+-----+------+
          |     |
          |   +---+
          +-->| σ |
              +---+
                |
                v
               h_t

Bidirectional LSTM Architecture

        Forward LSTM
+-----+     +-----+     +-----+
|     |     |     |     |     |
| LSTM|---->| LSTM|---->| LSTM|
|     |     |     |     |     |
+-----+     +-----+     +-----+
   ^           ^           ^
   |           |           |
  x_1         x_2         x_3
   |           |           |
   v           v           v
+-----+     +-----+     +-----+
|     |     |     |     |     |
| LSTM|<----| LSTM|<----| LSTM|
|     |     |     |     |     |
+-----+     +-----+     +-----+
        Backward LSTM

     [Combined outputs]
        |    |    |
        v    v    v
      y_1   y_2   y_3

Encoder-Decoder Architecture

     Encoder                 Decoder
+-----+-----+-----+     +-----+-----+-----+
|     |     |     |     |     |     |     |
| LSTM| LSTM| LSTM|     | LSTM| LSTM| LSTM|
|     |     |     |     |     |     |     |
+-----+-----+-----+     +-----+-----+-----+
   ^     ^     ^           ^     ^     ^
   |     |     |           |     |     |
  x_1   x_2   x_3      <START> y_1   y_2
                           |     |     |
                           v     v     v
                          y_1   y_2   y_3

Encoder-Decoder with Attention

     Encoder                 Decoder
+-----+-----+-----+     +-----+-----+-----+
|     |     |     |     |     |     |     |
| LSTM| LSTM| LSTM|<--->| LSTM| LSTM| LSTM|
|     |     |     |     |     |     |     |
+-----+-----+-----+     +-----+-----+-----+
   ^     ^     ^           ^     ^     ^
   |     |     |           |     |     |
  x_1   x_2   x_3      <START> y_1   y_2
                           |     |     |
                           v     v     v
                          y_1   y_2   y_3

NLP Algorithms Using These Architectures

Text Classification with LSTM

Preprocessing:
- Tokenize text
- Convert tokens to embeddings
Model Architecture:
- Embedding layer
- LSTM layer(s)
- Dense layer with softmax activation
Training:
- Cross-entropy loss
- Adam optimizer
Prediction:
- Feed new text through the model
- Take argmax of softmax outputs

Named Entity Recognition with BiLSTM

Preprocessing:
- Tokenize text
- Convert tokens to embeddings
- Create BIO/IOB tags for entities
Model Architecture:
- Embedding layer
- BiLSTM layer(s)
- Time-distributed dense layer with softmax
Training:
- Cross-entropy loss (or CRF loss)
- Often includes a CRF layer for coherent predictions
Prediction:
- Feed new text through the model
- Decode the most likely sequence of tags

Machine Translation with Encoder-Decoder + Attention

Preprocessing:
- Tokenize source and target text
- Create vocabulary for both languages
- Convert tokens to indices
Model Architecture:
- Source embedding layer
- Encoder (LSTM/BiLSTM)
- Attention mechanism
- Decoder (LSTM)
- Target embedding layer
- Output dense layer with softmax
Training:
- Teacher forcing (use ground truth as next input)
- Cross-entropy loss
- Beam search for inference
Prediction:
- Encode source sentence
- Generate target tokens one by one
- Use beam search to find best translation

Conclusion

The evolution from simple RNNs to attention-based encoder-decoder models has dramatically improved the capabilities of NLP systems. While transformers and large language models have since surpassed these architectures in many tasks, understanding these fundamental RNN-based models provides valuable insights into the development of sequence modeling in deep learning.

Each architecture builds upon the previous one, addressing specific limitations:

LSTMs solved the vanishing gradient problem of simple RNNs
BiLSTMs incorporated context from both directions
Encoder-Decoder models enabled variable-length sequence-to-sequence mapping
Attention mechanisms allowed models to focus on relevant parts of the input

Understanding these architectures and their evolution provides a solid foundation for working with modern NLP systems and developing new approaches to language understanding and generation.

Deep Learning For NLP Prerequisites

Understanding RNN Architectures for NLP: From Simple to Complex

Simple RNN: The Foundation

What is a Simple RNN?

How Simple RNNs Work

Applications in NLP

Limitations

LSTM: Solving the Long-Term Dependency Problem

What is LSTM?

How LSTMs Work

Applications in NLP

Advantages over Simple RNNs

Bidirectional LSTM: Context from Both Directions

What is a Bidirectional LSTM?

How BiLSTMs Work

Applications in NLP

Advantages over Standard LSTMs

Encoder-Decoder Architecture: The Seq2Seq Revolution

What is an Encoder-Decoder Architecture?

How Encoder-Decoders Work

Applications in NLP

Advanced Variants: Attention Mechanism

Comparison of Architectures

Visual Diagrams

Simple RNN Architecture

LSTM Cell Structure

Bidirectional LSTM Architecture

Encoder-Decoder Architecture

Encoder-Decoder with Attention

NLP Algorithms Using These Architectures

Text Classification with LSTM

Named Entity Recognition with BiLSTM

Machine Translation with Encoder-Decoder + Attention

Conclusion

Leave a Reply Cancel reply