Understanding RNN Architectures for NLP: From Simple to Complex
Natural Language Processing (NLP) has evolved dramatically with the development of increasingly sophisticated neural network architectures. In this blog post, we’ll explore various recurrent neural network (RNN) architectures that have revolutionized NLP tasks, from basic RNNs to complex encoder-decoder models.
Simple RNN: The Foundation
What is a Simple RNN?
A Simple Recurrent Neural Network (RNN) is the most basic form of recurrent architecture designed to handle sequential data. Unlike feedforward networks, RNNs include connections that feed the network’s previous state back into the current state, creating a form of “memory” about past inputs.
How Simple RNNs Work
In a simple RNN, at each time step t, the network:
- Takes in the current input x_t
- Combines it with the previous hidden state h_t-1
- Produces a new hidden state h_t and an output
The formula for this computation is:
h_t = tanh(W_x * x_t + W_h * h_t-1 + b)
y_t = W_y * h_t + b_y
Where:
- W_x, W_h, and W_y are weight matrices
- b and b_y are bias vectors
- tanh is the activation function
Applications in NLP
Simple RNNs can be used for:
- Next word prediction
- Part-of-speech tagging
- Simple text classification
Limitations
The major limitation of simple RNNs is the “vanishing gradient problem.” During backpropagation through time, gradients either vanish or explode as they’re propagated back through many time steps, making it difficult for the network to capture long-term dependencies.
LSTM: Solving the Long-Term Dependency Problem
What is LSTM?
Long Short-Term Memory (LSTM) networks were designed specifically to address the vanishing gradient problem of simple RNNs. Introduced by Hochreiter & Schmidhuber in 1997, LSTMs use a more complex internal structure with gating mechanisms.
How LSTMs Work
LSTMs introduce a cell state (C_t) that runs through the entire sequence, with gates controlling information flow:
- Forget Gate: Decides what to forget from the cell state
f_t = σ(W_f * [h_t-1, x_t] + b_f)
- Input Gate: Decides what new information to store
i_t = σ(W_i * [h_t-1, x_t] + b_i) C̃_t = tanh(W_C * [h_t-1, x_t] + b_C)
- Cell State Update: Updates the cell state
C_t = f_t * C_t-1 + i_t * C̃_t
- Output Gate: Controls what to output from the cell state
o_t = σ(W_o * [h_t-1, x_t] + b_o) h_t = o_t * tanh(C_t)
Applications in NLP
LSTMs excel at:
- Machine translation
- Text summarization
- Sentiment analysis
- Named entity recognition
- Speech recognition
Advantages over Simple RNNs
- Better at capturing long-term dependencies
- More resistant to the vanishing gradient problem
- Higher capacity for learning complex patterns
Bidirectional LSTM: Context from Both Directions
What is a Bidirectional LSTM?
A Bidirectional LSTM (BiLSTM) processes sequences in both forward and backward directions, capturing context from both past and future states.
How BiLSTMs Work
BiLSTMs include two separate LSTMs:
- A forward LSTM that processes the sequence from start to end
- A backward LSTM that processes from end to start
The outputs of both networks are typically concatenated or summed, providing a representation that incorporates context from both directions.
Applications in NLP
BiLSTMs are especially powerful for:
- Named entity recognition
- Part-of-speech tagging
- Question answering
- Sentiment analysis
Advantages over Standard LSTMs
- Captures context from both past and future time steps
- Provides richer representations for words in the middle of sequences
- Better performance on tasks where surrounding context matters
Encoder-Decoder Architecture: The Seq2Seq Revolution
What is an Encoder-Decoder Architecture?
The Encoder-Decoder (or Sequence-to-Sequence, Seq2Seq) architecture consists of two RNNs:
- An encoder that processes the input sequence
- A decoder that generates the output sequence
How Encoder-Decoders Work
- Encoder: Processes the input sequence word by word, producing a final hidden state that encapsulates the entire input.
- Decoder: Takes the encoder’s final state and generates output tokens one by one, feeding each generated token back as input for the next step.
In modern implementations, both the encoder and decoder typically use LSTM or GRU cells.
Applications in NLP
Encoder-Decoder architectures are ideal for:
- Machine translation
- Text summarization
- Dialogue systems
- Question answering
- Code generation
Advanced Variants: Attention Mechanism
The attention mechanism revolutionized encoder-decoder models by allowing the decoder to “pay attention” to different parts of the input sequence when generating each output token. The formula for attention is:
attention_weights = softmax(score(decoder_hidden_state, encoder_hidden_states))
context_vector = sum(attention_weights * encoder_hidden_states)
Comparison of Architectures
Architecture | Strengths | Weaknesses | Ideal NLP Tasks |
---|---|---|---|
Simple RNN | Simplicity, fewer parameters | Vanishing gradients, limited memory | Very short sequences, simple classification |
LSTM | Long-term dependencies, stable training | More complex, more parameters | Translation, summarization, general NLP |
BiLSTM | Context from both directions | Twice as many parameters as LSTM | Named entity recognition, POS tagging |
Encoder-Decoder | Handles variable-length I/O, maps between sequences | Complex training, slow inference | Machine translation, summarization |
Encoder-Decoder with Attention | Focuses on relevant parts of input | Most complex of all mentioned | State-of-the-art MT, summarization |
Visual Diagrams
Simple RNN Architecture
+-----+
| |
| h |<---+
| | |
+-----+ |
^ |
| |
+--+--+ |
| | |
x_t->| RNN |----+
| |
+-----+
|
v
y_t
LSTM Cell Structure
+---+ +---+
| × |<--| σ |<--+
+---+ +---+ |
| |
v |
+---+---+ +---+ |
| |<---| × |<-+
C_t-1 -> | + | | |
| |--->C_t | |
+-------+ ^ | |
| | |
+---+ | |
| × | | |
+---+ | |
^ | |
| | |
+---+ | |
+-->| σ | | |
| +---+ | |
| ^ | |
h_t-1 --->-----+----+ |
| | |
x_t ------+-----+------+
| |
| +---+
+-->| σ |
+---+
|
v
h_t
Bidirectional LSTM Architecture
Forward LSTM
+-----+ +-----+ +-----+
| | | | | |
| LSTM|---->| LSTM|---->| LSTM|
| | | | | |
+-----+ +-----+ +-----+
^ ^ ^
| | |
x_1 x_2 x_3
| | |
v v v
+-----+ +-----+ +-----+
| | | | | |
| LSTM|<----| LSTM|<----| LSTM|
| | | | | |
+-----+ +-----+ +-----+
Backward LSTM
[Combined outputs]
| | |
v v v
y_1 y_2 y_3
Encoder-Decoder Architecture
Encoder Decoder
+-----+-----+-----+ +-----+-----+-----+
| | | | | | | |
| LSTM| LSTM| LSTM| | LSTM| LSTM| LSTM|
| | | | | | | |
+-----+-----+-----+ +-----+-----+-----+
^ ^ ^ ^ ^ ^
| | | | | |
x_1 x_2 x_3 <START> y_1 y_2
| | |
v v v
y_1 y_2 y_3
Encoder-Decoder with Attention
Encoder Decoder
+-----+-----+-----+ +-----+-----+-----+
| | | | | | | |
| LSTM| LSTM| LSTM|<--->| LSTM| LSTM| LSTM|
| | | | | | | |
+-----+-----+-----+ +-----+-----+-----+
^ ^ ^ ^ ^ ^
| | | | | |
x_1 x_2 x_3 <START> y_1 y_2
| | |
v v v
y_1 y_2 y_3
NLP Algorithms Using These Architectures
Text Classification with LSTM
- Preprocessing:
- Tokenize text
- Convert tokens to embeddings
- Model Architecture:
- Embedding layer
- LSTM layer(s)
- Dense layer with softmax activation
- Training:
- Cross-entropy loss
- Adam optimizer
- Prediction:
- Feed new text through the model
- Take argmax of softmax outputs
Named Entity Recognition with BiLSTM
- Preprocessing:
- Tokenize text
- Convert tokens to embeddings
- Create BIO/IOB tags for entities
- Model Architecture:
- Embedding layer
- BiLSTM layer(s)
- Time-distributed dense layer with softmax
- Training:
- Cross-entropy loss (or CRF loss)
- Often includes a CRF layer for coherent predictions
- Prediction:
- Feed new text through the model
- Decode the most likely sequence of tags
Machine Translation with Encoder-Decoder + Attention
- Preprocessing:
- Tokenize source and target text
- Create vocabulary for both languages
- Convert tokens to indices
- Model Architecture:
- Source embedding layer
- Encoder (LSTM/BiLSTM)
- Attention mechanism
- Decoder (LSTM)
- Target embedding layer
- Output dense layer with softmax
- Training:
- Teacher forcing (use ground truth as next input)
- Cross-entropy loss
- Beam search for inference
- Prediction:
- Encode source sentence
- Generate target tokens one by one
- Use beam search to find best translation
Conclusion
The evolution from simple RNNs to attention-based encoder-decoder models has dramatically improved the capabilities of NLP systems. While transformers and large language models have since surpassed these architectures in many tasks, understanding these fundamental RNN-based models provides valuable insights into the development of sequence modeling in deep learning.
Each architecture builds upon the previous one, addressing specific limitations:
- LSTMs solved the vanishing gradient problem of simple RNNs
- BiLSTMs incorporated context from both directions
- Encoder-Decoder models enabled variable-length sequence-to-sequence mapping
- Attention mechanisms allowed models to focus on relevant parts of the input
Understanding these architectures and their evolution provides a solid foundation for working with modern NLP systems and developing new approaches to language understanding and generation.