Types of Recurrent Neural Networks: Architectures, Examples and Implementation

Recurrent Neural Networks (RNNs) are powerful sequence processing models that can handle data with temporal relationships. Unlike traditional feedforward neural networks, RNNs have connections that form directed cycles, allowing them to maintain memory of previous inputs. This makes them particularly effective for tasks involving sequential data like text, speech, time series, and more.

In this blog, we’ll explore different RNN architectures based on their input-output relationships, along with practical examples and implementations for each type.

Table of Contents

  1. One-to-One RNNs
  2. One-to-Many RNNs
  3. Many-to-One RNNs
  4. Many-to-Many RNNs (Synchronized)
  5. Many-to-Many RNNs (Encoder-Decoder)
  6. Advanced RNN Architectures

One-to-One RNNs

Architecture

One-to-One RNNs are the simplest form and technically not recurrent at all. They take a single input and produce a single output, much like a standard feedforward neural network.

Input(x) → Neural Network → Output(y)

Example Use Case

Image classification, where a single image is classified into a single category.

Implementation

import torch
import torch.nn as nn

class SimpleNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleNN, self).__init__()
        self.layer1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        out = self.layer1(x)
        out = self.relu(out)
        out = self.layer2(out)
        return out

# Example usage
input_size = 784  # For MNIST
hidden_size = 128
output_size = 10  # 10 digits

model = SimpleNN(input_size, hidden_size, output_size)

# For a single input
x = torch.randn(1, input_size)
output = model(x)
print(output.shape)  # torch.Size([1, 10])

Explanation

This isn’t truly an RNN since it lacks recurrence, but it’s included as a baseline for comparison. The model takes a flattened image (784 pixels for MNIST) and outputs probabilities for 10 digit classes.

One-to-Many RNNs

Architecture

One-to-Many RNNs take a single input and produce a sequence of outputs. The input is processed once, and then the network generates a series of outputs recursively.

Single Input(x) → RNN → Output sequence(y₁, y₂, ..., yₙ)

Example Use Case

Image captioning, where a single image generates a sequence of words describing it.

Implementation

import torch
import torch.nn as nn

class ImageCaptioningRNN(nn.Module):
    def __init__(self, image_feature_size, hidden_size, vocab_size, seq_length):
        super(ImageCaptioningRNN, self).__init__()
        self.hidden_size = hidden_size
        self.seq_length = seq_length
        
        # Image feature to hidden state
        self.image_to_hidden = nn.Linear(image_feature_size, hidden_size)
        
        # RNN cell (using GRU for simplicity)
        self.rnn_cell = nn.GRUCell(hidden_size, hidden_size)
        
        # Output layer
        self.hidden_to_output = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, image_features):
        batch_size = image_features.size(0)
        
        # Initialize hidden state from image features
        hidden = self.image_to_hidden(image_features)
        
        # Container for outputs
        outputs = []
        
        # Input for first step is the hidden state
        input_t = hidden
        
        # Generate sequence
        for t in range(self.seq_length):
            # Update hidden state
            hidden = self.rnn_cell(input_t, hidden)
            
            # Generate output for this timestep
            output = self.hidden_to_output(hidden)
            outputs.append(output)
            
            # Next input is the current hidden state
            input_t = hidden
        
        # Stack outputs along sequence dimension
        outputs = torch.stack(outputs, dim=1)
        return outputs

# Example usage
image_feature_size = 2048  # From a CNN like ResNet
hidden_size = 512
vocab_size = 10000
seq_length = 20  # Maximum caption length

model = ImageCaptioningRNN(image_feature_size, hidden_size, vocab_size, seq_length)

# For a batch of images
image_features = torch.randn(32, image_feature_size)
captions = model(image_features)
print(captions.shape)  # torch.Size([32, 20, 10000])

Explanation

This RNN uses a single image input to generate a sequence of words. After processing the image features to create an initial hidden state, the model recursively generates each word based on the previous hidden state. At each step, the output is a probability distribution over the vocabulary.

Many-to-One RNNs

Architecture

Many-to-One RNNs process a sequence of inputs and produce a single output, typically at the end of the sequence.

Input sequence(x₁, x₂, ..., xₙ) → RNN → Single Output(y)

Example Use Case

Sentiment analysis of text, where a sequence of words is classified as positive or negative.

Implementation

import torch
import torch.nn as nn

class SentimentRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, output_size):
        super(SentimentRNN, self).__init__()
        self.hidden_size = hidden_size
        
        # Word embeddings
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # RNN layer
        self.rnn = nn.GRU(embedding_dim, hidden_size, batch_first=True)
        
        # Output layer
        self.fc = nn.Linear(hidden_size, output_size)
        
    def forward(self, x):
        # x shape: (batch_size, sequence_length)
        
        # Embed the input sequence
        x = self.embedding(x)  # (batch_size, sequence_length, embedding_dim)
        
        # Process through RNN
        _, hidden = self.rnn(x)  # hidden: (1, batch_size, hidden_size)
        
        # Use the final hidden state
        hidden = hidden.squeeze(0)  # (batch_size, hidden_size)
        
        # Pass through output layer
        output = self.fc(hidden)  # (batch_size, output_size)
        
        return output

# Example usage
vocab_size = 20000
embedding_dim = 300
hidden_size = 256
output_size = 2  # Binary sentiment (positive/negative)

model = SentimentRNN(vocab_size, embedding_dim, hidden_size, output_size)

# For a batch of sequences
batch_size = 64
sequence_length = 100
input_sequences = torch.randint(0, vocab_size, (batch_size, sequence_length))
sentiment = model(input_sequences)
print(sentiment.shape)  # torch.Size([64, 2])

Explanation

This model processes a sequence of word indices, embeds them, and passes them through a GRU layer. Only the final hidden state is used to make the prediction, which is then passed through a linear layer to get the sentiment classification.

Many-to-Many RNNs (Synchronized)

Architecture

In synchronized Many-to-Many RNNs, there’s an output for each input at the same time step.

Input sequence(x₁, x₂, ..., xₙ) → RNN → Output sequence(y₁, y₂, ..., yₙ)

Example Use Case

Part-of-speech tagging, where each word in a sentence is tagged with its grammatical role.

Implementation

import torch
import torch.nn as nn

class POSTaggerRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, num_tags):
        super(POSTaggerRNN, self).__init__()
        
        # Word embeddings
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # Bidirectional RNN for better context
        self.rnn = nn.LSTM(embedding_dim, hidden_size, batch_first=True, bidirectional=True)
        
        # Output layer (accounting for bidirectionality)
        self.fc = nn.Linear(hidden_size * 2, num_tags)
        
    def forward(self, x):
        # x shape: (batch_size, sequence_length)
        
        # Embed the input sequence
        x = self.embedding(x)  # (batch_size, sequence_length, embedding_dim)
        
        # Process through RNN
        outputs, _ = self.rnn(x)  # outputs: (batch_size, sequence_length, hidden_size*2)
        
        # Pass each output through the final layer
        tag_space = self.fc(outputs)  # (batch_size, sequence_length, num_tags)
        
        return tag_space

# Example usage
vocab_size = 20000
embedding_dim = 300
hidden_size = 256
num_tags = 45  # Number of POS tags

model = POSTaggerRNN(vocab_size, embedding_dim, hidden_size, num_tags)

# For a batch of sequences
batch_size = 32
sequence_length = 50
input_sequences = torch.randint(0, vocab_size, (batch_size, sequence_length))
pos_tags = model(input_sequences)
print(pos_tags.shape)  # torch.Size([32, 50, 45])

Explanation

This bidirectional LSTM processes a sequence of words and outputs a tag prediction for each word in the sequence. The model embeds each word, processes the entire sequence with a bidirectional LSTM to capture context in both directions, and then maps each hidden state to a tag probability distribution.

Many-to-Many RNNs (Encoder-Decoder)

Architecture

The encoder-decoder architecture first processes the entire input sequence (encoder) and then generates an output sequence (decoder), possibly of different length.

Input sequence(x₁, x₂, ..., xₙ) → Encoder → [State] → Decoder → Output sequence(y₁, y₂, ..., yₘ)

Example Use Case

Machine translation, where a sentence in one language is translated to another language.

Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

class Encoder(nn.Module):
    def __init__(self, input_vocab_size, embedding_dim, hidden_size):
        super(Encoder, self).__init__()
        self.hidden_size = hidden_size
        
        self.embedding = nn.Embedding(input_vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_size, batch_first=True)
        
    def forward(self, x):
        # x shape: (batch_size, sequence_length)
        
        embedded = self.embedding(x)  # (batch_size, sequence_length, embedding_dim)
        outputs, (hidden, cell) = self.lstm(embedded)
        
        return outputs, hidden, cell

class Decoder(nn.Module):
    def __init__(self, output_vocab_size, embedding_dim, hidden_size):
        super(Decoder, self).__init__()
        self.hidden_size = hidden_size
        
        self.embedding = nn.Embedding(output_vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_vocab_size)
        
    def forward(self, x, hidden, cell):
        # x shape: (batch_size, 1)
        
        embedded = self.embedding(x)  # (batch_size, 1, embedding_dim)
        
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        # output: (batch_size, 1, hidden_size)
        
        prediction = self.fc(output.squeeze(1))  # (batch_size, output_vocab_size)
        
        return prediction, hidden, cell

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, source, target, teacher_forcing_ratio=0.5):
        # source: (batch_size, source_length)
        # target: (batch_size, target_length)
        
        batch_size = source.shape[0]
        target_length = target.shape[1]
        target_vocab_size = self.decoder.fc.out_features
        
        # Tensor to store decoder outputs
        outputs = torch.zeros(batch_size, target_length, target_vocab_size).to(self.device)
        
        # Encode the source sequence
        _, hidden, cell = self.encoder(source)
        
        # First decoder input is the <SOS> token
        decoder_input = target[:, 0].unsqueeze(1)  # (batch_size, 1)
        
        for t in range(1, target_length):
            # Pass through decoder
            output, hidden, cell = self.decoder(decoder_input, hidden, cell)
            
            # Store prediction
            outputs[:, t, :] = output
            
            # Decide whether to use teacher forcing
            teacher_force = torch.rand(1).item() < teacher_forcing_ratio
            
            # Get the highest predicted token
            top1 = output.argmax(1)
            
            # Use either prediction or actual target as next input
            decoder_input = target[:, t].unsqueeze(1) if teacher_force else top1.unsqueeze(1)
        
        return outputs

# Example usage
input_vocab_size = 10000
output_vocab_size = 8000
embedding_dim = 256
hidden_size = 512
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

encoder = Encoder(input_vocab_size, embedding_dim, hidden_size).to(device)
decoder = Decoder(output_vocab_size, embedding_dim, hidden_size).to(device)
model = Seq2Seq(encoder, decoder, device).to(device)

# For a batch of sequences
batch_size = 16
source_length = 20
target_length = 25
source = torch.randint(0, input_vocab_size, (batch_size, source_length)).to(device)
target = torch.randint(0, output_vocab_size, (batch_size, target_length)).to(device)

output = model(source, target)
print(output.shape)  # torch.Size([16, 25, 8000])

Explanation

This is a more complex sequence-to-sequence model with an encoder-decoder architecture. The encoder processes the entire input sequence and passes its final state to the decoder, which then generates an output sequence. The implementation includes teacher forcing, where during training the model sometimes uses the true target from the previous time step instead of its own prediction.

Advanced RNN Architectures

While the basic types above cover the fundamental input-output relationships, modern RNN applications often use more advanced architectures:

LSTM (Long Short-Term Memory)

LSTMs address the vanishing gradient problem in standard RNNs with a more complex cell structure that includes input, forget, and output gates.

GRU (Gated Recurrent Unit)

GRUs are a simplified version of LSTMs with fewer parameters but similar capabilities for long-term dependency modeling.

Bidirectional RNNs

These process sequences in both forward and backward directions to capture context from both past and future time steps.

Attention Mechanisms

Attention allows models to focus on relevant parts of the input sequence when generating each output element, significantly improving performance for tasks like translation.

Transformers

While not traditional RNNs, transformers have largely replaced RNNs in many sequence processing tasks due to their parallelization capabilities and stronger performance.

Conclusion

Recurrent Neural Networks offer versatile architectures for processing sequential data across various applications. From the simple one-to-many models for image captioning to the complex encoder-decoder structures for machine translation, RNNs and their variants enable powerful modeling of temporal dependencies.

Understanding the different input-output relationships in RNN architectures is crucial for selecting the right approach for your specific sequence processing task. While traditional RNNs have been widely used, more advanced architectures like LSTMs, GRUs, and attention-based models have pushed the boundaries of sequence modeling even further.

Leave a Reply

Your email address will not be published. Required fields are marked *