Sentiment Analysis with RNN End to End Project: A Technical Exploration

In today’s digital landscape, understanding sentiment from text data has become a crucial component for businesses and researchers alike. This blog post explores an end-to-end implementation of a sentiment analysis system using Recurrent Neural Networks (RNNs), with a detailed examination of the underlying code, architecture decisions, and deployment strategy.

Try the Sentiment WebApp: model Accuracy > 90%

IMDB Sentiment Analysis Webapp

Analyze the sentiment of any IMDB review using our Sentiment Analysis Tool

Launch Application

Introduction to the Project

The Sentiment Analysis RNN project by Tejas K provides a comprehensive implementation of sentiment analysis that takes raw text as input and classifies it into positive, negative, or neutral categories. What makes this project stand out is its careful attention to the entire machine learning pipeline from data preprocessing to deployment.

Let’s delve into the technical aspects of this implementation.

Data Preprocessing: The Foundation

The quality of any NLP model heavily depends on how well the text data is preprocessed. The project implements several crucial preprocessing steps:

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return ' '.join(tokens)

This preprocessing function performs several important operations:

  1. Converting text to lowercase to ensure consistent processing
  2. Removing HTML tags that might be present in web-scraped data
  3. Filtering out special characters and numbers to focus on alphabetic content
  4. Tokenizing the text into individual words
  5. Removing stopwords (common words like “the”, “and”, etc.) that typically don’t carry sentiment
  6. Lemmatizing words to reduce them to their base form

Building the Vocabulary: Tokenization and Embedding

Before feeding text to an RNN, we need to convert words into numerical vectors. The project implements a vocabulary builder and embedding mechanism:

class Vocabulary:
    def __init__(self, max_size=None):
        self.word2idx = {"<PAD>": 0, "<UNK>": 1}
        self.idx2word = {0: "<PAD>", 1: "<UNK>"}
        self.word_count = {}
        self.max_size = max_size
    
    def add_word(self, word):
        if word not in self.word_count:
            self.word_count[word] = 1
        else:
            self.word_count[word] += 1
    
    def build_vocab(self):
        # Sort words by frequency
        sorted_words = sorted(self.word_count.items(), key=lambda x: x[1], reverse=True)
        
        # Take only max_size most common words if specified
        if self.max_size:
            sorted_words = sorted_words[:self.max_size-2]  # -2 for <PAD> and <UNK>
        
        # Add words to dictionaries
        for word, _ in sorted_words:
            idx = len(self.word2idx)
            self.word2idx[word] = idx
            self.idx2word[idx] = word
    
    def text_to_indices(self, text, max_length=None):
        words = text.split()
        indices = [self.word2idx.get(word, self.word2idx["<UNK>"]) for word in words]
        
        if max_length:
            if len(indices) > max_length:
                indices = indices[:max_length]
            else:
                indices += [self.word2idx["<PAD>"]] * (max_length - len(indices))
        
        return indices

This vocabulary class:

  1. Maintains mappings between words and their numerical indices
  2. Counts word frequencies to build a vocabulary of the most common words
  3. Handles unknown words with a special <UNK> token
  4. Pads sequences to a consistent length with a <PAD> token
  5. Converts text to sequences of indices for model processing

The Core: RNN Model Architecture

The heart of the project is the RNN model architecture. The implementation uses PyTorch to build a flexible model that can be configured with different RNN cell types (LSTM or GRU) and embedding dimensions:

class SentimentRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx, cell_type='lstm'):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        
        if cell_type.lower() == 'lstm':
            self.rnn = nn.LSTM(embedding_dim, 
                              hidden_dim, 
                              num_layers=n_layers, 
                              bidirectional=bidirectional, 
                              dropout=dropout if n_layers > 1 else 0,
                              batch_first=True)
        elif cell_type.lower() == 'gru':
            self.rnn = nn.GRU(embedding_dim, 
                             hidden_dim, 
                             num_layers=n_layers, 
                             bidirectional=bidirectional, 
                             dropout=dropout if n_layers > 1 else 0,
                             batch_first=True)
        else:
            raise ValueError("cell_type must be 'lstm' or 'gru'")
        
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, text_lengths):
        # text = [batch size, seq length]
        embedded = self.dropout(self.embedding(text))
        # embedded = [batch size, seq length, embedding dim]
        
        # Pack sequence for RNN efficiency
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.cpu(), 
                                                         batch_first=True, enforce_sorted=False)
        
        if isinstance(self.rnn, nn.LSTM):
            packed_output, (hidden, _) = self.rnn(packed_embedded)
        else:  # GRU
            packed_output, hidden = self.rnn(packed_embedded)
            
        # hidden = [n layers * n directions, batch size, hidden dim]
        
        # If bidirectional, concatenate the final forward and backward hidden states
        if self.rnn.bidirectional:
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
        else:
            hidden = self.dropout(hidden[-1,:,:])
            
        # hidden = [batch size, hidden dim * n directions]
        
        return self.fc(hidden)

This model includes several key components:

  1. An embedding layer that converts word indices to dense vectors
  2. A configurable RNN layer (either LSTM or GRU) that processes the sequence
  3. Support for bidirectional processing to capture context from both directions
  4. Dropout for regularization to prevent overfitting
  5. A final fully connected layer for classification
  6. Efficient sequence packing to handle variable-length inputs

Training the Model: The Learning Process

The training loop implements several best practices for deep learning:

def train_model(model, train_iterator, optimizer, criterion):
    model.train()
    epoch_loss = 0
    epoch_acc = 0
    
    for batch in train_iterator:
        optimizer.zero_grad()
        
        text, text_lengths = batch.text
        predictions = model(text, text_lengths)
        
        loss = criterion(predictions, batch.label)
        acc = calculate_accuracy(predictions, batch.label)
        
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=5)
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    
    return epoch_loss / len(train_iterator), epoch_acc / len(train_iterator)

Notable aspects include:

  1. Setting the model to training mode with model.train()
  2. Zeroing gradients before each batch to prevent accumulation
  3. Computing loss and accuracy for monitoring training progress
  4. Implementing gradient clipping to prevent exploding gradients
  5. Updating model weights with the optimizer
  6. Tracking and returning average loss and accuracy

Evaluation and Testing: Measuring Performance

The evaluation function follows a similar structure but disables certain training-specific components:

def evaluate_model(model, iterator, criterion):
    model.eval()
    epoch_loss = 0
    epoch_acc = 0
    
    with torch.no_grad():
        for batch in iterator:
            text, text_lengths = batch.text
            predictions = model(text, text_lengths)
            
            loss = criterion(predictions, batch.label)
            acc = calculate_accuracy(predictions, batch.label)
            
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Key differences from the training function:

  1. Setting the model to evaluation mode with model.eval()
  2. Using torch.no_grad() to disable gradient calculation for efficiency
  3. Not performing backward passes or optimizer steps

Model Deployment: From PyTorch to Streamlit

The project’s deployment strategy involves exporting the trained PyTorch model to TorchScript for production use:

def export_model(model, vocab):
    model.eval()
    
    # Create a script module from the PyTorch model
    example_text = torch.randint(0, len(vocab), (1, 10))
    example_lengths = torch.tensor([10])
    
    traced_model = torch.jit.trace(model, (example_text, example_lengths))
    
    # Save the scripted model
    torch.jit.save(traced_model, "sentiment_model.pt")
    
    # Save the vocabulary
    with open("vocab.json", "w") as f:
        json.dump({
            "word2idx": vocab.word2idx,
            "idx2word": {int(k): v for k, v in vocab.idx2word.items()}
        }, f)

The exported model is then integrated into a Streamlit application for easy access:

def load_model():
    # Load the TorchScript model
    model = torch.jit.load("sentiment_model.pt")
    
    # Load vocabulary
    with open("vocab.json", "r") as f:
        vocab_data = json.load(f)
        
    # Recreate vocabulary object
    vocab = Vocabulary()
    vocab.word2idx = vocab_data["word2idx"]
    vocab.idx2word = {int(k): v for k, v in vocab_data["idx2word"].items()}
    
    return model, vocab

def predict_sentiment(model, vocab, text):
    # Preprocess text
    processed_text = preprocess_text(text)
    
    # Convert to indices
    indices = vocab.text_to_indices(processed_text, max_length=100)
    tensor = torch.LongTensor(indices).unsqueeze(0)  # Add batch dimension
    length = torch.tensor([len(indices)])
    
    # Make prediction
    model.eval()
    with torch.no_grad():
        prediction = model(tensor, length)
        
    # Get probability using softmax
    probabilities = F.softmax(prediction, dim=1)
    
    # Get predicted class
    predicted_class = torch.argmax(prediction, dim=1).item()
    
    # Map to sentiment
    sentiment_map = {0: "Negative", 1: "Neutral", 2: "Positive"}
    
    return {
        "sentiment": sentiment_map[predicted_class],
        "confidence": probabilities[0][predicted_class].item(),
        "probabilities": {
            sentiment_map[i]: prob.item() for i, prob in enumerate(probabilities[0])
        }
    }

The Streamlit application code brings everything together in a user-friendly interface:

def main():
    st.title("Sentiment Analysis with RNN")
    
    model, vocab = load_model()
    
    st.write("Enter text to analyze its sentiment:")
    user_input = st.text_area("Text input", "")
    
    if st.button("Analyze Sentiment"):
        if user_input:
            with st.spinner("Analyzing..."):
                result = predict_sentiment(model, vocab, user_input)
            
            st.write(f"**Sentiment:** {result['sentiment']}")
            st.write(f"**Confidence:** {result['confidence']*100:.2f}%")
            
            # Display probabilities
            st.write("### Probability Distribution")
            for sentiment, prob in result['probabilities'].items():
                st.write(f"{sentiment}: {prob*100:.2f}%")
                st.progress(prob)
        else:
            st.warning("Please enter some text to analyze.")

if __name__ == "__main__":
    main()

The iframe parameters and styling ensure:

  1. The dark theme specified with embed_options=dark_theme
  2. Responsive design that works on different screen sizes
  3. Clean integration with the WordPress site’s aesthetics
  4. Proper sizing to accommodate the application’s interface

Performance Optimization and Model Improvements

The project implements several performance optimizations:

  1. Batch processing during training to improve GPU utilization:
def create_iterators(train_data, valid_data, test_data, batch_size=64):
    train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
        (train_data, valid_data, test_data), 
        batch_size=batch_size,
        sort_key=lambda x: len(x.text),
        sort_within_batch=True,
        device=device)
    
    return train_iterator, valid_iterator, test_iterator
  1. Early stopping to prevent overfitting:
def train_with_early_stopping(model, train_iterator, valid_iterator, 
                             optimizer, criterion, patience=5):
    best_valid_loss = float('inf')
    epochs_without_improvement = 0
    
    for epoch in range(max_epochs):
        train_loss, train_acc = train_model(model, train_iterator, optimizer, criterion)
        valid_loss, valid_acc = evaluate_model(model, valid_iterator, criterion)
        
        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(model.state_dict(), 'best-model.pt')
            epochs_without_improvement = 0
        else:
            epochs_without_improvement += 1
        
        print(f'Epoch: {epoch+1}')
        print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
        print(f'\tVal. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')
        
        if epochs_without_improvement >= patience:
            print(f'Early stopping after {epoch+1} epochs')
            break
    
    # Load the best model
    model.load_state_dict(torch.load('best-model.pt'))
    return model
  1. Learning rate scheduling for better convergence:
optimizer = optim.Adam(model.parameters(), lr=2e-4)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', 
                                                factor=0.5, patience=2)

# In training loop
scheduler.step(valid_loss)

Conclusion: Putting It All Together

The Sentiment Analysis RNN project demonstrates how to build a complete NLP system from data preprocessing to web deployment. Key technical takeaways include:

  1. Effective text preprocessing is crucial for good model performance
  2. RNNs (particularly LSTMs and GRUs) excel at capturing sequential dependencies in text
  3. Proper training techniques like early stopping and learning rate scheduling improve model quality
  4. Model export and deployment bridges the gap between development and production
  5. Web integration makes the model accessible to end-users without technical knowledge

By embedding the Streamlit application in a WordPress site, this technical solution becomes accessible to a wider audience, showcasing how advanced NLP techniques can be applied to practical problems.

The combination of robust model architecture, efficient training procedures, and user-friendly deployment makes this project an excellent case study in applied deep learning for natural language processing.

You can explore the full implementation on GitHub or try the live demo at Streamlit App.