Sentiment Analysis with RNN End to End Project: A Technical Exploration
In today’s digital landscape, understanding sentiment from text data has become a crucial component for businesses and researchers alike. This blog post explores an end-to-end implementation of a sentiment analysis system using Recurrent Neural Networks (RNNs), with a detailed examination of the underlying code, architecture decisions, and deployment strategy.
Try the Sentiment WebApp: model Accuracy > 90%
IMDB Sentiment Analysis Webapp
Analyze the sentiment of any IMDB review using our Sentiment Analysis Tool
Launch ApplicationIntroduction to the Project
The Sentiment Analysis RNN project by Tejas K provides a comprehensive implementation of sentiment analysis that takes raw text as input and classifies it into positive, negative, or neutral categories. What makes this project stand out is its careful attention to the entire machine learning pipeline from data preprocessing to deployment.
Let’s delve into the technical aspects of this implementation.
Data Preprocessing: The Foundation
The quality of any NLP model heavily depends on how well the text data is preprocessed. The project implements several crucial preprocessing steps:
def preprocess_text(text):
# Convert to lowercase
text = text.lower()
# Remove HTML tags
text = re.sub(r'<.*?>', '', text)
# Remove special characters and numbers
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Tokenize
tokens = word_tokenize(text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
# Lemmatization
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(word) for word in tokens]
return ' '.join(tokens)
This preprocessing function performs several important operations:
- Converting text to lowercase to ensure consistent processing
- Removing HTML tags that might be present in web-scraped data
- Filtering out special characters and numbers to focus on alphabetic content
- Tokenizing the text into individual words
- Removing stopwords (common words like “the”, “and”, etc.) that typically don’t carry sentiment
- Lemmatizing words to reduce them to their base form
Building the Vocabulary: Tokenization and Embedding
Before feeding text to an RNN, we need to convert words into numerical vectors. The project implements a vocabulary builder and embedding mechanism:
class Vocabulary:
def __init__(self, max_size=None):
self.word2idx = {"<PAD>": 0, "<UNK>": 1}
self.idx2word = {0: "<PAD>", 1: "<UNK>"}
self.word_count = {}
self.max_size = max_size
def add_word(self, word):
if word not in self.word_count:
self.word_count[word] = 1
else:
self.word_count[word] += 1
def build_vocab(self):
# Sort words by frequency
sorted_words = sorted(self.word_count.items(), key=lambda x: x[1], reverse=True)
# Take only max_size most common words if specified
if self.max_size:
sorted_words = sorted_words[:self.max_size-2] # -2 for <PAD> and <UNK>
# Add words to dictionaries
for word, _ in sorted_words:
idx = len(self.word2idx)
self.word2idx[word] = idx
self.idx2word[idx] = word
def text_to_indices(self, text, max_length=None):
words = text.split()
indices = [self.word2idx.get(word, self.word2idx["<UNK>"]) for word in words]
if max_length:
if len(indices) > max_length:
indices = indices[:max_length]
else:
indices += [self.word2idx["<PAD>"]] * (max_length - len(indices))
return indices
This vocabulary class:
- Maintains mappings between words and their numerical indices
- Counts word frequencies to build a vocabulary of the most common words
- Handles unknown words with a special
<UNK>
token - Pads sequences to a consistent length with a
<PAD>
token - Converts text to sequences of indices for model processing
The Core: RNN Model Architecture
The heart of the project is the RNN model architecture. The implementation uses PyTorch to build a flexible model that can be configured with different RNN cell types (LSTM or GRU) and embedding dimensions:
class SentimentRNN(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers,
bidirectional, dropout, pad_idx, cell_type='lstm'):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
if cell_type.lower() == 'lstm':
self.rnn = nn.LSTM(embedding_dim,
hidden_dim,
num_layers=n_layers,
bidirectional=bidirectional,
dropout=dropout if n_layers > 1 else 0,
batch_first=True)
elif cell_type.lower() == 'gru':
self.rnn = nn.GRU(embedding_dim,
hidden_dim,
num_layers=n_layers,
bidirectional=bidirectional,
dropout=dropout if n_layers > 1 else 0,
batch_first=True)
else:
raise ValueError("cell_type must be 'lstm' or 'gru'")
self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, text, text_lengths):
# text = [batch size, seq length]
embedded = self.dropout(self.embedding(text))
# embedded = [batch size, seq length, embedding dim]
# Pack sequence for RNN efficiency
packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.cpu(),
batch_first=True, enforce_sorted=False)
if isinstance(self.rnn, nn.LSTM):
packed_output, (hidden, _) = self.rnn(packed_embedded)
else: # GRU
packed_output, hidden = self.rnn(packed_embedded)
# hidden = [n layers * n directions, batch size, hidden dim]
# If bidirectional, concatenate the final forward and backward hidden states
if self.rnn.bidirectional:
hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
else:
hidden = self.dropout(hidden[-1,:,:])
# hidden = [batch size, hidden dim * n directions]
return self.fc(hidden)
This model includes several key components:
- An embedding layer that converts word indices to dense vectors
- A configurable RNN layer (either LSTM or GRU) that processes the sequence
- Support for bidirectional processing to capture context from both directions
- Dropout for regularization to prevent overfitting
- A final fully connected layer for classification
- Efficient sequence packing to handle variable-length inputs
Training the Model: The Learning Process
The training loop implements several best practices for deep learning:
def train_model(model, train_iterator, optimizer, criterion):
model.train()
epoch_loss = 0
epoch_acc = 0
for batch in train_iterator:
optimizer.zero_grad()
text, text_lengths = batch.text
predictions = model(text, text_lengths)
loss = criterion(predictions, batch.label)
acc = calculate_accuracy(predictions, batch.label)
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=5)
optimizer.step()
epoch_loss += loss.item()
epoch_acc += acc.item()
return epoch_loss / len(train_iterator), epoch_acc / len(train_iterator)
Notable aspects include:
- Setting the model to training mode with
model.train()
- Zeroing gradients before each batch to prevent accumulation
- Computing loss and accuracy for monitoring training progress
- Implementing gradient clipping to prevent exploding gradients
- Updating model weights with the optimizer
- Tracking and returning average loss and accuracy
Evaluation and Testing: Measuring Performance
The evaluation function follows a similar structure but disables certain training-specific components:
def evaluate_model(model, iterator, criterion):
model.eval()
epoch_loss = 0
epoch_acc = 0
with torch.no_grad():
for batch in iterator:
text, text_lengths = batch.text
predictions = model(text, text_lengths)
loss = criterion(predictions, batch.label)
acc = calculate_accuracy(predictions, batch.label)
epoch_loss += loss.item()
epoch_acc += acc.item()
return epoch_loss / len(iterator), epoch_acc / len(iterator)
Key differences from the training function:
- Setting the model to evaluation mode with
model.eval()
- Using
torch.no_grad()
to disable gradient calculation for efficiency - Not performing backward passes or optimizer steps
Model Deployment: From PyTorch to Streamlit
The project’s deployment strategy involves exporting the trained PyTorch model to TorchScript for production use:
def export_model(model, vocab):
model.eval()
# Create a script module from the PyTorch model
example_text = torch.randint(0, len(vocab), (1, 10))
example_lengths = torch.tensor([10])
traced_model = torch.jit.trace(model, (example_text, example_lengths))
# Save the scripted model
torch.jit.save(traced_model, "sentiment_model.pt")
# Save the vocabulary
with open("vocab.json", "w") as f:
json.dump({
"word2idx": vocab.word2idx,
"idx2word": {int(k): v for k, v in vocab.idx2word.items()}
}, f)
The exported model is then integrated into a Streamlit application for easy access:
def load_model():
# Load the TorchScript model
model = torch.jit.load("sentiment_model.pt")
# Load vocabulary
with open("vocab.json", "r") as f:
vocab_data = json.load(f)
# Recreate vocabulary object
vocab = Vocabulary()
vocab.word2idx = vocab_data["word2idx"]
vocab.idx2word = {int(k): v for k, v in vocab_data["idx2word"].items()}
return model, vocab
def predict_sentiment(model, vocab, text):
# Preprocess text
processed_text = preprocess_text(text)
# Convert to indices
indices = vocab.text_to_indices(processed_text, max_length=100)
tensor = torch.LongTensor(indices).unsqueeze(0) # Add batch dimension
length = torch.tensor([len(indices)])
# Make prediction
model.eval()
with torch.no_grad():
prediction = model(tensor, length)
# Get probability using softmax
probabilities = F.softmax(prediction, dim=1)
# Get predicted class
predicted_class = torch.argmax(prediction, dim=1).item()
# Map to sentiment
sentiment_map = {0: "Negative", 1: "Neutral", 2: "Positive"}
return {
"sentiment": sentiment_map[predicted_class],
"confidence": probabilities[0][predicted_class].item(),
"probabilities": {
sentiment_map[i]: prob.item() for i, prob in enumerate(probabilities[0])
}
}
The Streamlit application code brings everything together in a user-friendly interface:
def main():
st.title("Sentiment Analysis with RNN")
model, vocab = load_model()
st.write("Enter text to analyze its sentiment:")
user_input = st.text_area("Text input", "")
if st.button("Analyze Sentiment"):
if user_input:
with st.spinner("Analyzing..."):
result = predict_sentiment(model, vocab, user_input)
st.write(f"**Sentiment:** {result['sentiment']}")
st.write(f"**Confidence:** {result['confidence']*100:.2f}%")
# Display probabilities
st.write("### Probability Distribution")
for sentiment, prob in result['probabilities'].items():
st.write(f"{sentiment}: {prob*100:.2f}%")
st.progress(prob)
else:
st.warning("Please enter some text to analyze.")
if __name__ == "__main__":
main()
The iframe parameters and styling ensure:
- The dark theme specified with
embed_options=dark_theme
- Responsive design that works on different screen sizes
- Clean integration with the WordPress site’s aesthetics
- Proper sizing to accommodate the application’s interface
Performance Optimization and Model Improvements
The project implements several performance optimizations:
- Batch processing during training to improve GPU utilization:
def create_iterators(train_data, valid_data, test_data, batch_size=64):
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
(train_data, valid_data, test_data),
batch_size=batch_size,
sort_key=lambda x: len(x.text),
sort_within_batch=True,
device=device)
return train_iterator, valid_iterator, test_iterator
- Early stopping to prevent overfitting:
def train_with_early_stopping(model, train_iterator, valid_iterator,
optimizer, criterion, patience=5):
best_valid_loss = float('inf')
epochs_without_improvement = 0
for epoch in range(max_epochs):
train_loss, train_acc = train_model(model, train_iterator, optimizer, criterion)
valid_loss, valid_acc = evaluate_model(model, valid_iterator, criterion)
if valid_loss < best_valid_loss:
best_valid_loss = valid_loss
torch.save(model.state_dict(), 'best-model.pt')
epochs_without_improvement = 0
else:
epochs_without_improvement += 1
print(f'Epoch: {epoch+1}')
print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
print(f'\tVal. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')
if epochs_without_improvement >= patience:
print(f'Early stopping after {epoch+1} epochs')
break
# Load the best model
model.load_state_dict(torch.load('best-model.pt'))
return model
- Learning rate scheduling for better convergence:
optimizer = optim.Adam(model.parameters(), lr=2e-4)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min',
factor=0.5, patience=2)
# In training loop
scheduler.step(valid_loss)
Conclusion: Putting It All Together
The Sentiment Analysis RNN project demonstrates how to build a complete NLP system from data preprocessing to web deployment. Key technical takeaways include:
- Effective text preprocessing is crucial for good model performance
- RNNs (particularly LSTMs and GRUs) excel at capturing sequential dependencies in text
- Proper training techniques like early stopping and learning rate scheduling improve model quality
- Model export and deployment bridges the gap between development and production
- Web integration makes the model accessible to end-users without technical knowledge
By embedding the Streamlit application in a WordPress site, this technical solution becomes accessible to a wider audience, showcasing how advanced NLP techniques can be applied to practical problems.
The combination of robust model architecture, efficient training procedures, and user-friendly deployment makes this project an excellent case study in applied deep learning for natural language processing.
You can explore the full implementation on GitHub or try the live demo at Streamlit App.