Natural Language Processing (NLP) requires converting human language into numerical formats that computers can understand. This guide explores major text representation techniques in depth, comparing their strengths, weaknesses, and practical applications.
1. One-Hot Encoding
One-hot encoding is a fundamental representation technique that forms the conceptual foundation for many text representation methods.
How It Works
One-hot encoding represents each word as a binary vector with a length equal to the vocabulary size. For a vocabulary of size V:
- Create a vector of length V filled with zeros
- Set the position corresponding to the word’s index to 1
- All other positions remain 0
Detailed Example
Consider a small vocabulary: [“apple”, “banana”, “cherry”, “date”, “elderberry”]
One-hot encodings:
- “apple” = [1, 0, 0, 0, 0]
- “banana” = [0, 1, 0, 0, 0]
- “cherry” = [0, 0, 1, 0, 0]
- “date” = [0, 0, 0, 1, 0]
- “elderberry” = [0, 0, 0, 0, 1]
To represent the sentence “I like apple and banana”:
- We would create five separate vectors for each word
- Words not in our vocabulary (like “I”, “like”, “and”) would either be ignored or added to the vocabulary
Mathematical Formulation
For a vocabulary V = {w₁, w₂, …, wₙ}, the one-hot encoding of word wᵢ is a vector v where:
- v[j] = 1 if j = i
- v[j] = 0 if j ≠ i
Advantages
- Simplicity: Straightforward to implement and understand
- Unique Representation: Each word has a distinct representation
- No Assumptions: Makes no assumptions about relationships between words
- Lossless: Preserves word identity perfectly
Disadvantages
- Dimensionality: For real vocabularies (50,000+ words), vectors become enormous
- Sparsity: Most elements are zero, wasting memory and computation
- No Semantic Information: “apple” and “fruit” are as different as “apple” and “rocket”
- No Contextual Information: The same word always has the same representation regardless of usage
Code Implementation
def one_hot_encode(word, vocabulary):
vector = [0] * len(vocabulary)
if word in vocabulary:
vector[vocabulary.index(word)] = 1
return vector
vocabulary = ["apple", "banana", "cherry", "date", "elderberry"]
print(one_hot_encode("banana", vocabulary)) # [0, 1, 0, 0, 0]
print(one_hot_encode("apple", vocabulary)) # [1, 0, 0, 0, 0]
2. Bag of Words (BoW)
Bag of Words builds on one-hot encoding to represent entire documents rather than individual words.
How It Works
- Create a vocabulary from all unique words in the corpus
- For each document:
- Initialize a vector of zeros with length equal to vocabulary size
- For each word in the document, increment the corresponding position
- The final vector contains counts of word occurrences
Detailed Example
Consider two documents:
- Document 1: “The cat sat on the mat”
- Document 2: “The dog chased the cat”
Vocabulary: [“the”, “cat”, “sat”, “on”, “mat”, “dog”, “chased”]
BoW representations:
- Document 1: [2, 1, 1, 1, 1, 0, 0] (2 occurrences of “the”, 1 of “cat”, etc.)
- Document 2: [2, 1, 0, 0, 0, 1, 1]
Mathematical Formulation
For a document D and vocabulary V = {w₁, w₂, …, wₙ}, the BoW representation is a vector v where:
- v[i] = count of word wᵢ in document D
Advantages
- Frequency Information: Captures how often words appear
- Document Comparison: Enables comparing documents based on content
- Simplicity: Easy to implement and understand
- Scalability: Works well with many classification algorithms
- Success in Practice: Despite simplicity, works well for many tasks like spam detection and document categorization
Disadvantages
- Loss of Order: “The cat chased the dog” and “The dog chased the cat” have identical representations
- Equal Weighting: Common words like “the” get high values despite low information content
- Sparse Representation: Most entries are zero for large vocabularies
- No Semantics: Doesn’t capture word relationships or meanings
Practical Applications
- Sentiment Analysis: Determining whether reviews are positive or negative
- Spam Detection: Identifying unwanted emails
- Document Categorization: Sorting documents into topics
Code Implementation
from collections import Counter
def create_bow(document, vocabulary):
word_counts = Counter(document.lower().split())
return [word_counts.get(word, 0) for word in vocabulary]
vocabulary = ["the", "cat", "sat", "on", "mat", "dog", "chased"]
doc1 = "The cat sat on the mat"
doc2 = "The dog chased the cat"
bow1 = create_bow(doc1, vocabulary)
bow2 = create_bow(doc2, vocabulary)
print(bow1) # [2, 1, 1, 1, 1, 0, 0]
print(bow2) # [2, 1, 0, 0, 0, 1, 1]
3. TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF enhances BoW by weighting terms based on their importance within and across documents.
How It Works
TF-IDF consists of two components:
- Term Frequency (TF): Measures how frequently a term appears in a document
- TF(t,d) = (Number of times term t appears in document d) / (Total number of terms in document d)
- Inverse Document Frequency (IDF): Measures how important a term is across the corpus
- IDF(t) = log(Total number of documents / Number of documents containing term t)
The final TF-IDF score is: TF-IDF(t,d) = TF(t,d) × IDF(t)
Detailed Example
Consider a corpus of three documents:
- Doc1: “The cat sat on the mat”
- Doc2: “The dog chased the cat”
- Doc3: “The bird flew over the house”
Let’s calculate the TF-IDF for the word “cat” in Doc1:
- Term Frequency for “cat” in Doc1:
- TF(“cat”,Doc1) = 1/6 = 0.167
- Inverse Document Frequency for “cat”:
- “cat” appears in 2 out of 3 documents
- IDF(“cat”) = log(3/2) ≈ log(1.5) ≈ 0.176
- TF-IDF for “cat” in Doc1:
- TF-IDF(“cat”,Doc1) = 0.167 × 0.176 ≈ 0.029
Compare this with the common word “the”:
- TF(“the”, Doc1) = 2/6 = 0.333
- IDF(“the”) = log(3/3) = log(1) = 0
- TF-IDF(“the”,Doc1) = 0.333 × 0 = 0
This shows how TF-IDF reduces the weight of common words that appear in all documents.
Mathematical Formulation
For term t in document d, from a corpus D:
- TF(t,d) = f(t,d) / Σₓ f(x,d) where f(t,d) is the count of term t in document d
- IDF(t) = log(|D| / |{d ∈ D : t ∈ d}|) where |D| is the total number of documents
- TF-IDF(t,d) = TF(t,d) × IDF(t)
Advantages
- Word Importance: Distinguishes between common and distinctive terms
- Weighting Mechanism: Reduces the impact of high-frequency, low-information words
- Enhanced Discrimination: Highlights words that characterize specific documents
- Proven Effectiveness: Outperforms raw BoW in many tasks
- Interpretability: Values have clear meaning (higher = more distinctive)
Disadvantages
- Still Ignores Order: Word sequence is not considered
- Corpus Dependency: IDF calculation requires a complete corpus
- No Semantic Understanding: Doesn’t capture word relationships
- Fixed Vocabulary: Struggles with out-of-vocabulary words
- Limited Context: Doesn’t capture word usage context
Practical Applications
- Information Retrieval: Powering search engines
- Document Clustering: Grouping similar documents
- Feature Extraction: Creating input features for machine learning algorithms
- Keyword Extraction: Identifying the most distinctive words in text
Code Implementation
import numpy as np
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
# Manual implementation
def compute_tfidf(corpus):
# Create vocabulary
all_words = set()
for doc in corpus:
for word in doc.lower().split():
all_words.add(word)
vocabulary = list(all_words)
# Calculate document frequency
doc_freq = Counter()
for doc in corpus:
words_in_doc = set(doc.lower().split())
for word in words_in_doc:
doc_freq[word] += 1
# Calculate TF-IDF
tfidf_vectors = []
for doc in corpus:
word_counts = Counter(doc.lower().split())
total_words = len(doc.lower().split())
tfidf_vector = []
for word in vocabulary:
# Term frequency
tf = word_counts.get(word, 0) / total_words
# Inverse document frequency
idf = np.log(len(corpus) / doc_freq.get(word, 1))
# TF-IDF
tfidf_vector.append(tf * idf)
tfidf_vectors.append(tfidf_vector)
return tfidf_vectors, vocabulary
# Example usage
corpus = [
"The cat sat on the mat",
"The dog chased the cat",
"The bird flew over the house"
]
tfidf_vectors, vocab = compute_tfidf(corpus)
print(f"Vocabulary: {vocab}")
print(f"TF-IDF for document 1: {tfidf_vectors[0]}")
# Using scikit-learn
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print("scikit-learn TF-IDF:")
print(X.toarray())
4. Word2Vec
Word2Vec represents a paradigm shift in text representation by generating dense, continuous vector embeddings that capture semantic relationships between words.
How It Works
Word2Vec uses shallow neural networks with two main architectures:
- Skip-gram:
- Input: Target word
- Output: Context words (surrounding words)
- The model learns to predict the context from a single word
- Continuous Bag of Words (CBOW):
- Input: Context words
- Output: Target word
- The model learns to predict a word from its context
During training, Word2Vec adjusts word vectors to maximize the probability of correct predictions, resulting in semantically similar words having similar embeddings.
Detailed Example
Consider training Word2Vec on this corpus: “The quick brown fox jumps over the lazy dog.”
For a window size of 2, training examples for Skip-gram include:
- Input: “quick”, Output: [“The”, “brown”, “fox”]
- Input: “brown”, Output: [“quick”, “fox”, “jumps”]
- Input: “fox”, Output: [“brown”, “quick”, “jumps”, “over”]
After training, similar words have similar vectors. For example, the vectors for “king”, “queen”, “man”, and “woman” would capture their semantic relationships, enabling vector arithmetic like:
- vec(“king”) – vec(“man”) + vec(“woman”) ≈ vec(“queen”)
Mathematical Formulation
For the Skip-gram model, the objective is to maximize:
- log P(context|word) = Σᵢ₌₁ᵀ Σⱼ∈context(i) log P(wⱼ|wᵢ)
Where P(wⱼ|wᵢ) is modeled using the softmax function:
- P(wⱼ|wᵢ) = exp(vᵀwⱼ · vwᵢ) / Σₖ₌₁ᵛ exp(vᵀwₖ · vwᵢ)
Where vwᵢ is the vector for the input word, and vᵀwⱼ is the vector for the context word.
Advantages
- Semantic Relationships: Captures word similarity and relationships
- Dense Representation: Low-dimensional vectors (typically 100-300 dimensions)
- Vector Arithmetic: Enables mathematical operations on word meanings
- Transferability: Pre-trained embeddings can be used across different tasks
- Performance: Dramatically improves results on many NLP tasks
Disadvantages
- Training Requirements: Needs large text corpora and computational resources
- Fixed Vocabulary: Cannot handle out-of-vocabulary words
- Single Representation: One vector per word, regardless of context
- No Polysemy: Cannot represent multiple meanings of the same word
- Black Box: Difficult to interpret individual dimensions
Practical Applications
- Semantic Similarity: Finding related words or documents
- Machine Translation: Improving language translation
- Named Entity Recognition: Identifying entities in text
- Text Classification: Enhancing document categorization
- Recommendation Systems: Finding similar items based on descriptions
Code Implementation
from gensim.models import Word2Vec
import numpy as np
# Sample corpus
sentences = [
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"],
["the", "brown", "fox", "is", "quick", "and", "the", "dog", "is", "lazy"],
["quick", "brown", "foxes", "jump", "over", "lazy", "dogs"]
]
# Train model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
# Explore word vectors
print(f"Vector for 'fox': {model.wv['fox'][:5]}...") # Show first 5 dimensions
# Find similar words
similar_words = model.wv.most_similar("fox", topn=3)
print(f"Words similar to 'fox': {similar_words}")
# Vector arithmetic
result = model.wv.most_similar(positive=["quick", "dog"], negative=["lazy"], topn=1)
print(f"quick - lazy + dog: {result}")
# Manual similarity calculation
cosine_similarity = np.dot(model.wv["fox"], model.wv["dog"]) / (
np.linalg.norm(model.wv["fox"]) * np.linalg.norm(model.wv["dog"])
)
print(f"Cosine similarity between 'fox' and 'dog': {cosine_similarity}")
5. GloVe (Global Vectors for Word Representation)
GloVe combines the benefits of matrix factorization methods and local context window methods.
How It Works
- Build a co-occurrence matrix X, where Xᵢⱼ represents how often word i appears in the context of word j
- Define a weighting function f(Xᵢⱼ) that gives less weight to rare and extremely common co-occurrences
- Find word vectors wᵢ and context vectors w̃ⱼ such that their dot product approximates the log of co-occurrence probability:
- wᵢᵀw̃ⱼ + bᵢ + b̃ⱼ ≈ log(Xᵢⱼ)
- Minimize the following objective:
- J = Σᵢ,ⱼ f(Xᵢⱼ)(wᵢᵀw̃ⱼ + bᵢ + b̃ⱼ – log(Xᵢⱼ))²
Detailed Example
Imagine we’ve analyzed a large corpus and created a co-occurrence matrix:
the | cat | sat | mat | |
---|---|---|---|---|
the | 0 | 45 | 12 | 32 |
cat | 45 | 0 | 67 | 5 |
sat | 12 | 67 | 0 | 56 |
mat | 32 | 5 | 56 | 0 |
GloVe would find word vectors such that:
- vec(“cat”)·vec(“sat”) > vec(“cat”)·vec(“mat”) because cats sit more than they are on mats
- vec(“the”)·vec(“cat”) > vec(“the”)·vec(“sat”) because “the cat” is more common than “the sat”
After training, GloVe might produce 300-dimensional vectors that capture these statistical relationships. These vectors would support the same analogical reasoning as Word2Vec.
Mathematical Formulation
The GloVe objective function:
- J = Σᵢ,ⱼ f(Xᵢⱼ)(wᵢᵀw̃ⱼ + bᵢ + b̃ⱼ – log(Xᵢⱼ))²
Where:
- Xᵢⱼ is the co-occurrence count between words i and j
- wᵢ and w̃ⱼ are word and context vectors
- bᵢ and b̃ⱼ are bias terms
- f(Xᵢⱼ) is a weighting function:
- f(x) = (x/xₘₐₓ)^α if x < xₘₐₓ
- f(x) = 1 otherwise
Advantages
- Global Statistics: Captures both word-word relationships and corpus-level statistics
- Efficiency: More efficient use of statistics than Word2Vec
- Performance: Often outperforms Word2Vec on analogy tasks
- Explicit Modeling: Directly models co-occurrence probabilities
- Parallelizable: Training can be parallelized more effectively than neural methods
Disadvantages
- Static Embeddings: One representation per word, regardless of context
- Training Data: Requires substantial text data for good performance
- Memory Usage: Co-occurrence matrix can be massive for large vocabularies
- Out-of-Vocabulary: Cannot handle words not seen during training
- No Polysemy: Cannot represent multiple word senses
Practical Applications
- Document Classification: Improving classification accuracy
- Machine Translation: Enhancing translation quality
- Named Entity Recognition: Identifying entities in text
- Question Answering: Improving understanding of questions and contexts
- Transfer Learning: Providing pre-trained representations for other tasks
Code Implementation
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Simplified GloVe-style training (actual implementation would be more complex)
def train_simplified_glove(co_occurrence_matrix, vector_size=50, iterations=50, learning_rate=0.05):
vocab_size = co_occurrence_matrix.shape[0]
# Initialize random word and context vectors
W = np.random.randn(vocab_size, vector_size) * 0.01
W_context = np.random.randn(vocab_size, vector_size) * 0.01
b = np.zeros(vocab_size)
b_context = np.zeros(vocab_size)
# Training loop
for iteration in range(iterations):
cost = 0
for i in range(vocab_size):
for j in range(vocab_size):
if co_occurrence_matrix[i, j] > 0:
# Weight function - simplified
weight = min(1, (co_occurrence_matrix[i, j] / 100) ** 0.75)
# Compute prediction and error
prediction = np.dot(W[i], W_context[j]) + b[i] + b_context[j]
error = prediction - np.log(max(co_occurrence_matrix[i, j], 1))
# Update cost
cost += weight * error ** 2
# Compute gradients
grad = weight * error
# Update parameters
W[i] -= learning_rate * grad * W_context[j]
W_context[j] -= learning_rate * grad * W[i]
b[i] -= learning_rate * grad
b_context[j] -= learning_rate * grad
if iteration % 10 == 0:
print(f"Iteration {iteration}, Cost: {cost}")
# Final word vectors (sum of word and context vectors)
final_vectors = W + W_context
return final_vectors
# Example co-occurrence matrix
co_occurrence = np.array([
[0, 45, 12, 32],
[45, 0, 67, 5],
[12, 67, 0, 56],
[32, 5, 56, 0]
])
# Train simplified GloVe
word_vectors = train_simplified_glove(co_occurrence, vector_size=10, iterations=100)
# Calculate similarities
sim_matrix = cosine_similarity(word_vectors)
print("Word similarity matrix:")
print(sim_matrix)
# Example words
words = ["the", "cat", "sat", "mat"]
# Show most similar words for each word
for i, word in enumerate(words):
similarities = [(words[j], sim_matrix[i, j]) for j in range(len(words)) if j != i]
similarities.sort(key=lambda x: x[1], reverse=True)
print(f"Words most similar to '{word}': {similarities}")
6. Contextual Embeddings (BERT, ELMo, etc.)
Contextual embeddings revolutionized NLP by generating dynamic representations that capture the meaning of words in their specific context.
How It Works
Unlike static embeddings, contextual models:
- Process entire sentences/paragraphs together
- Use deep neural architectures (Transformers for BERT, bidirectional LSTMs for ELMo)
- Pre-train on massive corpora using tasks like masked language modeling
- Generate different vectors for the same word depending on its context
- Often use subword tokenization (WordPiece for BERT, Byte-Pair Encoding for others)
Detailed Example
Consider the word “bank” in different contexts:
- “I deposited money in the bank yesterday.”
- “We sat on the bank of the river and watched the sunset.”
With contextual embeddings:
- The model processes each full sentence
- “bank” in the first sentence gets a vector representing the financial institution meaning
- “bank” in the second sentence gets a different vector representing the river shore meaning
- These vectors capture the distinct meanings despite being the same word
For BERT specifically, before generating embeddings:
- The input is tokenized: “I deposited money in the bank yesterday.” → [“[CLS]”, “i”, “deposit”, “##ed”, “money”, “in”, “the”, “bank”, “yesterday”, “.”, “[SEP]”]
- Each token is assigned three embeddings (token, position, segment) which are summed
- This combined representation passes through multiple Transformer layers
- The final layer outputs contextualized embeddings for each token
Mathematical Formulation
For the BERT model, the attention mechanism is a key component:
- Attention(Q, K, V) = softmax(QK^T / √dₖ)V
Where Q, K, and V are query, key, and value matrices derived from the input embeddings.
The entire model consists of multiple layers of multi-headed attention and feed-forward networks:
- h₁ = LayerNorm(x + MultiHeadAttention(x))
- h₂ = LayerNorm(h₁ + FeedForward(h₁))
Advantages
- Context Awareness: Captures word meaning based on surrounding context
- Polysemy Handling: Different representations for different word senses
- Subword Tokenization: Handles out-of-vocabulary words effectively
- Deep Understanding: Captures complex language phenomena like negation
- State-of-the-Art Performance: Achieves best results on most NLP tasks
- Transfer Learning: Pre-trained models can be fine-tuned for specific tasks
Disadvantages
- Computational Requirements: Extremely resource-intensive
- Complexity: More difficult to implement and use
- Interpretability: Hard to understand what specific dimensions represent
- Size: Models are very large (hundreds of millions to billions of parameters)
- Training Data: Requires massive amounts of text for pre-training
Practical Applications
- Question Answering: Understanding questions and finding answers in text
- Text Classification: Superior document categorization
- Named Entity Recognition: Identifying and classifying entities
- Text Generation: Creating coherent and contextually appropriate text
- Sentiment Analysis: Understanding nuanced opinions and emotions
- Machine Translation: Producing high-quality translations
Code Implementation
import torch
from transformers import BertTokenizer, BertModel
# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Example sentences with the word "bank"
sentences = [
"I deposited money in the bank yesterday.",
"We sat on the bank of the river and watched the sunset."
]
# Get contextual embeddings for each sentence
for sentence in sentences:
# Tokenize and prepare for model
inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
# Get model outputs
with torch.no_grad():
outputs = model(**inputs)
# Get embeddings from last layer
embeddings = outputs.last_hidden_state
# Find the position of "bank" in tokenized input
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
bank_position = tokens.index("bank")
# Extract the embedding for "bank"
bank_embedding = embeddings[0, bank_position].numpy()
print(f"\nContextual embedding for 'bank' in: \"{sentence}\"")
print(f"First 5 dimensions: {bank_embedding[:5]}...")
# Compare the two "bank" embeddings
bank1 = outputs.last_hidden_state[0, tokens.index("bank")].numpy()
tokens = tokenizer.convert_ids_to_tokens(tokenizer(sentences[1])["input_ids"])
inputs = tokenizer(sentences[1], return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
bank2 = outputs.last_hidden_state[0, tokens.index("bank")].numpy()
# Calculate cosine similarity
cosine_sim = np.dot(bank1, bank2) / (np.linalg.norm(bank1) * np.linalg.norm(bank2))
print(f"\nCosine similarity between the two 'bank' embeddings: {cosine_sim}")
print(f"Note that this value would be 1.0 with static embeddings like Word2Vec")
7. FastText
FastText extends Word2Vec by incorporating subword information, making it better at handling rare and unseen words.
How It Works
- Represents each word as a bag of character n-grams (plus the word itself)
- Each n-gram has its own vector representation
- A word’s embedding is the sum of its n-gram vectors
- Uses similar training approaches as Word2Vec (Skip-gram or CBOW)
Detailed Example
For the word “apple” with n-grams of length 3-6:
- Character n-grams: “<ap”, “app”, “ppl”, “ple”, “le>”, “<app”, “appl”, “pple”, “ple>”, “<appl”, “apple”, “pple>”, “<apple”, “apple>”
- (where < and > represent word boundaries)
The final vector for “apple” would be the sum of these n-gram vectors plus the vector for the whole word.
When encountering an unseen word like “applet”:
- Many n-grams overlap with “apple” (e.g., “app”, “ppl”)
- FastText can build a reasonable embedding from these shared n-grams
- This gives better coverage for rare, technical, or misspelled words
Advantages
- Morphological Awareness: Captures word structure and morphology
- Out-of-Vocabulary Handling: Can generate embeddings for unseen words
- Robustness to Misspellings: Similar embeddings for misspelled variants
- Better for Morphologically Rich Languages: Works well for languages with many word forms
- Compact Models: Can be compressed efficiently
Disadvantages
- Larger Models: More parameters than Word2Vec
- Computational Cost: More expensive to train
- Still Static: No context sensitivity despite subword awareness
- Less Semantic Precision: May blur distinctions between some similar words
Code Implementation
from gensim.models import FastText
# Sample corpus
sentences = [
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"],
["the", "brown", "fox", "is", "quick", "and", "the", "dog", "is", "lazy"],
["quick", "brown", "foxes", "jump", "over", "lazy", "dogs"]
]
# Train model
model = FastText(sentences, vector_size=100, window=5, min_count=1, workers=4, min_n=3, max_n=6)
# Explore word vectors
print(f"Vector for 'fox': {model.wv['fox'][:5]}...") # Show first 5 dimensions
# Find similar words
similar_words = model.wv.most_similar("fox", topn=3)
print(f"Words similar to 'fox': {similar_words}")
# Out-of-vocabulary word handling
print(f"Vector for unseen word 'foxiest': {model.wv['foxiest'][:5]}...")
8. Doc2Vec (Paragraph Vector)
Doc2Vec extends Word2Vec to learn fixed-length representations for variable-length texts such as sentences, paragraphs, or documents.
How It Works
Doc2Vec has two main variants:
- Distributed Memory (DM):
- Similar to CBOW in Word2Vec
- Predicts a target word given context words AND a document vector
- The document vector serves as a memory that captures the topic of the document
- Distributed Bag of Words (DBOW):
- Similar to Skip-gram in Word2Vec
- Predicts context words given only the document vector
- Simpler but often works as well as DM
Detailed Example
Consider training Doc2Vec on a corpus of movie reviews:
- Assign a unique ID to each review
- Train the model to predict words in the review given the review ID
- The resulting vectors for each review ID capture the semantic content
For example, reviews about action movies will have similar vectors, distinct from those about romantic comedies.
After training, we can:
- Compare documents directly (e.g., find similar movie reviews)
- Infer vectors for new, unseen documents
- Use document vectors for classification or clustering
Advantages
- Document-Level Semantics: Captures meaning at document scale
- Fixed-Length Representation: Consistent size regardless of document length
- Compositionality: Combines word and document meaning
- End-to-End Learning: Learns document representations directly
- Unsupervised: Doesn’t require labeled data
8. Doc2Vec (Paragraph Vector) (continued)
Disadvantages
- Training Complexity: More complex to train than word embeddings
- Data Requirements: Needs substantial corpus for good representations
- Hyperparameter Sensitivity: Performance varies with parameter settings
- Black Box: Difficult to interpret what dimensions represent
- No Context Within Documents: Treats all words in document equally
Practical Applications
- Document Classification: Categorizing texts by topic or sentiment
- Information Retrieval: Finding similar documents
- Document Clustering: Grouping similar texts
- Recommendation Systems: Suggesting similar content
- Plagiarism Detection: Identifying semantically similar documents
Code Implementation
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
# Sample corpus
documents = [
"The quick brown fox jumps over the lazy dog",
"The fox is quick and the dog is lazy",
"Quick brown foxes jump over lazy dogs",
"I love machine learning and natural language processing",
"Vector representations are useful in NLP tasks",
"Natural language processing involves machine learning"
]
# Preprocess and tag documents
tagged_docs = [TaggedDocument(words=word_tokenize(doc.lower()), tags=[i])
for i, doc in enumerate(documents)]
# Train model
model = Doc2Vec(vector_size=50, min_count=1, epochs=40)
model.build_vocab(tagged_docs)
model.train(tagged_docs, total_examples=model.corpus_count, epochs=model.epochs)
# Explore document vectors
print(f"Vector for document 0: {model.dv[0][:5]}...") # Show first 5 dimensions
# Find similar documents
similar_docs = model.dv.most_similar(0, topn=2)
print(f"Documents similar to document 0: {similar_docs}")
# Infer vector for a new document
new_doc = "Foxes and dogs are quick animals"
inferred_vector = model.infer_vector(word_tokenize(new_doc.lower()))
print(f"Inferred vector for new document: {inferred_vector[:5]}...")
# Find similar documents to the new document
similar_to_new = model.dv.most_similar([inferred_vector], topn=2)
print(f"Documents similar to new document: {similar_to_new}")
9. Universal Sentence Encoder (USE)
The Universal Sentence Encoder provides sentence-level embeddings that scale to various NLP tasks with minimal task-specific training data.
How It Works
USE has two major variants:
- Transformer-based:
- Uses a Transformer architecture similar to BERT
- Optimizes for accuracy at the cost of computational complexity
- Processes full sentences with attention mechanisms
- DAN-based (Deep Averaging Network):
- Averages embeddings for input words/n-grams and passes through a deep neural network
- More efficient but slightly less accurate
- Better suited for mobile and low-resource environments
Both are trained on a variety of tasks, including:
- Skip-thought prediction
- Translation ranking
- Natural language inference
- Conversational response prediction
Detailed Example
Consider two sentences:
- “The cat sat on the mat.”
- “A feline rested on the floor covering.”
Despite different vocabulary, USE would produce similar embeddings for these semantically similar sentences.
When applied to question answering:
- Question: “What is the capital of France?”
- Candidate answers from a knowledge base are encoded
- The answer with the highest cosine similarity to the question is selected
- “Paris is the capital of France” would have high similarity
Advantages
- Sentence-Level Semantics: Captures meaning at sentence scale
- Transfer Learning Ready: Pre-trained for use across multiple tasks
- Minimal Fine-tuning: Works well with limited task-specific data
- Language Understanding: Captures semantic similarities regardless of phrasing
- Multilingual Versions: Available for multiple languages
Disadvantages
- Fixed Representation: One vector per sentence regardless of length
- Computational Requirements: Transformer variant is resource-intensive
- Limited Context Length: Performance degrades with very long texts
- Black Box: Difficult to interpret dimensions
- Less Precise Than Task-Specific Models: Jack-of-all-trades approach
Practical Applications
- Semantic Textual Similarity: Measuring how similar two texts are
- Clustering: Grouping similar sentences or paragraphs
- Classification: Categorizing short texts
- Information Retrieval: Finding relevant information from a corpus
- Semantic Search: Searching by meaning rather than keywords
Code Implementation
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Load pre-trained USE model
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
# Example sentences
sentences = [
"The cat sat on the mat.",
"A feline rested on the floor covering.",
"Dogs chase cats.",
"What is the capital of France?",
"Paris is the capital of France."
]
# Generate embeddings
embeddings = embed(sentences)
print(f"Embedding shape: {embeddings.shape}") # Should be (5, 512)
# Compute similarity matrix
similarity_matrix = cosine_similarity(embeddings)
print("Similarity matrix:")
for i in range(len(sentences)):
for j in range(i+1, len(sentences)):
print(f"Similarity between \"{sentences[i]}\" and \"{sentences[j]}\": {similarity_matrix[i, j]:.4f}")
# Question answering example
question = "What is the capital of France?"
question_embedding = embed([question])
candidate_answers = [
"Paris is the capital of France.",
"Berlin is the capital of Germany.",
"London is the capital of England."
]
answer_embeddings = embed(candidate_answers)
# Calculate similarities between question and answers
similarities = cosine_similarity(question_embedding, answer_embeddings)[0]
for i, (answer, similarity) in enumerate(zip(candidate_answers, similarities)):
print(f"Answer {i+1}: \"{answer}\" - Similarity: {similarity:.4f}")
# Get best answer
best_answer_index = np.argmax(similarities)
print(f"Best answer: \"{candidate_answers[best_answer_index]}\"")
10. Sentence-BERT (SBERT)
Sentence-BERT modifies the BERT architecture to derive semantically meaningful sentence embeddings efficiently.
How It Works
- Uses siamese and triplet network structures with BERT/RoBERTa/etc. as base models
- Applies pooling to the output of BERT (mean, max, or CLS token pooling)
- Trained on sentence pairs with objectives like:
- Natural Language Inference (entailment, contradiction, neutral)
- Semantic Textual Similarity (scoring sentence similarity)
- Produces fixed-size sentence embeddings optimized for semantic comparison
Detailed Example
Training process example:
- Take sentence pairs labeled for similarity
- “I love pizza” and “Pizza is my favorite food” (similar)
- “I love pizza” and “I hate vegetables” (dissimilar)
- Pass both sentences through the same BERT model with shared weights
- Apply pooling to get a fixed vector for each sentence
- Train the network to minimize distance between similar sentences and maximize distance between dissimilar ones
In practice:
- Computing similarity between 10,000 sentences using BERT would require 50 million sentence pair computations
- With SBERT, each sentence is encoded once and similarities are computed via vector operations, reducing computation by 99.8%
Mathematical Formulation
For the triplet objective function:
- L = max(0, ||a-p||² – ||a-n||² + margin)
Where:
- a is the anchor sentence embedding
- p is a positive example (similar sentence)
- n is a negative example (dissimilar sentence)
- margin is a hyperparameter that enforces a minimum distance
Advantages
- Efficiency: Much faster than comparing all sentence pairs with BERT
- Semantic Understanding: Captures sentence meaning well
- Strong Transfer Learning: Pre-trained models work well across domains
- State-of-the-art Performance: Achieves excellent results on sentence similarity tasks
- Handles Longer Text: Better than word embeddings for sentences
- Task Adaptability: Can be fine-tuned for specific tasks
Disadvantages
- Resource Requirements: Still needs significant computational resources
- Limited Context Length: Performance decreases with very long texts
- Black Box Nature: Difficult to interpret what dimensions represent
- Fixed Embedding Size: Same dimensionality regardless of sentence complexity
- Domain Adaptation Challenges: May require fine-tuning for specialized domains
Practical Applications
- Semantic Search: Finding relevant documents quickly
- Clustering: Grouping similar texts efficiently
- Information Retrieval: Retrieving relevant information
- Paraphrase Mining: Finding alternative expressions of the same idea
- Automatic Essay Grading: Comparing student answers to reference answers
- Duplicate Question Detection: Finding similar questions on Q&A platforms
Code Implementation
from sentence_transformers import SentenceTransformer, util
# Load pre-trained SBERT model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Example sentences
sentences = [
"The cat sat on the mat.",
"A feline rested on the floor covering.",
"Dogs chase cats.",
"What is the capital of France?",
"Paris is the capital of France."
]
# Generate embeddings
embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}") # Should be (5, 384)
# Compute similarity matrix
similarity_matrix = util.cos_sim(embeddings, embeddings)
print("Similarity matrix:")
for i in range(len(sentences)):
for j in range(i+1, len(sentences)):
print(f"Similarity between \"{sentences[i]}\" and \"{sentences[j]}\": {similarity_matrix[i, j].item():.4f}")
# Semantic search example
query = "What is the capital of France?"
query_embedding = model.encode([query])
corpus = [
"Paris is the capital of France.",
"Berlin is the capital of Germany.",
"London is the capital of England."
]
corpus_embeddings = model.encode(corpus)
# Calculate similarities between query and corpus
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=3)
for hit in hits[0]:
print(f"Score: {hit['score']:.4f} - \"{corpus[hit['corpus_id']]}\"")
Comparison of Text Representation Techniques
Technique | Dimensionality | Context Awareness | OOV Handling | Semantic Capture | Computational Cost | Best For |
---|---|---|---|---|---|---|
One-Hot Encoding | Very High (vocab size) | None | Poor | None | Low | Basic preprocessing |
Bag of Words | High (vocab size) | None | Poor | None | Low | Simple classification |
TF-IDF | High (vocab size) | None | Poor | Limited | Low | Information retrieval |
Word2Vec | Medium (100-300) | None | Poor | Good | Medium | Word similarity, analogies |
GloVe | Medium (100-300) | None | Poor | Good | Medium | Word semantics, analogies |
FastText | Medium (100-300) | None | Good | Good | Medium-High | Morphologically rich languages |
Doc2Vec | Medium (100-300) | Document-level | Poor | Good | Medium | Document classification |
BERT/Contextual | High (768+) | Excellent | Good | Excellent | Very High | Complex NLP tasks |
Universal Sentence Encoder | Medium (512) | Sentence-level | Medium | Very Good | Medium-High | Sentence comparison |
Sentence-BERT | Medium (384-768) | Sentence-level | Good | Excellent | High | Efficient semantic search |
Practical Selection Guide
When to Use Each Technique
- One-Hot Encoding:
- Teaching concepts
- Very small vocabularies
- When explicit word identity is critical
- Bag of Words:
- Simple text classification tasks
- When word order doesn’t matter
- Limited computational resources
- Easily interpretable models
- TF-IDF:
- Search engine relevance ranking
- When distinctive words matter more than common ones
- Document similarity measures
- Topic extraction
- Word2Vec/GloVe:
- When word relationships matter
- Transfer learning for limited datasets
- Exploration of semantic relationships
- Moderate computational resources
- FastText:
- Languages with rich morphology
- When handling rare words is important
- When misspellings are common
- Social media text with neologisms
- Doc2Vec:
- Document-level tasks
- When document identity matters more than individual words
- Recommendation systems
- Plagiarism detection
- BERT/Contextual Embeddings:
- Complex language understanding tasks
- When word sense disambiguation is critical
- When context significantly changes meaning
- When state-of-the-art performance is required
- Universal Sentence Encoder:
- Cross-domain sentence comparison
- Limited fine-tuning data available
- Mobile or resource-constrained environments (DAN version)
- Quick prototyping of sentence-level applications
- Sentence-BERT:
- Large-scale semantic search
- Efficient clustering of many sentences
- Real-time similarity computation
- Production systems requiring sentence embeddings
Implementation Considerations
Data Preprocessing
Regardless of the technique chosen, proper text preprocessing is crucial:
- Tokenization: Breaking text into words, subwords, or characters
- Lowercasing: Converting all text to lowercase (usually)
- Stopword Removal: Removing common words with little semantic value (for non-neural methods)
- Stemming/Lemmatization: Reducing words to base forms (for non-neural methods)
- Special Character Handling: Deciding how to treat punctuation, numbers, etc.
- Handling Out-of-Vocabulary Words: Creating strategies for unseen words
Evaluation Metrics
When comparing text representation techniques, consider these metrics:
- Intrinsic Evaluation:
- Word/Sentence Similarity Correlation
- Analogy Task Accuracy
- Word/Document Clustering Quality
- Extrinsic Evaluation:
- Downstream Task Performance
- Classification Accuracy
- Retrieval Precision/Recall
- Machine Translation BLEU Scores
Hybrid Approaches
Often, the best solution combines multiple techniques:
- Ensemble Methods: Using multiple representation types and combining predictions
- Feature Stacking: Concatenating different embeddings
- Task-Specific Fine-Tuning: Starting with pre-trained embeddings and adapting to domain
- Multi-level Representations: Using word, sentence, and document embeddings together
Conclusion
Text representation has evolved dramatically from simple one-hot encoding to sophisticated contextual embedding models. Each technique offers unique trade-offs between simplicity, computational efficiency, and semantic understanding.
For practical applications:
- Consider your computational constraints
- Evaluate the importance of contextual understanding
- Assess the availability of training data
- Balance accuracy requirements against implementation complexity
The field continues to advance rapidly, with contextual embeddings and their efficient derivatives currently representing the state-of-the-art for most applications. However, simpler techniques like TF-IDF and non-contextual word embeddings remain valuable for specific use cases, especially when computational resources are limited or when interpretability is important.
By understanding the full spectrum of text representation techniques, NLP practitioners can make informed choices for their specific applications, leading to more effective and efficient text processing systems.