Comprehensive Guide to NLP Text Representation Techniques

Tejas Kamble
February 13, 2025
27 min read
NLP,text analysis,text mining

Natural Language Processing (NLP) requires converting human language into numerical formats that computers can understand. This guide explores major text representation techniques in depth, comparing their strengths, weaknesses, and practical applications.

1. One-Hot Encoding

One-hot encoding is a fundamental representation technique that forms the conceptual foundation for many text representation methods.

How It Works

One-hot encoding represents each word as a binary vector with a length equal to the vocabulary size. For a vocabulary of size V:

Create a vector of length V filled with zeros
Set the position corresponding to the word’s index to 1
All other positions remain 0

Detailed Example

Consider a small vocabulary: [“apple”, “banana”, “cherry”, “date”, “elderberry”]

One-hot encodings:

“apple” = [1, 0, 0, 0, 0]
“banana” = [0, 1, 0, 0, 0]
“cherry” = [0, 0, 1, 0, 0]
“date” = [0, 0, 0, 1, 0]
“elderberry” = [0, 0, 0, 0, 1]

To represent the sentence “I like apple and banana”:

We would create five separate vectors for each word
Words not in our vocabulary (like “I”, “like”, “and”) would either be ignored or added to the vocabulary

Mathematical Formulation

For a vocabulary V = {w₁, w₂, …, wₙ}, the one-hot encoding of word wᵢ is a vector v where:

v[j] = 1 if j = i
v[j] = 0 if j ≠ i

Advantages

Simplicity: Straightforward to implement and understand
Unique Representation: Each word has a distinct representation
No Assumptions: Makes no assumptions about relationships between words
Lossless: Preserves word identity perfectly

Disadvantages

Dimensionality: For real vocabularies (50,000+ words), vectors become enormous
Sparsity: Most elements are zero, wasting memory and computation
No Semantic Information: “apple” and “fruit” are as different as “apple” and “rocket”
No Contextual Information: The same word always has the same representation regardless of usage

Code Implementation

def one_hot_encode(word, vocabulary):
    vector = [0] * len(vocabulary)
    if word in vocabulary:
        vector[vocabulary.index(word)] = 1
    return vector

vocabulary = ["apple", "banana", "cherry", "date", "elderberry"]
print(one_hot_encode("banana", vocabulary))  # [0, 1, 0, 0, 0]
print(one_hot_encode("apple", vocabulary))   # [1, 0, 0, 0, 0]

2. Bag of Words (BoW)

Bag of Words builds on one-hot encoding to represent entire documents rather than individual words.

How It Works

Create a vocabulary from all unique words in the corpus
For each document:
- Initialize a vector of zeros with length equal to vocabulary size
- For each word in the document, increment the corresponding position
- The final vector contains counts of word occurrences

Detailed Example

Consider two documents:

Document 1: “The cat sat on the mat”
Document 2: “The dog chased the cat”

Vocabulary: [“the”, “cat”, “sat”, “on”, “mat”, “dog”, “chased”]

BoW representations:

Document 1: [2, 1, 1, 1, 1, 0, 0] (2 occurrences of “the”, 1 of “cat”, etc.)
Document 2: [2, 1, 0, 0, 0, 1, 1]

Mathematical Formulation

For a document D and vocabulary V = {w₁, w₂, …, wₙ}, the BoW representation is a vector v where:

v[i] = count of word wᵢ in document D

Advantages

Frequency Information: Captures how often words appear
Document Comparison: Enables comparing documents based on content
Simplicity: Easy to implement and understand
Scalability: Works well with many classification algorithms
Success in Practice: Despite simplicity, works well for many tasks like spam detection and document categorization

Disadvantages

Loss of Order: “The cat chased the dog” and “The dog chased the cat” have identical representations
Equal Weighting: Common words like “the” get high values despite low information content
Sparse Representation: Most entries are zero for large vocabularies
No Semantics: Doesn’t capture word relationships or meanings

Practical Applications

Sentiment Analysis: Determining whether reviews are positive or negative
Spam Detection: Identifying unwanted emails
Document Categorization: Sorting documents into topics

Code Implementation

from collections import Counter

def create_bow(document, vocabulary):
    word_counts = Counter(document.lower().split())
    return [word_counts.get(word, 0) for word in vocabulary]

vocabulary = ["the", "cat", "sat", "on", "mat", "dog", "chased"]
doc1 = "The cat sat on the mat"
doc2 = "The dog chased the cat"

bow1 = create_bow(doc1, vocabulary)
bow2 = create_bow(doc2, vocabulary)

print(bow1)  # [2, 1, 1, 1, 1, 0, 0]
print(bow2)  # [2, 1, 0, 0, 0, 1, 1]

3. TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF enhances BoW by weighting terms based on their importance within and across documents.

How It Works

TF-IDF consists of two components:

Term Frequency (TF): Measures how frequently a term appears in a document
- TF(t,d) = (Number of times term t appears in document d) / (Total number of terms in document d)
Inverse Document Frequency (IDF): Measures how important a term is across the corpus
- IDF(t) = log(Total number of documents / Number of documents containing term t)

The final TF-IDF score is: TF-IDF(t,d) = TF(t,d) × IDF(t)

Detailed Example

Consider a corpus of three documents:

Doc1: “The cat sat on the mat”
Doc2: “The dog chased the cat”
Doc3: “The bird flew over the house”

Let’s calculate the TF-IDF for the word “cat” in Doc1:

Term Frequency for “cat” in Doc1:
- TF(“cat”,Doc1) = 1/6 = 0.167
Inverse Document Frequency for “cat”:
- “cat” appears in 2 out of 3 documents
- IDF(“cat”) = log(3/2) ≈ log(1.5) ≈ 0.176
TF-IDF for “cat” in Doc1:
- TF-IDF(“cat”,Doc1) = 0.167 × 0.176 ≈ 0.029

Compare this with the common word “the”:

TF(“the”, Doc1) = 2/6 = 0.333
IDF(“the”) = log(3/3) = log(1) = 0
TF-IDF(“the”,Doc1) = 0.333 × 0 = 0

This shows how TF-IDF reduces the weight of common words that appear in all documents.

Mathematical Formulation

For term t in document d, from a corpus D:

TF(t,d) = f(t,d) / Σₓ f(x,d) where f(t,d) is the count of term t in document d
IDF(t) = log(|D| / |{d ∈ D : t ∈ d}|) where |D| is the total number of documents
TF-IDF(t,d) = TF(t,d) × IDF(t)

Advantages

Word Importance: Distinguishes between common and distinctive terms
Weighting Mechanism: Reduces the impact of high-frequency, low-information words
Enhanced Discrimination: Highlights words that characterize specific documents
Proven Effectiveness: Outperforms raw BoW in many tasks
Interpretability: Values have clear meaning (higher = more distinctive)

Disadvantages

Still Ignores Order: Word sequence is not considered
Corpus Dependency: IDF calculation requires a complete corpus
No Semantic Understanding: Doesn’t capture word relationships
Fixed Vocabulary: Struggles with out-of-vocabulary words
Limited Context: Doesn’t capture word usage context

Practical Applications

Information Retrieval: Powering search engines
Document Clustering: Grouping similar documents
Feature Extraction: Creating input features for machine learning algorithms
Keyword Extraction: Identifying the most distinctive words in text

Code Implementation

import numpy as np
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer

# Manual implementation
def compute_tfidf(corpus):
    # Create vocabulary
    all_words = set()
    for doc in corpus:
        for word in doc.lower().split():
            all_words.add(word)
    vocabulary = list(all_words)
    
    # Calculate document frequency
    doc_freq = Counter()
    for doc in corpus:
        words_in_doc = set(doc.lower().split())
        for word in words_in_doc:
            doc_freq[word] += 1
    
    # Calculate TF-IDF
    tfidf_vectors = []
    for doc in corpus:
        word_counts = Counter(doc.lower().split())
        total_words = len(doc.lower().split())
        tfidf_vector = []
        
        for word in vocabulary:
            # Term frequency
            tf = word_counts.get(word, 0) / total_words
            # Inverse document frequency
            idf = np.log(len(corpus) / doc_freq.get(word, 1))
            # TF-IDF
            tfidf_vector.append(tf * idf)
        
        tfidf_vectors.append(tfidf_vector)
    
    return tfidf_vectors, vocabulary

# Example usage
corpus = [
    "The cat sat on the mat",
    "The dog chased the cat",
    "The bird flew over the house"
]

tfidf_vectors, vocab = compute_tfidf(corpus)
print(f"Vocabulary: {vocab}")
print(f"TF-IDF for document 1: {tfidf_vectors[0]}")

# Using scikit-learn
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print("scikit-learn TF-IDF:")
print(X.toarray())

4. Word2Vec

Word2Vec represents a paradigm shift in text representation by generating dense, continuous vector embeddings that capture semantic relationships between words.

How It Works

Word2Vec uses shallow neural networks with two main architectures:

Skip-gram:
- Input: Target word
- Output: Context words (surrounding words)
- The model learns to predict the context from a single word
Continuous Bag of Words (CBOW):
- Input: Context words
- Output: Target word
- The model learns to predict a word from its context

During training, Word2Vec adjusts word vectors to maximize the probability of correct predictions, resulting in semantically similar words having similar embeddings.

Detailed Example

Consider training Word2Vec on this corpus: “The quick brown fox jumps over the lazy dog.”

For a window size of 2, training examples for Skip-gram include:

Input: “quick”, Output: [“The”, “brown”, “fox”]
Input: “brown”, Output: [“quick”, “fox”, “jumps”]
Input: “fox”, Output: [“brown”, “quick”, “jumps”, “over”]

After training, similar words have similar vectors. For example, the vectors for “king”, “queen”, “man”, and “woman” would capture their semantic relationships, enabling vector arithmetic like:

vec(“king”) – vec(“man”) + vec(“woman”) ≈ vec(“queen”)

Mathematical Formulation

For the Skip-gram model, the objective is to maximize:

log P(context|word) = Σᵢ₌₁ᵀ Σⱼ∈context(i) log P(wⱼ|wᵢ)

Where P(wⱼ|wᵢ) is modeled using the softmax function:

P(wⱼ|wᵢ) = exp(vᵀwⱼ · vwᵢ) / Σₖ₌₁ᵛ exp(vᵀwₖ · vwᵢ)

Where vwᵢ is the vector for the input word, and vᵀwⱼ is the vector for the context word.

Advantages

Semantic Relationships: Captures word similarity and relationships
Dense Representation: Low-dimensional vectors (typically 100-300 dimensions)
Vector Arithmetic: Enables mathematical operations on word meanings
Transferability: Pre-trained embeddings can be used across different tasks
Performance: Dramatically improves results on many NLP tasks

Disadvantages

Training Requirements: Needs large text corpora and computational resources
Fixed Vocabulary: Cannot handle out-of-vocabulary words
Single Representation: One vector per word, regardless of context
No Polysemy: Cannot represent multiple meanings of the same word
Black Box: Difficult to interpret individual dimensions

Practical Applications

Semantic Similarity: Finding related words or documents
Machine Translation: Improving language translation
Named Entity Recognition: Identifying entities in text
Text Classification: Enhancing document categorization
Recommendation Systems: Finding similar items based on descriptions

Code Implementation

from gensim.models import Word2Vec
import numpy as np

# Sample corpus
sentences = [
    ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"],
    ["the", "brown", "fox", "is", "quick", "and", "the", "dog", "is", "lazy"],
    ["quick", "brown", "foxes", "jump", "over", "lazy", "dogs"]
]

# Train model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Explore word vectors
print(f"Vector for 'fox': {model.wv['fox'][:5]}...")  # Show first 5 dimensions

# Find similar words
similar_words = model.wv.most_similar("fox", topn=3)
print(f"Words similar to 'fox': {similar_words}")

# Vector arithmetic
result = model.wv.most_similar(positive=["quick", "dog"], negative=["lazy"], topn=1)
print(f"quick - lazy + dog: {result}")

# Manual similarity calculation
cosine_similarity = np.dot(model.wv["fox"], model.wv["dog"]) / (
    np.linalg.norm(model.wv["fox"]) * np.linalg.norm(model.wv["dog"])
)
print(f"Cosine similarity between 'fox' and 'dog': {cosine_similarity}")

5. GloVe (Global Vectors for Word Representation)

GloVe combines the benefits of matrix factorization methods and local context window methods.

How It Works

Build a co-occurrence matrix X, where Xᵢⱼ represents how often word i appears in the context of word j
Define a weighting function f(Xᵢⱼ) that gives less weight to rare and extremely common co-occurrences
Find word vectors wᵢ and context vectors w̃ⱼ such that their dot product approximates the log of co-occurrence probability:
- wᵢᵀw̃ⱼ + bᵢ + b̃ⱼ ≈ log(Xᵢⱼ)
Minimize the following objective:
- J = Σᵢ,ⱼ f(Xᵢⱼ)(wᵢᵀw̃ⱼ + bᵢ + b̃ⱼ – log(Xᵢⱼ))²

Detailed Example

Imagine we’ve analyzed a large corpus and created a co-occurrence matrix:

	the	cat	sat	mat
the	0	45	12	32
cat	45	0	67	5
sat	12	67	0	56
mat	32	5	56	0

GloVe would find word vectors such that:

vec(“cat”)·vec(“sat”) > vec(“cat”)·vec(“mat”) because cats sit more than they are on mats
vec(“the”)·vec(“cat”) > vec(“the”)·vec(“sat”) because “the cat” is more common than “the sat”

After training, GloVe might produce 300-dimensional vectors that capture these statistical relationships. These vectors would support the same analogical reasoning as Word2Vec.

Mathematical Formulation

The GloVe objective function:

J = Σᵢ,ⱼ f(Xᵢⱼ)(wᵢᵀw̃ⱼ + bᵢ + b̃ⱼ – log(Xᵢⱼ))²

Where:

Xᵢⱼ is the co-occurrence count between words i and j
wᵢ and w̃ⱼ are word and context vectors
bᵢ and b̃ⱼ are bias terms
f(Xᵢⱼ) is a weighting function:
- f(x) = (x/xₘₐₓ)^α if x < xₘₐₓ
- f(x) = 1 otherwise

Advantages

Global Statistics: Captures both word-word relationships and corpus-level statistics
Efficiency: More efficient use of statistics than Word2Vec
Performance: Often outperforms Word2Vec on analogy tasks
Explicit Modeling: Directly models co-occurrence probabilities
Parallelizable: Training can be parallelized more effectively than neural methods

Disadvantages

Static Embeddings: One representation per word, regardless of context
Training Data: Requires substantial text data for good performance
Memory Usage: Co-occurrence matrix can be massive for large vocabularies
Out-of-Vocabulary: Cannot handle words not seen during training
No Polysemy: Cannot represent multiple word senses

Practical Applications

Document Classification: Improving classification accuracy
Machine Translation: Enhancing translation quality
Named Entity Recognition: Identifying entities in text
Question Answering: Improving understanding of questions and contexts
Transfer Learning: Providing pre-trained representations for other tasks

Code Implementation

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Simplified GloVe-style training (actual implementation would be more complex)
def train_simplified_glove(co_occurrence_matrix, vector_size=50, iterations=50, learning_rate=0.05):
    vocab_size = co_occurrence_matrix.shape[0]
    
    # Initialize random word and context vectors
    W = np.random.randn(vocab_size, vector_size) * 0.01
    W_context = np.random.randn(vocab_size, vector_size) * 0.01
    b = np.zeros(vocab_size)
    b_context = np.zeros(vocab_size)
    
    # Training loop
    for iteration in range(iterations):
        cost = 0
        
        for i in range(vocab_size):
            for j in range(vocab_size):
                if co_occurrence_matrix[i, j] > 0:
                    # Weight function - simplified
                    weight = min(1, (co_occurrence_matrix[i, j] / 100) ** 0.75)
                    
                    # Compute prediction and error
                    prediction = np.dot(W[i], W_context[j]) + b[i] + b_context[j]
                    error = prediction - np.log(max(co_occurrence_matrix[i, j], 1))
                    
                    # Update cost
                    cost += weight * error ** 2
                    
                    # Compute gradients
                    grad = weight * error
                    
                    # Update parameters
                    W[i] -= learning_rate * grad * W_context[j]
                    W_context[j] -= learning_rate * grad * W[i]
                    b[i] -= learning_rate * grad
                    b_context[j] -= learning_rate * grad
        
        if iteration % 10 == 0:
            print(f"Iteration {iteration}, Cost: {cost}")
    
    # Final word vectors (sum of word and context vectors)
    final_vectors = W + W_context
    return final_vectors

# Example co-occurrence matrix
co_occurrence = np.array([
    [0, 45, 12, 32],
    [45, 0, 67, 5],
    [12, 67, 0, 56],
    [32, 5, 56, 0]
])

# Train simplified GloVe
word_vectors = train_simplified_glove(co_occurrence, vector_size=10, iterations=100)

# Calculate similarities
sim_matrix = cosine_similarity(word_vectors)
print("Word similarity matrix:")
print(sim_matrix)

# Example words
words = ["the", "cat", "sat", "mat"]

# Show most similar words for each word
for i, word in enumerate(words):
    similarities = [(words[j], sim_matrix[i, j]) for j in range(len(words)) if j != i]
    similarities.sort(key=lambda x: x[1], reverse=True)
    print(f"Words most similar to '{word}': {similarities}")

6. Contextual Embeddings (BERT, ELMo, etc.)

Contextual embeddings revolutionized NLP by generating dynamic representations that capture the meaning of words in their specific context.

How It Works

Unlike static embeddings, contextual models:

Process entire sentences/paragraphs together
Use deep neural architectures (Transformers for BERT, bidirectional LSTMs for ELMo)
Pre-train on massive corpora using tasks like masked language modeling
Generate different vectors for the same word depending on its context
Often use subword tokenization (WordPiece for BERT, Byte-Pair Encoding for others)

Detailed Example

Consider the word “bank” in different contexts:

“I deposited money in the bank yesterday.”
“We sat on the bank of the river and watched the sunset.”

With contextual embeddings:

The model processes each full sentence
“bank” in the first sentence gets a vector representing the financial institution meaning
“bank” in the second sentence gets a different vector representing the river shore meaning
These vectors capture the distinct meanings despite being the same word

For BERT specifically, before generating embeddings:

The input is tokenized: “I deposited money in the bank yesterday.” → [“[CLS]”, “i”, “deposit”, “##ed”, “money”, “in”, “the”, “bank”, “yesterday”, “.”, “[SEP]”]
Each token is assigned three embeddings (token, position, segment) which are summed
This combined representation passes through multiple Transformer layers
The final layer outputs contextualized embeddings for each token

Mathematical Formulation

For the BERT model, the attention mechanism is a key component:

Attention(Q, K, V) = softmax(QK^T / √dₖ)V

Where Q, K, and V are query, key, and value matrices derived from the input embeddings.

The entire model consists of multiple layers of multi-headed attention and feed-forward networks:

h₁ = LayerNorm(x + MultiHeadAttention(x))
h₂ = LayerNorm(h₁ + FeedForward(h₁))

Advantages

Context Awareness: Captures word meaning based on surrounding context
Polysemy Handling: Different representations for different word senses
Subword Tokenization: Handles out-of-vocabulary words effectively
Deep Understanding: Captures complex language phenomena like negation
State-of-the-Art Performance: Achieves best results on most NLP tasks
Transfer Learning: Pre-trained models can be fine-tuned for specific tasks

Disadvantages

Computational Requirements: Extremely resource-intensive
Complexity: More difficult to implement and use
Interpretability: Hard to understand what specific dimensions represent
Size: Models are very large (hundreds of millions to billions of parameters)
Training Data: Requires massive amounts of text for pre-training

Practical Applications

Question Answering: Understanding questions and finding answers in text
Text Classification: Superior document categorization
Named Entity Recognition: Identifying and classifying entities
Text Generation: Creating coherent and contextually appropriate text
Sentiment Analysis: Understanding nuanced opinions and emotions
Machine Translation: Producing high-quality translations

Code Implementation

import torch
from transformers import BertTokenizer, BertModel

# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Example sentences with the word "bank"
sentences = [
    "I deposited money in the bank yesterday.",
    "We sat on the bank of the river and watched the sunset."
]

# Get contextual embeddings for each sentence
for sentence in sentences:
    # Tokenize and prepare for model
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
    
    # Get model outputs
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Get embeddings from last layer
    embeddings = outputs.last_hidden_state
    
    # Find the position of "bank" in tokenized input
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    bank_position = tokens.index("bank")
    
    # Extract the embedding for "bank"
    bank_embedding = embeddings[0, bank_position].numpy()
    
    print(f"\nContextual embedding for 'bank' in: \"{sentence}\"")
    print(f"First 5 dimensions: {bank_embedding[:5]}...")
    
# Compare the two "bank" embeddings
bank1 = outputs.last_hidden_state[0, tokens.index("bank")].numpy()
tokens = tokenizer.convert_ids_to_tokens(tokenizer(sentences[1])["input_ids"])
inputs = tokenizer(sentences[1], return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
bank2 = outputs.last_hidden_state[0, tokens.index("bank")].numpy()

# Calculate cosine similarity
cosine_sim = np.dot(bank1, bank2) / (np.linalg.norm(bank1) * np.linalg.norm(bank2))
print(f"\nCosine similarity between the two 'bank' embeddings: {cosine_sim}")
print(f"Note that this value would be 1.0 with static embeddings like Word2Vec")

7. FastText

FastText extends Word2Vec by incorporating subword information, making it better at handling rare and unseen words.

How It Works

Represents each word as a bag of character n-grams (plus the word itself)
Each n-gram has its own vector representation
A word’s embedding is the sum of its n-gram vectors
Uses similar training approaches as Word2Vec (Skip-gram or CBOW)

Detailed Example

For the word “apple” with n-grams of length 3-6:

Character n-grams: “<ap”, “app”, “ppl”, “ple”, “le>”, “<app”, “appl”, “pple”, “ple>”, “<appl”, “apple”, “pple>”, “<apple”, “apple>”
(where < and > represent word boundaries)

The final vector for “apple” would be the sum of these n-gram vectors plus the vector for the whole word.

When encountering an unseen word like “applet”:

Many n-grams overlap with “apple” (e.g., “app”, “ppl”)
FastText can build a reasonable embedding from these shared n-grams
This gives better coverage for rare, technical, or misspelled words

Advantages

Morphological Awareness: Captures word structure and morphology
Out-of-Vocabulary Handling: Can generate embeddings for unseen words
Robustness to Misspellings: Similar embeddings for misspelled variants
Better for Morphologically Rich Languages: Works well for languages with many word forms
Compact Models: Can be compressed efficiently

Disadvantages

Larger Models: More parameters than Word2Vec
Computational Cost: More expensive to train
Still Static: No context sensitivity despite subword awareness
Less Semantic Precision: May blur distinctions between some similar words

Code Implementation

from gensim.models import FastText

# Sample corpus
sentences = [
    ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"],
    ["the", "brown", "fox", "is", "quick", "and", "the", "dog", "is", "lazy"],
    ["quick", "brown", "foxes", "jump", "over", "lazy", "dogs"]
]

# Train model
model = FastText(sentences, vector_size=100, window=5, min_count=1, workers=4, min_n=3, max_n=6)

# Explore word vectors
print(f"Vector for 'fox': {model.wv['fox'][:5]}...")  # Show first 5 dimensions

# Find similar words
similar_words = model.wv.most_similar("fox", topn=3)
print(f"Words similar to 'fox': {similar_words}")

# Out-of-vocabulary word handling
print(f"Vector for unseen word 'foxiest': {model.wv['foxiest'][:5]}...")

8. Doc2Vec (Paragraph Vector)

Doc2Vec extends Word2Vec to learn fixed-length representations for variable-length texts such as sentences, paragraphs, or documents.

How It Works

Doc2Vec has two main variants:

Distributed Memory (DM):
- Similar to CBOW in Word2Vec
- Predicts a target word given context words AND a document vector
- The document vector serves as a memory that captures the topic of the document
Distributed Bag of Words (DBOW):
- Similar to Skip-gram in Word2Vec
- Predicts context words given only the document vector
- Simpler but often works as well as DM

Detailed Example

Consider training Doc2Vec on a corpus of movie reviews:

Assign a unique ID to each review
Train the model to predict words in the review given the review ID
The resulting vectors for each review ID capture the semantic content

For example, reviews about action movies will have similar vectors, distinct from those about romantic comedies.

After training, we can:

Compare documents directly (e.g., find similar movie reviews)
Infer vectors for new, unseen documents
Use document vectors for classification or clustering

Advantages

Document-Level Semantics: Captures meaning at document scale
Fixed-Length Representation: Consistent size regardless of document length
Compositionality: Combines word and document meaning
End-to-End Learning: Learns document representations directly
Unsupervised: Doesn’t require labeled data

8. Doc2Vec (Paragraph Vector) (continued)

Disadvantages

Training Complexity: More complex to train than word embeddings
Data Requirements: Needs substantial corpus for good representations
Hyperparameter Sensitivity: Performance varies with parameter settings
Black Box: Difficult to interpret what dimensions represent
No Context Within Documents: Treats all words in document equally

Practical Applications

Document Classification: Categorizing texts by topic or sentiment
Information Retrieval: Finding similar documents
Document Clustering: Grouping similar texts
Recommendation Systems: Suggesting similar content
Plagiarism Detection: Identifying semantically similar documents

Code Implementation

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

# Sample corpus
documents = [
    "The quick brown fox jumps over the lazy dog",
    "The fox is quick and the dog is lazy",
    "Quick brown foxes jump over lazy dogs",
    "I love machine learning and natural language processing",
    "Vector representations are useful in NLP tasks",
    "Natural language processing involves machine learning"
]

# Preprocess and tag documents
tagged_docs = [TaggedDocument(words=word_tokenize(doc.lower()), tags=[i]) 
               for i, doc in enumerate(documents)]

# Train model
model = Doc2Vec(vector_size=50, min_count=1, epochs=40)
model.build_vocab(tagged_docs)
model.train(tagged_docs, total_examples=model.corpus_count, epochs=model.epochs)

# Explore document vectors
print(f"Vector for document 0: {model.dv[0][:5]}...")  # Show first 5 dimensions

# Find similar documents
similar_docs = model.dv.most_similar(0, topn=2)
print(f"Documents similar to document 0: {similar_docs}")

# Infer vector for a new document
new_doc = "Foxes and dogs are quick animals"
inferred_vector = model.infer_vector(word_tokenize(new_doc.lower()))
print(f"Inferred vector for new document: {inferred_vector[:5]}...")

# Find similar documents to the new document
similar_to_new = model.dv.most_similar([inferred_vector], topn=2)
print(f"Documents similar to new document: {similar_to_new}")

9. Universal Sentence Encoder (USE)

The Universal Sentence Encoder provides sentence-level embeddings that scale to various NLP tasks with minimal task-specific training data.

How It Works

USE has two major variants:

Transformer-based:
- Uses a Transformer architecture similar to BERT
- Optimizes for accuracy at the cost of computational complexity
- Processes full sentences with attention mechanisms
DAN-based (Deep Averaging Network):
- Averages embeddings for input words/n-grams and passes through a deep neural network
- More efficient but slightly less accurate
- Better suited for mobile and low-resource environments

Both are trained on a variety of tasks, including:

Skip-thought prediction
Translation ranking
Natural language inference
Conversational response prediction

Detailed Example

Consider two sentences:

“The cat sat on the mat.”
“A feline rested on the floor covering.”

Despite different vocabulary, USE would produce similar embeddings for these semantically similar sentences.

When applied to question answering:

Question: “What is the capital of France?”
Candidate answers from a knowledge base are encoded
The answer with the highest cosine similarity to the question is selected
“Paris is the capital of France” would have high similarity

Advantages

Sentence-Level Semantics: Captures meaning at sentence scale
Transfer Learning Ready: Pre-trained for use across multiple tasks
Minimal Fine-tuning: Works well with limited task-specific data
Language Understanding: Captures semantic similarities regardless of phrasing
Multilingual Versions: Available for multiple languages

Disadvantages

Fixed Representation: One vector per sentence regardless of length
Computational Requirements: Transformer variant is resource-intensive
Limited Context Length: Performance degrades with very long texts
Black Box: Difficult to interpret dimensions
Less Precise Than Task-Specific Models: Jack-of-all-trades approach

Practical Applications

Semantic Textual Similarity: Measuring how similar two texts are
Clustering: Grouping similar sentences or paragraphs
Classification: Categorizing short texts
Information Retrieval: Finding relevant information from a corpus
Semantic Search: Searching by meaning rather than keywords

Code Implementation

import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained USE model
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

# Example sentences
sentences = [
    "The cat sat on the mat.",
    "A feline rested on the floor covering.",
    "Dogs chase cats.",
    "What is the capital of France?",
    "Paris is the capital of France."
]

# Generate embeddings
embeddings = embed(sentences)
print(f"Embedding shape: {embeddings.shape}")  # Should be (5, 512)

# Compute similarity matrix
similarity_matrix = cosine_similarity(embeddings)
print("Similarity matrix:")
for i in range(len(sentences)):
    for j in range(i+1, len(sentences)):
        print(f"Similarity between \"{sentences[i]}\" and \"{sentences[j]}\": {similarity_matrix[i, j]:.4f}")

# Question answering example
question = "What is the capital of France?"
question_embedding = embed([question])
candidate_answers = [
    "Paris is the capital of France.",
    "Berlin is the capital of Germany.",
    "London is the capital of England."
]
answer_embeddings = embed(candidate_answers)

# Calculate similarities between question and answers
similarities = cosine_similarity(question_embedding, answer_embeddings)[0]
for i, (answer, similarity) in enumerate(zip(candidate_answers, similarities)):
    print(f"Answer {i+1}: \"{answer}\" - Similarity: {similarity:.4f}")

# Get best answer
best_answer_index = np.argmax(similarities)
print(f"Best answer: \"{candidate_answers[best_answer_index]}\"")

10. Sentence-BERT (SBERT)

Sentence-BERT modifies the BERT architecture to derive semantically meaningful sentence embeddings efficiently.

How It Works

Uses siamese and triplet network structures with BERT/RoBERTa/etc. as base models
Applies pooling to the output of BERT (mean, max, or CLS token pooling)
Trained on sentence pairs with objectives like:
- Natural Language Inference (entailment, contradiction, neutral)
- Semantic Textual Similarity (scoring sentence similarity)
Produces fixed-size sentence embeddings optimized for semantic comparison

Detailed Example

Training process example:

Take sentence pairs labeled for similarity
- “I love pizza” and “Pizza is my favorite food” (similar)
- “I love pizza” and “I hate vegetables” (dissimilar)
Pass both sentences through the same BERT model with shared weights
Apply pooling to get a fixed vector for each sentence
Train the network to minimize distance between similar sentences and maximize distance between dissimilar ones

In practice:

Computing similarity between 10,000 sentences using BERT would require 50 million sentence pair computations
With SBERT, each sentence is encoded once and similarities are computed via vector operations, reducing computation by 99.8%

Mathematical Formulation

For the triplet objective function:

L = max(0, ||a-p||² – ||a-n||² + margin)

Where:

a is the anchor sentence embedding
p is a positive example (similar sentence)
n is a negative example (dissimilar sentence)
margin is a hyperparameter that enforces a minimum distance

Advantages

Efficiency: Much faster than comparing all sentence pairs with BERT
Semantic Understanding: Captures sentence meaning well
Strong Transfer Learning: Pre-trained models work well across domains
State-of-the-art Performance: Achieves excellent results on sentence similarity tasks
Handles Longer Text: Better than word embeddings for sentences
Task Adaptability: Can be fine-tuned for specific tasks

Disadvantages

Resource Requirements: Still needs significant computational resources
Limited Context Length: Performance decreases with very long texts
Black Box Nature: Difficult to interpret what dimensions represent
Fixed Embedding Size: Same dimensionality regardless of sentence complexity
Domain Adaptation Challenges: May require fine-tuning for specialized domains

Practical Applications

Semantic Search: Finding relevant documents quickly
Clustering: Grouping similar texts efficiently
Information Retrieval: Retrieving relevant information
Paraphrase Mining: Finding alternative expressions of the same idea
Automatic Essay Grading: Comparing student answers to reference answers
Duplicate Question Detection: Finding similar questions on Q&A platforms

Code Implementation

from sentence_transformers import SentenceTransformer, util

# Load pre-trained SBERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Example sentences
sentences = [
    "The cat sat on the mat.",
    "A feline rested on the floor covering.",
    "Dogs chase cats.",
    "What is the capital of France?",
    "Paris is the capital of France."
]

# Generate embeddings
embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")  # Should be (5, 384)

# Compute similarity matrix
similarity_matrix = util.cos_sim(embeddings, embeddings)
print("Similarity matrix:")
for i in range(len(sentences)):
    for j in range(i+1, len(sentences)):
        print(f"Similarity between \"{sentences[i]}\" and \"{sentences[j]}\": {similarity_matrix[i, j].item():.4f}")

# Semantic search example
query = "What is the capital of France?"
query_embedding = model.encode([query])
corpus = [
    "Paris is the capital of France.",
    "Berlin is the capital of Germany.",
    "London is the capital of England."
]
corpus_embeddings = model.encode(corpus)

# Calculate similarities between query and corpus
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=3)
for hit in hits[0]:
    print(f"Score: {hit['score']:.4f} - \"{corpus[hit['corpus_id']]}\"")

Comparison of Text Representation Techniques

Technique	Dimensionality	Context Awareness	OOV Handling	Semantic Capture	Computational Cost	Best For
One-Hot Encoding	Very High (vocab size)	None	Poor	None	Low	Basic preprocessing
Bag of Words	High (vocab size)	None	Poor	None	Low	Simple classification
TF-IDF	High (vocab size)	None	Poor	Limited	Low	Information retrieval
Word2Vec	Medium (100-300)	None	Poor	Good	Medium	Word similarity, analogies
GloVe	Medium (100-300)	None	Poor	Good	Medium	Word semantics, analogies
FastText	Medium (100-300)	None	Good	Good	Medium-High	Morphologically rich languages
Doc2Vec	Medium (100-300)	Document-level	Poor	Good	Medium	Document classification
BERT/Contextual	High (768+)	Excellent	Good	Excellent	Very High	Complex NLP tasks
Universal Sentence Encoder	Medium (512)	Sentence-level	Medium	Very Good	Medium-High	Sentence comparison
Sentence-BERT	Medium (384-768)	Sentence-level	Good	Excellent	High	Efficient semantic search

Practical Selection Guide

When to Use Each Technique

One-Hot Encoding:
- Teaching concepts
- Very small vocabularies
- When explicit word identity is critical
Bag of Words:
- Simple text classification tasks
- When word order doesn’t matter
- Limited computational resources
- Easily interpretable models
TF-IDF:
- Search engine relevance ranking
- When distinctive words matter more than common ones
- Document similarity measures
- Topic extraction
Word2Vec/GloVe:
- When word relationships matter
- Transfer learning for limited datasets
- Exploration of semantic relationships
- Moderate computational resources
FastText:
- Languages with rich morphology
- When handling rare words is important
- When misspellings are common
- Social media text with neologisms
Doc2Vec:
- Document-level tasks
- When document identity matters more than individual words
- Recommendation systems
- Plagiarism detection
BERT/Contextual Embeddings:
- Complex language understanding tasks
- When word sense disambiguation is critical
- When context significantly changes meaning
- When state-of-the-art performance is required
Universal Sentence Encoder:
- Cross-domain sentence comparison
- Limited fine-tuning data available
- Mobile or resource-constrained environments (DAN version)
- Quick prototyping of sentence-level applications
Sentence-BERT:
- Large-scale semantic search
- Efficient clustering of many sentences
- Real-time similarity computation
- Production systems requiring sentence embeddings

Implementation Considerations

Data Preprocessing

Regardless of the technique chosen, proper text preprocessing is crucial:

Tokenization: Breaking text into words, subwords, or characters
Lowercasing: Converting all text to lowercase (usually)
Stopword Removal: Removing common words with little semantic value (for non-neural methods)
Stemming/Lemmatization: Reducing words to base forms (for non-neural methods)
Special Character Handling: Deciding how to treat punctuation, numbers, etc.
Handling Out-of-Vocabulary Words: Creating strategies for unseen words

Evaluation Metrics

When comparing text representation techniques, consider these metrics:

Intrinsic Evaluation:
- Word/Sentence Similarity Correlation
- Analogy Task Accuracy
- Word/Document Clustering Quality
Extrinsic Evaluation:
- Downstream Task Performance
- Classification Accuracy
- Retrieval Precision/Recall
- Machine Translation BLEU Scores

Hybrid Approaches

Often, the best solution combines multiple techniques:

Ensemble Methods: Using multiple representation types and combining predictions
Feature Stacking: Concatenating different embeddings
Task-Specific Fine-Tuning: Starting with pre-trained embeddings and adapting to domain
Multi-level Representations: Using word, sentence, and document embeddings together

Conclusion

Text representation has evolved dramatically from simple one-hot encoding to sophisticated contextual embedding models. Each technique offers unique trade-offs between simplicity, computational efficiency, and semantic understanding.

For practical applications:

Consider your computational constraints
Evaluate the importance of contextual understanding
Assess the availability of training data
Balance accuracy requirements against implementation complexity

The field continues to advance rapidly, with contextual embeddings and their efficient derivatives currently representing the state-of-the-art for most applications. However, simpler techniques like TF-IDF and non-contextual word embeddings remain valuable for specific use cases, especially when computational resources are limited or when interpretability is important.

By understanding the full spectrum of text representation techniques, NLP practitioners can make informed choices for their specific applications, leading to more effective and efficient text processing systems.