Case Study: Using DBSCAN algorithm for Clustering and Anomaly Detection on Various Datasets

Tejas Kamble
February 14, 2025
14 min read

Introduction

In the realm of unsupervised machine learning, clustering algorithms play a pivotal role in identifying patterns, grouping similar data points, and detecting anomalies within datasets. Among these algorithms, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) stands out for its ability to discover clusters of arbitrary shapes and effectively identify outliers without requiring a predefined number of clusters.

This case study explores the application of DBSCAN for clustering and anomaly detection across various datasets, highlighting its strengths, limitations, and practical implementations. We’ll examine how DBSCAN performs in different scenarios, from well-separated clusters to complex datasets with varying densities and noise levels.

Understanding DBSCAN

Algorithmic Foundation

DBSCAN, introduced by Ester, Kriegel, Sander, and Xu in 1996, is a density-based clustering algorithm that groups points based on their proximity to other points. Unlike centroid-based methods like K-means, DBSCAN defines clusters as dense regions separated by sparser areas.

The algorithm relies on two key parameters:

ε (epsilon): The radius that defines the neighborhood of a point
MinPts: The minimum number of points required to form a dense region

DBSCAN categorizes points into three types:

Core points: Points with at least MinPts neighbors within distance ε
Border points: Points within distance ε of a core point but with fewer than MinPts neighbors
Noise points: Points that are neither core nor border points

Advantages of DBSCAN

Does not require specifying the number of clusters beforehand
Can find arbitrarily shaped clusters
Robust to outliers
Can identify noise points (potential anomalies)
Does not assume clusters are convex shaped
Only requires two parameters (ε and MinPts)

Limitations

Struggles with clusters of varying densities
Sensitive to parameter selection
Can have difficulty with high-dimensional data due to the “curse of dimensionality”
May have performance issues with large datasets

Implementation Framework

For our case studies, we’ll use Python with scikit-learn’s implementation of DBSCAN, supported by NumPy for numerical operations and Matplotlib for visualization.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_score

Parameter Selection Methodology

Selecting appropriate values for ε and MinPts is crucial for DBSCAN’s performance. We’ll employ the following strategies:

K-distance graph: Plot the distance to the k-th nearest neighbor for each point in sorted order. The “knee” or “elbow” in this plot suggests a good value for ε.

def find_epsilon(X, k):
    neigh = NearestNeighbors(n_neighbors=k)
    neigh.fit(X)
    distances, indices = neigh.kneighbors(X)
    distances = np.sort(distances[:, k-1])
    
    plt.figure(figsize=(10, 6))
    plt.plot(range(len(distances)), distances)
    plt.xlabel('Points sorted by distance')
    plt.ylabel(f'Distance to {k}th nearest neighbor')
    plt.title('K-distance Graph')
    plt.grid(True)
    return distances

Rule of thumb for MinPts: For 2D data, a common starting point is MinPts = 2*d where d is the dimensionality of the data.

Case Study 1: Synthetic Dataset with Well-Defined Clusters

Dataset Generation

We’ll start with a synthetic dataset containing well-defined clusters of different shapes to demonstrate DBSCAN’s ability to identify non-convex clusters.

from sklearn.datasets import make_moons, make_blobs

# Generate two interleaving half circles
X1, y1 = make_moons(n_samples=300, noise=0.05, random_state=42)

# Generate three isotropic blobs
X2, y2 = make_blobs(n_samples=300, centers=3, cluster_std=0.6, random_state=42)

# Combine datasets
X = np.vstack([X1, X2])

# Standardize features
X_scaled = StandardScaler().fit_transform(X)

plt.figure(figsize=(10, 8))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], s=50, alpha=0.7)
plt.title('Synthetic Dataset with Mixed Cluster Shapes')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True)

Parameter Selection

Using the k-distance graph to find an appropriate ε value:

distances = find_epsilon(X_scaled, k=5)
# Let's say we identified epsilon = 0.2 from the plot

# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
clusters = dbscan.fit_predict(X_scaled)

# Count number of clusters and noise points
n_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)
n_noise = list(clusters).count(-1)

Results Visualization

plt.figure(figsize=(12, 8))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap='viridis', 
            s=50, alpha=0.7, marker='o')

# Highlight noise points
noise_mask = clusters == -1
plt.scatter(X_scaled[noise_mask, 0], X_scaled[noise_mask, 1], 
            s=80, facecolors='none', edgecolors='red', label='Noise')

plt.title(f'DBSCAN Clustering (ε=0.2, MinPts=5): {n_clusters} clusters, {n_noise} noise points')
plt.legend()
plt.grid(True)

Analysis

For this dataset, DBSCAN successfully:

Identified the half-moon shapes as distinct clusters
Properly grouped the blob-shaped clusters
Distinguished outliers that don’t belong to any dense region

This demonstrates DBSCAN’s ability to identify clusters of arbitrary shapes, unlike K-means which would struggle with the non-convex half-moon clusters.

Case Study 2: Credit Card Fraud Detection

Dataset Overview

Credit card transactions provide a real-world application for anomaly detection. We’ll use a modified version of the Credit Card Fraud Detection dataset from Kaggle, focusing on identifying fraudulent transactions as anomalies.

# Load dataset (assuming it's already downloaded)
cc_data = pd.read_csv('creditcard.csv')
# The dataset contains legitimate and fraudulent transactions
# Features V1-V28 are PCA-transformed to protect confidentiality

# Separate features and target
X_cc = cc_data.drop('Class', axis=1)
y_cc = cc_data['Class']  # 0: legitimate, 1: fraud

# Scale the features
X_cc_scaled = StandardScaler().fit_transform(X_cc)

Dimensionality Reduction for Visualization

Since the dataset is high-dimensional, we’ll use PCA to reduce it to two dimensions for visualization:

pca = PCA(n_components=2)
X_cc_pca = pca.fit_transform(X_cc_scaled)

plt.figure(figsize=(10, 8))
plt.scatter(X_cc_pca[:, 0], X_cc_pca[:, 1], c=y_cc, cmap='coolwarm', 
            s=5, alpha=0.5)
plt.title('Credit Card Transactions (PCA-reduced)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.colorbar(label='Class (0: legitimate, 1: fraud)')

DBSCAN for Fraud Detection

# Finding optimal eps using k-distance graph
k_distances = find_epsilon(X_cc_scaled, k=15)
# Let's say we identified eps=0.5 from the plot

# Apply DBSCAN
dbscan_cc = DBSCAN(eps=0.5, min_samples=15)
clusters_cc = dbscan_cc.fit_predict(X_cc_scaled)

# Evaluate fraudulent transactions classified as anomalies
noise_indices = np.where(clusters_cc == -1)[0]
correctly_identified_fraud = sum(y_cc.iloc[noise_indices] == 1)
total_fraud = sum(y_cc == 1)

fraud_detection_rate = correctly_identified_fraud / total_fraud

Results Analysis

print(f"Total transactions: {len(X_cc)}")
print(f"Legitimate transactions: {sum(y_cc == 0)}")
print(f"Fraudulent transactions: {sum(y_cc == 1)}")
print(f"Transactions classified as anomalies: {len(noise_indices)}")
print(f"Fraudulent transactions correctly identified as anomalies: {correctly_identified_fraud}")
print(f"Fraud detection rate: {fraud_detection_rate:.2%}")

Visualization with PCA

plt.figure(figsize=(12, 8))
# Plot all points with their cluster labels
plt.scatter(X_cc_pca[:, 0], X_cc_pca[:, 1], c=clusters_cc, cmap='viridis', 
            s=5, alpha=0.5, marker='o')

# Highlight noise points
noise_mask_cc = clusters_cc == -1
plt.scatter(X_cc_pca[noise_mask_cc, 0], X_cc_pca[noise_mask_cc, 1], 
            s=20, facecolors='none', edgecolors='red', alpha=0.7, label='Anomalies')

# Highlight actual fraud cases
fraud_mask = y_cc == 1
plt.scatter(X_cc_pca[fraud_mask, 0], X_cc_pca[fraud_mask, 1], 
            s=30, marker='x', c='black', alpha=1, label='Actual Fraud')

plt.title('DBSCAN for Credit Card Fraud Detection')
plt.legend()
plt.xlabel('PC1')
plt.ylabel('PC2')

Discussion

This case study demonstrates DBSCAN’s potential for anomaly detection in financial transactions. However, it also reveals some challenges:

The imbalanced nature of fraud data (typically <1% fraudulent transactions)
The need for careful parameter tuning to achieve optimal results
The trade-off between detecting all fraud cases and minimizing false positives

A comparison with other anomaly detection techniques (isolation forests, one-class SVM) would be valuable to assess DBSCAN’s efficacy for this specific use case.

Case Study 3: Geographic Clustering for Market Segmentation

Dataset Description

For this case study, we’ll use a dataset containing customer locations (latitude and longitude) along with purchase frequency and average transaction values.

# Generate synthetic customer data
np.random.seed(42)
n_customers = 1000

# Generate geographic clusters in different cities/regions
locations = np.vstack([
    np.random.normal(loc=[40.7, -74.0], scale=0.05, size=(300, 2)),  # NYC
    np.random.normal(loc=[34.1, -118.2], scale=0.08, size=(350, 2)),  # LA
    np.random.normal(loc=[41.9, -87.6], scale=0.06, size=(250, 2)),  # Chicago
    np.random.normal(loc=[37.8, -122.4], scale=0.04, size=(100, 2)),  # SF
])

# Generate purchase behavior data
purchase_freq = np.random.gamma(shape=2, scale=2, size=n_customers)
avg_purchase = np.random.gamma(shape=5, scale=10, size=n_customers)

# Create a DataFrame
geo_data = pd.DataFrame({
    'latitude': locations[:, 0],
    'longitude': locations[:, 1],
    'purchase_frequency': purchase_freq,
    'avg_purchase': avg_purchase
})

# Scale the data for clustering
geo_features = geo_data[['latitude', 'longitude', 'purchase_frequency', 'avg_purchase']]
geo_scaled = StandardScaler().fit_transform(geo_features)

Parameter Selection and DBSCAN Application

# Find optimal epsilon
geo_distances = find_epsilon(geo_scaled, k=10)
# Let's say we determine eps=0.3 is optimal

# Apply DBSCAN
dbscan_geo = DBSCAN(eps=0.3, min_samples=10)
clusters_geo = dbscan_geo.fit_predict(geo_scaled)

# Add cluster information to the dataframe
geo_data['cluster'] = clusters_geo

# Calculate cluster statistics
cluster_stats = geo_data.groupby('cluster').agg({
    'purchase_frequency': ['mean', 'std'],
    'avg_purchase': ['mean', 'std'],
    'latitude': 'count'
}).rename(columns={'latitude': 'count'})

print(cluster_stats)

Geographic Visualization

plt.figure(figsize=(14, 10))

# Create a scatter plot of customer locations colored by cluster
scatter = plt.scatter(geo_data['longitude'], geo_data['latitude'], 
                     c=geo_data['cluster'], cmap='viridis', 
                     s=40, alpha=0.7)

# Highlight noise points
noise_geo = geo_data[geo_data['cluster'] == -1]
plt.scatter(noise_geo['longitude'], noise_geo['latitude'], 
           s=80, facecolors='none', edgecolors='red', label='Outliers')

plt.colorbar(scatter, label='Cluster')
plt.title('Geographic Customer Clusters')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend()
plt.grid(True)

Market Segmentation Analysis

We can extend the analysis by incorporating the purchase behavior data to develop targeted marketing strategies for each cluster:

# Visualize purchase behavior by cluster
plt.figure(figsize=(12, 8))

# Filter out noise points for clearer visualization
valid_clusters = geo_data[geo_data['cluster'] != -1]

plt.scatter(valid_clusters['purchase_frequency'], valid_clusters['avg_purchase'],
           c=valid_clusters['cluster'], cmap='viridis', s=50, alpha=0.7)

plt.title('Purchase Behavior by Customer Cluster')
plt.xlabel('Purchase Frequency')
plt.ylabel('Average Purchase Value ($)')
plt.colorbar(label='Cluster')
plt.grid(True)

Discussion

This case study demonstrates DBSCAN’s utility for geographic market segmentation. The algorithm effectively:

Identified distinct regional customer clusters
Detected outliers that might represent remote customers or data errors
Enabled segment-specific analysis of purchase behaviors

Marketing teams could use these insights to:

Develop location-based promotions for different regions
Create targeted campaigns based on purchase behavior patterns
Identify potential expansion opportunities in areas with high customer density

Case Study 4: Anomaly Detection in Network Traffic

Dataset Description

For our final case study, we’ll apply DBSCAN to network traffic data to identify potential security threats and anomalous behavior patterns.

# Generate synthetic network traffic data
np.random.seed(42)
n_connections = 5000

# Normal traffic features (connections per minute, bytes transferred, packet size)
normal_traffic = np.vstack([
    np.random.poisson(lam=20, size=4700),  # connections per minute
    np.random.normal(loc=5000, scale=1000, size=4700),  # bytes transferred
    np.random.normal(loc=1024, scale=200, size=4700)  # avg packet size
]).T

# Anomalous traffic (potential DoS attack, data exfiltration, port scanning)
anomalous_traffic = np.vstack([
    np.random.poisson(lam=100, size=100),  # high connection rate
    np.random.normal(loc=30000, scale=5000, size=100),  # unusual data volume
    np.random.normal(loc=256, scale=50, size=100)  # small packets
]).T

# Combine and shuffle
network_data = np.vstack([normal_traffic, anomalous_traffic])
np.random.shuffle(network_data)

network_df = pd.DataFrame(network_data, columns=['conn_per_min', 'bytes', 'packet_size'])

# Label the data (for evaluation only, not used in clustering)
network_df['is_anomaly'] = 0
network_df.loc[network_df['conn_per_min'] > 50, 'is_anomaly'] = 1
network_df.loc[network_df['bytes'] > 15000, 'is_anomaly'] = 1
network_df.loc[network_df['packet_size'] < 500, 'is_anomaly'] = 1

# Scale features
network_scaled = StandardScaler().fit_transform(network_df.iloc[:, :3])

DBSCAN for Network Anomaly Detection

# Find appropriate epsilon
net_distances = find_epsilon(network_scaled, k=20)
# Let's say we determine eps=0.3 is appropriate

# Apply DBSCAN
dbscan_net = DBSCAN(eps=0.3, min_samples=20)
clusters_net = dbscan_net.fit_predict(network_scaled)

# Add clustering results to DataFrame
network_df['cluster'] = clusters_net

# Evaluate anomaly detection performance
predicted_anomalies = (clusters_net == -1).astype(int)
actual_anomalies = network_df['is_anomaly'].values

true_positives = sum((predicted_anomalies == 1) & (actual_anomalies == 1))
false_positives = sum((predicted_anomalies == 1) & (actual_anomalies == 0))
true_negatives = sum((predicted_anomalies == 0) & (actual_anomalies == 0))
false_negatives = sum((predicted_anomalies == 0) & (actual_anomalies == 1))

precision = true_positives / (true_positives + false_positives)
recall = true_positives / (true_positives + false_negatives)
f1_score = 2 * (precision * recall) / (precision + recall)

print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1_score:.4f}")

Visualization

# 3D visualization
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(12, 10))
ax = fig.add_subplot(111, projection='3d')

scatter = ax.scatter(network_df['conn_per_min'], network_df['bytes'], 
                    network_df['packet_size'], c=network_df['cluster'], 
                    cmap='viridis', s=50, alpha=0.7)

# Mark actual anomalies with 'x'
anomalies = network_df[network_df['is_anomaly'] == 1]
ax.scatter(anomalies['conn_per_min'], anomalies['bytes'], anomalies['packet_size'],
          marker='x', c='red', s=100, label='Actual Anomalies')

ax.set_xlabel('Connections per Minute')
ax.set_ylabel('Bytes Transferred')
ax.set_zlabel('Avg Packet Size')
ax.set_title('Network Traffic Clusters with DBSCAN')
plt.colorbar(scatter, label='Cluster')
plt.legend()

Discussion

This case study highlights DBSCAN’s effectiveness for network security monitoring:

It successfully identified unusual traffic patterns that deviate from the norm
The algorithm doesn’t require labeled training data, making it suitable for detecting novel attacks
The precision-recall trade-off can be adjusted by modifying the ε and MinPts parameters

However, DBSCAN also faces challenges in this domain:

Network traffic patterns can change over time, requiring adaptive parameter settings
Some sophisticated attacks might mimic normal traffic patterns, evading detection
High-volume traffic requires efficient implementations of DBSCAN

Comparative Analysis and Best Practices

After applying DBSCAN across various datasets, we can draw some general conclusions and best practices:

Parameter Selection Strategies

ε (epsilon) selection:
- The k-distance plot provides a data-driven approach for selecting ε
- Domain knowledge should inform the selection process
- Multiple runs with different values may be necessary
MinPts selection:
- Higher values provide more robust clusters but may miss smaller clusters
- A common rule of thumb: MinPts ≥ dimension + 1
- For anomaly detection, larger MinPts values often yield better results

DBSCAN vs. Other Clustering Algorithms

Aspect	DBSCAN	K-means	Hierarchical Clustering
Cluster shape	Arbitrary	Spherical	Depends on linkage
Number of clusters	Automatic	Predefined	Can be determined post-hoc
Noise handling	Explicit	None	Limited
Scalability	O(n²) or O(n log n) with indexing	O(nki)	O(n²) to O(n³)
Varying densities	Limited	Moderate	Moderate

When to Use DBSCAN

DBSCAN is particularly suitable when:

The number of clusters is unknown
Clusters may have irregular shapes
The dataset contains noise and outliers
The goal includes anomaly detection

When to Consider Alternatives

Consider other algorithms when:

Clusters have significantly different densities
The dataset is very high-dimensional
Computational efficiency is critical for very large datasets
Hierarchical relationships between clusters are important

Advanced Techniques and Extensions

HDBSCAN: Hierarchical DBSCAN

HDBSCAN extends DBSCAN by converting it into a hierarchical clustering algorithm, addressing some of DBSCAN’s limitations:

Better handles varying density clusters
Eliminates the need to choose ε
Provides a hierarchical view of the cluster structure

from hdbscan import HDBSCAN

hdbscan_clusterer = HDBSCAN(min_cluster_size=15, min_samples=5)
hdbscan_clusters = hdbscan_clusterer.fit_predict(X_scaled)

OPTICS: Ordering Points To Identify Clustering Structure

OPTICS is another density-based algorithm that addresses DBSCAN’s parameter sensitivity:

Creates an ordering of points
Stores reachability distances
Allows for extracting clusters at different density levels

from sklearn.cluster import OPTICS

optics_clusterer = OPTICS(min_samples=5, xi=0.05, min_cluster_size=0.05)
optics_clusters = optics_clusterer.fit_predict(X_scaled)

Parameter Optimization with Silhouette Score

Automated parameter selection can be achieved using silhouette scores:

def optimize_dbscan_params(X, eps_range, min_samples_range):
    best_score = -1
    best_params = None
    
    for eps in eps_range:
        for min_samples in min_samples_range:
            dbscan = DBSCAN(eps=eps, min_samples=min_samples)
            labels = dbscan.fit_predict(X)
            
            # Skip if all points are noise or a single cluster
            if len(set(labels)) <= 1 or -1 not in set(labels):
                continue
                
            # Calculate silhouette score (excluding noise points)
            mask = labels != -1
            if sum(mask) <= 1:
                continue
                
            score = silhouette_score(X[mask], labels[mask])
            
            if score > best_score:
                best_score = score
                best_params = (eps, min_samples)
    
    return best_params, best_score

Conclusion

This case study has explored the application of DBSCAN for clustering and anomaly detection across diverse datasets. The algorithm’s ability to identify arbitrarily shaped clusters and detect outliers makes it a valuable tool in the data scientist’s toolkit.

Key takeaways include:

DBSCAN excels at identifying clusters without prior knowledge of their number or shape
Parameter selection is critical and should be informed by both data characteristics and domain knowledge
The algorithm is particularly effective for anomaly detection in various domains
Extensions like HDBSCAN and OPTICS address some of DBSCAN’s limitations

For practitioners, we recommend:

Using visualization techniques to understand data distribution before applying DBSCAN
Employing systematic parameter selection methods like k-distance plots
Comparing results with other clustering algorithms when appropriate
Considering domain-specific evaluation metrics beyond general clustering quality measures

By leveraging DBSCAN’s strengths while being mindful of its limitations, data scientists can extract valuable insights from complex, noisy datasets across numerous application domains.

Case Study: Using DBSCAN algorithm for Clustering and Anomaly Detection on Various Datasets

Introduction

Understanding DBSCAN

Algorithmic Foundation

Advantages of DBSCAN

Limitations

Implementation Framework

Parameter Selection Methodology

Case Study 1: Synthetic Dataset with Well-Defined Clusters

Dataset Generation

Parameter Selection

Results Visualization

Analysis

Case Study 2: Credit Card Fraud Detection

Dataset Overview

Dimensionality Reduction for Visualization

DBSCAN for Fraud Detection

Results Analysis

Visualization with PCA

Discussion

Case Study 3: Geographic Clustering for Market Segmentation

Dataset Description

Parameter Selection and DBSCAN Application

Geographic Visualization

Market Segmentation Analysis

Discussion

Case Study 4: Anomaly Detection in Network Traffic

Dataset Description

DBSCAN for Network Anomaly Detection

Visualization

Discussion

Comparative Analysis and Best Practices

Parameter Selection Strategies

DBSCAN vs. Other Clustering Algorithms

When to Use DBSCAN

When to Consider Alternatives

Advanced Techniques and Extensions

HDBSCAN: Hierarchical DBSCAN

OPTICS: Ordering Points To Identify Clustering Structure

Parameter Optimization with Silhouette Score

Conclusion

Leave a Reply Cancel reply