Clustering and Dimensionality Reduction

Biological datasets are almost always high-dimensional: 20,000 genes measured per sample, 30,000 genes per cell, millions of genetic variants per individual. Human intuition doesn't extend beyond three dimensions, and statistical methods struggle with the curse of dimensionality. Dimensionality reduction and clustering are the tools for finding structure in this high-dimensional space.

These methods are foundational to modern bioinformatics. Every single-cell RNA-seq analysis uses them. Every bulk transcriptomics paper includes a PCA plot. Understanding what these algorithms actually do — and where they mislead — is essential for interpreting results.

The Core Problem

You have a matrix: n samples × p features (where p might be 20,000 genes). You want to:

Visualize the data to discover structure
Cluster samples or cells into groups with similar expression patterns
Reduce noise by focusing on the principal axes of variation

The fundamental challenge: high-dimensional geometry is counterintuitive. In 20,000 dimensions, all points tend to be equidistant from each other, and the volume of space grows exponentially with dimensions. Methods that work in 2D may fail catastrophically at genomic scale.

Principal Component Analysis (PCA)

PCA is the workhorse linear dimensionality reduction method. It finds the directions (principal components) of maximum variance in the data.

The math: PCA solves for the eigenvectors of the covariance matrix of the data. The first principal component (PC1) is the direction of maximum variance; PC2 is the direction of maximum variance orthogonal to PC1; and so on.

Concretely: given a samples × genes matrix X (mean-centered), PCA finds a rotation W such that XW = scores (coordinates in PC space), and the variance of each score column is maximized (decreasing from PC1 to PC2 to PC3...).

What it preserves: global structure. Points that are far apart in the original space are far apart in PC space (linear distances preserved).

What it loses: fine-grained local structure. PCA projects linearly — it cannot capture non-linear structure (curved manifolds in gene expression space).

PCA in Transcriptomics

Input: normalized log-counts matrix (samples × genes)
↓
Filter to highly variable genes (top 2000–5000)
↓
Scale to unit variance (optional but common)
↓
PCA on gene expression matrix
↓
PC coordinates used for:
  - Quality control (detect batch effects, outliers)
  - Sample relationship visualization
  - Input to downstream clustering/UMAP

PC1 and PC2 plot: samples close together in PC space have similar global expression profiles. In a well-designed experiment, samples should cluster by biological condition, not by batch.

Variance explained: each PC explains a fraction of total variance. A scree plot shows the "elbow" — how many PCs capture most variation. Typically the first 10–50 PCs are informative; the rest is noise.

Batch effect detection: if PC1 separates samples by sequencing date rather than by biology, you have a batch effect. This is visualized before any biological interpretation.

ℹPCA inputs matter enormously

For RNA-seq: always log-transform counts before PCA (log2(CPM + 1) or DESeq2's variance-stabilizing transformation). Raw counts violate the assumptions PCA relies on — highly expressed genes dominate the covariance matrix and obscure biological structure. Also filter to highly variable genes; noise from stably expressed genes dilutes signal.

Interpreting PC Loadings

Each PC is a linear combination of all genes. The "loadings" (gene coefficients for each PC) tell you which genes drive the separation:

Genes with large positive loadings on PC1 are highly expressed in samples with high PC1 scores
Genes with large negative loadings on PC1 are highly expressed in samples with low PC1 scores

This enables biological interpretation: if PC1 separates tumor from normal samples, its loadings identify the genes most responsible for that separation. These are candidates for further investigation.

t-SNE: Visualizing Local Structure

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a non-linear dimensionality reduction method designed specifically for visualization. It produces 2D embeddings where similar points are close together.

The algorithm:

Compute pairwise similarities in high-dimensional space (using Gaussian kernels)
Define target pairwise similarities in 2D (using t-distribution with heavy tails)
Optimize 2D coordinates to minimize KL divergence between the two similarity distributions

The heavy-tailed t-distribution in 2D is the key innovation: it prevents crowding (all similar points collapsing to one spot) by allowing moderate-distance points to be mapped to greater distances in 2D.

What t-SNE preserves: local neighborhood structure. Points that are close together in high-dimensional space end up close in 2D.

What t-SNE does NOT preserve:

Global distances. Clusters far apart in t-SNE may or may not be far apart in expression space
Cluster sizes. A large cluster in t-SNE may have fewer or more cells than a small one
Distances between clusters are not interpretable

★What t-SNE plots don't tell you

In a t-SNE plot of single-cell data, the distance between cluster A and cluster B tells you nothing reliable about how different those cell types are transcriptomically. Two clusters that look far apart might be more similar to each other than two clusters that look adjacent. Use t-SNE for visualizing cluster membership, not for quantifying relationships.

Perplexity parameter: controls the effective number of neighbors considered for each point. Low perplexity (5–10) captures very local structure; high perplexity (50–100) captures more global structure. Try multiple values — the "best" t-SNE depends on your question.

Stochasticity: t-SNE uses random initialization and is stochastic. Run it multiple times with different seeds and use PCA initialization for reproducibility. Two t-SNE runs of the same data will look different.

Computational cost: O(n²) naive; Barnes-Hut approximation reduces to O(n log n). Still slow for >100,000 cells.

UMAP: Better Topology Preservation

UMAP (Uniform Manifold Approximation and Projection) has largely replaced t-SNE for single-cell analysis. It is faster, scales better, and preserves more global structure.

The mathematical basis: UMAP is grounded in topological data analysis and Riemannian geometry. It models the data as lying on a low-dimensional manifold and constructs a fuzzy topological representation, then optimizes a 2D embedding to match this representation.

Advantages over t-SNE:

Faster (often 10–100× for large datasets)
Better preservation of global structure (distances between clusters more interpretable)
Deterministic with fixed random seed
Scales to millions of cells

Key parameters:

n_neighbors (15–50): controls the balance between local and global structure. Small values → focus on local neighborhoods. Large values → capture more global topology
min_dist (0.0–1.0): minimum distance in the 2D embedding. Low values → tighter clusters; high values → more uniform distribution
n_components: dimensionality of output (usually 2 for visualization; sometimes 10–30 as intermediate representation)

UMAP is still non-linear: like t-SNE, UMAP distances between distant clusters are not perfectly interpretable. But within and between nearby clusters, the structure is more reliable than t-SNE.

Single-Cell Analysis Pipeline (Scanpy/Seurat)

Raw count matrix (cells × genes)
↓
Quality filtering (min genes/cell, max mitochondrial %)
↓
Normalization + log transform
↓
Identify highly variable genes
↓
PCA (50 PCs)
↓
k-NN graph in PC space (n_neighbors = 15)
↓
UMAP on k-NN graph
↓
Leiden/Louvain clustering on k-NN graph
↓
Cell type annotation (marker genes, reference datasets)

A key insight: UMAP is typically computed from the k-nearest-neighbor (k-NN) graph, not directly from raw expression. And clustering is also computed from the same k-NN graph — so UMAP layout and cluster assignments are derived from the same underlying graph structure.

Clustering Methods

Hierarchical Clustering

Hierarchical clustering builds a tree (dendrogram) showing how samples group together.

Agglomerative (bottom-up): each sample starts as its own cluster; then iteratively merge the two most similar clusters until one remains.

Linkage methods (how distance between clusters is defined):

Complete linkage: distance = maximum distance between any pair of points from the two clusters. Produces compact, similar-sized clusters.
Average linkage (UPGMA): distance = average distance between all pairs. Balanced trade-off.
Ward's method: minimize within-cluster variance at each merge. Often best for gene expression data.

Distance metrics: for expression data, typically:

Euclidean distance for normalized log-counts
1 − Pearson correlation for expression pattern similarity (captures relative shape, not absolute levels)
Spearman correlation-based distance for robustness to outliers

The dendrogram: cutting at different heights gives different numbers of clusters. The choice is subjective — use domain knowledge and validation metrics.

Heatmaps + hierarchical clustering: the canonical visualization for bulk RNA-seq differential expression results. Genes (rows) and samples (columns) clustered by expression profile. Co-regulated gene modules appear as blocks of similar color.

k-Means Clustering

k-Means partitions n points into k clusters by minimizing within-cluster sum of squared distances to the centroid.

Algorithm:

Initialize k centroids (randomly or with k-means++ smart initialization)
Assign each point to its nearest centroid
Recompute centroids as mean of assigned points
Repeat until convergence

Limitations:

Requires specifying k in advance
Assumes spherical clusters (equal variance in all directions) — fails for elongated or non-convex clusters
Sensitive to initialization (run multiple times, take best result)
Poor performance with outliers

Choosing k: plot within-cluster sum of squares vs. k (elbow method); or use silhouette score (measures how well each point fits its cluster vs. the nearest alternative).

Graph-Based Clustering (Louvain/Leiden)

For single-cell data, graph-based methods are standard:

Build a k-NN graph: connect each cell to its k nearest neighbors in PCA space
Weight edges by similarity
Apply community detection (Louvain or Leiden algorithm) to find communities that maximize modularity

Why this works for scRNA-seq: cells form a manifold in expression space. k-NN graphs capture the local topology of this manifold better than global distance-based methods. Clusters correspond to distinct cell states or types.

Resolution parameter: controls granularity. Higher resolution → more, smaller clusters. Lower resolution → fewer, larger clusters. There is no single "correct" resolution — it depends on the biological question (major lineages vs. fine subtypes).

ℹLeiden vs. Louvain

Leiden (Traag et al. 2019) is an improved version of Louvain that guarantees well-connected communities and is the current recommendation for single-cell clustering. Louvain can produce internally disconnected communities in some cases. For most practical purposes the results are similar, but use Leiden as the default.

Comparing Dimensionality Reduction Methods

Method	Type	Preserves	Speed	Best for
PCA	Linear	Global variance	Fast	QC, batch detection, input to downstream methods
t-SNE	Non-linear	Local neighborhoods	Slow	Visualization (≤100K cells)
UMAP	Non-linear	Local + some global	Fast	Visualization + downstream analysis
Hierarchical	Clustering	Hierarchical structure	O(n²) space	Heatmaps, small datasets, dendrogram needed
k-means	Clustering	Spherical clusters	Fast	Large datasets, well-separated clusters
Leiden/Louvain	Graph community	Topological structure	Fast	Single-cell clustering

Evaluating Clusters

Clustering is unsupervised — there's no ground truth. Evaluation is inherently harder than supervised learning:

Internal metrics (don't require labels):

Silhouette score: for each point, measures how similar it is to its own cluster vs. nearest other cluster. Range [−1, 1]; higher = better separation.
Davies-Bouldin index: average ratio of within-cluster scatter to between-cluster distance. Lower = better.
Calinski-Harabasz index: ratio of between-cluster to within-cluster variance. Higher = better.

Biological validation (for scRNA-seq):

Marker genes: do cluster-specific marker genes match known cell type markers?
Reference dataset integration: do clusters align with annotated reference datasets (CellTypist, Azimuth)?
Functional coherence: do cells in a cluster respond similarly to perturbations?
Trajectory analysis: do cluster relationships form biologically plausible developmental paths?

Batch Correction Before Visualization

A common problem: samples processed in different batches cluster by batch in PCA/UMAP rather than by biology.

ComBat (bulk RNA-seq): parametric batch effect correction that adjusts for additive and multiplicative batch effects. Run on normalized log-counts before PCA.

Harmony (single-cell): integrates single-cell datasets by iteratively adjusting PC coordinates to remove batch effects while preserving biological variation.

scVI (deep generative model): learns a latent representation that accounts for batch effects probabilistically.

After batch correction: UMAP and clustering should reflect biology, not technical factors. Always verify with known biological markers — over-correction can merge genuinely different cell types.

Application: Single-Cell RNA-seq Workflow

The Scanpy/Seurat standard workflow exemplifies how these methods combine:

QC filtering: remove cells with too few genes (empty droplets), too many genes (doublets), or high mitochondrial fraction (damaged cells)
Normalization: normalize to 10,000 counts per cell, then log-transform
Feature selection: keep the top 2,000–5,000 highly variable genes (reduces noise, speeds computation)
PCA: run on highly variable genes; keep top 50 PCs
Batch correction (if needed): Harmony on PC embeddings
k-NN graph: k=15 neighbors in PC space
Clustering: Leiden at multiple resolutions; choose resolution that matches biological prior
UMAP: for visualization; run on the same k-NN graph
Differential expression: between clusters (identifies marker genes)
Cell type annotation: match markers to reference; confirm with scores

This pipeline is largely automated in Scanpy (sc.pp, sc.tl, sc.pl) and Seurat (FindVariableFeatures, RunPCA, FindNeighbors, RunUMAP).

Common Pitfalls

Treating UMAP distances as meaningful: the most common misinterpretation. Cluster topology in UMAP is informative; inter-cluster distances are not.

Over-clustering: too many clusters split real cell types into arbitrary sub-clusters without biological meaning. Always validate with marker genes.

Under-clustering: too few clusters merge distinct cell populations. Rare cell types (5% of cells) may not appear as their own cluster unless the resolution is high enough.

PCA on raw counts: always log-transform first. PCA on raw counts is dominated by highly expressed genes and gives misleading results.

Not filtering highly variable genes: running PCA on all 30,000 genes includes thousands of stably expressed housekeeping genes that add noise without signal.

Ignoring batch effects: batch effects can be stronger than biological signal. Always run PCA first and check for non-biological clustering.