Biological datasets are almost always high-dimensional: 20,000 measured per sample, 30,000 per , millions of genetic per individual. Human intuition doesn't extend beyond three dimensions, and statistical methods struggle with the curse of dimensionality. Dimensionality reduction and are the tools for finding structure in this high-dimensional space.
These methods are foundational to modern bioinformatics. Every single- analysis uses them. Every bulk transcriptomics paper includes a plot. Understanding what these algorithms actually do — and where they mislead — is essential for interpreting results.
The Core Problem
You have a matrix: n samples × p features (where p might be 20,000 ). You want to:
- Visualize the data to discover structure
- samples or into groups with similar expression patterns
- Reduce noise by focusing on the principal axes of variation
The fundamental challenge: high-dimensional geometry is counterintuitive. In 20,000 dimensions, all points tend to be equidistant from each other, and the volume of space grows exponentially with dimensions. Methods that work in 2D may fail catastrophically at genomic scale.
Principal Component Analysis (PCA)
is the workhorse linear dimensionality reduction method. It finds the directions (principal components) of maximum variance in the data.
The math: solves for the eigenvectors of the covariance matrix of the data. The first principal component (PC1) is the direction of maximum variance; PC2 is the direction of maximum variance orthogonal to PC1; and so on.
Concretely: given a samples × matrix X (mean-centered), finds a rotation W such that XW = scores (coordinates in PC space), and the variance of each score column is maximized (decreasing from PC1 to PC2 to PC3...).
What it preserves: global structure. Points that are far apart in the original space are far apart in PC space (linear distances preserved).
What it loses: fine-grained local structure. projects linearly — it cannot capture non-linear structure (curved manifolds in space).
PCA in Transcriptomics
Input: normalized log-counts matrix (samples × genes)
↓
Filter to highly variable genes (top 2000–5000)
↓
Scale to unit variance (optional but common)
↓
PCA on gene expression matrix
↓
PC coordinates used for:
- Quality control (detect batch effects, outliers)
- Sample relationship visualization
- Input to downstream clustering/UMAP
PC1 and PC2 plot: samples close together in PC space have similar global expression profiles. In a well-designed experiment, samples should by biological condition, not by batch.
Variance explained: each PC explains a fraction of total variance. A scree plot shows the "elbow" — how many PCs capture most variation. Typically the first 10–50 PCs are informative; the rest is noise.
Batch effect detection: if PC1 separates samples by date rather than by biology, you have a batch effect. This is visualized before any biological interpretation.
For : always log-transform counts before (log2(CPM + 1) or DESeq2's variance-stabilizing transformation). Raw counts violate the assumptions relies on — highly expressed dominate the covariance matrix and obscure biological structure. Also filter to highly variable ; noise from stably expressed dilutes signal.
Interpreting PC Loadings
Each PC is a linear combination of all . The "loadings" ( coefficients for each PC) tell you which drive the separation:
- with large positive loadings on PC1 are highly expressed in samples with high PC1 scores
- with large negative loadings on PC1 are highly expressed in samples with low PC1 scores
This enables biological interpretation: if PC1 separates tumor from normal samples, its loadings identify the most responsible for that separation. These are candidates for further investigation.
t-SNE: Visualizing Local Structure
t-SNE (t-Distributed Stochastic Neighbor Embedding) is a non-linear dimensionality reduction method designed specifically for visualization. It produces 2D embeddings where similar points are close together.
The algorithm:
- Compute pairwise similarities in high-dimensional space (using Gaussian kernels)
- Define target pairwise similarities in 2D (using t-distribution with heavy tails)
- Optimize 2D coordinates to minimize KL divergence between the two similarity distributions
The heavy-tailed t-distribution in 2D is the key innovation: it prevents crowding (all similar points collapsing to one spot) by allowing moderate-distance points to be mapped to greater distances in 2D.
What t-SNE preserves: local neighborhood structure. Points that are close together in high-dimensional space end up close in 2D.
What t-SNE does NOT preserve:
- Global distances. far apart in t-SNE may or may not be far apart in expression space
- sizes. A large in t-SNE may have fewer or more than a small one
- Distances between are not interpretable
In a t-SNE plot of single- data, the distance between A and B tells you nothing reliable about how different those types are transcriptomically. Two that look far apart might be more similar to each other than two that look adjacent. Use t-SNE for visualizing membership, not for quantifying relationships.
Perplexity parameter: controls the effective number of neighbors considered for each point. Low perplexity (5–10) captures very local structure; high perplexity (50–100) captures more global structure. Try multiple values — the "best" t-SNE depends on your question.
Stochasticity: t-SNE uses random initialization and is stochastic. Run it multiple times with different seeds and use initialization for reproducibility. Two t-SNE runs of the same data will look different.
Computational cost: O(n²) naive; Barnes-Hut approximation reduces to O(n log n). Still slow for >100,000 .
UMAP: Better Topology Preservation
UMAP (Uniform Manifold Approximation and Projection) has largely replaced t-SNE for single- analysis. It is faster, scales better, and preserves more global structure.
The mathematical basis: UMAP is grounded in topological data analysis and Riemannian geometry. It models the data as lying on a low-dimensional manifold and constructs a fuzzy topological representation, then optimizes a 2D embedding to match this representation.
Advantages over t-SNE:
- Faster (often 10–100× for large datasets)
- Better preservation of global structure (distances between more interpretable)
- Deterministic with fixed random seed
- Scales to millions of
Key parameters:
n_neighbors(15–50): controls the balance between local and global structure. Small values → focus on local neighborhoods. Large values → capture more global topologymin_dist(0.0–1.0): minimum distance in the 2D embedding. Low values → tighter ; high values → more uniform distributionn_components: dimensionality of output (usually 2 for visualization; sometimes 10–30 as intermediate representation)
UMAP is still non-linear: like t-SNE, UMAP distances between distant are not perfectly interpretable. But within and between nearby , the structure is more reliable than t-SNE.
Single-Cell Analysis Pipeline (Scanpy/Seurat)
Raw count matrix (cells × genes)
↓
Quality filtering (min genes/cell, max mitochondrial %)
↓
Normalization + log transform
↓
Identify highly variable genes
↓
PCA (50 PCs)
↓
k-NN graph in PC space (n_neighbors = 15)
↓
UMAP on k-NN graph
↓
Leiden/Louvain clustering on k-NN graph
↓
Cell type annotation (marker genes, reference datasets)
A key insight: UMAP is typically computed from the k-nearest-neighbor (k-NN) graph, not directly from raw expression. And is also computed from the same k-NN graph — so UMAP layout and assignments are derived from the same underlying graph structure.
Clustering Methods
Hierarchical Clustering
Hierarchical builds a tree (dendrogram) showing how samples group together.
Agglomerative (bottom-up): each sample starts as its own ; then iteratively merge the two most similar until one remains.
Linkage methods (how distance between is defined):
- Complete linkage: distance = maximum distance between any pair of points from the two . Produces compact, similar-sized .
- Average linkage (UPGMA): distance = average distance between all pairs. Balanced trade-off.
- Ward's method: minimize within- variance at each merge. Often best for data.
Distance metrics: for expression data, typically:
- Euclidean distance for normalized log-counts
- 1 − Pearson correlation for expression pattern similarity (captures relative shape, not absolute levels)
- Spearman correlation-based distance for robustness to outliers
The dendrogram: cutting at different heights gives different numbers of . The choice is subjective — use domain knowledge and validation metrics.
Heatmaps + hierarchical : the canonical visualization for bulk results. (rows) and samples (columns) clustered by expression profile. Co-regulated modules appear as blocks of similar color.
k-Means Clustering
k-Means partitions n points into k by minimizing within- sum of squared distances to the centroid.
Algorithm:
- Initialize k centroids (randomly or with k-means++ smart initialization)
- Assign each point to its nearest centroid
- Recompute centroids as mean of assigned points
- Repeat until convergence
Limitations:
- Requires specifying k in advance
- Assumes spherical (equal variance in all directions) — fails for elongated or non-convex
- Sensitive to initialization (run multiple times, take best result)
- Poor performance with outliers
Choosing k: plot within- sum of squares vs. k (elbow method); or use silhouette score (measures how well each point fits its vs. the nearest alternative).
Graph-Based Clustering (Louvain/Leiden)
For single- data, graph-based methods are standard:
- Build a k-NN graph: connect each to its k nearest neighbors in space
- Weight edges by similarity
- Apply community detection (Louvain or Leiden algorithm) to find communities that maximize modularity
Why this works for scRNA-seq: form a manifold in expression space. k-NN graphs capture the local topology of this manifold better than global distance-based methods. correspond to distinct states or types.
Resolution parameter: controls granularity. Higher resolution → more, smaller . Lower resolution → fewer, larger . There is no single "correct" resolution — it depends on the biological question (major lineages vs. fine subtypes).
Leiden (Traag et al. 2019) is an improved version of Louvain that guarantees well-connected communities and is the current recommendation for single- . Louvain can produce internally disconnected communities in some cases. For most practical purposes the results are similar, but use Leiden as the default.
Comparing Dimensionality Reduction Methods
| Method | Type | Preserves | Speed | Best for |
|---|---|---|---|---|
| PCA | Linear | Global variance | Fast | QC, batch detection, input to downstream methods |
| t-SNE | Non-linear | Local neighborhoods | Slow | Visualization (≤100K cells) |
| UMAP | Non-linear | Local + some global | Fast | Visualization + downstream analysis |
| Hierarchical | Clustering | Hierarchical structure | O(n²) space | Heatmaps, small datasets, dendrogram needed |
| k-means | Clustering | Spherical clusters | Fast | Large datasets, well-separated clusters |
| Leiden/Louvain | Graph community | Topological structure | Fast | Single-cell clustering |
Evaluating Clusters
is unsupervised — there's no ground truth. Evaluation is inherently harder than supervised learning:
Internal metrics (don't require labels):
- Silhouette score: for each point, measures how similar it is to its own vs. nearest other . Range [−1, 1]; higher = better separation.
- Davies-Bouldin index: average ratio of within- scatter to between- distance. Lower = better.
- Calinski-Harabasz index: ratio of between- to within- variance. Higher = better.
Biological validation (for scRNA-seq):
- Marker : do -specific marker match known type markers?
- Reference dataset integration: do with annotated reference datasets (CellTypist, Azimuth)?
- Functional coherence: do in a respond similarly to perturbations?
- Trajectory analysis: do relationships form biologically plausible developmental paths?
Batch Correction Before Visualization
A common problem: samples processed in different batches by batch in /UMAP rather than by biology.
ComBat (bulk ): parametric batch effect correction that adjusts for additive and multiplicative batch effects. Run on normalized log-counts before .
Harmony (single-): integrates single- datasets by iteratively adjusting PC coordinates to remove batch effects while preserving biological variation.
scVI (deep generative model): learns a latent representation that accounts for batch effects probabilistically.
After batch correction: UMAP and should reflect biology, not technical factors. Always verify with known biological markers — over-correction can merge genuinely different types.
Application: Single-Cell RNA-seq Workflow
The Scanpy/Seurat standard workflow exemplifies how these methods combine:
- QC filtering: remove with too few (empty droplets), too many (doublets), or high mitochondrial fraction (damaged )
- Normalization: normalize to 10,000 counts per , then log-transform
- Feature selection: keep the top 2,000–5,000 highly variable (reduces noise, speeds computation)
- : run on highly variable ; keep top 50 PCs
- Batch correction (if needed): Harmony on PC embeddings
- k-NN graph: k=15 neighbors in PC space
- : Leiden at multiple resolutions; choose resolution that matches biological prior
- UMAP: for visualization; run on the same k-NN graph
- : between (identifies marker )
- type annotation: match markers to reference; confirm with scores
This pipeline is largely automated in Scanpy (sc.pp, sc.tl, sc.pl) and Seurat (FindVariableFeatures, RunPCA, FindNeighbors, RunUMAP).
Common Pitfalls
Treating UMAP distances as meaningful: the most common misinterpretation. topology in UMAP is informative; inter- distances are not.
Over-: too many split real types into arbitrary sub- without biological meaning. Always validate with marker .
Under-: too few merge distinct populations. Rare types (5% of ) may not appear as their own unless the resolution is high enough.
on raw counts: always log-transform first. on raw counts is dominated by highly expressed and gives misleading results.
Not filtering highly variable : running on all 30,000 includes thousands of stably expressed housekeeping that add noise without signal.
Ignoring batch effects: batch effects can be stronger than biological signal. Always run first and check for non-biological .