Differential Expression Analysis

Differential expression analysis answers the most common question in transcriptomics: which genes change their expression between condition A and condition B? It sounds straightforward, but the statistical challenges — count data, high dimensionality, small sample sizes, multiple testing — require specialized methods developed specifically for RNA-seq data.

The Statistical Problem

Given:

n₁ samples in condition A (e.g., tumor)
n₂ samples in condition B (e.g., normal)
Expression measurements for G ~20,000 genes in each sample

Find: which genes differ significantly between conditions?

Why you can't use a t-test on raw counts:

RNA-seq counts are integers, not continuous — a gene with 0 counts in one sample but 100 in another can't be normally distributed
Variance is not constant across count levels. For Poisson counts, variance = mean. For RNA-seq, variance >> mean (overdispersion) — this is biological variability on top of sampling variation
Sequencing depth varies between samples — a sample with 2× more total reads will show 2× more counts per gene, which is a technical artifact, not biology

Normalization First

Before testing, counts must be normalized to remove sequencing depth effects:

DESeq2 size factor normalization: For each sample, compute a size factor as the median of (count of gene g in sample j) / (geometric mean of gene g across all samples). Dividing each count by the sample's size factor makes samples comparable.

This is more robust than simple per-million scaling because it's resistant to the influence of highly expressed genes (a gene expressed at 10% of total reads would dominate simple scaling).

When to use which normalization:

DESeq2 size factors (within-experiment normalization): correct for library size differences between samples in the same experiment
TMM (edgeR): similar to DESeq2, assumes most genes are non-differentially expressed
TPM (for cross-experiment comparison): normalizes for both library size AND gene length; appropriate for comparing expression levels across samples from different studies
CPM/RPKM/FPKM: older methods, generally avoid RPKM/FPKM due to mathematical inconsistencies in cross-sample comparison

ℹDon't use TPM for differential expression testing

TPM is appropriate for visualizing and reporting expression levels (e.g., "BRCA1 is expressed at 15 TPM in breast tissue"). But for differential expression statistical testing, use raw counts with DESeq2 or edgeR — these tools do their own normalization internally and their statistical models require raw counts. Running DESeq2 on TPM-normalized data invalidates the variance model.

The Negative Binomial Model

The appropriate model for RNA-seq count data is the negative binomial (NB) distribution:

P(Y = k) = Γ(k + μ/α) / (Γ(μ/α) × k!) × (αμ)^k / (1 + αμ)^(k + μ/α)

Where:

μ = mean count (mean expression level)
α = dispersion parameter (overdispersion; α=0 reduces to Poisson)

The variance is: Var(Y) = μ + α × μ²

Dispersion α captures the extra variability beyond sampling noise. It varies by gene and is typically estimated from the data.

The Small n Problem and Shrinkage

With only n=3–5 replicates per group, the per-gene variance estimate is unreliable. A gene with 3 control samples might show high variance just by chance — leading to inflated confidence intervals and lost power.

Empirical Bayes shrinkage: both DESeq2 and edgeR "borrow strength" across genes. The idea: the dispersion of a gene with count 50 should be similar to dispersions of other genes with similar counts. By pooling information across thousands of genes, you get much better dispersion estimates for any individual gene.

DESeq2 fits a curve: dispersion as a function of mean expression (the "dispersion trend"). Each gene's dispersion estimate is then shrunk toward this trend. Genes with few replicates (where the estimate is unreliable) are shrunk more; genes where the data strongly support a high dispersion are shrunk less.

This shrinkage is a form of regularization — exactly the same principle as Ridge regression. It reduces variance in the estimate at the cost of introducing some bias, but the net effect on test performance is strongly positive.

DESeq2 Workflow

Raw counts matrix (genes × samples)
    ↓
Create DESeqDataSet object
    ↓
Estimate size factors (normalization)
    ↓
Estimate dispersions (gene-wise, fit trend, shrink)
    ↓
Fit negative binomial GLM per gene
    ↓
Wald test or LRT for log fold changes
    ↓
Shrink log fold changes (apeglm/ashr)
    ↓
Apply multiple testing correction (Benjamini-Hochberg)
    ↓
Results table: log2FC, SE, Wald statistic, p-value, padj

Design Formula

The design formula specifies what comparisons to make:

# Simple two-group comparison
dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = metadata,
                              design = ~condition)

# Controlling for batch effect
dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = metadata,
                              design = ~batch + condition)

# Interaction: does treatment effect differ between genotypes?
dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = metadata,
                              design = ~genotype + treatment + genotype:treatment)

The design formula is a linear model in the generalized linear framework. Each gene is tested with the same model structure; coefficients and tests correspond to the design matrix columns.

Log Fold Change Shrinkage

Even after dispersion shrinkage, log fold change estimates can be noisy for low-count genes. DESeq2's lfcShrink() applies an additional shrinkage prior to fold change estimates:

The apeglm method (adaptive Student's t prior) is the default recommendation
Shrinks large LFCs for low-count genes toward zero while preserving estimates for high-count genes
Does NOT change the p-value or significance status — only the LFC estimate for visualization and ranking

Shrunken LFCs are appropriate for MA plots, volcano plots, and ranking genes for pathway analysis. Unshrunken LFCs can produce extreme values for low-count genes that dominate rankings.

Interpreting Results

The Results Table

Column	Meaning
baseMean	Mean normalized count across all samples
log2FoldChange	Shrunken log₂(treated/control)
lfcSE	Standard error of the log2FC estimate
stat	Wald statistic (log2FC / lfcSE)
pvalue	P-value from Wald test (assuming null: log2FC = 0)
padj	BH-adjusted p-value (FDR-corrected)

Standard significance threshold: padj < 0.05 (5% FDR) AND |log2FoldChange| > 1 (at least 2-fold change). The second criterion is a biological significance filter — statistically significant but tiny fold changes may not be biologically relevant.

The MA Plot

The MA plot shows log fold change (y-axis) vs. mean expression (x-axis). With fold change shrinkage:

High-count genes: LFC estimates spread around the true values; significant genes clearly visible
Low-count genes: LFCs shrunk toward zero; noisy estimates don't dominate the plot

Red dots = significant (padj < 0.05). A well-behaved MA plot shows symmetry around LFC = 0 for non-significant genes, with significant genes extending away from the center.

The Volcano Plot

Volcano plot shows -log₁₀(p-value) vs. log₂ fold change:

X-axis: magnitude of change (more extreme = larger effect)
Y-axis: statistical significance (higher = more significant)
Upper-left: significantly downregulated
Upper-right: significantly upregulated
Bottom middle: neither significant nor large effect

Colored/labeled points are typically those passing both significance and fold-change thresholds.

Independent Filtering

DESeq2 automatically performs independent filtering: it removes genes with very low mean counts from multiple testing correction. Why? These genes have essentially no power to be detected as significant, so they consume multiple testing correction budget without contributing real discoveries.

The filter is "independent" because it uses a criterion (mean count) that is independent of the test statistic under the null — so it doesn't bias the results, just increases power by reducing the number of hypotheses tested.

LRT vs. Wald Test

Wald test (default): tests whether a specific coefficient (e.g., log2FC for condition) is significantly different from zero. Fast; directly tests one coefficient.

Likelihood ratio test (LRT): compares the full model to a reduced model (without the term of interest). More appropriate for:

Testing whether condition explains significant variance overall (across a set of conditions)
Comparing models with different complexity
Testing for significant effects in any direction simultaneously

Use LRT when you have multiple conditions and want to identify genes that change across any of them: DESeq2::DESeq(dds, test="LRT", reduced=~1).

edgeR: The Alternative

edgeR uses a similar negative binomial framework but implements different specifics:

TMM normalization instead of DESeq2's size factors
Quasi-likelihood approach (QL F-test) for better calibration with small samples — now the recommended mode
Tagwise dispersion estimation (per-gene) with empirical Bayes pooling

For most purposes, DESeq2 and edgeR give very similar results. DESeq2 has more built-in features (time-series, complex designs, shrinkage). edgeR's QL approach is sometimes preferred for very small sample sizes.

limma-voom: The Workhorse for Complex Designs

For complex multi-factor designs, limma-voom is often the most flexible tool:

Voom: transforms log-CPM values with precision weights that account for the mean-variance relationship in count data
limma: applies linear modeling with empirical Bayes variance shrinkage

limma's strength: it handles any linear model design, including continuous covariates, interaction terms, batch effects, and time-course comparisons. The full linear model framework of limma is more flexible than DESeq2/edgeR for complex experimental designs.

Time-Course Analysis

For experiments with multiple timepoints, the DESeq2 LRT identifies genes with any change over time, while maSigPro and ImpulseDE2 can identify genes with specific temporal patterns (monotonic increase, transient peak, etc.).

For periodic patterns (circadian rhythms), RAIN and JTK_CYCLE use non-parametric tests specifically designed for rhythm detection.

Practical Considerations

Minimum sample size: n ≥ 3 per group is an absolute minimum; n ≥ 5–6 is strongly recommended. With n=2, the only statistical option is treating samples as technical replicates — which is rarely justified.

Check for outlier samples before analysis: PCA or hierarchical clustering of samples should show grouping by condition, not by sample. A sample that clusters with the wrong group needs investigation (possible labeling error, contamination, or poor quality).

Surpassing the low-count filter manually: DESeq2's independent filtering is automatic, but manually filtering before analysis (e.g., keeping only genes with CPM > 1 in at least 3 samples) ensures you analyze genes with evidence of expression in at least some samples.

Biological interpretation requires context: 500 differentially expressed genes is the start, not the answer. Next steps: pathway enrichment, network analysis, validation of top candidates, integration with other data types (ATAC-seq for regulatory elements, protein expression for functional confirmation).