Part 8·8.3·16 min read

Differential Expression Analysis

The statistical framework for finding genes that change between conditions in RNA-seq — negative binomial models, shrinkage estimation, and multiple testing.

RNA-seqdifferential expressionDESeq2statisticstranscriptomics

analysis answers the most common question in transcriptomics: which change their expression between condition A and condition B? It sounds straightforward, but the statistical challenges — count data, high dimensionality, small sample sizes, multiple testing — require specialized methods developed specifically for data.

The Statistical Problem

Given:

  • n₁ samples in condition A (e.g., tumor)
  • n₂ samples in condition B (e.g., normal)
  • Expression measurements for G ~20,000 in each sample

Find: which differ significantly between conditions?

Why you can't use a t-test on raw counts:

  1. counts are integers, not continuous — a with 0 counts in one sample but 100 in another can't be normally distributed

  2. Variance is not constant across count levels. For Poisson counts, variance = mean. For , variance >> mean (overdispersion) — this is biological variability on top of sampling variation

  3. depth varies between samples — a sample with 2× more total will show 2× more counts per , which is a technical artifact, not biology

Normalization First

Before testing, counts must be normalized to remove depth effects:

DESeq2 size factor normalization: For each sample, compute a size factor as the median of (count of g in sample j) / (geometric mean of g across all samples). Dividing each count by the sample's size factor makes samples comparable.

This is more robust than simple per-million scaling because it's resistant to the influence of highly expressed (a expressed at 10% of total would dominate simple scaling).

When to use which normalization:

  • DESeq2 size factors (within-experiment normalization): correct for library size differences between samples in the same experiment
  • TMM (edgeR): similar to DESeq2, assumes most are non-
  • TPM (for cross-experiment comparison): normalizes for both library size AND length; appropriate for comparing across samples from different studies
  • CPM/RPKM/FPKM: older methods, generally avoid RPKM/FPKM due to mathematical inconsistencies in cross-sample comparison
Don't use TPM for differential expression testing

TPM is appropriate for visualizing and reporting (e.g., "BRCA1 is expressed at 15 TPM in breast tissue"). But for statistical testing, use raw counts with DESeq2 or edgeR — these tools do their own normalization internally and their statistical models require raw counts. Running DESeq2 on TPM-normalized data invalidates the variance model.

The Negative Binomial Model

The appropriate model for count data is the negative binomial (NB) distribution:

P(Y = k) = Γ(k + μ/α) / (Γ(μ/α) × k!) × (αμ)^k / (1 + αμ)^(k + μ/α)

Where:

  • μ = mean count (mean )
  • α = dispersion parameter (overdispersion; α=0 reduces to Poisson)

The variance is: Var(Y) = μ + α × μ²

Dispersion α captures the extra variability beyond sampling noise. It varies by and is typically estimated from the data.

The Small n Problem and Shrinkage

With only n=3–5 replicates per group, the per- variance estimate is unreliable. A with 3 control samples might show high variance just by chance — leading to inflated confidence intervals and lost power.

Empirical Bayes shrinkage: both DESeq2 and edgeR "borrow strength" across . The idea: the dispersion of a with count 50 should be similar to dispersions of other with similar counts. By pooling information across thousands of , you get much better dispersion estimates for any individual .

DESeq2 fits a curve: dispersion as a function of mean expression (the "dispersion trend"). Each 's dispersion estimate is then shrunk toward this trend. with few replicates (where the estimate is unreliable) are shrunk more; where the data strongly support a high dispersion are shrunk less.

This shrinkage is a form of regularization — exactly the same principle as Ridge regression. It reduces variance in the estimate at the cost of introducing some bias, but the net effect on test performance is strongly positive.

DESeq2 Workflow

Raw counts matrix (genes × samples)
    ↓
Create DESeqDataSet object
    ↓
Estimate size factors (normalization)
    ↓
Estimate dispersions (gene-wise, fit trend, shrink)
    ↓
Fit negative binomial GLM per gene
    ↓
Wald test or LRT for log fold changes
    ↓
Shrink log fold changes (apeglm/ashr)
    ↓
Apply multiple testing correction (Benjamini-Hochberg)
    ↓
Results table: log2FC, SE, Wald statistic, p-value, padj

Design Formula

The design formula specifies what comparisons to make:

r
# Simple two-group comparison
dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = metadata,
                              design = ~condition)

# Controlling for batch effect
dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = metadata,
                              design = ~batch + condition)

# Interaction: does treatment effect differ between genotypes?
dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = metadata,
                              design = ~genotype + treatment + genotype:treatment)

The design formula is a linear model in the generalized linear framework. Each is tested with the same model structure; coefficients and tests correspond to the design matrix columns.

Log Fold Change Shrinkage

Even after dispersion shrinkage, log estimates can be noisy for low-count . DESeq2's lfcShrink() applies an additional shrinkage prior to estimates:

  • The apeglm method (adaptive Student's t prior) is the default recommendation
  • Shrinks large LFCs for low-count toward zero while preserving estimates for high-count
  • Does NOT change the or significance status — only the LFC estimate for visualization and ranking

Shrunken LFCs are appropriate for MA plots, volcano plots, and ranking for analysis. Unshrunken LFCs can produce extreme values for low-count that dominate rankings.

Interpreting Results

The Results Table

ColumnMeaning
baseMeanMean normalized count across all samples
log2FoldChangeShrunken log₂(treated/control)
lfcSEStandard error of the log2FC estimate
statWald statistic (log2FC / lfcSE)
pvalueP-value from Wald test (assuming null: log2FC = 0)
padjBH-adjusted p-value (FDR-corrected)

Standard significance threshold: < 0.05 (5% ) AND |log2FoldChange| > 1 (at least 2-). The second criterion is a biological significance filter — statistically significant but tiny fold changes may not be biologically relevant.

The MA Plot

The MA plot shows log (y-axis) vs. mean expression (x-axis). With shrinkage:

  • High-count : LFC estimates spread around the true values; significant clearly visible
  • Low-count : LFCs shrunk toward zero; noisy estimates don't dominate the plot

Red dots = significant ( < 0.05). A well-behaved MA plot shows symmetry around LFC = 0 for non-significant , with significant extending away from the center.

The Volcano Plot

Volcano plot shows -log₁₀() vs. log₂ :

  • X-axis: magnitude of change (more extreme = larger effect)
  • Y-axis: statistical significance (higher = more significant)
  • Upper-left: significantly downregulated
  • Upper-right: significantly upregulated
  • Bottom middle: neither significant nor large effect

Colored/labeled points are typically those passing both significance and thresholds.

Independent Filtering

DESeq2 automatically performs independent filtering: it removes with very low mean counts from multiple testing correction. Why? These have essentially no power to be detected as significant, so they consume multiple testing correction budget without contributing real discoveries.

The filter is "independent" because it uses a criterion (mean count) that is independent of the test statistic under the null — so it doesn't bias the results, just increases power by reducing the number of hypotheses tested.

LRT vs. Wald Test

Wald test (default): tests whether a specific coefficient (e.g., for condition) is significantly different from zero. Fast; directly tests one coefficient.

Likelihood ratio test (LRT): compares the full model to a reduced model (without the term of interest). More appropriate for:

  • Testing whether condition explains significant variance overall (across a set of conditions)
  • Comparing models with different complexity
  • Testing for significant effects in any direction simultaneously

Use LRT when you have multiple conditions and want to identify that change across any of them: DESeq2::DESeq(dds, test="LRT", reduced=~1).

edgeR: The Alternative

edgeR uses a similar negative binomial framework but implements different specifics:

  • TMM normalization instead of DESeq2's size factors
  • Quasi-likelihood approach (QL F-test) for better calibration with small samples — now the recommended mode
  • Tagwise dispersion estimation (per-) with empirical Bayes pooling

For most purposes, DESeq2 and edgeR give very similar results. DESeq2 has more built-in features (time-series, complex designs, shrinkage). edgeR's QL approach is sometimes preferred for very small sample sizes.

limma-voom: The Workhorse for Complex Designs

For complex multi-factor designs, limma-voom is often the most flexible tool:

  1. Voom: transforms log-CPM values with precision weights that account for the mean-variance relationship in count data
  2. limma: applies linear modeling with empirical Bayes variance shrinkage

limma's strength: it handles any linear model design, including continuous covariates, interaction terms, batch effects, and time-course comparisons. The full linear model framework of limma is more flexible than DESeq2/edgeR for complex experimental designs.

Time-Course Analysis

For experiments with multiple timepoints, the DESeq2 LRT identifies with any change over time, while maSigPro and ImpulseDE2 can identify with specific temporal patterns (monotonic increase, transient peak, etc.).

For periodic patterns (circadian rhythms), RAIN and JTK_CYCLE use non-parametric tests specifically designed for rhythm detection.

Practical Considerations

Minimum sample size: n ≥ 3 per group is an absolute minimum; n ≥ 5–6 is strongly recommended. With n=2, the only statistical option is treating samples as technical replicates — which is rarely justified.

Check for outlier samples before analysis: or hierarchical of samples should show grouping by condition, not by sample. A sample that with the wrong group needs investigation (possible labeling error, contamination, or poor quality).

Surpassing the low-count filter manually: DESeq2's independent filtering is automatic, but manually filtering before analysis (e.g., keeping only with CPM > 1 in at least 3 samples) ensures you analyze with evidence of expression in at least some samples.

Biological interpretation requires context: 500 is the start, not the answer. Next steps: enrichment, network analysis, validation of top candidates, integration with other data types (ATAC-seq for regulatory elements, expression for functional confirmation).