Differential expression analysis answers the most common question in transcriptomics: which genes change their expression between condition A and condition B? It sounds straightforward, but the statistical challenges — count data, high dimensionality, small sample sizes, multiple testing — require specialized methods developed specifically for RNA-seq data.
The Statistical Problem
Given:
- n₁ samples in condition A (e.g., tumor)
- n₂ samples in condition B (e.g., normal)
- Expression measurements for G ~20,000 genes in each sample
Find: which genes differ significantly between conditions?
Why you can't use a t-test on raw counts:
-
RNA-seq counts are integers, not continuous — a gene with 0 counts in one sample but 100 in another can't be normally distributed
-
Variance is not constant across count levels. For Poisson counts, variance = mean. For RNA-seq, variance >> mean (overdispersion) — this is biological variability on top of sampling variation
-
Sequencing depth varies between samples — a sample with 2× more total reads will show 2× more counts per gene, which is a technical artifact, not biology
Normalization First
Before testing, counts must be normalized to remove sequencing depth effects:
DESeq2 size factor normalization: For each sample, compute a size factor as the median of (count of gene g in sample j) / (geometric mean of gene g across all samples). Dividing each count by the sample's size factor makes samples comparable.
This is more robust than simple per-million scaling because it's resistant to the influence of highly expressed genes (a gene expressed at 10% of total reads would dominate simple scaling).
When to use which normalization:
- DESeq2 size factors (within-experiment normalization): correct for library size differences between samples in the same experiment
- TMM (edgeR): similar to DESeq2, assumes most genes are non-differentially expressed
- TPM (for cross-experiment comparison): normalizes for both library size AND gene length; appropriate for comparing expression levels across samples from different studies
- CPM/RPKM/FPKM: older methods, generally avoid RPKM/FPKM due to mathematical inconsistencies in cross-sample comparison
TPM is appropriate for visualizing and reporting expression levels (e.g., "BRCA1 is expressed at 15 TPM in breast tissue"). But for differential expression statistical testing, use raw counts with DESeq2 or edgeR — these tools do their own normalization internally and their statistical models require raw counts. Running DESeq2 on TPM-normalized data invalidates the variance model.
The Negative Binomial Model
The appropriate model for RNA-seq count data is the negative binomial (NB) distribution:
P(Y = k) = Γ(k + μ/α) / (Γ(μ/α) × k!) × (αμ)^k / (1 + αμ)^(k + μ/α)
Where:
- μ = mean count (mean expression level)
- α = dispersion parameter (overdispersion; α=0 reduces to Poisson)
The variance is: Var(Y) = μ + α × μ²
Dispersion α captures the extra variability beyond sampling noise. It varies by gene and is typically estimated from the data.
The Small n Problem and Shrinkage
With only n=3–5 replicates per group, the per-gene variance estimate is unreliable. A gene with 3 control samples might show high variance just by chance — leading to inflated confidence intervals and lost power.
Empirical Bayes shrinkage: both DESeq2 and edgeR "borrow strength" across genes. The idea: the dispersion of a gene with count 50 should be similar to dispersions of other genes with similar counts. By pooling information across thousands of genes, you get much better dispersion estimates for any individual gene.
DESeq2 fits a curve: dispersion as a function of mean expression (the "dispersion trend"). Each gene's dispersion estimate is then shrunk toward this trend. Genes with few replicates (where the estimate is unreliable) are shrunk more; genes where the data strongly support a high dispersion are shrunk less.
This shrinkage is a form of regularization — exactly the same principle as Ridge regression. It reduces variance in the estimate at the cost of introducing some bias, but the net effect on test performance is strongly positive.
DESeq2 Workflow
Raw counts matrix (genes × samples)
↓
Create DESeqDataSet object
↓
Estimate size factors (normalization)
↓
Estimate dispersions (gene-wise, fit trend, shrink)
↓
Fit negative binomial GLM per gene
↓
Wald test or LRT for log fold changes
↓
Shrink log fold changes (apeglm/ashr)
↓
Apply multiple testing correction (Benjamini-Hochberg)
↓
Results table: log2FC, SE, Wald statistic, p-value, padj
Design Formula
The design formula specifies what comparisons to make:
# Simple two-group comparison
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = metadata,
design = ~condition)
# Controlling for batch effect
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = metadata,
design = ~batch + condition)
# Interaction: does treatment effect differ between genotypes?
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = metadata,
design = ~genotype + treatment + genotype:treatment)
The design formula is a linear model in the generalized linear framework. Each gene is tested with the same model structure; coefficients and tests correspond to the design matrix columns.
Log Fold Change Shrinkage
Even after dispersion shrinkage, log fold change estimates can be noisy for low-count genes. DESeq2's lfcShrink() applies an additional shrinkage prior to fold change estimates:
- The apeglm method (adaptive Student's t prior) is the default recommendation
- Shrinks large LFCs for low-count genes toward zero while preserving estimates for high-count genes
- Does NOT change the p-value or significance status — only the LFC estimate for visualization and ranking
Shrunken LFCs are appropriate for MA plots, volcano plots, and ranking genes for pathway analysis. Unshrunken LFCs can produce extreme values for low-count genes that dominate rankings.
Interpreting Results
The Results Table
| Column | Meaning |
|---|---|
| baseMean | Mean normalized count across all samples |
| log2FoldChange | Shrunken log₂(treated/control) |
| lfcSE | Standard error of the log2FC estimate |
| stat | Wald statistic (log2FC / lfcSE) |
| pvalue | P-value from Wald test (assuming null: log2FC = 0) |
| padj | BH-adjusted p-value (FDR-corrected) |
Standard significance threshold: padj < 0.05 (5% FDR) AND |log2FoldChange| > 1 (at least 2-fold change). The second criterion is a biological significance filter — statistically significant but tiny fold changes may not be biologically relevant.
The MA Plot
The MA plot shows log fold change (y-axis) vs. mean expression (x-axis). With fold change shrinkage:
- High-count genes: LFC estimates spread around the true values; significant genes clearly visible
- Low-count genes: LFCs shrunk toward zero; noisy estimates don't dominate the plot
Red dots = significant (padj < 0.05). A well-behaved MA plot shows symmetry around LFC = 0 for non-significant genes, with significant genes extending away from the center.
The Volcano Plot
Volcano plot shows -log₁₀(p-value) vs. log₂ fold change:
- X-axis: magnitude of change (more extreme = larger effect)
- Y-axis: statistical significance (higher = more significant)
- Upper-left: significantly downregulated
- Upper-right: significantly upregulated
- Bottom middle: neither significant nor large effect
Colored/labeled points are typically those passing both significance and fold-change thresholds.
Independent Filtering
DESeq2 automatically performs independent filtering: it removes genes with very low mean counts from multiple testing correction. Why? These genes have essentially no power to be detected as significant, so they consume multiple testing correction budget without contributing real discoveries.
The filter is "independent" because it uses a criterion (mean count) that is independent of the test statistic under the null — so it doesn't bias the results, just increases power by reducing the number of hypotheses tested.
LRT vs. Wald Test
Wald test (default): tests whether a specific coefficient (e.g., log2FC for condition) is significantly different from zero. Fast; directly tests one coefficient.
Likelihood ratio test (LRT): compares the full model to a reduced model (without the term of interest). More appropriate for:
- Testing whether condition explains significant variance overall (across a set of conditions)
- Comparing models with different complexity
- Testing for significant effects in any direction simultaneously
Use LRT when you have multiple conditions and want to identify genes that change across any of them: DESeq2::DESeq(dds, test="LRT", reduced=~1).
edgeR: The Alternative
edgeR uses a similar negative binomial framework but implements different specifics:
- TMM normalization instead of DESeq2's size factors
- Quasi-likelihood approach (QL F-test) for better calibration with small samples — now the recommended mode
- Tagwise dispersion estimation (per-gene) with empirical Bayes pooling
For most purposes, DESeq2 and edgeR give very similar results. DESeq2 has more built-in features (time-series, complex designs, shrinkage). edgeR's QL approach is sometimes preferred for very small sample sizes.
limma-voom: The Workhorse for Complex Designs
For complex multi-factor designs, limma-voom is often the most flexible tool:
- Voom: transforms log-CPM values with precision weights that account for the mean-variance relationship in count data
- limma: applies linear modeling with empirical Bayes variance shrinkage
limma's strength: it handles any linear model design, including continuous covariates, interaction terms, batch effects, and time-course comparisons. The full linear model framework of limma is more flexible than DESeq2/edgeR for complex experimental designs.
Time-Course Analysis
For experiments with multiple timepoints, the DESeq2 LRT identifies genes with any change over time, while maSigPro and ImpulseDE2 can identify genes with specific temporal patterns (monotonic increase, transient peak, etc.).
For periodic patterns (circadian rhythms), RAIN and JTK_CYCLE use non-parametric tests specifically designed for rhythm detection.
Practical Considerations
Minimum sample size: n ≥ 3 per group is an absolute minimum; n ≥ 5–6 is strongly recommended. With n=2, the only statistical option is treating samples as technical replicates — which is rarely justified.
Check for outlier samples before analysis: PCA or hierarchical clustering of samples should show grouping by condition, not by sample. A sample that clusters with the wrong group needs investigation (possible labeling error, contamination, or poor quality).
Surpassing the low-count filter manually: DESeq2's independent filtering is automatic, but manually filtering before analysis (e.g., keeping only genes with CPM > 1 in at least 3 samples) ensures you analyze genes with evidence of expression in at least some samples.
Biological interpretation requires context: 500 differentially expressed genes is the start, not the answer. Next steps: pathway enrichment, network analysis, validation of top candidates, integration with other data types (ATAC-seq for regulatory elements, protein expression for functional confirmation).