Bioinformatics analyses routinely require statistical testing, but the tests used cover a narrower range than a full statistics course. This chapter focuses on the tests you'll encounter most often, emphasizing assumptions (what the test requires to be valid) and the biological contexts where each applies.
Two-Group Comparisons
Student's t-test
The workhorse of two-group comparisons. Tests whether the means of two groups differ.
Assumptions:
- Continuous data (interval/ratio scale)
- Approximately normally distributed within each group (or n > 30 by CLT)
- Independent observations
- For two-sample t-test: equal or unequal variances (Welch's t-test handles unequal variances — use it by default)
When used in bioinformatics:
- Comparing continuous biomarker levels between cases and controls
- Comparing normalized expression values (log2 CPM, TPM) between conditions — though specialized tools (limma) use more sophisticated variance modeling
- Comparing alpha diversity (species richness) between microbiome samples
Not appropriate for:
- Raw RNA-seq counts (negative binomial distribution, not normal)
- Comparing proportions (use proportion test or chi-squared)
- Paired samples without specifying the paired structure
Paired t-test: when each observation in group A is matched to one in group B (before/after, same patient treated vs. untreated). Dramatically increases power by removing between-subject variance.
Mann-Whitney U test (Wilcoxon rank-sum test)
Non-parametric alternative to the t-test. Tests whether the distributions of two groups are shifted relative to each other, without assuming normality.
When to prefer Mann-Whitney over t-test:
- Small n (n < 20) where normality can't be assumed
- Ordinal data (e.g., disease grade 0–4)
- Presence of outliers that would heavily influence the mean
- Clearly non-normal distributions (e.g., protein turnover rates, metabolite concentrations)
Trade-off: lower power than the t-test when the data truly is normally distributed. If n is large and there are no severe outliers, the t-test and Mann-Whitney give similar p-values.
Comparing More Than Two Groups
One-way ANOVA
Extends the t-test to three or more groups. Tests whether any group means differ.
F-statistic: ratio of between-group variance to within-group variance. Large F → groups differ more than expected by chance.
Important: ANOVA tells you that at least one group differs, not which. Post-hoc tests (Tukey HSD, Bonferroni correction, Dunnett's test for comparing all groups to control) identify which pairs differ.
In bioinformatics: comparing expression levels across multiple cell lines, multiple developmental timepoints, or multiple patient cohorts.
Kruskal-Wallis test
Non-parametric alternative to one-way ANOVA. Tests whether samples come from the same distribution without normality assumptions.
Common in microbiome research (comparing alpha diversity across multiple sample groups) and clinical studies where normality can't be assumed.
Categorical Data
Chi-squared test
Tests whether two categorical variables are independent.
Construction: compare observed cell counts in a contingency table to expected counts (under the null of independence). χ² = Σ (O-E)²/E.
Assumptions:
- Independent observations
- Expected count ≥ 5 in each cell (use Fisher's exact test if this is violated)
When used:
- Testing whether an SNP genotype is associated with a disease outcome (simple 2×2 test before GWAS methods)
- Testing whether GO term enrichment is significant (enriched vs. not enriched × query set vs. background)
- Testing whether two mutations co-occur or are mutually exclusive in tumor samples
Fisher's Exact Test
Exact version of chi-squared for 2×2 contingency tables when expected cell counts are small (n < ~20, or any cell expected count < 5).
Used extensively in gene set enrichment: given a set of differentially expressed genes and a pathway gene set, is the overlap greater than expected by chance?
In DEG set Not in DEG set
In pathway a b = pathway size
Not in pathway c d
= DEG set size = total genes
Fisher's exact p-value for this over-representation test is the p-value for 1-sided hypergeometric test.
Hypergeometric Test
The formal statistical test for overlap significance. Given m total genes, K in the pathway, n in the DEG set, and k in the overlap: what is the probability of k or more genes overlapping by chance?
This is exactly what tools like topGO, DAVID, and Enrichr use internally. Understanding it lets you interpret their outputs critically — notably, the background (the "universe") critically affects the p-value and is often chosen poorly (using all measured genes rather than all genes expressed in that cell type).
Survival Analysis
Survival analysis handles time-to-event data — when an event of interest (death, relapse, progression) occurs. It's ubiquitous in clinical oncology.
Why Standard Tests Fail for Survival Data
Two problems:
- Censoring: many patients haven't had the event by the study end — you know they survived at least to that point, but not the exact survival time. Censored data can't be excluded (information loss) or treated as the event (inflates event counts).
- Time-varying hazard: the risk of the event may change over time (a patient recovering from surgery has high risk early, then lower risk if they survive).
Kaplan-Meier Estimator
Estimates the survival function S(t) = probability of surviving past time t, accounting for censored observations.
At each event time tᵢ: S(tᵢ) = S(tᵢ₋₁) × (1 - dᵢ/nᵢ), where dᵢ = events at time tᵢ, nᵢ = at risk just before tᵢ.
The Kaplan-Meier curve is the standard visualization for survival data — a step function that drops at each event time. Median survival is read where the curve crosses S = 0.5.
Log-Rank Test
Tests whether two survival curves differ significantly. The standard test for comparing treatment arms in clinical trials.
Assumptions:
- Proportional hazards: the ratio of hazard rates between groups is constant over time. If the survival curves cross, proportional hazards is violated and the log-rank test is unreliable.
Cox Proportional Hazards Model
Extends survival analysis to include covariates:
h(t|X) = h₀(t) × exp(β₁X₁ + β₂X₂ + ...)
Where h₀(t) is the baseline hazard (unspecified — semi-parametric) and exp(βᵢ) is the hazard ratio for covariate Xᵢ.
Hazard ratio (HR): HR = 2 means the instantaneous rate of the event is 2× higher in the exposed group vs. the reference.
Used in clinical genomics: does high expression of a gene predict worse overall survival? Is a specific mutation associated with reduced progression-free survival?
In TCGA (The Cancer Genome Atlas) analyses, Cox regression with gene expression as a continuous variable tests prognostic value across thousands of genes simultaneously — requiring FDR correction.
Correlation
Pearson Correlation
Measures linear correlation between two continuous variables. Assumes bivariate normality. Sensitive to outliers.
Use for: comparing gene expression between two conditions when normality is reasonable.
Spearman Rank Correlation
Non-parametric rank-based correlation. Tests monotonic (not just linear) association. Robust to outliers.
Use for: comparing omics features generally (biomarker vs. clinical variable), microbiome diversity vs. metadata variables, any data with potential outliers.
Pitfall: both Pearson and Spearman measure pairwise correlation, but correlation is not causation and does not imply regulatory relationships.
Test Selection Guide
| Data type | Comparison | Recommended test |
|---|---|---|
| Continuous, normal | 2 groups | Welch t-test |
| Continuous, non-normal or ordinal | 2 groups | Mann-Whitney |
| Continuous, normal | ≥3 groups | One-way ANOVA + post-hoc |
| Continuous, non-normal | ≥3 groups | Kruskal-Wallis |
| Continuous, paired | Before/after same subjects | Paired t-test or Wilcoxon |
| Categorical | 2×2 contingency, n > 40 | Chi-squared |
| Categorical | 2×2 contingency, small n | Fisher's exact |
| Enrichment | Gene set overlap | Hypergeometric (Fisher's) |
| Count data (RNA-seq) | 2+ conditions | DESeq2, edgeR, limma-voom |
| Time to event | 2 groups, survival | Log-rank |
| Time to event | Multiple covariates | Cox proportional hazards |
| Correlation | Linear, normal | Pearson r |
| Correlation | Non-linear or robust | Spearman ρ |
Effect Sizes: Don't Just Report p-values
A p-value tells you whether an effect exists; it doesn't tell you whether it matters. With large samples, tiny meaningless differences become highly significant. Effect size measures the magnitude of the difference:
- Cohen's d (standardized mean difference): d = |μ₁ - μ₂| / σ_pooled. d = 0.2 small, 0.5 medium, 0.8 large.
- Log₂ fold change: standard for differential expression. |log₂FC| > 1 (2-fold change) as biological significance threshold.
- Hazard ratio: HR = 2 means 2× higher event rate.
- AUC (AUROC): discriminative ability of a biomarker. AUC = 0.5 is random; AUC > 0.75 is clinically useful; AUC > 0.9 is excellent.
Reporting both p-value and effect size, with confidence intervals, is increasingly required by journals and is scientifically necessary for interpreting results. A fold change of 1.1 with p = 10⁻¹⁰ (from a huge transcriptomics study) may not be biologically relevant. A fold change of 3 with p = 0.04 (from a small study) may be very important but poorly replicated.