Part 8·8.2·14 min read

Essential Statistical Tests for Bioinformatics

The statistical tests that appear most in bioinformatics — when to use each, what the assumptions are, and common mistakes to avoid.

statisticshypothesis testingt-testchi-squaredsurvival analysis

Bioinformatics analyses routinely require statistical testing, but the tests used cover a narrower range than a full statistics course. This chapter focuses on the tests you'll encounter most often, emphasizing assumptions (what the test requires to be valid) and the biological contexts where each applies.

Two-Group Comparisons

Student's t-test

The workhorse of two-group comparisons. Tests whether the means of two groups differ.

Assumptions:

  • Continuous data (interval/ratio scale)
  • Approximately normally distributed within each group (or n > 30 by CLT)
  • Independent observations
  • For two-sample t-test: equal or unequal variances (Welch's t-test handles unequal variances — use it by default)

When used in bioinformatics:

  • Comparing continuous biomarker levels between cases and controls
  • Comparing normalized expression values (log2 CPM, TPM) between conditions — though specialized tools (limma) use more sophisticated variance modeling
  • Comparing alpha diversity (species richness) between microbiome samples

Not appropriate for:

  • Raw RNA-seq counts (negative binomial distribution, not normal)
  • Comparing proportions (use proportion test or chi-squared)
  • Paired samples without specifying the paired structure

Paired t-test: when each observation in group A is matched to one in group B (before/after, same patient treated vs. untreated). Dramatically increases power by removing between-subject variance.

Mann-Whitney U test (Wilcoxon rank-sum test)

Non-parametric alternative to the t-test. Tests whether the distributions of two groups are shifted relative to each other, without assuming normality.

When to prefer Mann-Whitney over t-test:

  • Small n (n < 20) where normality can't be assumed
  • Ordinal data (e.g., disease grade 0–4)
  • Presence of outliers that would heavily influence the mean
  • Clearly non-normal distributions (e.g., protein turnover rates, metabolite concentrations)

Trade-off: lower power than the t-test when the data truly is normally distributed. If n is large and there are no severe outliers, the t-test and Mann-Whitney give similar p-values.

Comparing More Than Two Groups

One-way ANOVA

Extends the t-test to three or more groups. Tests whether any group means differ.

F-statistic: ratio of between-group variance to within-group variance. Large F → groups differ more than expected by chance.

Important: ANOVA tells you that at least one group differs, not which. Post-hoc tests (Tukey HSD, Bonferroni correction, Dunnett's test for comparing all groups to control) identify which pairs differ.

In bioinformatics: comparing expression levels across multiple cell lines, multiple developmental timepoints, or multiple patient cohorts.

Kruskal-Wallis test

Non-parametric alternative to one-way ANOVA. Tests whether samples come from the same distribution without normality assumptions.

Common in microbiome research (comparing alpha diversity across multiple sample groups) and clinical studies where normality can't be assumed.

Categorical Data

Chi-squared test

Tests whether two categorical variables are independent.

Construction: compare observed cell counts in a contingency table to expected counts (under the null of independence). χ² = Σ (O-E)²/E.

Assumptions:

  • Independent observations
  • Expected count ≥ 5 in each cell (use Fisher's exact test if this is violated)

When used:

  • Testing whether an SNP genotype is associated with a disease outcome (simple 2×2 test before GWAS methods)
  • Testing whether GO term enrichment is significant (enriched vs. not enriched × query set vs. background)
  • Testing whether two mutations co-occur or are mutually exclusive in tumor samples

Fisher's Exact Test

Exact version of chi-squared for 2×2 contingency tables when expected cell counts are small (n < ~20, or any cell expected count < 5).

Used extensively in gene set enrichment: given a set of differentially expressed genes and a pathway gene set, is the overlap greater than expected by chance?

         In DEG set   Not in DEG set
In pathway      a            b        = pathway size
Not in pathway  c            d
         = DEG set size       = total genes

Fisher's exact p-value for this over-representation test is the p-value for 1-sided hypergeometric test.

Hypergeometric Test

The formal statistical test for overlap significance. Given m total genes, K in the pathway, n in the DEG set, and k in the overlap: what is the probability of k or more genes overlapping by chance?

This is exactly what tools like topGO, DAVID, and Enrichr use internally. Understanding it lets you interpret their outputs critically — notably, the background (the "universe") critically affects the p-value and is often chosen poorly (using all measured genes rather than all genes expressed in that cell type).

Survival Analysis

Survival analysis handles time-to-event data — when an event of interest (death, relapse, progression) occurs. It's ubiquitous in clinical oncology.

Why Standard Tests Fail for Survival Data

Two problems:

  1. Censoring: many patients haven't had the event by the study end — you know they survived at least to that point, but not the exact survival time. Censored data can't be excluded (information loss) or treated as the event (inflates event counts).
  2. Time-varying hazard: the risk of the event may change over time (a patient recovering from surgery has high risk early, then lower risk if they survive).

Kaplan-Meier Estimator

Estimates the survival function S(t) = probability of surviving past time t, accounting for censored observations.

At each event time tᵢ: S(tᵢ) = S(tᵢ₋₁) × (1 - dᵢ/nᵢ), where dᵢ = events at time tᵢ, nᵢ = at risk just before tᵢ.

The Kaplan-Meier curve is the standard visualization for survival data — a step function that drops at each event time. Median survival is read where the curve crosses S = 0.5.

Log-Rank Test

Tests whether two survival curves differ significantly. The standard test for comparing treatment arms in clinical trials.

Assumptions:

  • Proportional hazards: the ratio of hazard rates between groups is constant over time. If the survival curves cross, proportional hazards is violated and the log-rank test is unreliable.

Cox Proportional Hazards Model

Extends survival analysis to include covariates:

h(t|X) = h₀(t) × exp(β₁X₁ + β₂X₂ + ...)

Where h₀(t) is the baseline hazard (unspecified — semi-parametric) and exp(βᵢ) is the hazard ratio for covariate Xᵢ.

Hazard ratio (HR): HR = 2 means the instantaneous rate of the event is 2× higher in the exposed group vs. the reference.

Used in clinical genomics: does high expression of a gene predict worse overall survival? Is a specific mutation associated with reduced progression-free survival?

In TCGA (The Cancer Genome Atlas) analyses, Cox regression with gene expression as a continuous variable tests prognostic value across thousands of genes simultaneously — requiring FDR correction.

Correlation

Pearson Correlation

Measures linear correlation between two continuous variables. Assumes bivariate normality. Sensitive to outliers.

Use for: comparing gene expression between two conditions when normality is reasonable.

Spearman Rank Correlation

Non-parametric rank-based correlation. Tests monotonic (not just linear) association. Robust to outliers.

Use for: comparing omics features generally (biomarker vs. clinical variable), microbiome diversity vs. metadata variables, any data with potential outliers.

Pitfall: both Pearson and Spearman measure pairwise correlation, but correlation is not causation and does not imply regulatory relationships.

Test Selection Guide

Data typeComparisonRecommended test
Continuous, normal2 groupsWelch t-test
Continuous, non-normal or ordinal2 groupsMann-Whitney
Continuous, normal≥3 groupsOne-way ANOVA + post-hoc
Continuous, non-normal≥3 groupsKruskal-Wallis
Continuous, pairedBefore/after same subjectsPaired t-test or Wilcoxon
Categorical2×2 contingency, n > 40Chi-squared
Categorical2×2 contingency, small nFisher's exact
EnrichmentGene set overlapHypergeometric (Fisher's)
Count data (RNA-seq)2+ conditionsDESeq2, edgeR, limma-voom
Time to event2 groups, survivalLog-rank
Time to eventMultiple covariatesCox proportional hazards
CorrelationLinear, normalPearson r
CorrelationNon-linear or robustSpearman ρ

Effect Sizes: Don't Just Report p-values

A p-value tells you whether an effect exists; it doesn't tell you whether it matters. With large samples, tiny meaningless differences become highly significant. Effect size measures the magnitude of the difference:

  • Cohen's d (standardized mean difference): d = |μ₁ - μ₂| / σ_pooled. d = 0.2 small, 0.5 medium, 0.8 large.
  • Log₂ fold change: standard for differential expression. |log₂FC| > 1 (2-fold change) as biological significance threshold.
  • Hazard ratio: HR = 2 means 2× higher event rate.
  • AUC (AUROC): discriminative ability of a biomarker. AUC = 0.5 is random; AUC > 0.75 is clinically useful; AUC > 0.9 is excellent.

Reporting both p-value and effect size, with confidence intervals, is increasingly required by journals and is scientifically necessary for interpreting results. A fold change of 1.1 with p = 10⁻¹⁰ (from a huge transcriptomics study) may not be biologically relevant. A fold change of 3 with p = 0.04 (from a small study) may be very important but poorly replicated.