Statistics taught in introductory courses assumes a setting that biological data frequently violates. Small sample sizes, many variables, non-independent observations, hierarchical structure, zero-inflated distributions, and p-values applied at the scale of 20,000 genes simultaneously — these are the normal conditions of bioinformatics, not edge cases.
Understanding why biological data is different — and what approaches those differences require — is the foundation for doing valid analysis. This chapter introduces the key statistical challenges you'll encounter before we address them one by one.
The Multiple Testing Problem: The Central Challenge
In a standard two-group comparison, you run one test and compare the result to α = 0.05. At this threshold, 5% of truly null results will be called significant (false positives). With one test, this is acceptable.
Now consider a transcriptomics experiment: you measure expression of 20,000 genes and test each one for differential expression. Under the null hypothesis (all genes unchanged), you expect 0.05 × 20,000 = 1,000 false positives — regardless of how large your true effect is. The standard p < 0.05 threshold is useless.
This is the multiple testing problem, and it's pervasive in bioinformatics:
- Genome-wide association studies (GWAS): ~6 million SNPs tested
- Differential expression: ~20,000 genes
- ChIP-seq peak calling: ~thousands of genomic windows
- Single-cell analysis: ~30,000 genes × many cell clusters
Two main solutions:
Bonferroni correction (FWER control): Divide α by the number of tests. For 20,000 tests at α=0.05: threshold = 0.05/20,000 = 2.5×10⁻⁶. This controls the family-wise error rate (probability of any false positive). Very conservative — rejects many true positives when power is limited. GWAS uses p < 5×10⁻⁸ by convention (Bonferroni for ~10⁶ independent tests).
Benjamini-Hochberg (FDR control): Controls the false discovery rate (expected proportion of discoveries that are false) rather than the probability of any error. Sorts p-values, finds the largest rank k such that p(k) ≤ kq/m (where q is the desired FDR level, m is the number of tests), and declares tests 1 through k significant.
FDR at q=0.05 means 5% of your called significant results are expected to be false positives. This is appropriate for exploratory analysis where you want to find real signals for follow-up, at the cost of some false positives.
In a DESeq2 differential expression analysis of 20,000 genes, applying Bonferroni would require p < 2.5×10⁻⁶ to be significant. This might leave you with zero findings even when hundreds of genes truly change.
FDR at q=0.05 allows p-values up to ~0.001 to be significant (depending on the distribution), recovering hundreds of true positives. The trade-off: 5% of those are false positives — but you know the rate and can follow up the most important hits.
For clinical biomarker development, you may need stricter FDR control. For exploratory screens, 10–20% FDR is often acceptable.
High Dimensionality: More Variables Than Observations
Standard statistical theory is built for n >> p (n observations, p variables). Many biological datasets have the opposite: GWAS with 1000 patients and 6 million SNPs; single-cell RNA-seq with 5000 cells and 30,000 genes.
Problems in high dimensions:
- Overfitting: a model with more parameters than observations will fit the training data perfectly but generalize poorly
- Curse of dimensionality: in high dimensions, all points become equally distant from each other — distance metrics lose meaning
- Spurious correlations: with many variables, random correlations appear significant by chance
Solutions:
- Dimensionality reduction before analysis (PCA, UMAP, NMF)
- Regularized regression (Ridge, LASSO, Elastic Net) for prediction models — adds a penalty term that shrinks coefficients, reducing effective model complexity
- Sparse methods that select a subset of features (LASSO forces many coefficients to exactly zero)
- Cross-validation for model selection and performance estimation
Non-Normal Distributions
RNA-seq count data is not normally distributed. It's:
- Count data (non-negative integers)
- Overdispersed relative to Poisson (variance >> mean), following a negative binomial distribution
- Zero-inflated in single-cell RNA-seq (many genes detected in 0 reads for a given cell)
Applying t-tests directly to raw counts is invalid. The field uses specialized models:
- DESeq2 and edgeR: fit negative binomial models with empirical Bayes shrinkage of dispersion estimates
- MAST and Seurat-based tests: for single-cell data with zero inflation
- Voom (limma): transforms counts to log-counts and applies precision weights, enabling linear model framework
Understanding the appropriate error model is not optional for transcriptomics — wrong distributional assumptions lead to inflated false positive rates.
Non-Independence: Biological Replicates and Batch Effects
Statistical tests assume observations are independent. Biological data is frequently non-independent:
Paired samples: before/after treatment measurements from the same patient. Using an unpaired test discards this structure and loses power.
Repeated measures: multiple timepoints from the same subject. Need mixed-effects models or repeated-measures ANOVA.
Batch effects: samples processed in different labs, on different days, or with different reagent lots will cluster by batch rather than biology. Batch effect correction tools (ComBat, limma::removeBatchEffect) or design matrices that include batch as a covariate are essential.
Technical replicates vs. biological replicates: sequencing the same RNA library twice (technical replicate) gives highly correlated data — not two independent observations. Only biological replicates (different patients, different animals, different wells of cells) provide truly independent observations. A study with 10 technical replicates and no biological replicates has n=1, regardless of sequencing depth.
Confounding Variables
A confounder is a variable associated with both the exposure (treatment, genotype) and the outcome (gene expression, disease), that creates a spurious association if uncontrolled.
Classic example: case-control study comparing gene expression in disease vs. healthy individuals. If disease patients are older on average, age-related gene expression changes will appear as disease-specific changes. Including age as a covariate in the linear model controls for this.
In genomics:
- Population stratification in GWAS: individuals with different ancestries have different allele frequencies AND different disease rates. An SNP enriched in a high-risk ancestry population will appear disease-associated even if it has no biological role. Principal components of genome-wide SNP variation are included as covariates to control for ancestry.
- Cell type composition in bulk RNA-seq: differences in immune infiltration (tumor vs. normal) confound gene expression comparisons. Cell type deconvolution methods (CIBERSORT, MuSiC) estimate cell type proportions for correction.
The Sample Size Problem
Biological experiments are expensive. A typical RNA-seq experiment has n=3–5 per group — dramatically underpowered for detecting small effects after multiple testing correction. The replication crisis in biomedical research is partly caused by underpowered studies that find (and publish) results that can't be replicated.
Power analysis should precede any study: given expected effect size, desired power (usually 80%), sample variance, and number of tests, what minimum sample size is needed? Tools: pwr package (R), pwr2pwr Python (for RNA-seq).
For a typical RNA-seq experiment (1000 genes with differential expression at FC=1.5, α=5×10⁻⁵, power=80%): n=5–10 per group is often a minimum, with n=10+ recommended for reliable results.
Biological Variation vs. Measurement Noise
Biological systems are inherently variable — individuals differ, cells within a sample differ, environmental conditions fluctuate. This biological variability is signal-carrying (differences between conditions) and must be distinguished from measurement noise (sequencing errors, PCR bias, handling variation).
Separating sources of variation requires:
- Properly designed replication
- Appropriate variance decomposition (mixed models, ANOVA)
- Understanding of the biological system (expected effect sizes, biological vs. technical CV)
For single-cell RNA-seq: each cell is highly variable, but the average across cells is more stable. Clustering and trajectory analyses must account for the discrete, bursty nature of transcription (transcriptional noise) rather than treating cell-to-cell differences as noise to be suppressed.
Summary: What to Watch For
Before running any biological data analysis, ask:
- How many tests am I running? → Apply appropriate multiple testing correction
- Are my observations independent? → Check for paired samples, batch effects, hierarchical structure
- Is the error model appropriate? → Count data? Zero-inflated? Proportions? Choose the right test
- What are the confounders? → Include age, sex, ancestry, cell type composition as appropriate
- Is my sample size adequate? → Power analysis before the experiment, not after
- Biological vs. technical replicates? → Only biological replicates provide independent observations
These principles will guide every analysis in the chapters that follow.