Supervised ML in Biology

Supervised machine learning — learning from labeled examples to predict new ones — has found extensive application in biology. Predicting whether a patient will respond to treatment. Classifying whether a genomic is pathogenic. Identifying cancer subtypes from . Predicting structure from sequence. The problems are diverse, but the workflow and pitfalls are remarkably consistent.

This chapter focuses on applying supervised ML in biological contexts: what works, what doesn't, and the specific traps that biological data sets for the unwary.

The Supervised Learning Setup in Biology

Features (X): measurements on a sample. In biology, features are almost always high-dimensional:

: 20,000 per sample
sequence: millions of SNPs per individual
Clinical + molecular combined: hundreds to thousands of variables
sequence: one-hot encoded

Labels (y): what you're predicting:

Binary: responder/non-responder, pathogenic/benign, cancer/normal
Multi-class: cancer subtype (LumA/LumB/HER2+/TNBC)
Continuous (regression): drug IC50, stability, survival time

The fundamental constraint: biological datasets are almost always small relative to feature dimensionality. A typical clinical genomics study might have 200 patients and 50,000 features — a ratio that is unfavorable for most ML algorithms.

Model Selection for Biological Data

Regularized Logistic/Linear Regression

For high-dimensional, small-n data, regularized regression is often the best starting point:

LASSO (L1 regularization): adds a penalty proportional to |β|. Drives many coefficients exactly to zero — automatic feature selection. Final models contain tens to hundreds of features from an initial space of thousands. Interpretable; each selected feature has a coefficient.

Ridge (L2 regularization): adds a penalty proportional to β². Shrinks all coefficients but rarely to zero. Better when many features contribute small amounts (polygenic traits, where thousands of SNPs each contribute slightly).

Elastic Net: combines L1 and L2. Handles correlated features better than LASSO alone (LASSO picks one arbitrarily from a correlated group; Elastic Net tends to group them).

These are appropriate when:

n << p (more features than samples)
Interpretability is required (which /SNPs drive the prediction?)
Linear decision boundaries are reasonable

Decision Trees and Random Forests

Random Forest: ensemble of decision trees, each trained on a bootstrapped sample with a random feature subset. Predictions are averaged across trees.

Advantages for biological data:

Handles high dimensionality without explicit regularization
Captures non-linear feature interactions
Robust to irrelevant features
Provides feature importance estimates
Handles mixed data types (categorical + continuous)

Feature importance: impurity-based importance (mean decrease in Gini impurity) or permutation importance. Permutation importance is more reliable — it measures the actual performance drop when a feature is shuffled.

Caution with correlated features: when features are highly correlated (common in transcriptomics — co-regulated modules), tree-based importance is split among correlated features, making any single feature appear less important. SHAP values address this more rigorously.

Gradient Boosting (XGBoost, LightGBM)

Gradient boosting builds an ensemble of weak trees sequentially, each correcting errors of the previous. State-of-the-art for tabular data.

In biology, gradient boosting excels for:

Clinical + molecular combined predictors
Datasets with mixed feature types
Non-linear interactions between clinical variables

The downside: prone to overfitting on small biological datasets. Requires careful regularization and early stopping.

Support Vector Machines (SVMs)

SVMs find the maximum-margin hyperplane separating classes. With the kernel trick (RBF kernel), they handle non-linear boundaries in high dimensions.

Historically widely used for microarray expression classification (the "SVM era" of bioinformatics). Now largely supplanted by random forests for tabular data, but still used in sequence-based prediction tasks (splice site recognition, binding site classification) where kernel design can encode biological knowledge.

Neural Networks and Deep Learning

Covered in the next chapter. For tabular biological data with small n, deep learning is often not competitive with gradient boosting or regularized regression. Deep learning becomes dominant when:

Data is large (millions of sequences, whole slide images)
Raw data structure matters (sequences, images — where CNNs or transformers can learn representations)

The Validation Trap: Biological Data Pitfalls

This is the most critical section for practitioners. Biological ML papers frequently report inflated performance due to validation mistakes.

Sample Size and Power

A training set of 50 samples and a test set of 20 samples gives very wide confidence intervals on any performance estimate. An AUC of 0.82 on 20 test samples might be indistinguishable from AUC 0.60 in a larger study.

Key question before model development: do you have enough samples for reliable validation? General guidelines:

Binary classification: at minimum 50–100 events (cases) in the test set for reliable AUC estimation
Rare classes: need enough positive examples to train — a dataset with 95% negatives and 5% positives requires class weighting or oversampling (SMOTE)

Cross-Validation Correctly

Standard k-fold cross-validation (k=5 or 10): split data into k folds, train on k-1 folds, test on the remaining fold, rotate.

Critical mistake: leakage through feature selection. A common error in genomics:

Select the top 100 most across all samples
Train a classifier using those 100 with cross-validation

This is wrong. The feature selection used all samples including the test fold, so the test data influenced which features were selected. The reported performance is optimistic.

Correct approach: the entire feature selection pipeline must be inside the cross-validation loop:

In each CV fold: select features using only the training samples
Apply the selected features to the test fold
Never use test set information for any step that feeds into the model

In scikit-learn, this means using Pipeline to chain feature selection + model — the pipeline is then passed to cross_validate, ensuring correct separation.

Independent Test Set vs. Cross-Validation

For clinical biomarker development, cross-validation is not sufficient for claiming clinical validity. Cross-validation estimates generalization within the same cohort; truly independent validation requires:

A separate cohort (different hospital, different country, different time period)
Prospective data collected after model development (not retrospective)

Many biomarkers published with impressive cross-validation AUCs fail in independent validation — different patient populations, different sample handling protocols, different platforms.

★The reproducibility gap in biological ML

A 2020 survey of 94 published cancer biomarker studies found that only 7% were validated in an independent cohort. The field has a replication problem. For your own work, build independent validation into the study design from the start — not as an afterthought when a reviewer asks.

Class Imbalance

Biological datasets are often imbalanced:

Rare disease vs. common controls (1:100 ratio)
Pathogenic vs. benign (pathogenic = minority class)
Rare types in single- data

Why accuracy is misleading: a classifier that always predicts "normal" achieves 99% accuracy on a 1:99 imbalanced dataset — but catches zero cases.

Better metrics for imbalanced data:

AUROC (area under ROC curve): threshold-independent; AUC = 0.5 is random, 1.0 is perfect
AUPRC (area under precision-recall curve): more informative when positive class is rare; uninformative baseline is the positive rate
Sensitivity/Specificity at a clinical threshold: often more clinically interpretable than overall AUC
F1 score: harmonic mean of precision and recall

Handling imbalance in training:

Class weights: weight the minority class more heavily in the loss function
Oversampling: SMOTE generates synthetic minority examples by interpolation
Undersampling: randomly remove majority class examples

Overfitting in Small Biological Datasets

With 100 samples and 20,000 features, a model can memorize noise. Signs of overfitting:

Large gap between training performance and CV performance
Features selected by the model are biologically implausible (random , not known )
Performance degrades on external validation

Defenses:

Strong regularization (high λ in LASSO/Ridge)
Feature filtering (variance filtering, HVG selection) to reduce dimensionality before modeling
Simple models (fewer parameters) — often a LASSO logistic regression outperforms a neural network on n=100 data
Nested cross-validation for hyperparameter tuning (outer loop for performance estimation, inner loop for hyperparameter selection)

Feature Importance and Interpretability

Biological ML demands interpretability beyond most domains — a black-box model with no biological explanation won't be published or adopted clinically.

SHAP (SHapley Additive exPlanations): decomposes each prediction into additive contributions from each feature, grounded in game theory. For each sample, SHAP values show how much each feature pushed the prediction above or below the baseline.

SHAP is now standard for complex models (gradient boosting, random forests) in bioinformatics. Beeswarm plots show global feature importance and direction of effect simultaneously.

Coefficient interpretation (LASSO): for linear models, coefficients directly give feature effects. A LASSO model with 50 selected and their coefficients is biologically interpretable and can be checked against known biology.

Enrichment analysis on top features: take the top-100 SHAP-ranked from a cancer subtype classifier; run enrichment. Do the driving features correspond to known cancer biology? This is a standard sanity check and often yields biological insights.

Specific Applications in Biology

Clinical Outcome Prediction

Predicts patient outcomes (response, survival, toxicity) from molecular + clinical features:

Training data: retrospective cohorts with known outcomes
Features: clinical variables + genomic ( status, expression, CNV) + pathology
Output: probability of response or risk score
Validation: ideally prospective clinical trial

Example: Oncotype DX (a 21- assay) predicts chemotherapy benefit in breast cancer. The algorithm (developed on retrospective data, validated prospectively in the TAILORx trial) is now standard of care.

Variant Effect Prediction

Predicts whether a (especially missense SNV) is pathogenic:

Features: evolutionary conservation, structure context, biochemical properties of change, population frequency
Labels: known pathogenic/benign from ClinVar
Tools: CADD, PolyPhen-2, SIFT, REVEL, AlphaMissense

The training-test leakage problem: pathogenicity predictors trained on ClinVar have a specific risk — in ClinVar were classified partly based on the same sequence properties the model uses. Benchmarking requires careful exclusion of present during training.

Drug Response Prediction

Predicts IC50 or AUC from line/patient features:

GDSC/CCLE datasets: ~1000 cancer lines with genomic profiles and drug response for hundreds of drugs
Features: , , CNV
Challenge: line models don't always to patient tumors

Spatial Transcriptomics

Newer application: each spot in a tissue has both a location and . Spatial ML predicts type composition, identifies spatial expression patterns, and connects histology to molecular state.

Model Performance Benchmarking

Before claiming a new model is state-of-the-art, benchmark rigorously:

Baselines:

Logistic regression with L2 regularization (strong baseline for high-dimensional data)
Random forest with default parameters
Existing published methods for the same task

Evaluation protocol:

Fix all preprocessing and feature engineering before model selection
Use nested cross-validation for hyperparameter tuning
Report confidence intervals (bootstrap or CV variance)
Test on truly held-out data, not just CV

Multiple comparisons: testing 20 models and reporting the best overfits to the validation set. Either use a held-out final test set or correct for model selection.

ℹThe right question for biological ML

The goal is usually not to build the single best model, but to identify which features are biologically meaningful. A LASSO that selects 20 and achieves AUC 0.78 is often more valuable than a neural network achieving AUC 0.82 if the 20 implicate a specific that can be validated experimentally and potentially targeted therapeutically.

Tools and Frameworks

Task	Tools
General ML	scikit-learn, XGBoost, LightGBM
Deep learning	PyTorch, TensorFlow/Keras
Interpretability	SHAP, eli5, lime
Survival models	lifelines, scikit-survival
Expression-specific	limma, DESeq2, glmnet (R)
Genomic variant ML	CADD web tool, AlphaMissense, EVE
Clinical ML + reporting	mlr3 (R), scikit-learn Pipelines

The scikit-learn Pipeline API deserves special mention: it chains preprocessing → feature selection → model into a single object that integrates cleanly with cross-validation, preventing data leakage and making model serialization cleaner.