Supervised machine learning — learning from labeled examples to predict new ones — has found extensive application in biology. Predicting whether a patient will respond to treatment. Classifying whether a genomic is pathogenic. Identifying cancer subtypes from . Predicting structure from sequence. The problems are diverse, but the workflow and pitfalls are remarkably consistent.
This chapter focuses on applying supervised ML in biological contexts: what works, what doesn't, and the specific traps that biological data sets for the unwary.
The Supervised Learning Setup in Biology
Features (X): measurements on a sample. In biology, features are almost always high-dimensional:
- : 20,000 per sample
- sequence: millions of SNPs per individual
- Clinical + molecular combined: hundreds to thousands of variables
- sequence: one-hot encoded
Labels (y): what you're predicting:
- Binary: responder/non-responder, pathogenic/benign, cancer/normal
- Multi-class: cancer subtype (LumA/LumB/HER2+/TNBC)
- Continuous (regression): drug IC50, stability, survival time
The fundamental constraint: biological datasets are almost always small relative to feature dimensionality. A typical clinical genomics study might have 200 patients and 50,000 features — a ratio that is unfavorable for most ML algorithms.
Model Selection for Biological Data
Regularized Logistic/Linear Regression
For high-dimensional, small-n data, regularized regression is often the best starting point:
LASSO (L1 regularization): adds a penalty proportional to |β|. Drives many coefficients exactly to zero — automatic feature selection. Final models contain tens to hundreds of features from an initial space of thousands. Interpretable; each selected feature has a coefficient.
Ridge (L2 regularization): adds a penalty proportional to β². Shrinks all coefficients but rarely to zero. Better when many features contribute small amounts (polygenic traits, where thousands of SNPs each contribute slightly).
Elastic Net: combines L1 and L2. Handles correlated features better than LASSO alone (LASSO picks one arbitrarily from a correlated group; Elastic Net tends to group them).
These are appropriate when:
- n << p (more features than samples)
- Interpretability is required (which /SNPs drive the prediction?)
- Linear decision boundaries are reasonable
Decision Trees and Random Forests
Random Forest: ensemble of decision trees, each trained on a bootstrapped sample with a random feature subset. Predictions are averaged across trees.
Advantages for biological data:
- Handles high dimensionality without explicit regularization
- Captures non-linear feature interactions
- Robust to irrelevant features
- Provides feature importance estimates
- Handles mixed data types (categorical + continuous)
Feature importance: impurity-based importance (mean decrease in Gini impurity) or permutation importance. Permutation importance is more reliable — it measures the actual performance drop when a feature is shuffled.
Caution with correlated features: when features are highly correlated (common in transcriptomics — co-regulated modules), tree-based importance is split among correlated features, making any single feature appear less important. SHAP values address this more rigorously.
Gradient Boosting (XGBoost, LightGBM)
Gradient boosting builds an ensemble of weak trees sequentially, each correcting errors of the previous. State-of-the-art for tabular data.
In biology, gradient boosting excels for:
- Clinical + molecular combined predictors
- Datasets with mixed feature types
- Non-linear interactions between clinical variables
The downside: prone to overfitting on small biological datasets. Requires careful regularization and early stopping.
Support Vector Machines (SVMs)
SVMs find the maximum-margin hyperplane separating classes. With the kernel trick (RBF kernel), they handle non-linear boundaries in high dimensions.
Historically widely used for microarray expression classification (the "SVM era" of bioinformatics). Now largely supplanted by random forests for tabular data, but still used in sequence-based prediction tasks (splice site recognition, binding site classification) where kernel design can encode biological knowledge.
Neural Networks and Deep Learning
Covered in the next chapter. For tabular biological data with small n, deep learning is often not competitive with gradient boosting or regularized regression. Deep learning becomes dominant when:
- Data is large (millions of sequences, whole slide images)
- Raw data structure matters (sequences, images — where CNNs or transformers can learn representations)
The Validation Trap: Biological Data Pitfalls
This is the most critical section for practitioners. Biological ML papers frequently report inflated performance due to validation mistakes.
Sample Size and Power
A training set of 50 samples and a test set of 20 samples gives very wide confidence intervals on any performance estimate. An AUC of 0.82 on 20 test samples might be indistinguishable from AUC 0.60 in a larger study.
Key question before model development: do you have enough samples for reliable validation? General guidelines:
- Binary classification: at minimum 50–100 events (cases) in the test set for reliable AUC estimation
- Rare classes: need enough positive examples to train — a dataset with 95% negatives and 5% positives requires class weighting or oversampling (SMOTE)
Cross-Validation Correctly
Standard k-fold cross-validation (k=5 or 10): split data into k folds, train on k-1 folds, test on the remaining fold, rotate.
Critical mistake: leakage through feature selection. A common error in genomics:
- Select the top 100 most across all samples
- Train a classifier using those 100 with cross-validation
This is wrong. The feature selection used all samples including the test fold, so the test data influenced which features were selected. The reported performance is optimistic.
Correct approach: the entire feature selection pipeline must be inside the cross-validation loop:
- In each CV fold: select features using only the training samples
- Apply the selected features to the test fold
- Never use test set information for any step that feeds into the model
In scikit-learn, this means using Pipeline to chain feature selection + model — the pipeline is then passed to cross_validate, ensuring correct separation.
Independent Test Set vs. Cross-Validation
For clinical biomarker development, cross-validation is not sufficient for claiming clinical validity. Cross-validation estimates generalization within the same cohort; truly independent validation requires:
- A separate cohort (different hospital, different country, different time period)
- Prospective data collected after model development (not retrospective)
Many biomarkers published with impressive cross-validation AUCs fail in independent validation — different patient populations, different sample handling protocols, different platforms.
A 2020 survey of 94 published cancer biomarker studies found that only 7% were validated in an independent cohort. The field has a replication problem. For your own work, build independent validation into the study design from the start — not as an afterthought when a reviewer asks.
Class Imbalance
Biological datasets are often imbalanced:
- Rare disease vs. common controls (1:100 ratio)
- Pathogenic vs. benign (pathogenic = minority class)
- Rare types in single- data
Why accuracy is misleading: a classifier that always predicts "normal" achieves 99% accuracy on a 1:99 imbalanced dataset — but catches zero cases.
Better metrics for imbalanced data:
- AUROC (area under ROC curve): threshold-independent; AUC = 0.5 is random, 1.0 is perfect
- AUPRC (area under precision-recall curve): more informative when positive class is rare; uninformative baseline is the positive rate
- Sensitivity/Specificity at a clinical threshold: often more clinically interpretable than overall AUC
- F1 score: harmonic mean of precision and recall
Handling imbalance in training:
- Class weights: weight the minority class more heavily in the loss function
- Oversampling: SMOTE generates synthetic minority examples by interpolation
- Undersampling: randomly remove majority class examples
Overfitting in Small Biological Datasets
With 100 samples and 20,000 features, a model can memorize noise. Signs of overfitting:
- Large gap between training performance and CV performance
- Features selected by the model are biologically implausible (random , not known )
- Performance degrades on external validation
Defenses:
- Strong regularization (high λ in LASSO/Ridge)
- Feature filtering (variance filtering, HVG selection) to reduce dimensionality before modeling
- Simple models (fewer parameters) — often a LASSO logistic regression outperforms a neural network on n=100 data
- Nested cross-validation for hyperparameter tuning (outer loop for performance estimation, inner loop for hyperparameter selection)
Feature Importance and Interpretability
Biological ML demands interpretability beyond most domains — a black-box model with no biological explanation won't be published or adopted clinically.
SHAP (SHapley Additive exPlanations): decomposes each prediction into additive contributions from each feature, grounded in game theory. For each sample, SHAP values show how much each feature pushed the prediction above or below the baseline.
SHAP is now standard for complex models (gradient boosting, random forests) in bioinformatics. Beeswarm plots show global feature importance and direction of effect simultaneously.
Coefficient interpretation (LASSO): for linear models, coefficients directly give feature effects. A LASSO model with 50 selected and their coefficients is biologically interpretable and can be checked against known biology.
Enrichment analysis on top features: take the top-100 SHAP-ranked from a cancer subtype classifier; run enrichment. Do the driving features correspond to known cancer biology? This is a standard sanity check and often yields biological insights.
Specific Applications in Biology
Clinical Outcome Prediction
Predicts patient outcomes (response, survival, toxicity) from molecular + clinical features:
- Training data: retrospective cohorts with known outcomes
- Features: clinical variables + genomic ( status, expression, CNV) + pathology
- Output: probability of response or risk score
- Validation: ideally prospective clinical trial
Example: Oncotype DX (a 21- assay) predicts chemotherapy benefit in breast cancer. The algorithm (developed on retrospective data, validated prospectively in the TAILORx trial) is now standard of care.
Variant Effect Prediction
Predicts whether a (especially missense SNV) is pathogenic:
- Features: evolutionary conservation, structure context, biochemical properties of change, population frequency
- Labels: known pathogenic/benign from ClinVar
- Tools: CADD, PolyPhen-2, SIFT, REVEL, AlphaMissense
The training-test leakage problem: pathogenicity predictors trained on ClinVar have a specific risk — in ClinVar were classified partly based on the same sequence properties the model uses. Benchmarking requires careful exclusion of present during training.
Drug Response Prediction
Predicts IC50 or AUC from line/patient features:
- GDSC/CCLE datasets: ~1000 cancer lines with genomic profiles and drug response for hundreds of drugs
- Features: , , CNV
- Challenge: line models don't always to patient tumors
Spatial Transcriptomics
Newer application: each spot in a tissue has both a location and . Spatial ML predicts type composition, identifies spatial expression patterns, and connects histology to molecular state.
Model Performance Benchmarking
Before claiming a new model is state-of-the-art, benchmark rigorously:
Baselines:
- Logistic regression with L2 regularization (strong baseline for high-dimensional data)
- Random forest with default parameters
- Existing published methods for the same task
Evaluation protocol:
- Fix all preprocessing and feature engineering before model selection
- Use nested cross-validation for hyperparameter tuning
- Report confidence intervals (bootstrap or CV variance)
- Test on truly held-out data, not just CV
Multiple comparisons: testing 20 models and reporting the best overfits to the validation set. Either use a held-out final test set or correct for model selection.
The goal is usually not to build the single best model, but to identify which features are biologically meaningful. A LASSO that selects 20 and achieves AUC 0.78 is often more valuable than a neural network achieving AUC 0.82 if the 20 implicate a specific that can be validated experimentally and potentially targeted therapeutically.
Tools and Frameworks
| Task | Tools |
|---|---|
| General ML | scikit-learn, XGBoost, LightGBM |
| Deep learning | PyTorch, TensorFlow/Keras |
| Interpretability | SHAP, eli5, lime |
| Survival models | lifelines, scikit-survival |
| Expression-specific | limma, DESeq2, glmnet (R) |
| Genomic variant ML | CADD web tool, AlphaMissense, EVE |
| Clinical ML + reporting | mlr3 (R), scikit-learn Pipelines |
The scikit-learn Pipeline API deserves special mention: it chains preprocessing → feature selection → model into a single object that integrates cleanly with cross-validation, preventing data leakage and making model serialization cleaner.