Deep learning has transformed biology more profoundly than almost any other field outside computer vision and NLP. AlphaFold solved a 50-year-old problem. Sequence models predict the effect of any . Foundation models trained on billions of pairs are learning the grammar of the . Medical imaging AI pathology slides at radiologist accuracy.
This chapter covers how deep learning architecture choices map to biological data types, the landmark models that define the field, and what the current generation of biological foundation models can and cannot do.
Why Deep Learning Works for Biological Sequence Data
Biological sequences — , , — are the natural domain of deep learning:
- Long-range dependencies: a splice site 10,000 bp from a coding region affects . A distal 500 kb away regulates . CNNs and transformers can capture these dependencies.
- Hierarchical feature learning: → TF binding motifs → regulatory modules → -type-specific programs. Deep architectures naturally learn hierarchical representations.
- Scale: billions of pairs of sequence are available. Transformer models that require massive training data fit naturally.
- Discrete alphabet: has 4 letters; have 20. Sequence models for language (which also operate on discrete tokens) map cleanly to these alphabets.
Convolutional Neural Networks on Genomic Sequences
The first wave of biological deep learning applied CNNs to fixed-length genomic sequences — the same architecture that revolutionized image classification, adapted to 1D .
Architecture:
One-hot encoded sequence (L × 4, where L = sequence length)
↓
Conv1D filters (capture k-mer motifs, analogous to edge detectors)
↓
Pooling (aggregate across local windows)
↓
Stacked convolutional layers (hierarchical feature composition)
↓
Dense layers
↓
Prediction (TF binding, chromatin accessibility, splicing)
DeepBind (2015): predicted TF- binding specificity from sequence; matched or exceeded PWM-based methods and discovered non-linear binding rules.
DeepSEA (2015): predicted 919 chromatin features (DNase, histone marks, TF binding) from 1 kb sequence windows. Trained on ENCODE data. Enabled in silico saturation mutagenesis — scoring every possible single- change for effect on regulatory activity.
Enformer (2021): attention-based architecture predicting directly from 200 kb of surrounding sequence. Captures distal - interactions that shorter-context models miss. Currently the best model for predicting regulatory activity from sequence.
Splicing Prediction
is governed by short sequence motifs (splice sites, branch points, ESEs, ESSs) in complex combinations — a natural target for deep learning.
SpliceAI (2019): deep residual network predicting splice site usage from 10 kb sequence context. Validated against known splice-altering ; now used in clinical interpretation. An independent test: SpliceAI predictions correlate with patient for of unknown significance near splice sites.
SpliceAI enabled -wide prediction of consequences for all possible SNVs — the first clinically adopted deep learning model in interpretation.
Protein Structure: AlphaFold
AlphaFold 2 (2021) is arguably the most significant scientific breakthrough produced by deep learning. It solved the structure prediction problem — predicting 3D structure from sequence — with accuracy matching experimental methods.
The architecture: AlphaFold 2 combines:
- Multiple sequence (MSA) processing: evolutionary covariation signals from thousands of homologous sequences encode structural constraints
- Pair representation: pairwise relationships between all residue pairs
- Evoformer: a transformer-like module that iteratively updates sequence and pair representations with attention across both dimensions
- Structure module: explicitly constructs 3D coordinates using equivariant neural network operations (frames for each residue)
The key insight: co-evolutionary information is structural information. If two residues are in contact in 3D, in one residue are compensated by in the other across evolution. The pattern of compensatory in the MSA encodes 3D contacts.
AlphaFold Database now contains predicted structures for >200 million . In practice:
- Look up any UniProt → get a structure prediction in seconds
- Per-residue confidence scores (pLDDT) indicate reliability (> 70 = reliable; < 50 = disordered or uncertain)
- PAE (predicted error) maps indicate domain-relative confidence — useful for multi-domain and - interfaces
AlphaFold 3 (2024): extended to -, -, -small molecule complexes. Enables in silico prediction of drug- and nucleic acid- interactions.
Limitations: AlphaFold predicts the "ground state" structure well but has limitations for:
- Intrinsically disordered regions (low pLDDT correctly indicates uncertainty)
- Conformational changes (binding-induced fit)
- Rare families with few homologs in the MSA (sparse evolutionary information)
- Novel de novo designed (no evolutionary history)
When interpreting AlphaFold structures, always check the per-residue pLDDT score (colored on the structure in the AF-DB viewer). Blue regions (pLDDT > 90) are high-confidence; yellow/orange (50–70) are less reliable; red (< 50) should be treated as structurally uncharacterized. Low-confidence regions often correspond to biologically meaningful disordered regions — not prediction failures.
Protein Language Models (PLMs)
language models are transformers trained on large collections of sequences using masked language modeling — the same approach as BERT for natural language.
ESM-2 (Meta AI): trained on 250 million sequences from UniRef50. The representations (embeddings) capture structural and functional properties without explicit structure training.
Applications:
- effect prediction: evolutionary language modeling score predicts pathogenicity of missense . ESMFold achieves structure prediction from a single sequence without MSA.
- function prediction: embeddings by function in embedding space; nearest neighbors in embedding space are often functionally similar .
- engineering: scoring all at a position identifies those likely to be tolerated — guides directed evolution and design experiments.
ProtTrans, ESM-1v, EVE: different PLM ; EVE models epistasis (combined effects of multiple ) and predicts clinical pathogenicity.
DNA/RNA Foundation Models
The same transformer architecture trained on genomic sequences learns the regulatory grammar of the .
Transformer (2023): transformer trained on 2,500 . Learned representations generalize to downstream regulatory tasks without additional training.
HyenaDNA: convolution-based model that handles sequences up to 1 million bp — capturing very long-range genomic dependencies that transformers (with O(n²) attention) cannot scale to.
DNABERT-2: BERT-style model on genomic sequences; fine-tuned for prediction, TF binding, chromatin accessibility.
Genomic language models can predict:
- binding sites
- Chromatin accessibility (ATAC-seq peaks)
- activity
- Effect of on regulatory activity
- from sequence
language models (SpliceBERT, RNABERT) target prediction and structure.
Medical Imaging: Computational Pathology
Histopathology slides contain rich information but are large (gigapixel images) and require expert interpretation. Deep learning has transformed this:
Patch-based CNNs: slice whole slide images into patches (256×256 px); classify each patch; aggregate predictions. Used for:
- Tumor vs. normal classification
- Cancer subtype classification
- Grading (Gleason score for prostate cancer)
Multiple instance learning (MIL): treat the slide as a "bag" of patches; learn which patches are informative for the slide-level label without patch-level annotations. The standard approach for weakly supervised pathology.
Foundation models for pathology (UNI, CONCH, PLIP): vision transformers pretrained on millions of pathology images. Fine-tuned for specific tasks with minimal labeled data.
Multi-modal integration: combining (from spatial transcriptomics) with histology (H&E images) enables predicting molecular subtypes directly from images, or using images to infer spatial at scale.
Graph Neural Networks for Molecular Biology
Molecules, interaction networks, and metabolic are naturally represented as graphs. Graph neural networks (GNNs) operate directly on graph-structured data.
Molecular property prediction:
- Atoms = nodes; bonds = edges
- GNN learns atom-level representations aggregating neighborhood information
- Graph-level readout predicts molecular properties (solubility, toxicity, binding affinity)
- Applications: ADMET prediction in drug discovery, toxicity screening, reaction outcome prediction
- interaction networks: GNNs on PPI graphs predict essentiality, drug targets, and disease candidates.
graph models: in spatial transcriptomics, each is a node with expression features; neighbors are edges. GNNs predict state from neighborhood context.
Sequence-to-Function: The Central Paradigm
Many modern biological deep learning tasks follow the same pattern:
Sequence → [Deep Learning Model] → Function
Examples:
- sequence → chromatin accessibility
- sequence →
- sequence → 3D structure
- sequence → stability / binding affinity
- sequence → secondary structure
- sequence → binding
The power of this paradigm: once a model is trained, you can predict the effect of any sequence change in silico, without experiments. This enables:
- Saturation mutagenesis: score all single at every position
- Inverse design: search sequence space for sequences with desired properties
- interpretation: predict functional effect of any observed
In silico directed evolution: use the sequence-to-function model as an objective function; optimize with gradient descent or evolutionary algorithms to find sequences with maximum predicted activity. AlphaFold + language models have enabled design of new and that function in wet lab validation.
Training Considerations for Biological Deep Learning
Data Splits for Sequence Data
Standard random train/test splits are invalid for — homologous sequences in train and test lead to data leakage.
For models: split by sequence identity. Test set should share <30% identity with any training . Use tools like MMseqs2 for .
For genomic models: split by . Train on 1–18; validate on chr19; test on chr20–22 and chrX. This ensures no positional overlap between train and test.
Time-based splits: for clinical data or databases, split by date of deposition to simulate realistic prospective evaluation.
Transfer Learning
Most biological deep learning leverages pretrained models:
- Pretrain on massive unlabeled (self-supervised)
- Fine-tune on smaller labeled datasets for specific tasks
This is especially important because labeled biological data is scarce (functional annotations require expensive experiments) while is abundant.
Zero-shot prediction: pretrained language models can predict effects without any task-specific fine-tuning — purely from evolutionary language modeling.
Uncertainty Quantification
Biological applications require knowing when the model doesn't know:
- AlphaFold provides pLDDT confidence scores
- Ensemble models estimate uncertainty from prediction variance
- Conformal prediction provides coverage-guaranteed prediction sets
In clinical applications, uncertainty quantification is increasingly required by regulatory frameworks.
Limitations and Honest Assessment
Data quality over architecture: most performance gains in biological DL come from better data curation, not novel architectures. A transformer on clean data beats a poorly curated dataset regardless of architecture.
Distribution shift: models trained on cancer lines may not predict patient tumors. Models trained on one tissue may not generalize to another. Biological context matters enormously.
Causation vs. correlation: a model predicting activity from sequence may learn that GC-rich regions are accessible (correlation) rather than that specific TF motifs drive accessibility (causation). Perturbation experiments are required to establish causality.
Clinical gap: even highly accurate predictive models face regulatory, ethical, and practical barriers to clinical deployment. FDA cleared AI/ML medical devices require prospective clinical validation — a much higher bar than a published paper with high AUC.
If your dataset has fewer than a few thousand samples and the features are well-characterized (known , clinical variables), gradient boosting or regularized regression will almost always outperform deep learning. Deep learning shines when: (1) you have raw data (sequences, images) where features need to be learned, (2) data is abundant (hundreds of thousands of examples or more), and (3) you can leverage pretrained representations through transfer learning.
Landmark Models Reference
| Model | Year | Task | Architecture |
|---|---|---|---|
| DeepSEA | 2015 | Chromatin features from sequence | CNN |
| SpliceAI | 2019 | Splice site prediction | Deep residual network |
| AlphaFold 2 | 2021 | Protein structure prediction | Evoformer + structure module |
| Enformer | 2021 | Gene expression from sequence | Transformer (attention) |
| ESM-2 | 2022 | Protein language model | Transformer (BERT-style) |
| AlphaFold 3 | 2024 | Biomolecular complex structure | Diffusion + transformer |
| Nucleotide Transformer | 2023 | Genomic sequence model | Transformer |
These models represent the current state of the art and are actively used in research and increasingly in clinical pipelines.