Part 6·6.1·12 min read

Mutations and Variants

Mutations are changes to the DNA sequence — the raw material of evolution, the drivers of cancer, and the targets of clinical genomics.

mutationsvariantsgenomicsSNPsindels

Every human genome differs from the reference at approximately 4–5 million positions. Some differences cause disease. Most are harmless. A few confer advantages. Understanding the types of mutations, how they arise, and how to classify their effects is foundational to clinical genomics, cancer biology, and evolutionary analysis.

This is not abstract taxonomy. When you run a variant caller on a tumor-normal pair, every output line is a mutation described by these categories. When you interpret a clinical variant report, every entry is classified by this framework. Knowing the types and effects of mutations determines what questions you can ask and what tools you use to answer them.

Mutation vs. Variant: The Terminology

These terms are often used interchangeably, but in clinical genomics they have distinct contexts:

Mutation implies a pathological change — a variant known to cause disease. It's a clinical judgment.

Variant is the neutral term for any position that differs from the reference. Variants are further classified by evidence:

  • Pathogenic: known to cause disease
  • Likely pathogenic: strong evidence for pathogenicity
  • Variant of uncertain significance (VUS): insufficient evidence
  • Likely benign: probably harmless
  • Benign: known to have no disease effect

The distinction matters for communication with clinicians and patients. Everything in a genome is a variant; very few are mutations in the clinical sense.

Types of Variants by Size and Mechanism

Single Nucleotide Variants (SNVs)

A single base change. The most common type of genetic variation. When referring to common SNVs found at >1% frequency in the population, they're called SNPs (single nucleotide polymorphisms). Most disease-associated variants discovered in GWAS are SNPs.

SNVs in coding regions are classified by their effect on the protein:

Synonymous (silent): The nucleotide changes but the codon still encodes the same amino acid (due to codon degeneracy). No amino acid change. Often assumed to be neutral — but can affect splicing, codon usage, or mRNA stability.

Missense: The nucleotide change causes a different amino acid to be incorporated. Effect on protein function depends on the amino acid properties and the position. A conservative substitution (e.g., Leu → Ile, both hydrophobic) is less likely to be damaging than a radical one (e.g., Arg → Glu, charge reversal).

Nonsense: The nucleotide change creates a premature stop codon (UAA, UAG, UGA). Produces a truncated protein — almost always loss-of-function if the stop codon is early in the coding sequence. The truncated mRNA is often degraded by NMD (nonsense-mediated decay).

Splice site: Occurs at the consensus splice site sequence (GT at 5' splice site, AG at 3' splice site, or nearby sequences). Disrupts splicing → exon skipping, intron retention, or cryptic splice activation. Often as damaging as nonsense mutations.

Insertions and Deletions (Indels)

In-frame indels: Length divisible by 3 → inserts or deletes amino acids without disrupting the reading frame. Typically less severe than frameshift indels. May delete a critical residue or domain.

Frameshift indels: Length not divisible by 3 → shifts the reading frame of all downstream codons. Produces a completely different amino acid sequence after the indel, usually followed quickly by a premature stop codon. Almost always loss-of-function.

Structural Variants (SVs)

Large-scale DNA rearrangements affecting hundreds of base pairs to megabases:

  • Copy Number Variants (CNVs): duplications or deletions of chromosomal segments. Gene amplification (extra copies → protein overexpression) and deletion (fewer copies → reduced expression or loss-of-function) are both common in cancer.
  • Inversions: a segment is reversed in orientation
  • Translocations: a segment moves to a different chromosome (or a different position on the same chromosome). Oncogenic translocations create fusion genes: BCR-ABL in CML (t(9;22)), EML4-ALK in lung cancer, etc.
  • Mobile element insertions: retrotransposons or other mobile elements inserting into genes

Tandem Repeats

Short sequence motifs repeated in tandem. Microsatellites (2–6 bp repeats) are highly polymorphic and prone to replication slippage errors. Trinucleotide repeat expansion is the mechanism of Huntington's disease (CAG expansion in HTT), Fragile X (CGG expansion in FMR1), and other neurodegenerative diseases.

Mutation Mechanisms

Replication Errors

DNA polymerase occasionally incorporates the wrong base (proofreading reduces this to ~1/10⁹ per base per replication). Mismatch repair then catches most remaining errors. The few that escape become permanent mutations.

Spontaneous Chemical Damage

  • Deamination: cytosine spontaneously loses its amino group → uracil (reads as thymine). Creates C→T transitions, most commonly at CpG dinucleotides. This is the most common endogenous mutational mechanism.
  • Depurination: purine bases are spontaneously cleaved from the backbone, creating abasic sites.
  • Oxidation: reactive oxygen species (ROS) generate 8-oxoguanine, which can mispair with adenine → G:C→T:A transversions.

Environmental Mutagens

  • UV radiation: creates cyclobutane pyrimidine dimers and 6-4 photoproducts at adjacent pyrimidines → C→T and CC→TT transitions. Characteristic signature in skin cancers.
  • Cigarette smoke: polycyclic aromatic hydrocarbons and other carcinogens create bulky adducts → G→T transversions. Characteristic signature in lung cancers from smokers.
  • Alkylating agents: attach methyl or ethyl groups to DNA bases → errors during replication.
  • Ionizing radiation: double-strand breaks → large deletions, translocations.
  • APOBEC cytidine deaminases: cellular enzymes normally involved in innate immunity; when dysregulated, cause extensive C→T and C→G mutations at TC contexts. Major mutational process in many cancer types.

Mutational Signatures

The pattern of mutations in a genome reflects the processes that caused them. The COSMIC Mutational Signatures database (v3.4 as of 2024) catalogs 78 validated single base substitution signatures, plus others for small indels and SVs.

Each signature is characterized by the relative rates of all 96 mutation types (6 substitution types × 16 trinucleotide contexts). Signature 4 (smoking) is dominated by C[G→T]G. Signature 7a/7b (UV) is dominated by C[C→T]C. Signature 3 (homologous recombination deficiency, found in BRCA1/2-mutant tumors) is dominated by deletions.

Decomposing a tumor's mutations into mutational signatures reveals the etiology — what caused the mutations — and can have clinical implications (BRCA1/2-like signature → may respond to PARP inhibitors).

Variant Classification Frameworks

ACMG/AMP Guidelines

The standard for germline variant classification (used in clinical genetics labs) is the ACMG/AMP 2015 guidelines. They use a combination of evidence criteria:

  • Population frequency: Is the variant common in the general population? Common = less likely pathogenic.
  • Computational predictions: Do tools (SIFT, PolyPhen-2, AlphaMissense) predict it's damaging?
  • Functional studies: Does it disrupt protein function in experimental assays?
  • Segregation: Does the variant co-segregate with disease in affected families?
  • Known pathogenic variants: Is this the same or similar to a previously validated pathogenic variant?

Evidence is combined to reach one of 5 classifications (pathogenic, likely pathogenic, VUS, likely benign, benign).

OncoKB and Clinical Oncogenomics

For somatic (cancer) variants, separate classification systems apply. OncoKB classifies variants by their clinical actionability — whether there's an approved drug, a clinical trial, or just biological evidence.

A BRAF V600E mutation in melanoma is Level 1 (FDA-approved therapy: vemurafenib, dabrafenib). The same mutation in cholangiocarcinoma might be Level 3A (evidence from clinical trials, not approved). Different cancers, same variant, different clinical implications.

The VCF File Format

Variant data is stored in VCF (Variant Call Format) files. This is the universal format for genomic variant data.

##fileformat=VCFv4.2
##reference=GRCh38
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
#CHROM POS     ID          REF ALT QUAL FILTER INFO           FORMAT  SAMPLE1
chr17  7674220 rs28934578  G   T   .    PASS   AF=0.001;      GT:DP   0/1:45
chr7   140453136 .         A   T   100  PASS   SOMATIC;       GT:DP   0/1:120

Key fields:

  • CHROM/POS: chromosome and 1-based position
  • REF/ALT: reference and alternate alleles
  • QUAL: variant quality score
  • FILTER: PASS or reason for filtering
  • INFO: semicolon-delimited annotations (allele frequency, functional effect, etc.)
  • FORMAT/SAMPLE: per-sample genotype data

The GT (genotype) field encodes the alleles: 0/0 = homozygous reference, 0/1 = heterozygous, 1/1 = homozygous alternate. Somatic variants in tumors are often 0/1 with a variant allele fraction (VAF) far below 50% due to tumor heterogeneity and normal cell contamination.

VCF annotation tools (ANNOVAR, VEP, SnpEff) add predicted functional effects to the INFO field.

Key Population Databases

dbSNP: NCBI's database of known variants. Assigns rs numbers to common and clinically observed variants. A variant in dbSNP is not necessarily benign — it just means it's been observed before.

gnomAD (Genome Aggregation Database): ~800,000 exomes and ~76,000 whole genomes from diverse populations. The most important population frequency database. A variant observed in thousands of gnomAD individuals is almost certainly not a high-penetrance disease variant.

ClinVar: NCBI's database of variant-disease associations. Aggregates classifications from clinical labs, researchers, and curated sources. The primary reference for clinical variant interpretation.

COSMIC (Catalogue Of Somatic Mutations In Cancer): Somatic mutation database from tumor sequencing. Contains >8 million unique mutations from >40,000 tumor samples. Essential for identifying oncogenic mutations and mutational signatures.

Understanding these databases — their scope, their limitations, and how to query them — is the foundation of clinical genomics and cancer bioinformatics. In Chapter 6.5 we'll work directly with VCF files in Python to perform this analysis computationally.