Mutations and Variants

Point mutation: synonymous vs. missense outcomes

Every human differs from the reference at approximately 4–5 million positions. Some differences cause disease. Most are harmless. A few confer advantages. Understanding the types of , how they arise, and how to classify their effects is foundational to clinical genomics, cancer biology, and evolutionary analysis.

This is not abstract taxonomy. When you run a caller on a tumor-normal pair, every output line is a described by these categories. When you interpret a clinical report, every entry is classified by this framework. Knowing the types and effects of determines what questions you can ask and what tools you use to answer them.

Mutation vs. Variant: The Terminology

These terms are often used interchangeably, but in clinical genomics they have distinct contexts:

implies a pathological change — a known to cause disease. It's a clinical judgment.

is the neutral term for any position that differs from the reference. are further classified by evidence:

Pathogenic: known to cause disease
Likely pathogenic: strong evidence for pathogenicity
of uncertain significance (VUS): insufficient evidence
Likely benign: probably harmless
Benign: known to have no disease effect

The distinction matters for communication with clinicians and patients. Everything in a is a ; very few are in the clinical sense.

Types of Variants by Size and Mechanism

Single Nucleotide Variants (SNVs)

A single change. The most common type of genetic variation. When referring to common SNVs found at >1% frequency in the population, they're called SNPs (single polymorphisms). Most disease-associated discovered in GWAS are SNPs.

SNVs in coding regions are classified by their effect on the :

Synonymous (silent): The changes but the codon still encodes the same (due to codon degeneracy). No change. Often assumed to be neutral — but can affect , codon usage, or stability.

Missense: The change causes a different to be incorporated. Effect on function depends on the properties and the position. A conservative substitution (e.g., Leu → Ile, both hydrophobic) is less likely to be damaging than a radical one (e.g., Arg → Glu, charge reversal).

Nonsense: The change creates a premature stop codon (UAA, UAG, UGA). Produces a truncated — almost always loss-of-function if the stop codon is early in the coding sequence. The truncated is often degraded by NMD (nonsense-mediated decay).

Splice site: Occurs at the consensus splice site sequence (GT at 5' splice site, AG at 3' splice site, or nearby sequences). Disrupts → skipping, retention, or cryptic splice activation. Often as damaging as nonsense .

Insertions and Deletions (Indels)

In-frame indels: Length divisible by 3 → inserts or deletes without disrupting the reading frame. Typically less severe than frameshift indels. May delete a critical residue or domain.

Frameshift indels: Length not divisible by 3 → shifts the reading frame of all downstream codons. Produces a completely different sequence after the indel, usually followed quickly by a premature stop codon. Almost always loss-of-function.

Structural Variants (SVs)

Large-scale rearrangements affecting hundreds of pairs to megabases:

Copy Number (CNVs): duplications or deletions of chromosomal segments. amplification (extra copies → overexpression) and deletion (fewer copies → reduced expression or loss-of-function) are both common in cancer.
Inversions: a segment is reversed in orientation
Translocations: a segment moves to a different (or a different position on the same ). Oncogenic translocations create fusion : BCR-ABL in CML (t(9;22)), EML4-ALK in lung cancer, etc.
Mobile element insertions: retrotransposons or other mobile elements inserting into

Tandem Repeats

Short sequence motifs repeated in tandem. Microsatellites (2–6 bp repeats) are highly polymorphic and prone to replication slippage errors. Trinucleotide repeat expansion is the mechanism of Huntington's disease (CAG expansion in HTT), Fragile X (CGG expansion in FMR1), and other neurodegenerative diseases.

Mutation Mechanisms

Replication Errors

polymerase occasionally incorporates the wrong (proofreading reduces this to ~1/10⁹ per per replication). Mismatch repair then catches most remaining errors. The few that escape become permanent .

Spontaneous Chemical Damage

Deamination: cytosine spontaneously loses its amino group → uracil ( as thymine). Creates C→T transitions, most commonly at CpG dinucleotides. This is the most common endogenous mutational mechanism.
Depurination: purine are spontaneously cleaved from the backbone, creating abasic sites.
Oxidation: reactive oxygen species (ROS) generate 8-oxoguanine, which can mispair with adenine → G:C→T:A transversions.

Environmental Mutagens

UV radiation: creates cyclobutane pyrimidine dimers and 6-4 photoproducts at adjacent pyrimidines → C→T and CC→TT transitions. Characteristic signature in skin cancers.
Cigarette smoke: polycyclic aromatic hydrocarbons and other carcinogens create bulky adducts → G→T transversions. Characteristic signature in lung cancers from smokers.
Alkylating agents: attach methyl or ethyl groups to → errors during replication.
Ionizing radiation: double-strand breaks → large deletions, translocations.
APOBEC cytidine deaminases: cellular normally involved in innate immunity; when dysregulated, cause extensive C→T and C→G at TC contexts. Major mutational process in many cancer types.

Mutational Signatures

The pattern of in a reflects the processes that caused them. The COSMIC Mutational Signatures database (v3.4 as of 2024) catalogs 78 validated single substitution signatures, plus others for small indels and SVs.

Each signature is characterized by the relative rates of all 96 types (6 substitution types × 16 trinucleotide contexts). Signature 4 (smoking) is dominated by C[G→T]G. Signature 7a/7b (UV) is dominated by C[C→T]C. Signature 3 (homologous recombination deficiency, found in BRCA1/2-mutant tumors) is dominated by deletions.

Decomposing a tumor's into mutational signatures reveals the etiology — what caused the — and can have clinical implications (BRCA1/2-like signature → may respond to PARP inhibitors).

Variant Classification Frameworks

ACMG/AMP Guidelines

The standard for germline classification (used in clinical genetics labs) is the ACMG/AMP 2015 guidelines. They use a combination of evidence criteria:

Population frequency: Is the common in the general population? Common = less likely pathogenic.
Computational predictions: Do tools (SIFT, PolyPhen-2, AlphaMissense) predict it's damaging?
Functional studies: Does it disrupt function in experimental assays?
Segregation: Does the co-segregate with disease in affected families?
Known pathogenic : Is this the same or similar to a previously validated pathogenic ?

Evidence is combined to reach one of 5 classifications (pathogenic, likely pathogenic, VUS, likely benign, benign).

OncoKB and Clinical Oncogenomics

For somatic (cancer) , separate classification systems apply. OncoKB classifies by their clinical actionability — whether there's an approved drug, a clinical trial, or just biological evidence.

A BRAF V600E in melanoma is Level 1 (FDA-approved therapy: vemurafenib, dabrafenib). The same in cholangiocarcinoma might be Level 3A (evidence from clinical trials, not approved). Different cancers, same , different clinical implications.

The VCF File Format

data is stored in VCF ( Call Format) files. This is the universal format for genomic data.

##fileformat=VCFv4.2
##reference=GRCh38
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
#CHROM POS     ID          REF ALT QUAL FILTER INFO           FORMAT  SAMPLE1
chr17  7674220 rs28934578  G   T   .    PASS   AF=0.001;      GT:DP   0/1:45
chr7   140453136 .         A   T   100  PASS   SOMATIC;       GT:DP   0/1:120

Key fields:

CHROM/POS: and 1-based position
REF/ALT: reference and alternate
QUAL: quality score
FILTER: PASS or reason for filtering
INFO: semicolon-delimited annotations ( frequency, functional effect, etc.)
FORMAT/SAMPLE: per-sample data

The GT () field encodes the : 0/0 = homozygous reference, 0/1 = heterozygous, 1/1 = homozygous alternate. Somatic in tumors are often 0/1 with a fraction (VAF) far below 50% due to tumor heterogeneity and normal contamination.

VCF annotation tools (ANNOVAR, VEP, SnpEff) add predicted functional effects to the INFO field.

Key Population Databases

dbSNP: NCBI's database of known . Assigns rs numbers to common and clinically observed . A in dbSNP is not necessarily benign — it just means it's been observed before.

gnomAD ( Aggregation Database): ~800,000 exomes and ~76,000 whole from diverse populations. The most important population frequency database. A observed in thousands of gnomAD individuals is almost certainly not a high-penetrance disease .

ClinVar: NCBI's database of -disease associations. Aggregates classifications from clinical labs, researchers, and curated sources. The primary reference for clinical interpretation.

COSMIC (Catalogue Of Somatic In Cancer): Somatic database from tumor . Contains >8 million unique from >40,000 tumor samples. Essential for identifying oncogenic and mutational signatures.

Understanding these databases — their scope, their limitations, and how to query them — is the foundation of clinical genomics and cancer bioinformatics. In Chapter 6.5 we'll work directly with VCF files in Python to perform this analysis computationally.

⟷DECODER

Biology

Mutations are permanent changes to the DNA sequence — substitutions, insertions, or deletions of one or more bases. Most mutations are neutral or repaired before expression; a small fraction alter protein function. Somatic mutations affect only the individual; germline mutations are heritable.

{ } For Developers

A mutation is a bit flip in the source code. A synonymous substitution is a no-op (same amino acid, different codon — like renaming a variable in a compiled binary). A missense mutation is a type error: different amino acid, potentially broken function. A frameshift (insertion/deletion) is corruption of the entire downstream sequence — every codon after the edit point is wrong. Nonsense mutations are null pointer dereferences: a premature stop codon truncates the protein.

LAB · Point Mutation Simulator

Python · Pyodide

# Point mutations: substituting one base changes a codon, potentially changing the amino acid.

CODON_TABLE = {
  "ATG": "Met", "TTC": "Phe", "TTT": "Phe", "TTA": "Leu", "TTG": "Leu",
  "GCT": "Ala", "GCC": "Ala", "GCA": "Ala", "GCG": "Ala",
  "GAT": "Asp", "GAC": "Asp", "GAA": "Glu", "GAG": "Glu",
  "CGT": "Arg", "CGC": "Arg", "CGA": "Arg", "CGG": "Arg",
  "AGT": "Ser", "AGC": "Ser", "AGA": "Arg", "AGG": "Arg",
  "TAA": "Stop", "TAG": "Stop", "TGA": "Stop",
  "CAT": "His", "CAC": "His", "CAA": "Gln", "CAG": "Gln",
  "AAT": "Asn", "AAC": "Asn", "AAA": "Lys", "AAG": "Lys",
  "GTT": "Val", "GTC": "Val", "GTA": "Val", "GTG": "Val",
  "TGT": "Cys", "TGC": "Cys", "TGG": "Trp",
  "CCT": "Pro", "CCC": "Pro", "CCA": "Pro", "CCG": "Pro",
  "ACT": "Thr", "ACC": "Thr", "ACA": "Thr", "ACG": "Thr",
}

def translate_dna(dna):
  protein = []
  for i in range(0, len(dna) - 2, 3):
      codon = dna[i:i+3]
      aa = CODON_TABLE.get(codon, "?")
      if aa == "Stop":
          break
      protein.append(aa)
  return protein

def point_mutate(dna, position, new_base):
  return dna[:position] + new_base + dna[position+1:]

original = "ATGGCTGAGCGT"
mutant = point_mutate(original, 5, "T")   # position 5: C -> T

orig_protein = translate_dna(original)
mut_protein  = translate_dna(mutant)

print("Original DNA :", original, "->", "-".join(orig_protein))
print("Mutant DNA   :", mutant,   "->", "-".join(mut_protein))
print()
if orig_protein == mut_protein:
  print("Result: synonymous mutation (silent) -- same protein")
else:
  print("Result: missense mutation -- protein changed at position", 
        next(i for i,(a,b) in enumerate(zip(orig_protein,mut_protein)) if a!=b) + 1)