Part 2·2.5·16 min read

Proteins: The Runtime Executable

Proteins are the cell's working programs — their shape is their function, and their function emerges from a sequence that folds spontaneously in milliseconds.

proteinstranslationprotein foldingstructure
Central Dogma: DNA → RNA → Protein

If is source code and is bytecode, are the compiled, running executables that actually do things. build structures, catalyze reactions, transmit signals, regulate , transport molecules, and defend against pathogens. The is, in a very real sense, a machine — most of what makes a liver different from a is the set of each type produces and maintains.

Understanding as a programmer means understanding three things: how they're built (), how they achieve their function (folding), and how that function can be predicted and analyzed computationally (structural biology and proteomics).

Translation: Executing the mRNA

converts the sequence of an into the sequence of a . It occurs at ribosomes — large complexes of rRNA and that function as the 's runtime environment.

The ribosome in triplets (codons), and for each codon, recruits the corresponding aminoacyl-tRNA. The tRNA brings the correct , the ribosome forms a peptide bond between successive , and the growing polypeptide chain is extended one residue at a time.

The three stages:

Initiation

The small ribosomal subunit (40S in eukaryotes) associates with the at the 5' cap and scans for the start codon (AUG, encoding methionine). When it finds it, the large subunit (60S) joins, forming the complete 80S ribosome. Initiation requires multiple initiation factors and GTP hydrolysis.

Most eukaryotic mRNAs use cap-dependent . mRNAs and some cellular stress-response mRNAs use IRES (Internal Ribosome Entry Sites) — structures that recruit the ribosome directly to an internal site without cap recognition.

Elongation

The ribosome has three sites:

  • A site (acceptor): where the incoming aminoacyl-tRNA binds
  • P site (peptidyl): where the growing peptide chain is held
  • E site (exit): where the spent tRNA exits

Each elongation cycle:

  1. An aminoacyl-tRNA with the right anticodon binds the A site
  2. Peptidyl transferase (the rRNA ribozyme) transfers the growing chain to the A-site , forming a new peptide bond
  3. The ribosome translocates one codon in the 3'→5' direction (moving the peptidyl-tRNA from A→P, old tRNA from P→E)

Speed: ~15–20 /second in eukaryotes. A 300 aa takes ~20 seconds to synthesize.

Termination

When a stop codon (UAA, UAG, or UGA) enters the A site, a release factor rather than a tRNA binds. This triggers hydrolysis of the peptide chain from the final tRNA, releasing the completed polypeptide. The ribosome then dissociates.

After release, the newly synthesized polypeptide is just a linear chain of . It doesn't become functional until it folds.

The 20 Amino Acids: The Type System

are built from 20 canonical , each defined by its side chain (R group). The side chain determines an 's chemical character:

PropertyAmino acidsFunctional consequence
Nonpolar/hydrophobicAla, Val, Leu, Ile, Pro, Phe, Trp, MetDrive hydrophobic core formation
Polar, unchargedSer, Thr, Cys, Tyr, Asn, GlnHydrogen bonding, active site residues
Positively chargedArg, Lys, HisDNA binding, salt bridges
Negatively chargedAsp, GluCatalysis, charge repulsion
SpecialGly (flexibility), Pro (rigidity, disrupts helices)Structural roles

The sequence of — the primary structure — contains all the information needed to fold into the correct 3D shape. This is Anfinsen's dogma, established in 1961: a 's native structure is the thermodynamic minimum for that sequence. No assembly instructions are required beyond the sequence itself.

Protein Folding: Compilation from Sequence to Structure

As the polypeptide chain emerges from the ribosome, it begins folding. Folding is driven by thermodynamics — the seeks its minimum free energy conformation — but it doesn't sample all possible configurations (that would take longer than the age of the universe). Instead, folding proceeds through a folding funnel: a landscape of conformations where energy decreases toward the native state, guiding the chain efficiently.

Secondary Structure

The polypeptide backbone forms regular local structures stabilized by backbone hydrogen bonds:

  • α-helix: a right-handed coil where every backbone NH forms a hydrogen bond with the backbone C=O four residues earlier. Roughly 1.5 Å rise per residue, 3.6 residues per turn. Common in (transmembrane α-helices) and many cytoplasmic .

  • β-sheet: extended strands arranged side-by-side, held together by interstrand hydrogen bonds. Can be parallel or antiparallel. Found in immunoglobulins, β-barrel , and amyloid fibrils.

  • Loops and turns: irregularly structured regions connecting helices and strands. Often located on surfaces and form binding sites and active sites.

Tertiary and Quaternary Structure

The full 3D arrangement of all atoms in a single polypeptide is its tertiary structure. It's stabilized by:

  • Hydrophobic interactions (nonpolar residues in the core away from water)
  • Hydrogen bonds (between side chains and backbone)
  • Disulfide bonds (covalent bonds between cysteine side chains — common in extracellular )
  • Salt bridges (between oppositely charged side chains)

Many functional are multi-subunit assemblies — quaternary structure. Hemoglobin is a tetramer (α₂β₂). The proteasome is a 26-subunit complex. The ribosome has >80 subunits plus three rRNAs.

AlphaFold and the protein folding revolution

For 50 years, predicting 3D structure from sequence alone was considered one of the hardest problems in science. In 2020, DeepMind's AlphaFold2 achieved near-experimental accuracy on the CASP14 benchmark, effectively solving the problem for single-chain . The AlphaFold Structure Database now contains predicted structures for >200 million — essentially all known . AlphaFold3 (2024) extended this to complexes with , , and small molecules.

For bioinformatics practitioners, this means structure-based analyses that previously required experimental data (X-ray crystallography, cryo-EM) are now available computationally for virtually any .

Protein Domains: Modules and Libraries

Evolution rarely builds from scratch. Instead, it recombines and modifies existing structural units called domains — independently folding segments with defined structure and function that appear in many different .

Classic examples:

  • SH2 domain: binds phosphotyrosine residues. Found in 120+ human . Key transducer in tyrosine kinase signaling.
  • -binding domains: zinc fingers, helix-turn-helix, leucine zippers — each with specific sequence preferences
  • Kinase domain: the catalytic core of kinases, responsible for phosphorylating serine, threonine, or tyrosine residues
  • Ubiquitin-binding domains: recognize ubiquitin modifications on other

A single can contain multiple domains from different "families," often connected by flexible linkers. This modularity means you can infer partial function from sequence alone — if you find an SH2 domain in an uncharacterized , it almost certainly binds phosphoproteins.

The Pfam and InterPro databases catalog known domains and can be used to annotate predicted from genomic sequence.

Post-Translational Modifications: Runtime Configuration

don't arrive at their final functional state straight from the ribosome. Post-translational modifications (PTMs) add functional groups after synthesis:

PTMEffectBiological role
PhosphorylationAdds negative charge, alters shapeSignal transduction on/off switches
UbiquitinationTags for proteasomal degradation or traffickingProtein turnover, DNA repair
GlycosylationAdds sugar chainsMembrane stability, cell recognition
AcetylationNeutralizes positive chargeHistone regulation, metabolic enzymes
MethylationVariable charge effectHistone code, protein-protein interactions
CleavageRemoves signal peptide or prodomainProtein activation, secretion

Phosphorylation alone involves ~70,000 known phosphorylation sites in the human proteome. Kinases (add phosphate groups) and phosphatases (remove them) form intricate regulatory networks — cellular signaling is largely written in the language of phosphorylation.

{ }PTMs as runtime feature flags

If the sequence is the binary, post-translational modifications are runtime state. The same can be active or inactive, nuclear or cytoplasmic, stable or targeted for degradation — all determined by which PTMs it carries at a given moment.

Phosphoproteomics (mass spectrometry that measures phosphorylation states) is analogous to runtime instrumentation: you're not reading the code, you're observing the running state of the system.

Protein Degradation: Garbage Collection

don't live forever. The has two main degradation :

The ubiquitin-proteasome system (UPS): tagged with chains of ubiquitin (a small 76-aa ) are recognized and degraded by the 26S proteasome — a large barrel-shaped complex whose central chamber contains proteases. This is the primary for degrading cytoplasmic , regulatory with short half-lives, and misfolded . ~80% of cellular degradation goes through the UPS.

Autophagy: Portions of cytoplasm — including whole organelles and aggregates — are engulfed by a double- vesicle (autophagosome) that fuses with the lysosome for degradation. Used for bulk turnover, organelle quality control (mitophagy clears damaged mitochondria), and nutrient recycling during starvation.

Both are tightly regulated. Dysfunction in either contributes to neurodegeneration ( aggregation diseases like Parkinson's, Alzheimer's), cancer (inappropriate stabilization of oncoproteins), and aging.

Why Proteins Are Central to Bioinformatics

Almost everything in bioinformatics ultimately relates to :

  • annotation asks: does this change the sequence, structure, or stability?
  • Drug discovery asks: which are good targets, and how does a small molecule bind to them?
  • Single- tells you which are , but abundance is the downstream readout
  • Structural bioinformatics uses sequence to predict or analyze 3D structure

Proteomics — mass spectrometry-based measurement of abundances and modifications — is becoming increasingly important alongside transcriptomics. Unlike , are directly functional; the correlation between abundance and abundance is only moderate (~0.4–0.6 Pearson r in most studies). The reasons include differential efficiency, variable stability, and PTM regulation.

Knowing the — its sequence, structure, modifications, binding partners, and stability — is knowing what the is actually doing right now.

DECODER
Biology

Proteins are linear polymers of amino acids that fold into three-dimensional structures. Their function — enzyme, transporter, receptor, structural component — is entirely determined by their shape. A single amino acid change can abolish or alter function entirely.

{ } For Developers

A protein is a compiled and linked binary. The amino acid sequence is machine code, the 3D fold is the loaded executable in memory. Post-translational modifications (phosphorylation, glycosylation) are runtime patches — they change behavior without recompiling. A missense mutation is a single byte flip that can crash the process or silently corrupt state.

LAB · Amino Acid Composition
Python · Pyodide