If DNA is source code and RNA is bytecode, proteins are the compiled, running executables that actually do things. Proteins build cell structures, catalyze reactions, transmit signals, regulate gene expression, transport molecules, and defend against pathogens. The cell is, in a very real sense, a protein machine — most of what makes a liver cell different from a neuron is the set of proteins each type produces and maintains.
Understanding proteins as a programmer means understanding three things: how they're built (translation), how they achieve their function (folding), and how that function can be predicted and analyzed computationally (structural biology and proteomics).
Translation: Executing the mRNA
Translation converts the nucleotide sequence of an mRNA into the amino acid sequence of a protein. It occurs at ribosomes — large complexes of rRNA and protein that function as the cell's runtime environment.
The ribosome reads mRNA in triplets (codons), and for each codon, recruits the corresponding aminoacyl-tRNA. The tRNA brings the correct amino acid, the ribosome forms a peptide bond between successive amino acids, and the growing polypeptide chain is extended one residue at a time.
The three stages:
Initiation
The small ribosomal subunit (40S in eukaryotes) associates with the mRNA at the 5' cap and scans for the start codon (AUG, encoding methionine). When it finds it, the large subunit (60S) joins, forming the complete 80S ribosome. Initiation requires multiple initiation factors and GTP hydrolysis.
Most eukaryotic mRNAs use cap-dependent translation. Viral mRNAs and some cellular stress-response mRNAs use IRES (Internal Ribosome Entry Sites) — RNA structures that recruit the ribosome directly to an internal site without cap recognition.
Elongation
The ribosome has three sites:
- A site (acceptor): where the incoming aminoacyl-tRNA binds
- P site (peptidyl): where the growing peptide chain is held
- E site (exit): where the spent tRNA exits
Each elongation cycle:
- An aminoacyl-tRNA with the right anticodon binds the A site
- Peptidyl transferase (the rRNA ribozyme) transfers the growing chain to the A-site amino acid, forming a new peptide bond
- The ribosome translocates one codon in the 3'→5' direction (moving the peptidyl-tRNA from A→P, old tRNA from P→E)
Speed: ~15–20 amino acids/second in eukaryotes. A 300 aa protein takes ~20 seconds to synthesize.
Termination
When a stop codon (UAA, UAG, or UGA) enters the A site, a release factor rather than a tRNA binds. This triggers hydrolysis of the peptide chain from the final tRNA, releasing the completed polypeptide. The ribosome then dissociates.
After release, the newly synthesized polypeptide is just a linear chain of amino acids. It doesn't become functional until it folds.
The 20 Amino Acids: The Type System
Proteins are built from 20 canonical amino acids, each defined by its side chain (R group). The side chain determines an amino acid's chemical character:
| Property | Amino acids | Functional consequence |
|---|---|---|
| Nonpolar/hydrophobic | Ala, Val, Leu, Ile, Pro, Phe, Trp, Met | Drive hydrophobic core formation |
| Polar, uncharged | Ser, Thr, Cys, Tyr, Asn, Gln | Hydrogen bonding, active site residues |
| Positively charged | Arg, Lys, His | DNA binding, salt bridges |
| Negatively charged | Asp, Glu | Catalysis, charge repulsion |
| Special | Gly (flexibility), Pro (rigidity, disrupts helices) | Structural roles |
The sequence of amino acids — the primary structure — contains all the information needed to fold into the correct 3D shape. This is Anfinsen's dogma, established in 1961: a protein's native structure is the thermodynamic minimum for that sequence. No assembly instructions are required beyond the sequence itself.
Protein Folding: Compilation from Sequence to Structure
As the polypeptide chain emerges from the ribosome, it begins folding. Folding is driven by thermodynamics — the protein seeks its minimum free energy conformation — but it doesn't sample all possible configurations (that would take longer than the age of the universe). Instead, folding proceeds through a folding funnel: a landscape of conformations where energy decreases toward the native state, guiding the chain efficiently.
Secondary Structure
The polypeptide backbone forms regular local structures stabilized by backbone hydrogen bonds:
-
α-helix: a right-handed coil where every backbone NH forms a hydrogen bond with the backbone C=O four residues earlier. Roughly 1.5 Å rise per residue, 3.6 residues per turn. Common in membrane proteins (transmembrane α-helices) and many cytoplasmic proteins.
-
β-sheet: extended strands arranged side-by-side, held together by interstrand hydrogen bonds. Can be parallel or antiparallel. Found in immunoglobulins, β-barrel membrane proteins, and amyloid fibrils.
-
Loops and turns: irregularly structured regions connecting helices and strands. Often located on protein surfaces and form binding sites and active sites.
Tertiary and Quaternary Structure
The full 3D arrangement of all atoms in a single polypeptide is its tertiary structure. It's stabilized by:
- Hydrophobic interactions (nonpolar residues cluster in the core away from water)
- Hydrogen bonds (between side chains and backbone)
- Disulfide bonds (covalent bonds between cysteine side chains — common in extracellular proteins)
- Salt bridges (between oppositely charged side chains)
Many functional proteins are multi-subunit assemblies — quaternary structure. Hemoglobin is a tetramer (α₂β₂). The proteasome is a 26-subunit complex. The ribosome has >80 protein subunits plus three rRNAs.
For 50 years, predicting 3D structure from sequence alone was considered one of the hardest problems in science. In 2020, DeepMind's AlphaFold2 achieved near-experimental accuracy on the CASP14 benchmark, effectively solving the problem for single-chain proteins. The AlphaFold Protein Structure Database now contains predicted structures for >200 million proteins — essentially all known proteins. AlphaFold3 (2024) extended this to complexes with DNA, RNA, and small molecules.
For bioinformatics practitioners, this means structure-based analyses that previously required experimental data (X-ray crystallography, cryo-EM) are now available computationally for virtually any protein.
Protein Domains: Modules and Libraries
Evolution rarely builds proteins from scratch. Instead, it recombines and modifies existing structural units called domains — independently folding segments with defined structure and function that appear in many different proteins.
Classic examples:
- SH2 domain: binds phosphotyrosine residues. Found in 120+ human proteins. Key transducer in receptor tyrosine kinase signaling.
- DNA-binding domains: zinc fingers, helix-turn-helix, leucine zippers — each with specific DNA sequence preferences
- Kinase domain: the catalytic core of protein kinases, responsible for phosphorylating serine, threonine, or tyrosine residues
- Ubiquitin-binding domains: recognize ubiquitin modifications on other proteins
A single protein can contain multiple domains from different "families," often connected by flexible linkers. This modularity means you can infer partial function from sequence alone — if you find an SH2 domain in an uncharacterized protein, it almost certainly binds phosphoproteins.
The Pfam and InterPro databases catalog known protein domains and can be used to annotate predicted proteins from genomic sequence.
Post-Translational Modifications: Runtime Configuration
Proteins don't arrive at their final functional state straight from the ribosome. Post-translational modifications (PTMs) add functional groups after synthesis:
| PTM | Effect | Biological role |
|---|---|---|
| Phosphorylation | Adds negative charge, alters shape | Signal transduction on/off switches |
| Ubiquitination | Tags for proteasomal degradation or trafficking | Protein turnover, DNA repair |
| Glycosylation | Adds sugar chains | Membrane stability, cell recognition |
| Acetylation | Neutralizes positive charge | Histone regulation, metabolic enzymes |
| Methylation | Variable charge effect | Histone code, protein-protein interactions |
| Cleavage | Removes signal peptide or prodomain | Protein activation, secretion |
Phosphorylation alone involves ~70,000 known phosphorylation sites in the human proteome. Kinases (add phosphate groups) and phosphatases (remove them) form intricate regulatory networks — cellular signaling is largely written in the language of phosphorylation.
If the amino acid sequence is the binary, post-translational modifications are runtime state. The same protein can be active or inactive, nuclear or cytoplasmic, stable or targeted for degradation — all determined by which PTMs it carries at a given moment.
Phosphoproteomics (mass spectrometry that measures phosphorylation states) is analogous to runtime instrumentation: you're not reading the code, you're observing the running state of the system.
Protein Degradation: Garbage Collection
Proteins don't live forever. The cell has two main degradation pathways:
The ubiquitin-proteasome system (UPS): Proteins tagged with chains of ubiquitin (a small 76-aa protein) are recognized and degraded by the 26S proteasome — a large barrel-shaped complex whose central chamber contains proteases. This is the primary pathway for degrading cytoplasmic proteins, regulatory proteins with short half-lives, and misfolded proteins. ~80% of cellular protein degradation goes through the UPS.
Autophagy: Portions of cytoplasm — including whole organelles and protein aggregates — are engulfed by a double-membrane vesicle (autophagosome) that fuses with the lysosome for degradation. Used for bulk turnover, organelle quality control (mitophagy clears damaged mitochondria), and nutrient recycling during starvation.
Both pathways are tightly regulated. Dysfunction in either contributes to neurodegeneration (protein aggregation diseases like Parkinson's, Alzheimer's), cancer (inappropriate stabilization of oncoproteins), and aging.
Why Proteins Are Central to Bioinformatics
Almost everything in bioinformatics ultimately relates to proteins:
- Variant annotation asks: does this mutation change the protein sequence, structure, or stability?
- Drug discovery asks: which proteins are good targets, and how does a small molecule bind to them?
- Single-cell RNA-seq tells you which genes are transcribed, but protein abundance is the downstream readout
- Structural bioinformatics uses sequence to predict or analyze 3D structure
Proteomics — mass spectrometry-based measurement of protein abundances and modifications — is becoming increasingly important alongside transcriptomics. Unlike mRNA, proteins are directly functional; the correlation between mRNA abundance and protein abundance is only moderate (~0.4–0.6 Pearson r in most studies). The reasons include differential translation efficiency, variable protein stability, and PTM regulation.
Knowing the protein — its sequence, structure, modifications, binding partners, and stability — is knowing what the cell is actually doing right now.