Part 3·3.1·14 min read

Gene Expression

Gene expression is the process by which genetic information becomes functional output — controlled by transcription factors, promoters, and a layered regulatory architecture.

gene expressiontranscription factorsregulation

The human genome contains roughly 20,000 protein-coding genes. Every cell in your body carries all of them. A liver cell and a neuron have identical DNA. Yet they look different, behave differently, produce different proteins, and have dramatically different functional properties.

The difference is gene expression — which genes are turned on, at what levels, and in response to what signals. Gene expression is not a binary switch; it's a continuous, dynamically regulated process that determines cell identity, mediates cellular responses to the environment, and underlies development from a single fertilized egg to a trillion-cell organism.

What "Expression" Means

When we say a gene is "expressed," we mean it's being actively transcribed into RNA, and (for protein-coding genes) that RNA is being translated into protein. When we say a gene is "upregulated," we mean it's producing more RNA and/or protein than baseline. "Downregulated" means less.

In practice, because RNA-seq measures mRNA abundance as a proxy, "gene expression" in bioinformatics usually refers specifically to the mRNA level. This is a useful approximation but not a perfect one — mRNA levels don't always reflect protein levels, and non-coding RNA expression is often measured separately.

Transcription Factors: The Signal Integrators

The primary mechanism for controlling which genes are expressed is transcription factors (TFs) — proteins that bind specific DNA sequences and regulate RNA polymerase activity.

TFs work by binding to regulatory sequences in gene promoters and enhancers. When a TF binds near a promoter, it can:

  • Recruit the general transcription machinery (activators)
  • Block RNA polymerase recruitment (repressors)
  • Stabilize or destabilize nucleosomes (chromatin remodelers)
  • Recruit histone-modifying enzymes
{ }Transcription factors as environment-specific build configuration

Imagine your genome as a monorepo with 20,000 modules. Each module has a build configuration that specifies: "build this module if ENV_LIVER is set AND NOT ENV_NEURON AND ENV_OXYGEN > 0.2." Transcription factors are the environment variables. The cell's current TF repertoire determines which configurations evaluate to true, and therefore which genes are built.

The key insight: TFs are themselves proteins encoded by genes. The regulatory system is itself regulated by the very same mechanism.

The human genome encodes approximately 1,600 transcription factors — roughly 8% of all protein-coding genes. Many TFs function in combinations: the combinatorial code of TFs present determines the gene expression profile. A liver-specific gene might require the simultaneous binding of HNF4α, FOXA2, and C/EBPα to its enhancer. Individually, none of them is sufficient.

Promoter Architecture

A gene's promoter is not a simple on/off switch. It's a regulatory element with multiple functional modules:

Core Promoter

The minimal region sufficient for transcription initiation. Contains the TATA box (TATAAA, ~−30 from start), the initiator element (Inr, at the +1 site), and/or a downstream promoter element (DPE, ~+30). These elements position and orient RNA polymerase II.

About 20–30% of human promoters contain a TATA box. Most use the Inr or other elements. "TATA-less" promoters are common for housekeeping genes and often contain CpG islands.

Proximal Promoter

Extends ~500 bp upstream. Contains binding sites for specific TFs that regulate the gene. Often contains GC boxes (bound by Sp1) and other common regulatory elements.

Distal Regulatory Elements

Enhancers can be located up to megabases away from the gene they regulate, looping through 3D space to contact the promoter. They function as signal integration hubs: when the right combination of TFs binds, they activate the promoter through direct contact (mediated by the Mediator coactivator complex).

Identifying which enhancers regulate which genes is an active area of genomics research — the 3D genome folding problem.

Signal Transduction → Transcription

Most environmental signals ultimately control gene expression through one of two mechanisms:

Cytoplasmic-to-nuclear signaling: A signal (growth factor, cytokine) binds a receptor at the cell surface, triggering a kinase cascade that phosphorylates a dormant TF, causing it to translocate to the nucleus and activate target genes.

Classic examples:

  • JAK-STAT pathway: Cytokine binds receptor → JAKs phosphorylate STATs → STATs dimerize and enter nucleus → activate immune response genes
  • MAPK/ERK pathway: Growth factor → RAS → MEK → ERK → phosphorylates TFs like ELK1, c-Fos → cell proliferation genes
  • NF-κB pathway: Inflammatory signal → IκB kinase → IκB degradation → NF-κB enters nucleus → inflammatory gene expression

Steroid/nuclear receptor pathway: Lipid-soluble signaling molecules (steroids, thyroid hormone, retinoic acid) diffuse through the membrane, bind nuclear receptors in the cytoplasm or nucleus, and directly activate gene expression. The receptor-ligand complex is itself the TF.

Why steroid hormones are so potent as drugs

Because steroid-like molecules directly activate nuclear receptors that then regulate hundreds of genes, drugs targeting nuclear receptors have sweeping effects. Glucocorticoids (like dexamethasone) suppress inflammation by binding the glucocorticoid receptor and activating anti-inflammatory genes while suppressing pro-inflammatory ones. This potency is also why they have significant side effects — many genes are changed simultaneously.

Measuring Gene Expression: RNA-seq

RNA sequencing (RNA-seq) is the standard method for measuring gene expression genome-wide. The workflow:

  1. Extract total RNA from cells/tissue
  2. Select mRNA (poly-A selection or ribo-depletion)
  3. Convert to cDNA (reverse transcription)
  4. Fragment and ligate sequencing adapters
  5. Sequence millions of short reads (50–150 bp)
  6. Align reads to the reference genome
  7. Count reads per gene
  8. Normalize and perform statistical testing

The count matrix — genes × samples — is the fundamental data structure of transcriptomics. Each entry is the number of sequencing reads that mapped to a given gene in a given sample.

Normalization

Raw counts can't be compared directly because:

  • Samples are sequenced to different depths (total reads differ)
  • Longer genes generate more reads than shorter ones

Common normalization methods:

  • TPM (Transcripts Per Million): divides by gene length then normalizes to per-million reads. Best for within-sample comparisons and cross-study comparisons.
  • CPM (Counts Per Million): normalizes to per-million reads without length correction. For RNA-seq, appropriate when comparing the same gene across samples.
  • TMM/DESeq2 normalization: more sophisticated methods that account for composition bias (a few highly expressed genes dominating the total count).

Differential Expression Analysis

The core question in most transcriptomics studies: which genes change between condition A and condition B?

Statistical tools (DESeq2, edgeR, limma) model count data using negative binomial distributions (RNA-seq counts are overdispersed relative to Poisson) and perform hypothesis testing per gene. Output:

  • Log₂ fold change (LFC): how much does expression change? LFC = 2 means 4× higher in condition A.
  • Adjusted p-value: false discovery rate-corrected significance. Typically threshold at 0.05 or 0.1.
  • Volcano plot: visualizes all genes by LFC vs. -log₁₀(p-value)

A gene with |LFC| > 1 and adj. p < 0.05 is typically considered differentially expressed, though thresholds vary by context.

Cell-Type-Specific Expression

Not all cells express all genes equally. The combinatorial TF code creates cell-type-specific expression programs:

  • Tissue-specific genes: expressed in one tissue but not others (e.g., albumin is liver-specific; insulin is β-cell-specific)
  • Housekeeping genes: expressed in all cell types at similar levels (ribosomal proteins, metabolic enzymes)
  • Inducible genes: expressed only in response to specific signals (interferon-stimulated genes, heat shock proteins)

Single-cell RNA sequencing (scRNA-seq) has revealed that this picture is far more complex than bulk RNA-seq suggests. What appears as a homogeneous cell population often contains multiple distinct subpopulations with different expression programs. A tumor biopsy contains not just tumor cells but fibroblasts, endothelial cells, immune cells — each with distinct expression profiles.

Gene Expression Atlases

Large-scale projects have catalogued expression across human tissues and cell types:

  • GTEx (Genotype-Tissue Expression project): bulk RNA-seq from ~50 tissues from hundreds of human donors. Essential for understanding tissue-specific expression and eQTL mapping (genetic variants that affect expression).

  • Human Cell Atlas: single-cell RNA-seq from all major human cell types. Ongoing; aims to characterize every cell type in the human body.

  • FANTOM5: cap-analysis gene expression (CAGE) data capturing transcription start sites at single-nucleotide resolution across hundreds of human cell types.

These atlases serve as reference datasets: if you find a gene is upregulated in your tumor samples, GTEx tells you whether it's also normally expressed in that tissue, or whether it's aberrant.

Gene Set Enrichment Analysis (GSEA)

A single list of differentially expressed genes is hard to interpret. Gene set enrichment analysis shifts the focus from individual genes to pathways and functional modules.

The idea: given a ranked list of genes (e.g., ranked by fold change), is a predefined gene set (e.g., "genes in the PI3K/AKT pathway") overrepresented at the top or bottom of the list? If so, the pathway is enriched in your condition.

Tools: GSEA (original), fgsea (fast), ClusterProfiler. Gene sets come from MSigDB, KEGG, Reactome, GO (Gene Ontology).

GSEA transforms "500 individual genes changed" into "mTOR signaling, oxidative phosphorylation, and cell cycle were the most affected pathways" — a much more interpretable biological statement.

Why Gene Expression Is the Central Data Type in Biomedicine

Gene expression data is ubiquitous in biomedical research because it provides a functional readout of cell state. You can:

  • Compare healthy vs. diseased tissue
  • Track cellular responses to drug treatment
  • Identify biomarkers for disease stratification
  • Understand developmental trajectories
  • Characterize the tumor microenvironment

The tools and concepts from this chapter — TF regulation, RNA-seq workflow, differential expression, pathway enrichment — are foundational for essentially all transcriptomics analysis. They appear repeatedly throughout the rest of the curriculum.