Gene Expression

Gene expression: TF binding → transcription → mRNA

The human contains roughly 20,000 -coding . Every in your body carries all of them. A liver and a have identical . Yet they look different, behave differently, produce different , and have dramatically different functional properties.

The difference is — which are turned on, at what levels, and in response to what signals. is not a binary switch; it's a continuous, dynamically regulated process that determines identity, mediates cellular responses to the environment, and underlies development from a single fertilized egg to a trillion- organism.

What "Expression" Means

When we say a is "expressed," we mean it's being actively into , and (for -coding ) that is being into . When we say a is "upregulated," we mean it's producing more and/or than baseline. "Downregulated" means less.

In practice, because measures abundance as a proxy, "" in bioinformatics usually refers specifically to the level. This is a useful approximation but not a perfect one — levels don't always reflect levels, and non-coding expression is often measured separately.

Transcription Factors: The Signal Integrators

The primary mechanism for controlling which are expressed is (TFs) — that bind specific sequences and regulate polymerase activity.

TFs work by binding to regulatory sequences in and . When a TF binds near a , it can:

Recruit the general machinery (activators)
Block polymerase recruitment (repressors)
Stabilize or destabilize nucleosomes (chromatin remodelers)
Recruit histone-modifying

{ }Transcription factors as environment-specific build configuration

Imagine your as a monorepo with 20,000 modules. Each module has a build configuration that specifies: "build this module if ENV_LIVER is set AND NOT ENV_NEURON AND ENV_OXYGEN > 0.2." are the environment variables. The 's current TF repertoire determines which configurations evaluate to true, and therefore which are built.

The key insight: TFs are themselves encoded by . The regulatory system is itself regulated by the very same mechanism.

The human encodes approximately 1,600 — roughly 8% of all -coding . Many TFs function in combinations: the combinatorial code of TFs present determines the profile. A liver-specific might require the simultaneous binding of HNF4α, FOXA2, and C/EBPα to its . Individually, none of them is sufficient.

Promoter Architecture

A 's is not a simple on/off switch. It's a regulatory element with multiple functional modules:

Core Promoter

The minimal region sufficient for initiation. Contains the TATA box (TATAAA, ~−30 from start), the initiator element (Inr, at the +1 site), and/or a downstream element (DPE, ~+30). These elements position and orient polymerase II.

About 20–30% of human contain a TATA box. Most use the Inr or other elements. "TATA-less" are common for housekeeping and often contain CpG islands.

Proximal Promoter

Extends ~500 bp upstream. Contains binding sites for specific TFs that regulate the . Often contains GC boxes (bound by Sp1) and other common regulatory elements.

Distal Regulatory Elements

can be located up to megabases away from the they regulate, looping through 3D space to contact the . They function as signal integration hubs: when the right combination of TFs binds, they activate the through direct contact (mediated by the Mediator coactivator complex).

Identifying which regulate which is an active area of genomics research — the 3D folding problem.

Signal Transduction → Transcription

Most environmental signals ultimately control through one of two mechanisms:

Cytoplasmic-to-nuclear signaling: A signal (growth factor, cytokine) binds a at the surface, triggering a kinase cascade that phosphorylates a dormant TF, causing it to translocate to the nucleus and activate target .

Classic examples:

JAK-STAT : Cytokine binds → JAKs phosphorylate STATs → STATs dimerize and enter nucleus → activate immune response
MAPK/ERK : Growth factor → RAS → MEK → ERK → phosphorylates TFs like ELK1, c-Fos → proliferation
NF-κB : Inflammatory signal → IκB kinase → IκB degradation → NF-κB enters nucleus → inflammatory

Steroid/nuclear : Lipid-soluble signaling molecules (steroids, thyroid hormone, retinoic acid) diffuse through the , bind nuclear in the cytoplasm or nucleus, and directly activate . The - complex is itself the TF.

ℹWhy steroid hormones are so potent as drugs

Because steroid-like molecules directly activate nuclear that then regulate hundreds of , drugs targeting nuclear have sweeping effects. Glucocorticoids (like dexamethasone) suppress inflammation by binding the glucocorticoid and activating anti-inflammatory while suppressing pro-inflammatory ones. This potency is also why they have significant side effects — many are changed simultaneously.

Measuring Gene Expression: RNA-seq

() is the standard method for measuring -wide. The workflow:

Extract total from /tissue
Select (poly-A selection or ribo-depletion)
Convert to cDNA (reverse )
Fragment and ligate adapters
Sequence millions of short (50–150 bp)
to the reference
Count per
Normalize and perform statistical testing

The count matrix — × samples — is the fundamental data structure of transcriptomics. Each entry is the number of that mapped to a given in a given sample.

Normalization

Raw counts can't be compared directly because:

Samples are sequenced to different depths (total differ)
Longer generate more than shorter ones

Common normalization methods:

TPM (Transcripts Per Million): divides by length then normalizes to per-million . Best for within-sample comparisons and cross-study comparisons.
CPM (Counts Per Million): normalizes to per-million without length correction. For , appropriate when comparing the same across samples.
TMM/DESeq2 normalization: more sophisticated methods that account for composition bias (a few highly expressed dominating the total count).

Differential Expression Analysis

The core question in most transcriptomics studies: which change between condition A and condition B?

Statistical tools (DESeq2, edgeR, limma) model count data using negative binomial distributions ( counts are overdispersed relative to Poisson) and perform hypothesis testing per . Output:

Log₂ (LFC): how much does expression change? LFC = 2 means 4× higher in condition A.
: -corrected significance. Typically threshold at 0.05 or 0.1.
Volcano plot: visualizes all by LFC vs. -log₁₀()

A with |LFC| > 1 and adj. p < 0.05 is typically considered , though thresholds vary by context.

Cell-Type-Specific Expression

Not all express all equally. The combinatorial TF code creates -type-specific expression programs:

Tissue-specific : expressed in one tissue but not others (e.g., albumin is liver-specific; insulin is β--specific)
Housekeeping : expressed in all types at similar levels (ribosomal , metabolic )
Inducible : expressed only in response to specific signals (interferon-stimulated , heat shock )

Single- (scRNA-seq) has revealed that this picture is far more complex than bulk suggests. What appears as a homogeneous population often contains multiple distinct subpopulations with different expression programs. A tumor biopsy contains not just tumor but fibroblasts, endothelial , immune — each with distinct expression profiles.

Gene Expression Atlases

Large-scale projects have catalogued expression across human tissues and types:

GTEx (-Tissue Expression project): bulk from ~50 tissues from hundreds of human donors. Essential for understanding tissue-specific expression and eQTL mapping (genetic that affect expression).
Human Atlas: single- from all major human types. Ongoing; aims to characterize every type in the human body.
FANTOM5: cap-analysis (CAGE) data capturing start sites at single- resolution across hundreds of human types.

These atlases serve as reference datasets: if you find a is upregulated in your tumor samples, GTEx tells you whether it's also normally expressed in that tissue, or whether it's aberrant.

Gene Set Enrichment Analysis (GSEA)

A single list of is hard to interpret. set enrichment analysis shifts the focus from individual to and functional modules.

The idea: given a ranked list of (e.g., ranked by ), is a predefined set (e.g., " in the PI3K/AKT ") overrepresented at the top or bottom of the list? If so, the is enriched in your condition.

Tools: GSEA (original), fgsea (fast), ClusterProfiler. sets come from MSigDB, KEGG, Reactome, GO ( Ontology).

GSEA transforms "500 individual changed" into "mTOR signaling, oxidative phosphorylation, and cycle were the most affected " — a much more interpretable biological statement.

⟷DECODER

Biology

Gene expression is the process by which information encoded in a gene is used to produce a functional product. Different cell types express different subsets of the ~20,000 human genes, producing radically different proteomes from the same genome.

{ } For Developers

Gene expression is feature flagging at the molecular level. Every cell has the full codebase but only runs certain modules. Transcription factors are the runtime configuration that determines which genes are loaded. Chromatin state is the access control layer — tightly packed DNA is read-protected. The result: liver cells and neurons share 100% of their source code but behave like completely different applications.

Why Gene Expression Is the Central Data Type in Biomedicine

data is ubiquitous in biomedical research because it provides a functional readout of state. You can:

Compare healthy vs. diseased tissue
Track cellular responses to drug treatment
Identify biomarkers for disease stratification
Understand developmental trajectories
Characterize the tumor microenvironment

The tools and concepts from this chapter — TF regulation, workflow, , enrichment — are foundational for essentially all transcriptomics analysis. They appear repeatedly throughout the rest of the curriculum.