Genes: Functions in the Source Code

In the previous chapter, we established that the is a ~3 billion character string. But a string alone is not a program. A program needs structure: defined units with names, inputs, outputs, and rules for when they run. In biology, that structure is provided by .

A is the fundamental unit of biological information — a stretch of with enough regulatory context to be selectively , converted into , and (usually) into . Understanding structure is essential because every bioinformatics tool that works with — callers, pipelines, annotation software — reasons about coordinates, boundaries, splice sites, and regulatory regions.

What a Gene Actually Is

Here's a definition that will hold up better than the casual version: a is a heritable unit of sequence that can be into , where that is controlled by associated regulatory elements.

Note what's missing from that definition: "encodes a ." About 1.5% of the human encodes , but roughly 80% is into at some point. Many of those non-coding RNAs have important regulatory functions. A that produces only non-coding is still a .

ℹThe evolving gene concept

The definition of "" has been revised multiple times since the term was coined in 1909. Early genetics defined by their effects. Molecular biology redefined them as sequences encoding . Genomics forced another revision: some encode only , some produce multiple via , and some overlap with each other on opposite strands. The operational definition we use here — a unit with regulatory context — reflects the current working consensus.

Gene Structure: The Anatomy of a Function

A -coding in a eukaryote has several components, each with a distinct role:

The Promoter

The is a regulatory sequence upstream of the (typically within ~2000 bp of the start site) where machinery assembles. It contains the core — recognition sequences for polymerase — and often additional sequences that bind regulatory called .

Think of the as a function signature combined with its access modifier. It defines: can this be called? Under what conditions? With what inputs ()?

The classic core elements include:

TATA box (~−25 to −30 from start site) — binding site for TBP (TATA-binding ), part of the basal machinery
Initiator element (at the +1 site) — present in many without a TATA box

Many human also have CpG islands — regions with high GC content and many CpG dinucleotides that resist methylation in active . CpG methylation is a key epigenetic silencing mechanism, covered in Chapter 3.2.

Enhancers and Silencers

are regulatory sequences that increase when bound by specific . They can be located thousands or even hundreds of thousands of pairs away from the they regulate — looping through 3D space to contact the .

Silencers work similarly but decrease .

{ }Enhancers as environment variables

An is like an environment variable that gets passed to a build process. The 's is the build script — it runs, but what it does depends on what environment variables are set. A liver-specific that binds HNF4α (a liver ) will activate only in liver because only liver have HNF4α available. The same in a , without that , stays silent.

Exons and Introns

When a -coding is , the full copy — called pre- — includes both the coding sequences and intervening non-coding sequences:

— sequences that end up in the mature (the word "" = "expressed")
— sequences spliced out before the leaves the nucleus ("" = "intervening")

After , a process called removes the and joins the . The result is a mature with only the coding and regulatory sequences needed for .

The average human -coding has ~9 and ~8 . average ~200 bp; average ~3,500 bp. The actual coding sequence (the open reading frame, or ORF) is typically much smaller than the total span, which can stretch over 100 kb or more of genomic .

{ }Introns as commented-out code

look superficially like comments or dead code — sequences that are present in the but removed before execution. But unlike commented-out code, are not inert. Many contain regulatory elements: splice site signals, regulatory RNAs, and even entire small . The machinery that removes them is also a target for regulation, which can change the product entirely.

Splice Sites

The boundaries between and are defined by splice site consensus sequences. The 5' splice site (| boundary) typically starts with GT (GU in ); the 3' splice site (| boundary) ends with AG. The phrase "GT-AG rule" is a useful mnemonic.

Within the , a branch point sequence (~20–50 bp upstream of the 3' splice site) forms a lariat structure during . The spliceosome — a large - complex — catalyzes the reaction.

in splice sites are a major class of pathogenic . A single change at the GT or AG can cause skipping (the gets included in the and removed), retention (the ends up in the ), or cryptic splice site activation (a nearby sequence that looks like a splice site gets used instead). All of these alter or destroy the product.

The Coding Sequence (CDS) and Open Reading Frame

The coding sequence (CDS) is the portion of the mature that gets into . It begins with a start codon (AUG, encoding methionine) and ends with a stop codon (UAA, UAG, or UGA).

The CDS is embedded in the between UTRs — untranslated regions:

5' UTR — between the cap and the start codon; contains ribosome binding sites and regulatory elements
3' UTR — between the stop codon and the poly-A tail; contains regulatory sequences that influence stability, efficiency, and subcellular localization

★UTRs are not just flanking sequences

The 3' UTR is a major hub for post-transcriptional regulation. It contains binding sites for microRNAs — small non-coding RNAs that target mRNAs for degradation or translational silencing. Over 60% of human -coding are regulated by microRNAs. When analyzing differential , UTR or affecting microRNA binding sites can have major effects even though they don't change the sequence.

The Codon Table: A Lookup Table for Translation

The genetic code maps triplets of (codons) to . There are 4³ = 64 possible codons and 20 , so most are encoded by multiple codons — this is called degeneracy or redundancy.

The code is:

Universal — almost identical across all life (with minor exceptions in some mitochondria and organisms)
Degenerate — multiple codons map to the same (e.g., GCU, GCC, GCA, GCG all encode alanine)
Non-overlapping — each belongs to exactly one codon
Comma-free — no delimiters between codons; the reading frame is established by the start codon

The degeneracy is not random. Synonymous codons (encoding the same ) often differ only in the third position — the "wobble" position. This makes the code more robust to point : a change in the third codon position often doesn't change the .

Pseudogenes and Gene Families

Not everything that looks like a is functional. Pseudogenes are sequences that resemble but have lost function through . They arise when a is duplicated and one copy accumulates inactivating .

More productively, duplication is the primary mechanism for evolving new functions. The human contains many families — groups of related that arose by duplication and divergence. The hemoglobin (HBA1, HBA2, HBB, HBD, etc.) are a classic example: all related, all encoding oxygen-carrying , but with different expression patterns and oxygen affinities tuned to developmental stage and tissue type.

Reading a Gene Annotation File

In practice, are described in annotation files — GTF ( Transfer Format) or GFF3 files that list genomic coordinates for each feature. Every analysis starts by mapping to a reference and counting per , which requires a annotation file.

A GTF record looks like this:

chr17  HAVANA  gene        43044295  43125483  .  -  .  gene_id "ENSG00000012048"; gene_name "BRCA1";
chr17  HAVANA  transcript  43044295  43125483  .  -  .  gene_id "ENSG00000012048"; transcript_id "ENST00000357654";
chr17  HAVANA  exon        43124017  43125483  .  -  .  gene_id "ENSG00000012048"; exon_number "1";
chr17  HAVANA  CDS         43124017  43125364  .  -  .  gene_id "ENSG00000012048"; protein_id "ENSP00000350283";

Fields: , source, feature type, start, end, score, strand, frame, attributes.

The coordinates are 1-based and half-open (start is inclusive, end is inclusive in GTF). The strand (+ or -) matters: on the minus strand are right-to-left in genomic coordinates, so position 43125483 is the 5' end of BRCA1.

Understanding GTF/GFF3 files is a prerequisite for: , ChIP-seq, annotation, CRISPR guide design, and most browser work.

Why Gene Structure Matters for Bioinformatics

Almost every bioinformatics analysis involves boundaries at some level:

annotation: is this SNP in a coding ? In a splice site? In a UTR? The functional impact depends entirely on where it falls in the structure.
: are counted per , per transcript, sometimes per . Isoform-level analysis requires knowing - boundaries.
ChIP-seq: where is a binding relative to nearby ?
CRISPR design: guides near a splice site can disrupt even if they don't hit the coding sequence directly.

The is not just a label or a name. It's a precise, structured unit with regulatory logic, internal organization, and defined outputs. Treating it as a simple position on a misses most of the biology.

⟷DECODER

Biology

A gene is a discrete segment of DNA that encodes a functional product — usually a protein, sometimes an RNA. Genes include not just the coding sequence but regulatory regions (promoter, enhancers) that control when and where expression occurs.

{ } For Developers

A gene is a function definition with its own configuration: the promoter is the function signature and access modifier, enhancers are feature flags that change expression in specific contexts, the coding sequence is the function body, and introns are inline comments stripped before execution. The genome is a codebase of ~20,000 such functions.

LAB · Gene Structure Parser

Python · Pyodide