A virus is, in the most literal sense, a piece of code looking for a machine to run on. It carries a genome — a complete program for making copies of itself — but possesses none of the cellular machinery needed to execute that program. It must find a host cell and co-opt that cell's ribosomes, polymerases, and energy supply to replicate.
This dependency defines the virus. It is not alive in the usual sense (no metabolism, no homeostasis, no independent reproduction), but it is the most successful self-replicating entity on Earth by copy number. There are an estimated 10³¹ individual viruses in the biosphere — more than all other biological entities combined.
Understanding viruses is essential for bioinformatics because viral sequences appear everywhere: in sequencing datasets as contaminants or co-infections, as integrated retroelements in every vertebrate genome, and as tools (viral vectors) for gene delivery in research and therapy.
What a Virus Actually Is
At minimum, a virus requires:
- A genome — nucleic acid containing the information to encode viral proteins and direct replication
- A capsid — a protein shell that protects the genome during transmission
- A mechanism for entering host cells
Many viruses also have: 4. An envelope — a lipid bilayer (derived from host cell membranes) that surrounds the capsid in some viruses 5. Accessory proteins — regulatory, immune evasion, or structural proteins encoded in the genome
A virus is like a compiled binary on a USB drive. The binary contains valid instructions, but it can't do anything without a computer to run on. When it finds a computer (a host cell), it takes over the host's resources — CPU (ribosomes), RAM (cytoplasm), I/O (transport machinery) — to execute its program: replicate itself and prepare new copies for distribution.
The host immune system is the antivirus software: it scans incoming files, flags suspicious patterns, and attempts to quarantine or delete the binary before it can execute.
Genome Types: More Diversity Than Cellular Life
Unlike cellular life, which uses only double-stranded DNA as its genome, viruses use virtually every possible nucleic acid configuration:
| Genome type | Example viruses | Notes |
|---|---|---|
| dsDNA | Herpesviruses, poxviruses, adenoviruses | Most similar to cellular genomes; can be very large (poxviruses ~200 kb) |
| ssDNA | Parvoviruses, circoviruses | Small, simple genomes |
| dsRNA | Reoviruses, rotaviruses | Replicate in the cytoplasm via RNA-dependent RNA polymerase |
| +ssRNA | Coronaviruses (SARS-CoV-2), flaviviruses (dengue, Zika), picornaviruses (polio) | Genome directly functions as mRNA; can be immediately translated |
| −ssRNA | Influenza, rabies, Ebola | Genome is complement of mRNA; must be transcribed first |
| ssRNA-RT | HIV, HTLV | RNA genome, but replicates through DNA intermediate via reverse transcriptase |
| dsDNA-RT | Hepatitis B | DNA genome, replicates via RNA intermediate |
The "+" and "−" designations for RNA viruses refer to strand polarity relative to mRNA: positive-sense (+ssRNA) can be directly translated by ribosomes; negative-sense (−ssRNA) must first be copied into mRNA.
The Baltimore classification system (1971, Nobel Prize 1975) categorizes all viruses by genome type and replication strategy into 7 classes. It remains the foundational taxonomy for virology and directly predicts which host enzymes the virus can hijack vs. which it must bring itself. For example, negative-sense RNA viruses must carry their own RNA-dependent RNA polymerase in the virion because host cells have no such enzyme.
Capsid Symmetry: Geometry Matters
Viral capsids self-assemble from repeated copies of one or a few proteins. Two fundamental geometries have evolved:
Icosahedral symmetry: Most non-enveloped animal viruses. 20 equilateral triangular faces. Efficient packing — close to a sphere, maximizing volume-to-surface ratio. Adenoviruses, poliovirus, HPV, hepatitis B use icosahedral capsids.
Helical symmetry: Capsid proteins spiral around the genome. Used by negative-sense RNA viruses (tobacco mosaic virus, rabies) and influenza. Flexible length — accommodates variable genome sizes.
Complex capsids: Some viruses (poxviruses, bacteriophages) have more complex, asymmetric structures that don't fit either category.
Self-assembly from symmetric, repeated units is elegant: the genome only needs to encode a small number of capsid protein sequences rather than a unique capsid structure. It's the same principle as building complex 3D structures from identical Lego bricks.
Enveloped vs. Non-Enveloped: Implications for Transmission
The presence or absence of a lipid envelope has major consequences:
Enveloped viruses (HIV, SARS-CoV-2, influenza, herpesviruses):
- Acquire their envelope by budding through host cell membranes during exit
- Envelope contains host membrane proteins and viral proteins (spike proteins, etc.)
- More sensitive to detergents, heat, and drying — which disrupt the lipid bilayer
- Spread more effectively through direct contact or droplets; less durable on surfaces
Non-enveloped viruses (adenoviruses, rotaviruses, poliovirus, norovirus):
- Naked capsid — resistant to detergents, acid, and drying
- Can survive on surfaces for hours to days
- Typically transmitted via the fecal-oral route or contaminated surfaces
- This is why alcohol-based hand sanitizers are less effective against non-enveloped viruses (alcohol disrupts lipid envelopes but less effectively disrupts bare protein capsids)
Viral Genome Size: The Minimal Program
Viral genomes span an enormous range:
- Smallest: Hepatitis D (1.7 kb, encodes only 1 protein; requires hepatitis B for replication)
- Largest animal viruses: Mimivirus (~1.2 Mb — larger than some bacterial genomes)
- Typical human pathogens: Influenza ~13 kb, SARS-CoV-2 ~30 kb, HIV ~9.7 kb, HSV-1 ~152 kb
The pressure to minimize genome size drives extreme coding density in small viruses:
- Overlapping reading frames: the same nucleotide sequence encodes two different proteins in different frames
- Polyproteins: one large protein is translated and then cleaved by proteases into multiple functional proteins
- Multifunctional proteins: one protein serves as capsid component, replicase, and immune antagonist
This coding compactness is one reason viral genome analysis is particularly interesting computationally — a single mutation can affect multiple proteins simultaneously if it falls in an overlapping region.
Viral Protein Functions
The genes in even small viral genomes encode a complete set of functions:
Structural proteins (capsid, envelope, matrix): pack and protect the genome during transmission
Replication proteins (polymerases, helicases, proteases): carry out genome replication and, in negative-sense RNA viruses, initial transcription
Entry proteins: viral surface proteins that recognize host cell receptors and mediate membrane fusion or endocytosis
Immune evasion proteins: found in all successful viruses; mechanisms include blocking interferon signaling, hiding viral proteins from MHC presentation, and expressing host-like proteins to avoid detection
Accessory/regulatory proteins: control the timing and rate of viral gene expression; determine whether infection is acute or latent
Viral Diversity in the Human Genome
About 8% of the human genome consists of recognizable retrovirus-derived sequences — endogenous retroviruses (ERVs) that integrated into ancestral germ cells millions of years ago. Most are degenerate and non-functional, but some ERV-derived sequences have been co-opted:
- Syncytin-1 and Syncytin-2 (derived from ERV envelope proteins) are essential for placental development in primates — the cells that form the placenta fuse together using a repurposed viral fusion protein
- Some ERV promoters drive gene expression in tissues where the original gene had no promoter
- ERV sequences contribute to regulatory elements (enhancers, CTCF binding sites)
This makes the human genome literally a palimpsest of ancient viral integrations — and makes "contamination" in genomic sequencing non-trivial to detect, since ERV sequences can align to viral reference genomes.
Viruses as Bioinformatics Tools
Beyond their role as pathogens, viruses are essential tools in molecular biology and medicine:
Viral vectors: Adeno-associated virus (AAV), lentivirus, and adenovirus are engineered as delivery vehicles for gene therapy. Understanding the natural biology of each vector type is necessary to understand its tropism (which cells it infects), capacity (how much DNA it can carry), and immunogenicity.
CRISPR delivery: Most clinical CRISPR therapies use viral vectors to deliver the guide RNA and Cas9. Vector choice determines delivery efficiency, immune response, and editing duration.
Research tools: Bacteriophages (viruses that infect bacteria) are used for phage display, protein library screening, and as model systems. Lambda phage was one of the first genomes sequenced and led directly to recombinant DNA technology.
In the next chapter, we'll examine how viruses actually get into cells and replicate — the infection mechanism.