Part 5·5.6·25 min read

In Practice: BLAST with Viral Sequences

Use BLAST and Biopython to identify viral sequences, classify strains, build phylogenetic trees, and analyze variant mutations in real genomic data.

BLASTvirologyphylogeneticssequence alignmentpractical

Viral genomics is one of the most active areas in bioinformatics. Sequencing-based surveillance, outbreak investigation, drug resistance monitoring, and vaccine target analysis all require the same core skills: aligning sequences, identifying strains, and placing sequences in evolutionary context. This chapter puts those skills to work with real viral sequence data.

Setup

bash
pip install biopython matplotlib pandas numpy requests
# For local BLAST (optional but recommended for large datasets):
# conda install -c bioconda blast
python
from Bio import Entrez, SeqIO
from Bio.Blast import NCBIWWW, NCBIXML
from Bio import Align
import pandas as pd
import numpy as np
import requests
import time

Part 1: Identifying an Unknown Sequence with BLAST

A common task: you have a sequence of unknown origin (from a patient sample, environmental sample, or novel organism) and need to identify it.

python
# An unknown sequence — let's see what BLAST says
unknown_seq = """
ATGAGAGCCTTGTCCTTTATAGCACAAATGGCAACCAAATGCATCAAAATGAGAGAATCAGTCTCAATGG
ATGGAAATGAATAAGCCTGTCAAGCTGGATGGAAACAACATTACAACAGAAGATCAAATGGGGAAGATGG
AGAAAGAAGTGGTCATCAGAAGCAACAGCTTATCATCAGCAGCAGCCATCAGCAGGATAGAAGCGGCAGC
""".replace("\n", "").strip()

Entrez.email = "your@email.com"

print(f"Query length: {len(unknown_seq)} bp")
print("Running BLAST against NT database...")

result_handle = NCBIWWW.qblast(
    "blastn",
    "nt",
    unknown_seq,
    hitlist_size=10,
    format_type="XML"
)

# Save result to avoid re-running
with open("blast_result.xml", "w") as f:
    f.write(result_handle.read())
result_handle.close()
print("BLAST complete.")

Parsing BLAST Results

python
def parse_blast_results(xml_file: str) -> pd.DataFrame:
    with open(xml_file) as f:
        blast_record = next(NCBIXML.parse(f))

    rows = []
    for alignment in blast_record.alignments:
        for hsp in alignment.hsps[:1]:  # best HSP per hit
            rows.append({
                "title": alignment.title[:80],
                "accession": alignment.accession,
                "score": hsp.score,
                "e_value": hsp.expect,
                "identity_pct": 100 * hsp.identities / hsp.align_length,
                "alignment_length": hsp.align_length,
                "query_coverage_pct": 100 * (hsp.query_end - hsp.query_start + 1) / blast_record.query_length,
            })

    return pd.DataFrame(rows)

df = parse_blast_results("blast_result.xml")
print(df[["title", "identity_pct", "e_value", "query_coverage_pct"]].to_string(index=False))

Interpreting Results

python
def classify_blast_hit(identity: float, coverage: float) -> str:
    """Classify the BLAST hit quality."""
    if identity >= 99 and coverage >= 95:
        return "Near-identical match (same strain or minor variant)"
    elif identity >= 95 and coverage >= 90:
        return "High confidence match (closely related strain)"
    elif identity >= 80 and coverage >= 70:
        return "Moderate match (same species, different strain)"
    elif identity >= 70:
        return "Distant homolog (same genus or family)"
    else:
        return "Weak match — possibly novel or contamination"

for _, row in df.head(5).iterrows():
    classification = classify_blast_hit(row["identity_pct"], row["query_coverage_pct"])
    print(f"\n{row['title'][:60]}")
    print(f"  Identity: {row['identity_pct']:.1f}%, Coverage: {row['query_coverage_pct']:.1f}%")
    print(f"  {classification}")

Part 2: Fetching and Comparing Influenza Strains

Track the evolution of influenza hemagglutinin (HA) across years.

python
def fetch_influenza_ha_sequences(strains: list[str]) -> list:
    """Fetch HA sequences for specified influenza strains."""
    records = []
    for strain_query in strains:
        handle = Entrez.esearch(
            db="nucleotide",
            term=f"{strain_query}[Title] AND hemagglutinin[Title] AND complete[Title] AND influenza[Organism]",
            retmax=1
        )
        result = Entrez.read(handle)
        handle.close()

        if not result["IdList"]:
            print(f"No results for: {strain_query}")
            continue

        time.sleep(0.4)
        handle = Entrez.efetch(
            db="nucleotide",
            id=result["IdList"][0],
            rettype="fasta",
            retmode="text"
        )
        record = SeqIO.read(handle, "fasta")
        record.description = strain_query  # use strain name as description
        records.append(record)
        handle.close()
        print(f"Fetched: {strain_query} ({len(record.seq)} bp)")
        time.sleep(0.4)

    return records

# A subset of influenza H3N2 strains from different years
strains_to_fetch = [
    "A/Hong Kong/1/1968 H3N2",
    "A/Sydney/5/1997 H3N2",
    "A/Wisconsin/67/2005 H3N2",
    "A/Perth/16/2009 H3N2",
]

ha_records = fetch_influenza_ha_sequences(strains_to_fetch)
SeqIO.write(ha_records, "influenza_ha.fasta", "fasta")
print(f"\nSaved {len(ha_records)} HA sequences")

Part 3: Pairwise Sequence Alignment

Compare sequences to measure evolutionary divergence:

python
from Bio import Align
from Bio.Seq import Seq

def pairwise_identity(seq1: str, seq2: str) -> float:
    """Compute percent identity using pairwise global alignment."""
    aligner = Align.PairwiseAligner()
    aligner.mode = "global"
    aligner.match_score = 1
    aligner.mismatch_score = -1
    aligner.open_gap_score = -2
    aligner.extend_gap_score = -0.5

    alignment = aligner.align(seq1, seq2)[0]
    matches = sum(a == b for a, b in zip(alignment[0], alignment[1]) if a != "-" and b != "-")
    aligned_length = sum(1 for a, b in zip(alignment[0], alignment[1]) if a != "-" or b != "-")
    return 100 * matches / aligned_length if aligned_length > 0 else 0

# Build identity matrix
n = len(ha_records)
identity_matrix = np.zeros((n, n))
strain_labels = [r.description for r in ha_records]

for i in range(n):
    for j in range(i, n):
        if i == j:
            identity_matrix[i][j] = 100.0
        else:
            idt = pairwise_identity(str(ha_records[i].seq[:500]), str(ha_records[j].seq[:500]))
            identity_matrix[i][j] = identity_matrix[j][i] = idt

# Display as a heatmap-ready DataFrame
df_identity = pd.DataFrame(identity_matrix, index=strain_labels, columns=strain_labels)
print("\nSequence identity matrix (%):")
print(df_identity.round(1))

Part 4: Mutation Analysis (SARS-CoV-2 Spike)

Analyze mutations in the spike protein receptor-binding domain across variants:

python
REFERENCE_SPIKE_FRAGMENT = (
    "QCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGT"
    "NGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEF"
    "QFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFV"
)

def find_mutations(ref_seq: str, query_seq: str, ref_start: int = 1) -> list:
    """Find amino acid mutations between two sequences (already aligned)."""
    mutations = []
    for i, (ref_aa, qry_aa) in enumerate(zip(ref_seq, query_seq)):
        if ref_aa != qry_aa and ref_aa != "-" and qry_aa != "-":
            pos = ref_start + i
            mutations.append(f"{ref_aa}{pos}{qry_aa}")
    return mutations

# Known VOC spike protein mutations (simplified representation)
variant_mutations = {
    "Alpha (B.1.1.7)": ["N501Y", "D614G", "P681H"],
    "Beta (B.1.351)": ["K417N", "E484K", "N501Y", "D614G"],
    "Delta (B.1.617.2)": ["L452R", "T478K", "D614G", "P681R"],
    "Omicron BA.1": ["G339D", "S371L", "S373P", "S375F", "K417N", "N440K",
                     "G446S", "S477N", "T478K", "E484A", "Q493R", "G496S",
                     "Q498R", "N501Y", "Y505H", "D614G", "H655Y", "N679K",
                     "P681H", "N764K", "D796Y", "N856K", "Q954H", "N969K"],
}

# Analyze which mutations affect the RBD (receptor binding domain) positions 319-541
RBD_START, RBD_END = 319, 541

def is_rbd_mutation(mutation: str) -> bool:
    """Check if mutation is in the receptor-binding domain."""
    pos = int("".join(c for c in mutation[1:-1] if c.isdigit()))
    return RBD_START <= pos <= RBD_END

print("Variant Mutation Analysis:")
print(f"{'Variant':<25} {'Total mutations':>15} {'RBD mutations':>15} {'RBD rate':>12}")
print("-" * 70)

for variant, mutations in variant_mutations.items():
    rbd_muts = [m for m in mutations if is_rbd_mutation(m)]
    rate = len(rbd_muts) / len(mutations) if mutations else 0
    print(f"{variant:<25} {len(mutations):>15} {len(rbd_muts):>15} {rate:>11.0%}")

print(f"\nOmicron RBD mutations: {[m for m in variant_mutations['Omicron BA.1'] if is_rbd_mutation(m)]}")

Part 5: Building a Simple Phylogenetic Tree

Use sequence distances to build a neighbor-joining tree:

python
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

def distance_matrix_to_upgma_tree(labels: list, dist_matrix: np.ndarray) -> dict:
    """
    Simplified UPGMA clustering.
    Returns a nested dict representing the tree structure.
    """
    # Convert to distance (0 = identical, 1 = maximally different)
    d = (100 - dist_matrix) / 100

    nodes = {i: labels[i] for i in range(len(labels))}
    heights = {i: 0.0 for i in range(len(labels))}
    tree = {labels[i]: None for i in range(len(labels))}

    current_d = d.copy()
    current_nodes = list(range(len(labels)))

    while len(current_nodes) > 1:
        # Find minimum distance pair
        min_dist = np.inf
        min_i, min_j = -1, -1
        for i in range(len(current_nodes)):
            for j in range(i + 1, len(current_nodes)):
                ni, nj = current_nodes[i], current_nodes[j]
                if current_d[ni][nj] < min_dist:
                    min_dist = current_d[ni][nj]
                    min_i, min_j = i, j

        ni, nj = current_nodes[min_i], current_nodes[min_j]
        new_label = f"({nodes[ni]}, {nodes[nj]})"
        nodes[len(nodes)] = new_label
        print(f"  Merge: {nodes[ni]} + {nodes[nj]} at distance {min_dist:.3f}")
        current_nodes.pop(max(min_i, min_j))
        current_nodes.pop(min(min_i, min_j))
        current_nodes.append(len(nodes) - 1)

    return nodes

print("\nBuilding phylogenetic tree:")
tree = distance_matrix_to_upgma_tree(strain_labels, identity_matrix)

Visualization with a Simple Dendrogram

python
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import squareform

# Convert identity to distance
dist_array = (100 - identity_matrix) / 100
np.fill_diagonal(dist_array, 0)
condensed = squareform(dist_array)

# UPGMA linkage
Z = linkage(condensed, method="average")

fig, ax = plt.subplots(figsize=(10, 5))
dendrogram(
    Z,
    labels=strain_labels,
    orientation="top",
    leaf_rotation=45,
    ax=ax
)
ax.set_xlabel("Sequence Distance")
ax.set_ylabel("Branch length")
ax.set_title("Influenza HA Phylogenetic Tree (UPGMA)")
plt.tight_layout()
plt.savefig("influenza_tree.png", dpi=150)
plt.show()

Part 6: Practical SARS-CoV-2 Surveillance Pipeline

A simplified version of what genomic surveillance labs do:

python
def sars_cov2_surveillance_summary(accessions: list[str]) -> pd.DataFrame:
    """
    For each accession: fetch sequence, check for key spike mutations,
    and assign to variant based on mutation profile.
    """
    # Key positions to check (in spike protein)
    SIGNATURE_MUTATIONS = {
        "D614G": (614, "D", "G"),
        "N501Y": (501, "N", "Y"),
        "E484K": (484, "E", "K"),
        "E484A": (484, "E", "A"),
        "L452R": (452, "L", "R"),
        "P681R": (681, "P", "R"),
        "P681H": (681, "P", "H"),
    }

    VARIANT_SIGNATURES = {
        "Omicron": {"N501Y", "E484A", "D614G"},
        "Delta": {"L452R", "D614G", "P681R"},
        "Beta": {"E484K", "N501Y", "D614G"},
        "Alpha": {"N501Y", "D614G", "P681H"},
        "Wuhan-like": set(),
    }

    results = []
    for acc in accessions:
        time.sleep(0.4)
        try:
            handle = Entrez.efetch(db="nucleotide", id=acc, rettype="gb", retmode="text")
            record = SeqIO.read(handle, "genbank")
            handle.close()

            # This is simplified — real pipelines align to reference
            result = {
                "accession": acc,
                "length": len(record.seq),
                "country": "Unknown",
                "date": "Unknown",
            }

            # Extract metadata from qualifiers
            for feature in record.features:
                if feature.type == "source":
                    result["country"] = feature.qualifiers.get("country", ["Unknown"])[0]
                    result["date"] = feature.qualifiers.get("collection_date", ["Unknown"])[0]

            results.append(result)
        except Exception as e:
            results.append({"accession": acc, "error": str(e)})

    return pd.DataFrame(results)

# Example accessions (real SARS-CoV-2 sequences — replace with actual accession numbers)
example_accessions = ["MN908947"]  # Original Wuhan reference genome
df_summary = sars_cov2_surveillance_summary(example_accessions)
print(df_summary)

Summary: BLAST Workflow Patterns

TaskTool/Method
Identify unknown sequenceBLAST against NT/NR
Compare strainsPairwise alignment, identity matrix
Track viral evolutionFetch sequences by year/strain, build phylogenetic tree
Detect resistance mutationsCompare against reference, annotate positions
SurveillanceFetch sequences by country/date, characterize mutations

The full industrial-scale version of this workflow (used in real COVID surveillance) runs GISAID-scale databases through NextStrain, uses augur/Nextstrain pipelines for phylogenetic placement, and pushes results to a real-time phylogenetic visualization at nextstrain.org. The underlying logic is exactly what we've covered here — the scale is different, the biology is the same.

GISAID and NCBI as viral sequence sources

For SARS-CoV-2, GISAID (gisaid.org) is the primary repository — over 15 million sequences deposited. Access requires registration and agreeing to data sharing terms.

NCBI's Virus portal (ncbi.nlm.nih.gov/labs/virus) provides a web and API interface for SARS-CoV-2 and other virus sequences from GenBank, with filtering by country, date, host, and genome completeness.

For influenza, NCBI's Influenza Database (ncbi.nlm.nih.gov/genomes/FLU) provides full genome sequences with standardized segment labels.