Reproducibility and Best Practices

Bioinformatics has a reproducibility problem. A 2015 survey found that over 70% of researchers had tried and failed to reproduce another lab's computational analysis. The causes are well understood: undocumented software versions, hardcoded file paths, analyses run interactively without recording steps, random seeds not fixed, and data modifications without provenance tracking.

This matters scientifically (results that can't be reproduced can't be built upon) and practically (you won't be able to reproduce your own analysis six months later without these practices). This chapter covers the practices that separate analysis that works once from analysis that works reliably.

The Reproducibility Stack

Full reproducibility requires addressing multiple layers:

Code (exact version)
↓
Software environment (exact tool versions, dependencies)
↓
Data (exact input, with checksums)
↓
Compute environment (OS, CPU architecture)
↓
Randomness (fixed seeds)
↓
Documentation (parameters, decisions made)

Failure at any layer breaks reproducibility. Each layer has its solution.

Version Control: Git for Analysis

Code without version control is unscientific. Every analysis script, pipeline configuration, and data processing step should be in git.

What to commit:

Analysis scripts (Python, R, bash)
Pipeline configuration files (Snakemake/Nextflow rules)
Parameter files (sample sheets, configuration YAML)
Custom utility functions

What NOT to commit:

Raw data (too large; use data management tools)
Intermediate files (too large; regenerate from pipeline)
Credentials or API keys
Large binary files (use Git LFS if necessary)

Commit granularity: commit per logical step ("add read trimming step to pipeline"), not per keystroke and not per week. The commit history should tell the story of the analysis.

Branch strategy for analyses: use feature branches for experimental changes; merge to main when validated. Tag releases that correspond to paper submissions — you need to be able to check out exactly the code that generated the figures.

main: stable, reviewed analysis code
├── feature/add-deseq2-analysis
├── feature/try-alternative-normalization
└── tags: submission-v1, revision-v1

Environment Management

The Dependency Problem

Bioinformatics tools are notorious for complex dependency chains and version sensitivity. An analysis run with DESeq2 version 1.28 may give different results than the same analysis with version 1.36 (due to algorithm updates). Tools may conflict with each other's dependencies.

Conda and Mamba

Conda creates isolated environments with reproducible package versions:

bash

# Create environment from specification file
conda env create -f environment.yml

# environment.yml
name: rnaseq-analysis
channels:
  - bioconda
  - conda-forge
dependencies:
  - python=3.10
  - snakemake=7.32
  - fastqc=0.12
  - star=2.7.10
  - samtools=1.17
  - r-base=4.3
  - bioconductor-deseq2=1.40

Always pin exact versions in environment.yml. python>=3.8 is not reproducible; python=3.10.6 is.

Mamba: drop-in replacement for conda with faster dependency resolution (C++ solver vs. Python). Use mamba wherever you'd use conda.

Containers: Docker and Singularity

Conda captures Python/R packages but not the underlying OS, system libraries, or compiled binaries. Containers capture everything:

Docker: packages the entire runtime environment (OS, system libraries, tools) into a portable image. The same Docker image runs identically on any Linux machine.

dockerfile

FROM continuumio/miniconda3:23.3.1-0
COPY environment.yml .
RUN conda env create -f environment.yml

Singularity/Apptainer: Docker alternative designed for HPC environments where Docker requires root access. Converts Docker images; runs without root. Standard at most research computing clusters.

Best practice: define your environment as a Dockerfile or Conda environment file, build the container, and record the image hash in your pipeline configuration. The image hash (SHA256) is an exact fingerprint of the software environment.

ℹContainers on HPC clusters

Most HPC clusters don't allow Docker (requires root), but do support Singularity/Apptainer. Nextflow and Snakemake both support Singularity natively: specify a container image per rule/process, and the workflow manager pulls and runs the container automatically. This is the recommended approach for pipeline portability across HPC systems.

Workflow Managers

Ad hoc bash scripts for multi-step analyses have predictable failure modes:

No way to resume from failure
Re-runs entire analysis when you change one step
No parallelization management
No resource specification

Workflow managers solve all of these.

Snakemake

Snakemake uses a rule-based system: define rules that specify inputs, outputs, and the shell command. Snakemake builds a dependency DAG from the target output files and runs only what's needed.

python

# Snakefile
rule trim_reads:
    input:
        r1 = "data/raw/{sample}_R1.fastq.gz",
        r2 = "data/raw/{sample}_R2.fastq.gz"
    output:
        r1 = "results/trimmed/{sample}_R1.fastq.gz",
        r2 = "results/trimmed/{sample}_R2.fastq.gz"
    threads: 4
    resources:
        mem_mb = 8000
    shell:
        "trimmomatic PE -threads {threads} {input.r1} {input.r2} "
        "{output.r1} /dev/null {output.r2} /dev/null ..."

rule star_align:
    input:
        r1 = "results/trimmed/{sample}_R1.fastq.gz",
        r2 = "results/trimmed/{sample}_R2.fastq.gz",
        index = "data/reference/star_index"
    output:
        bam = "results/aligned/{sample}.bam"
    ...

Key Snakemake features:

Wildcards ({sample}): automatically scales to all samples
Checkpointing: if a step fails, re-run resumes from that step, not the beginning
HPC integration: submit each rule as a separate cluster job automatically
Container support: conda: or singularity: directive per rule
Report generation: HTML report with runtime statistics and output file provenance

Nextflow

Nextflow uses a dataflow paradigm: processes communicate via channels (queues of data). More powerful for complex branching workflows and cloud computing.

groovy

process TRIM_READS {
    input:
    tuple val(sample), path(reads)
    
    output:
    tuple val(sample), path("*.trimmed.fastq.gz")
    
    script:
    """
    trimmomatic PE ${reads[0]} ${reads[1]} ...
    """
}

workflow {
    reads = Channel.fromFilePairs("data/raw/*_{R1,R2}.fastq.gz")
    trimmed = TRIM_READS(reads)
    ALIGN(trimmed)
}

nf-core: a community collection of Nextflow pipelines for common bioinformatics tasks (RNA-seq, ChIP-seq, variant calling, etc.) following best practices. For standard analyses, using an nf-core pipeline is often better than writing your own — the pipelines are extensively tested, documented, and maintained.

When to Use Which

Use Case	Tool
Custom analysis pipeline	Snakemake (Python-native, easy to learn)
Production pipeline, cloud	Nextflow (better cloud support)
Standard RNA-seq/ATAC-seq	nf-core pipeline
Single scripts, Jupyter	conda environment + git

Data Management

Immutable Raw Data

Raw data should never be modified. Treat it as read-only:

Set directory permissions to read-only after initial download
Verify checksums (MD5/SHA256) after download and periodically
Keep raw data separate from processed data

Never modify in place: if a processing step requires format conversion, write to a new file. The chain from raw → processed must be fully traceable.

Data Versioning with DVC

DVC (Data Version Control): Git extension for versioning large data files. Stores data on S3/GCS/local storage; commits only pointers (checksums) to git.

bash

dvc add data/raw/rnaseq_counts.tsv    # track file in DVC
git add data/raw/rnaseq_counts.tsv.dvc  # commit pointer
dvc push                              # push data to remote storage

This gives you the same workflow as git for data: dvc pull to download the exact data version corresponding to a commit.

Sample Sheets and Metadata

Every analysis should have a structured sample sheet that maps sample IDs to:

File paths (raw data)
Biological metadata (condition, genotype, tissue, timepoint)
Technical metadata (batch, sequencing date, library prep)
QC metrics (if available)

Keep sample sheets in git (they're small). Never encode sample information in file names as the sole source of truth — file names get changed, copied incorrectly, or truncated.

Computational Notebooks: Jupyter Best Practices

Jupyter notebooks are excellent for exploratory analysis and reporting, but problematic for reproducibility when used naively:

Cell execution order is not recorded — notebooks can be in a state that can't be reproduced by running top-to-bottom
Hidden state accumulates from re-running cells in arbitrary order
Large notebooks are hard to review and test

Best practices:

Restart and run all before committing — ensure the notebook runs cleanly from top to bottom
Parameterized notebooks: use Papermill to run notebooks with different parameters — each run produces a separate output notebook, creating a record
Extract reusable code to .py modules; notebooks should call functions, not define them
Clear outputs before committing (or use nbstripout git hook) — output data in notebooks makes diffs unreadable and inflates repository size
Pin the kernel environment in the notebook metadata or a companion requirements.txt

Randomness and Seeds

Any analysis using randomization must fix the random seed:

Clustering (k-means initialization)
Dimensionality reduction (t-SNE, UMAP)
Train/test splits in ML
Bootstrapping and permutation tests
Stochastic gradient descent (neural networks)

python

import numpy as np
import random
import torch

SEED = 42
np.random.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)
# For CUDA:
torch.cuda.manual_seed_all(SEED)

Report the seed in methods. If the analysis is sensitive to seed choice, run with multiple seeds and report the distribution of results.

Code Quality in Analysis

Analysis code in academia often has lower quality standards than production software — but this costs reproducibility:

Configurable parameters: never hardcode file paths, thresholds, or parameters in scripts. Use a configuration file (YAML/TOML) or command-line arguments. This makes re-running with different parameters trivial and documents what parameters were used.

yaml

# config.yaml
input:
  counts: data/counts_matrix.tsv
  metadata: data/sample_metadata.csv
deseq2:
  fdr_threshold: 0.05
  lfc_threshold: 1.0
  reference_level: "control"
output:
  dir: results/deseq2/

Logging over print statements: use Python's logging module or R's futile.logger. Logs should include timestamps, parameter values, and key metrics (input size, output size, runtime). A log file is a record of what actually happened during a run.

Intermediate outputs: write intermediate results at key steps. If a downstream step fails, you can resume without re-running expensive upstream steps (even without a workflow manager).

Reporting and Documentation

Methods section clarity: the methods section of a bioinformatics paper should be reproducible. Include exact tool names, versions, command-line parameters, and reference genome/annotation versions used.

Bad methods: "We performed differential expression analysis using DESeq2 with default parameters."

Good methods: "Differential expression analysis was performed with DESeq2 v1.40.0 (Love et al. 2014), using negative binomial GLM with Wald test. The design formula was ~batch + condition, where batch corrected for the two sequencing runs. Genes were filtered to those with mean normalized counts > 10 across all samples. p-values were adjusted with Benjamini-Hochberg FDR; genes with padj < 0.05 and |log2FC| > 1 were called differentially expressed."

Supplementary tables: provide all processed results (full differential expression tables, not just top hits) as downloadable files. Reviewers and future researchers need the complete results to validate findings or use as prior information.

The Reproducibility Checklist

Before submitting a paper or sharing an analysis:

All code in a git repository with a README
Software environment specified (conda environment.yml or Dockerfile)
Raw data archived with checksums
Analysis pipeline runs end-to-end from raw data to figures
All random seeds fixed and documented
Analysis parameters in config files, not hardcoded
Methods section includes exact tool versions and parameters
Full results tables in supplementary data (not just filtered highlights)
Repository tagged at submission version

★The six-month test

A common self-test for reproducibility: "Can someone with no knowledge of this project reproduce the key figures in six months using only the repository and the paper's methods section?" If the answer is no, the analysis is not truly reproducible. The "someone" might be you — lab members move on, memory fades, and you will need to revisit this analysis.

Tools Summary

Category	Tool	Purpose
Version control	Git + GitHub/GitLab	Code versioning and collaboration
Environment	Conda/Mamba	Package management
Containers	Docker + Singularity	Full environment reproducibility
Workflow	Snakemake	Pipeline management (Python-native)
Workflow	Nextflow + nf-core	Pipeline management (cloud/HPC)
Data versioning	DVC	Large file versioning
Notebooks	Papermill	Parameterized notebook execution
Randomness	Fixed seeds	Stochastic reproducibility

Reproducibility is not a bureaucratic requirement — it is the basic scientific standard that allows findings to be built upon. Every hour invested in these practices saves many hours of confusion, re-analysis, and failed replication.