Bioinformatics has a reproducibility problem. A 2015 survey found that over 70% of researchers had tried and failed to reproduce another lab's computational analysis. The causes are well understood: undocumented software versions, hardcoded file paths, analyses run interactively without recording steps, random seeds not fixed, and data modifications without provenance tracking.
This matters scientifically (results that can't be reproduced can't be built upon) and practically (you won't be able to reproduce your own analysis six months later without these practices). This chapter covers the practices that separate analysis that works once from analysis that works reliably.
The Reproducibility Stack
Full reproducibility requires addressing multiple layers:
Code (exact version)
↓
Software environment (exact tool versions, dependencies)
↓
Data (exact input, with checksums)
↓
Compute environment (OS, CPU architecture)
↓
Randomness (fixed seeds)
↓
Documentation (parameters, decisions made)
Failure at any layer breaks reproducibility. Each layer has its solution.
Version Control: Git for Analysis
Code without version control is unscientific. Every analysis script, pipeline configuration, and data processing step should be in git.
What to commit:
- Analysis scripts (Python, R, bash)
- Pipeline configuration files (Snakemake/Nextflow rules)
- Parameter files (sample sheets, configuration YAML)
- Custom utility functions
What NOT to commit:
- Raw data (too large; use data management tools)
- Intermediate files (too large; regenerate from pipeline)
- Credentials or API keys
- Large binary files (use Git LFS if necessary)
Commit granularity: commit per logical step ("add read trimming step to pipeline"), not per keystroke and not per week. The commit history should tell the story of the analysis.
Branch strategy for analyses: use feature branches for experimental changes; merge to main when validated. Tag releases that correspond to paper submissions — you need to be able to check out exactly the code that generated the figures.
main: stable, reviewed analysis code
├── feature/add-deseq2-analysis
├── feature/try-alternative-normalization
└── tags: submission-v1, revision-v1
Environment Management
The Dependency Problem
Bioinformatics tools are notorious for complex dependency chains and version sensitivity. An analysis run with DESeq2 version 1.28 may give different results than the same analysis with version 1.36 (due to algorithm updates). Tools may conflict with each other's dependencies.
Conda and Mamba
Conda creates isolated environments with reproducible package versions:
# Create environment from specification file
conda env create -f environment.yml
# environment.yml
name: rnaseq-analysis
channels:
- bioconda
- conda-forge
dependencies:
- python=3.10
- snakemake=7.32
- fastqc=0.12
- star=2.7.10
- samtools=1.17
- r-base=4.3
- bioconductor-deseq2=1.40
Always pin exact versions in environment.yml. python>=3.8 is not reproducible; python=3.10.6 is.
Mamba: drop-in replacement for conda with faster dependency resolution (C++ solver vs. Python). Use mamba wherever you'd use conda.
Containers: Docker and Singularity
Conda captures Python/R packages but not the underlying OS, system libraries, or compiled binaries. Containers capture everything:
Docker: packages the entire runtime environment (OS, system libraries, tools) into a portable image. The same Docker image runs identically on any Linux machine.
FROM continuumio/miniconda3:23.3.1-0
COPY environment.yml .
RUN conda env create -f environment.yml
Singularity/Apptainer: Docker alternative designed for HPC environments where Docker requires root access. Converts Docker images; runs without root. Standard at most research computing clusters.
Best practice: define your environment as a Dockerfile or Conda environment file, build the container, and record the image hash in your pipeline configuration. The image hash (SHA256) is an exact fingerprint of the software environment.
Most HPC clusters don't allow Docker (requires root), but do support Singularity/Apptainer. Nextflow and Snakemake both support Singularity natively: specify a container image per rule/process, and the workflow manager pulls and runs the container automatically. This is the recommended approach for pipeline portability across HPC systems.
Workflow Managers
Ad hoc bash scripts for multi-step analyses have predictable failure modes:
- No way to resume from failure
- Re-runs entire analysis when you change one step
- No parallelization management
- No resource specification
Workflow managers solve all of these.
Snakemake
Snakemake uses a rule-based system: define rules that specify inputs, outputs, and the shell command. Snakemake builds a dependency DAG from the target output files and runs only what's needed.
# Snakefile
rule trim_reads:
input:
r1 = "data/raw/{sample}_R1.fastq.gz",
r2 = "data/raw/{sample}_R2.fastq.gz"
output:
r1 = "results/trimmed/{sample}_R1.fastq.gz",
r2 = "results/trimmed/{sample}_R2.fastq.gz"
threads: 4
resources:
mem_mb = 8000
shell:
"trimmomatic PE -threads {threads} {input.r1} {input.r2} "
"{output.r1} /dev/null {output.r2} /dev/null ..."
rule star_align:
input:
r1 = "results/trimmed/{sample}_R1.fastq.gz",
r2 = "results/trimmed/{sample}_R2.fastq.gz",
index = "data/reference/star_index"
output:
bam = "results/aligned/{sample}.bam"
...
Key Snakemake features:
- Wildcards (
{sample}): automatically scales to all samples - Checkpointing: if a step fails, re-run resumes from that step, not the beginning
- HPC integration: submit each rule as a separate cluster job automatically
- Container support:
conda:orsingularity:directive per rule - Report generation: HTML report with runtime statistics and output file provenance
Nextflow
Nextflow uses a dataflow paradigm: processes communicate via channels (queues of data). More powerful for complex branching workflows and cloud computing.
process TRIM_READS {
input:
tuple val(sample), path(reads)
output:
tuple val(sample), path("*.trimmed.fastq.gz")
script:
"""
trimmomatic PE ${reads[0]} ${reads[1]} ...
"""
}
workflow {
reads = Channel.fromFilePairs("data/raw/*_{R1,R2}.fastq.gz")
trimmed = TRIM_READS(reads)
ALIGN(trimmed)
}
nf-core: a community collection of Nextflow pipelines for common bioinformatics tasks (RNA-seq, ChIP-seq, variant calling, etc.) following best practices. For standard analyses, using an nf-core pipeline is often better than writing your own — the pipelines are extensively tested, documented, and maintained.
When to Use Which
| Use Case | Tool |
|---|---|
| Custom analysis pipeline | Snakemake (Python-native, easy to learn) |
| Production pipeline, cloud | Nextflow (better cloud support) |
| Standard RNA-seq/ATAC-seq | nf-core pipeline |
| Single scripts, Jupyter | conda environment + git |
Data Management
Immutable Raw Data
Raw data should never be modified. Treat it as read-only:
- Set directory permissions to read-only after initial download
- Verify checksums (MD5/SHA256) after download and periodically
- Keep raw data separate from processed data
Never modify in place: if a processing step requires format conversion, write to a new file. The chain from raw → processed must be fully traceable.
Data Versioning with DVC
DVC (Data Version Control): Git extension for versioning large data files. Stores data on S3/GCS/local storage; commits only pointers (checksums) to git.
dvc add data/raw/rnaseq_counts.tsv # track file in DVC
git add data/raw/rnaseq_counts.tsv.dvc # commit pointer
dvc push # push data to remote storage
This gives you the same workflow as git for data: dvc pull to download the exact data version corresponding to a commit.
Sample Sheets and Metadata
Every analysis should have a structured sample sheet that maps sample IDs to:
- File paths (raw data)
- Biological metadata (condition, genotype, tissue, timepoint)
- Technical metadata (batch, sequencing date, library prep)
- QC metrics (if available)
Keep sample sheets in git (they're small). Never encode sample information in file names as the sole source of truth — file names get changed, copied incorrectly, or truncated.
Computational Notebooks: Jupyter Best Practices
Jupyter notebooks are excellent for exploratory analysis and reporting, but problematic for reproducibility when used naively:
- Cell execution order is not recorded — notebooks can be in a state that can't be reproduced by running top-to-bottom
- Hidden state accumulates from re-running cells in arbitrary order
- Large notebooks are hard to review and test
Best practices:
- Restart and run all before committing — ensure the notebook runs cleanly from top to bottom
- Parameterized notebooks: use Papermill to run notebooks with different parameters — each run produces a separate output notebook, creating a record
- Extract reusable code to
.pymodules; notebooks should call functions, not define them - Clear outputs before committing (or use
nbstripoutgit hook) — output data in notebooks makes diffs unreadable and inflates repository size - Pin the kernel environment in the notebook metadata or a companion
requirements.txt
Randomness and Seeds
Any analysis using randomization must fix the random seed:
- Clustering (k-means initialization)
- Dimensionality reduction (t-SNE, UMAP)
- Train/test splits in ML
- Bootstrapping and permutation tests
- Stochastic gradient descent (neural networks)
import numpy as np
import random
import torch
SEED = 42
np.random.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)
# For CUDA:
torch.cuda.manual_seed_all(SEED)
Report the seed in methods. If the analysis is sensitive to seed choice, run with multiple seeds and report the distribution of results.
Code Quality in Analysis
Analysis code in academia often has lower quality standards than production software — but this costs reproducibility:
Configurable parameters: never hardcode file paths, thresholds, or parameters in scripts. Use a configuration file (YAML/TOML) or command-line arguments. This makes re-running with different parameters trivial and documents what parameters were used.
# config.yaml
input:
counts: data/counts_matrix.tsv
metadata: data/sample_metadata.csv
deseq2:
fdr_threshold: 0.05
lfc_threshold: 1.0
reference_level: "control"
output:
dir: results/deseq2/
Logging over print statements: use Python's logging module or R's futile.logger. Logs should include timestamps, parameter values, and key metrics (input size, output size, runtime). A log file is a record of what actually happened during a run.
Intermediate outputs: write intermediate results at key steps. If a downstream step fails, you can resume without re-running expensive upstream steps (even without a workflow manager).
Reporting and Documentation
Methods section clarity: the methods section of a bioinformatics paper should be reproducible. Include exact tool names, versions, command-line parameters, and reference genome/annotation versions used.
Bad methods: "We performed differential expression analysis using DESeq2 with default parameters."
Good methods: "Differential expression analysis was performed with DESeq2 v1.40.0 (Love et al. 2014), using negative binomial GLM with Wald test. The design formula was ~batch + condition, where batch corrected for the two sequencing runs. Genes were filtered to those with mean normalized counts > 10 across all samples. p-values were adjusted with Benjamini-Hochberg FDR; genes with padj < 0.05 and |log2FC| > 1 were called differentially expressed."
Supplementary tables: provide all processed results (full differential expression tables, not just top hits) as downloadable files. Reviewers and future researchers need the complete results to validate findings or use as prior information.
The Reproducibility Checklist
Before submitting a paper or sharing an analysis:
- All code in a git repository with a README
- Software environment specified (conda
environment.ymlor Dockerfile) - Raw data archived with checksums
- Analysis pipeline runs end-to-end from raw data to figures
- All random seeds fixed and documented
- Analysis parameters in config files, not hardcoded
- Methods section includes exact tool versions and parameters
- Full results tables in supplementary data (not just filtered highlights)
- Repository tagged at submission version
A common self-test for reproducibility: "Can someone with no knowledge of this project reproduce the key figures in six months using only the repository and the paper's methods section?" If the answer is no, the analysis is not truly reproducible. The "someone" might be you — lab members move on, memory fades, and you will need to revisit this analysis.
Tools Summary
| Category | Tool | Purpose |
|---|---|---|
| Version control | Git + GitHub/GitLab | Code versioning and collaboration |
| Environment | Conda/Mamba | Package management |
| Containers | Docker + Singularity | Full environment reproducibility |
| Workflow | Snakemake | Pipeline management (Python-native) |
| Workflow | Nextflow + nf-core | Pipeline management (cloud/HPC) |
| Data versioning | DVC | Large file versioning |
| Notebooks | Papermill | Parameterized notebook execution |
| Randomness | Fixed seeds | Stochastic reproducibility |
Reproducibility is not a bureaucratic requirement — it is the basic scientific standard that allows findings to be built upon. Every hour invested in these practices saves many hours of confusion, re-analysis, and failed replication.