GitHub

Documentation

Don't demand reproducibility from researchers—demand it from their tools.

The Reproducibility Crisis

BDP is a dependency manager for biological databases—treating UniProt, NCBI, and other data sources like software packages with version control and lockfiles.

Only 11% of bioinformatics studies can be reproduced [1], with data versioning being a major contributing factor. The core problem isn't researchers—it's the tooling. Labs spend 4-12 hours per project on manual data management: writing download scripts, verifying checksums, coordinating versions with collaborators, and documenting data provenance for publications. Research shows workflow automation can save 30-75% of this time [2]. With BDP, these tasks take ~15 minutes.

[1] Leipzig, J. et al. (2021). The five pillars of computational reproducibility: bioinformatics and beyond. Briefings in Bioinformatics, 24(6).

[2] Perkel, J. M. (2015). Experiences with workflows for automating data-intensive bioinformatics. Biology Direct, 10(1).


Here are some workflow examples showing BDP in action (examples use Git for version control, which is recommended but not required—see Best Practices for details):

Example Workflow: Protein Analysis Project

A typical project analyzing insulin variants across species.

Step 1: Find the right data ~15-20 min → 5 sec

Before:

Browse UniProt website, search for "insulin", manually identify accession IDs, note down version and release date

With BDP:

bash
bdp search "insulin homo sapiens"
# uniprot:P01308-fasta@1.0 - http://localhost:3000/sources/uniprot/P01308

Step 2: Download specific proteins ~30-45 min → 30 sec

Before:

Navigate UniProt FTP, find correct directory structure, write wget script, download files, verify manually

With BDP:

bash
bdp source add uniprot:P01308-fasta@1.0
bdp pull

Step 3: Verify data integrity ~10-15 min → 2 sec

Before:

Download checksums separately, run shasum, compare manually, repeat if mismatch

With BDP:

bash
bdp audit
# ✓ All sources verified

Step 4: Share with collaborators ~1-3 hours → 1 min

Before:

Upload files to shared server, send download link via email, explain which version/release, collaborator downloads, confirms they have the right version

With BDP:

bash
# You: Commit bdp.yml and bdp.lock to git repository
git add bdp.yml bdp.lock
git commit -m "Add insulin data sources"
git push
# Collaborator: Clone and retrieve data in one command
git clone <repo>
bdp pull

Step 5: Six months later - reproduce the analysis ~2-6 hours → 10 sec

Before:

Searching through emails and chat history to reconstruct which database versions were used, checking whether files still exist on shared storage, finding broken download links, and attempting to locate archived versions from months ago

With BDP:

bash
git checkout <commit-from-6-months-ago>
bdp pull
# Exact same data, guaranteed

Step 6: Write the paper ~45-90 min → 5 sec

Before:

Manually write Data Availability Statement, look up correct citations for UniProt, format BibTeX entries, include version numbers and dates

With BDP:

bash
bdp audit export --format das > data-availability.md
bdp cite --format bibtex > references.bib
data-availability.md
## Data Availability Statement

Protein sequence data were obtained from UniProt release 2024_01 (accessed
January 15, 2024) using bdp (Bioinformatics Dependencies Platform - http://localhost:3000).
Specifically, we used human insulin precursor (UniProt ID: P01308) obtained via
`bdp pull` with package identifier uniprot:P01308-fasta@1.0. All data sources
are version-pinned in the project repository with cryptographic checksums to
ensure reproducibility (https://github.com/lab/project).
references.bib
@misc{uniprot_P01308_2024,
  author = {{The UniProt Consortium}},
  title = {UniProt: P01308 - Insulin (INS)},
  year = {2024},
  note = {UniProt Release 2024\_01, accessed January 15, 2024},
  url = {https://www.uniprot.org/uniprotkb/P01308},
  version = {2024\_01}
}

Total time: ~4-12 hours → ~15 minutes

*Note: Time estimates are illustrative based on researcher interviews. We're actively collecting data on workflow inefficiencies. Have data or want to share your experience? Contact us or open a discussion.*

Another Tool? We Know.

Tool fatigue is real. But this takes 30 seconds to try—see if it actually solves your problems.

View installation guide