Jupyter Integration#

Jupyter notebooks work with BDP through the standard Python integration. Use bdp generate python to produce a paths module, import it in your notebook, and your data is always resolved to the exact locked versions.

Setup#

Run once from your terminal, in the project directory:

bash

bdp pull
bdp generate python > bdp_data.py

Then from any notebook in the same directory:

python

from bdp_data import BDP_SOURCES
 
fasta = BDP_SOURCES["uniprot:P01308-fasta@1.0"]
print(fasta)
# /home/user/project/.bdp/data/sources/uniprot/P01308/1.0/P01308_1.0.fasta

Automate with a Hook#

Add bdp generate python as a post-pull hook so bdp_data.py stays in sync automatically:

bdp.yml

project:
name: protein-notebook
version: 1.0.0

sources:
- uniprot:P01308-fasta@1.0
- hpo:hp-obo@1.0

hooks:
post_pull:
  - bdp generate python > bdp_data.py

Using Environment Variables#

Alternatively, use os.environ to read BDP paths directly — useful when running notebooks via papermill or in CI:

python

import os
 
fasta = os.environ["BDP_UNIPROT_P01308_FASTA"]
obo   = os.environ["BDP_HPO_HP_OBO"]

A Complete Notebook Example#

python

# Cell 1 — import paths
from bdp_data import BDP_SOURCES
 
# Cell 2 — load data
fasta_path = BDP_SOURCES["uniprot:P01308-fasta@1.0"]
 
sequences = {}
current_id = None
with open(fasta_path) as f:
    for line in f:
        line = line.strip()
        if line.startswith(">"):
            current_id = line[1:].split()[0]
            sequences[current_id] = ""
        elif current_id:
            sequences[current_id] += line
 
print(f"Loaded {len(sequences)} sequences")
 
# Cell 3 — analysis
lengths = {k: len(v) for k, v in sequences.items()}
print(f"Mean length: {sum(lengths.values()) / len(lengths):.0f} aa")

Commit these files alongside your notebooks:

bdp.yml — which sources the notebook needs
bdp.lock — exact versions and checksums
bdp_data.py — resolved paths (generated from the lock)

Anyone cloning your repository runs bdp pull and opens the notebook — they get the exact same data, the exact same paths, and byte-for-byte reproducible results.

Do not commit .bdp/data/ — the raw data files are large and regenerated from the lock file on demand.

JupyterHub and Shared Environments#

On JupyterHub, configure a shared BDP cache to avoid each user downloading the same files:

bash

# In your JupyterHub startup script or user profile
export BDP_CACHE_DIR=/shared/bdp/cache

All users then share a single download of each locked source.

Full Example#

See the complete Jupyter notebook example on Codeberg — includes analysis.ipynb with gene-disease variant analysis using BioPython, cyvcf2, and pandas.

Dokumentation

Translation Not Available

Jupyter Integration#

Setup#

Automate with a Hook#

Using Environment Variables#

A Complete Notebook Example#

Reproducibility and Sharing#

JupyterHub and Shared Environments#

Full Example#