Translation Not Available
This page is not yet available in German. Showing English content instead.
Help us translate this page! Contribute on GitHub
Jupyter Integration#
Jupyter notebooks work with BDP through the standard Python integration. Use bdp generate python to produce a paths module, import it in your notebook, and your data is always resolved to the exact locked versions.
Setup#
Run once from your terminal, in the project directory:
bdp pullbdp generate python > bdp_data.pyThen from any notebook in the same directory:
from bdp_data import BDP_SOURCES fasta = BDP_SOURCES["uniprot:P01308-fasta@1.0"]print(fasta)# /home/user/project/.bdp/data/sources/uniprot/P01308/1.0/P01308_1.0.fastaAutomate with a Hook#
Add bdp generate python as a post-pull hook so bdp_data.py stays in sync automatically:
project: name: protein-notebook version: 1.0.0 sources: - uniprot:P01308-fasta@1.0 - hpo:hp-obo@1.0 hooks: post_pull: - bdp generate python > bdp_data.py
Using Environment Variables#
Alternatively, use os.environ to read BDP paths directly — useful when running notebooks via papermill or in CI:
import os fasta = os.environ["BDP_UNIPROT_P01308_FASTA"]obo = os.environ["BDP_HPO_HP_OBO"]A Complete Notebook Example#
# Cell 1 — import pathsfrom bdp_data import BDP_SOURCES # Cell 2 — load datafasta_path = BDP_SOURCES["uniprot:P01308-fasta@1.0"] sequences = {}current_id = Nonewith open(fasta_path) as f: for line in f: line = line.strip() if line.startswith(">"): current_id = line[1:].split()[0] sequences[current_id] = "" elif current_id: sequences[current_id] += line print(f"Loaded {len(sequences)} sequences") # Cell 3 — analysislengths = {k: len(v) for k, v in sequences.items()}print(f"Mean length: {sum(lengths.values()) / len(lengths):.0f} aa")Reproducibility and Sharing#
Commit these files alongside your notebooks:
bdp.yml— which sources the notebook needsbdp.lock— exact versions and checksumsbdp_data.py— resolved paths (generated from the lock)
Anyone cloning your repository runs bdp pull and opens the notebook — they get the exact same data, the exact same paths, and byte-for-byte reproducible results.
Do not commit .bdp/data/ — the raw data files are large and regenerated from the lock file on demand.
JupyterHub and Shared Environments#
On JupyterHub, configure a shared BDP cache to avoid each user downloading the same files:
# In your JupyterHub startup script or user profileexport BDP_CACHE_DIR=/shared/bdp/cacheAll users then share a single download of each locked source.
Full Example#
See the complete Jupyter notebook example on Codeberg — includes analysis.ipynb with gene-disease variant analysis using BioPython, cyvcf2, and pandas.