Dokumentation

Mit KI übersetzt — wir entschuldigen uns für etwaige Fehler. Helfen Sie uns, diese Übersetzung zu verbessern.

Jupyter Integration#

Jupyter notebooks work with BDP through the standard Python integration. Use bdp generate python to produce a paths module, import it in your notebook, and your data is always resolved to the exact locked versions.

Setup#

Run once from your terminal, in the project directory:

bash
bdp pull
bdp generate python > bdp_data.py

Then from any notebook in the same directory:

python
from bdp_data import BDP_SOURCES
fasta = BDP_SOURCES["uniprot:P01308-fasta@1.0"]
print(fasta)
# /home/user/project/.bdp/data/sources/uniprot/P01308/1.0/P01308_1.0.fasta

Automate with a Hook#

Add bdp generate python as a post-pull hook so bdp_data.py stays in sync automatically:

bdp.yml
project:
name: protein-notebook
version: 1.0.0

sources:
- uniprot:P01308-fasta@1.0
- hpo:hp-obo@1.0

hooks:
post_pull:
  - bdp generate python > bdp_data.py

Using Environment Variables#

Alternatively, use os.environ to read BDP paths directly — useful when running notebooks via papermill or in CI:

python
import os
fasta = os.environ["BDP_UNIPROT_P01308_FASTA"]
obo = os.environ["BDP_HPO_HP_OBO"]

A Complete Notebook Example#

python
# Cell 1 — import paths
from bdp_data import BDP_SOURCES
# Cell 2 — load data
fasta_path = BDP_SOURCES["uniprot:P01308-fasta@1.0"]
sequences = {}
current_id = None
with open(fasta_path) as f:
for line in f:
line = line.strip()
if line.startswith(">"):
current_id = line[1:].split()[0]
sequences[current_id] = ""
elif current_id:
sequences[current_id] += line
print(f"Loaded {len(sequences)} sequences")
# Cell 3 — analysis
lengths = {k: len(v) for k, v in sequences.items()}
print(f"Mean length: {sum(lengths.values()) / len(lengths):.0f} aa")

Reproducibility and Sharing#

Commit these files alongside your notebooks:

  • bdp.yml — which sources the notebook needs
  • bdp.lock — exact versions and checksums
  • bdp_data.py — resolved paths (generated from the lock)

Anyone cloning your repository runs bdp pull and opens the notebook — they get the exact same data, the exact same paths, and byte-for-byte reproducible results.

Do not commit .bdp/data/ — the raw data files are large and regenerated from the lock file on demand.

JupyterHub and Shared Environments#

On JupyterHub, configure a shared BDP cache to avoid each user downloading the same files:

bash
# In your JupyterHub startup script or user profile
export BDP_CACHE_DIR=/shared/bdp/cache

All users then share a single download of each locked source.

Full Example#

See the complete Jupyter notebook example on Codeberg — includes analysis.ipynb with gene-disease variant analysis using BioPython, cyvcf2, and pandas.