Dokumentation

Mit KI übersetzt — wir entschuldigen uns für etwaige Fehler. Helfen Sie uns, diese Übersetzung zu verbessern.

Post-Pull Hooks#

Downloading a file is rarely the last step. A FASTA file needs to be indexed before BLAST can use it. A compressed archive needs to be decompressed. A GTF annotation needs to be converted. A downstream pipeline needs to be triggered.

Without hooks, you'd write a shell script that wraps bdp pull and then runs the preparation steps manually — and that script lives outside your manifest, outside version control, and breaks the moment someone on your team forgets to run it.

Post-pull hooks put that logic inside bdp.yml. They run automatically after every successful pull, in the same reproducible way, for everyone.

What to Version and What to Ignore#

The guiding principle: version the process, not the output.

Biological data files are large — often gigabytes. You don't commit them to git. You commit the specification of what to download and how to prepare it, and anyone with that specification can reproduce the exact same state from scratch.

bdp init handles this automatically. It writes a single entry to your .gitignore:

.gitignore
# BDP cache and runtime files
.bdp/

That one line covers everything: downloaded source files, the audit database, and all hook-generated artifacts — as long as you write hook outputs into .bdp/. You never need to add *.fai, *.nhr, *.nsq, or any other derived-file patterns manually.

What you do commit:

  • bdp.yml — declares which sources to pull and which hooks to run
  • bdp.lock — pins exact versions and checksums for every source
  • scripts/ — the hook scripts themselves

The data is large and changes on an upstream schedule. The specification is small and changes when you decide. Git tracks the specification; bdp pull restores everything else.

Convention: Write Hook Outputs to .bdp/#

Because .bdp/ is already gitignored, it is the natural home for anything generated by hooks. The convention is to use subdirectories that mirror what was built:

Artifact typeConvention
BLAST databases.bdp/blast/
Sequence indexes (.fai, .bwt, …).bdp/index/
Decompressed files.bdp/derived/
Custom processed outputs.bdp/derived/

This is a convention, not enforced by BDP — hooks run arbitrary shell commands and can write anywhere. But writing outside .bdp/ means you need to maintain extra .gitignore patterns yourself. Write inside .bdp/ and git ignores it for free.

A BLAST database built by a hook is an artifact of running bdp pull — you rebuild it on demand, you don't store it. The reproducibility guarantee — enforced by bdp.lock — means the result is always byte-for-byte identical.

A Quick Example#

You've added a UniProt protein sequence to your project. After pulling it, you want BLAST to index it immediately:

bdp.yml
project:
name: my-analysis
version: 1.0.0

sources:
- uniprot:P01308-fasta@1.0

hooks:
post_pull:
  - mkdir -p .bdp/blast && makeblastdb -in "$BDP_UNIPROT_P01308_FASTA" -dbtype prot -out .bdp/blast/p01308

Run bdp pull once. The file is downloaded, verified, and the BLAST database is built — automatically, every time, for every member of your team. The database lives in .bdp/blast/ — gitignored, reproducible on demand.

Configuration#

Hooks are defined at the project level in bdp.yml under hooks.post_pull. They are a list of shell commands and run once after the entire pull is done — all sources downloaded, checksums verified, bdp.lock written.

bdp.yml
hooks:
post_pull:
  - python scripts/decompress.py
  - python scripts/build_index.py
  - nextflow run pipeline.nf

Commands run sequentially in order. If you have multiple sources and multiple preparation steps, list them all — BDP runs each one after the other.

When Hooks Run#

Post-pull hooks fire after:

  1. All sources have been downloaded
  2. All checksums have been verified
  3. bdp.lock has been written to disk

They do not run in dry-run mode (bdp pull --dry-run). If a source is already cached and up to date, the pull still runs hooks — the preparation step is always executed to ensure your working state is consistent.

Environment Variables#

Every hook runs with environment variables injected for each locked source, so you never have to hardcode paths:

BDP_PROJECT_ROOT — absolute path to the project directory (where bdp.yml lives)

BDP_<ORG>_<NAME>_<FORMAT> — absolute path to each locked source file on disk

The variable name is derived from the source spec: uppercase, non-alphanumeric characters replaced with underscores:

Source specEnvironment variable
uniprot:P01308-fasta@1.0BDP_UNIPROT_P01308_FASTA
hpo:hp-obo@1.0BDP_HPO_HP_OBO
ensembl:homo-sapiens-gtf@1.0BDP_ENSEMBL_HOMO_SAPIENS_GTF

This means your hook scripts are portable — they don't need to know where BDP caches files, and they work correctly regardless of who is running the project or where.

Execution Environment#

  • Shell: sh -c on Unix/Linux/macOS, cmd /C on Windows
  • Working directory: project root (where bdp.yml is located)
  • Output: streamed directly to the terminal — both stdout and stderr are visible
  • Timeout: none — hooks can run as long as they need

Error Handling#

Hook failures are non-fatal. If a hook exits with a non-zero status, bdp pull prints a warning and moves on. The pull itself exits with code 0.

bash
→ Running post-pull hooks...
→ python scripts/build_index.py
✓ python scripts/build_index.py
→ bash scripts/validate.sh
⚠ Hook 'bash scripts/validate.sh' exited with status 1

All sources remain cached. Fix the script and re-run bdp pull — sources already in cache are not re-downloaded, only the hooks run again.

More Examples#

Build a BLAST database#

bash
mkdir -p .bdp/blast && makeblastdb -in "$BDP_UNIPROT_P01308_FASTA" -dbtype prot -out .bdp/blast/p01308

Decompress and index a FASTA#

bash
gzip -dk "$BDP_UNIPROT_P01308_FASTA" -c > .bdp/derived/p01308.fasta && samtools faidx .bdp/derived/p01308.fasta

Run a Python preparation script#

scripts/prepare.py
import os

fasta = os.environ["BDP_UNIPROT_P01308_FASTA"]
project = os.environ["BDP_PROJECT_ROOT"]

# Build a simple sequence index — write to .bdp/derived/ (already gitignored)
import pathlib
derived = pathlib.Path(project) / ".bdp" / "derived"
derived.mkdir(parents=True, exist_ok=True)

with open(fasta) as f:
  headers = [line.strip() for line in f if line.startswith(">")]

with open(derived / "index.txt", "w") as out:
  out.write("\n".join(headers))

Trigger a Nextflow pipeline#

bash
nextflow run pipeline.nf --fasta "$BDP_UNIPROT_P01308_FASTA" --outdir results/

Audit Logging#

Every hook execution is recorded in the local audit log (.bdp/bdp.db):

  • Command text
  • Exit code
  • Duration (milliseconds)

Run bdp audit to inspect the full execution history, including hook runs. This is useful for debugging flaky preparation steps and for compliance workflows that require a record of all data processing steps.