Dokumentation

Mit KI übersetzt — wir entschuldigen uns für etwaige Fehler. Helfen Sie uns, diese Übersetzung zu verbessern.

Cache Management#

BDP caches every downloaded file locally so that bdp pull is fast after the first run — already-present files are skipped without a network request. The cache is project-local by default and can be redirected to a shared network path for team use.

Where Files Are Stored#

The default cache directory is .bdp/data/ inside your project root — the same .bdp/ directory that bdp init creates and gitignores. Files are stored with a deterministic path:

Code
.bdp/data/sources/{org}/{identifier}/{version}/{identifier}_{version}.{format}

For example, uniprot:P01308-fasta@1.0 is cached at:

Code
.bdp/data/sources/uniprot/P01308/1.0/P01308_1.0.fasta

This layout means every source has a stable, predictable location. You can reference files directly by path if needed, but using the BDP_* environment variables injected by hooks is more portable.

How Cache Lookup Works#

Before downloading, bdp pull checks whether the file already exists at its expected path. If it does, the download is skipped entirely — no checksum comparison, no network request. The lockfile is still updated with the source's metadata.

bash
bdp pull
# ✓ uniprot:P01308-fasta@1.0 (cached, skipped)
# ↓ hpo:hp-obo@1.0 (downloading...)
# ✓ hpo:hp-obo@1.0 (verified)

To force a re-download of all sources regardless of cache state:

bash
bdp pull --force

This is useful if you suspect a file has been corrupted on disk. BDP will re-download and verify the checksum against bdp.lock.

Cache Commands#

bash
# Show cache location, resolved path, and total disk usage
bdp cache show
# Change the cache directory (writes to .bdp/.config)
bdp cache set /path/to/cache
# Reset cache path back to the default (.bdp/data)
bdp cache reset

bdp cache show output:

bash
Cache Configuration:
Configured path: .bdp/data
Resolved path: /home/user/project/.bdp/data
Project root: /home/user/project
Cache size: 284 MB

Checking What's Cached#

bdp status shows which sources from bdp.lock are present on disk and which are missing:

bash
Locked Sources:
✓ uniprot:P01308-fasta@1.0 [cached] 4.2 MB
✓ hpo:hp-obo@1.0 [cached] 18.7 MB
○ ensembl:homo-sapiens-gtf@1.0 [missing]
Summary:
2 cached · 1 missing · 22.9 MB on disk
Cache dir: /home/user/project/.bdp/data
Run 'bdp pull' to download missing sources.

Cleaning the Cache#

bash
# Remove all cached source files
bdp clean --all --yes
# Clear only the search results cache (separate database)
bdp clean --search-cache --yes

Without --yes, BDP will prompt for confirmation. Without --all, it shows the current cache size and exits.

After cleaning, bdp pull re-downloads everything from scratch on the next run.

Cache Configuration#

Cache settings live in .bdp/.config — a TOML file inside your project's .bdp/ directory. bdp cache set writes to this file; you can also edit it directly.

.bdp/.config
[cache]
path = ".bdp/data"

The path can be relative (resolved from the project root) or absolute. This is the only current cache configuration option.

Shared Team Cache#

If your team works on the same server or has a shared network filesystem, point everyone's cache to the same directory. Each member runs once and the rest reuse the result.

bash
# Run this in each team member's project directory
bdp cache set /mnt/shared/bdp-cache

This writes the absolute path to .bdp/.config. Commit .bdp/.config to your repository so the shared path is automatically used by everyone who clones the project.

.bdp/.config
[cache]
path = "/mnt/shared/bdp-cache"

Multiple projects can point to the same cache directory — files are stored by content-addressed path, so there is no collision between projects using the same source.

HPC and Cluster Usage#

On HPC clusters, set the cache to a shared scratch or project storage path so compute nodes don't each download their own copy:

submit.sh
#!/bin/bash
#SBATCH --job-name=analysis
#SBATCH --nodes=4

# Set shared cache before running — only one node will download
bdp cache set /scratch/project/shared/bdp-cache
bdp pull

# All nodes now have access to the same cached files
srun python scripts/analysis.py

If jobs run concurrently across nodes accessing the same cache directory, BDP writes are atomic at the file level — a file is either fully written or absent, never partially written. There is no distributed lock, so concurrent pulls of the same source may result in duplicate downloads, but not corruption.

Search Cache#

bdp search results are cached separately in a SQLite database at your OS cache directory (~/.cache/bdp/bdp.db on Linux/macOS). Entries expire after 5 minutes. This is independent of the data cache and is not project-specific.

bash
# Clear stale search results
bdp clean --search-cache --yes

Lockfile and Cache Relationship#

bdp.lock records the checksum and metadata for every pulled source, but it does not manage the cache directly. The cache is purely a filesystem store — BDP checks for file existence, not lockfile state, when deciding whether to skip a download.

This means:

  • Deleting .bdp/data/ and re-running bdp pull always produces the same result — the lockfile guarantees the same versions and checksums are fetched.
  • Moving the cache (bdp cache set) does not invalidate the lockfile — BDP will re-download to the new location on the next bdp pull.
  • If you share a lockfile without sharing the cache, collaborators download the exact same files independently.