Documentation

Python Integration#

Use bdp generate python to get a Python module with pathlib.Path variables for every source in your bdp.lock, ready to import in any script or notebook.

Generate Paths#

bash
bdp generate python

Output (bdp_data.py):

python
"""BDP data paths - auto-generated by 'bdp generate python'. Do not edit."""
from pathlib import Path
_BASE = Path(__file__).parent
# Source: uniprot:P01308-fasta@2024.03
UNIPROT_P01308_FASTA = _BASE / ".bdp" / "data" / "sources" / "uniprot" / "P01308" / "2024.03" / "P01308_2024.03.fasta"
# Source: clinvar:variants-vcf@2024.01
CLINVAR_VARIANTS_VCF = _BASE / ".bdp" / "data" / "sources" / "clinvar" / "variants" / "2024.01" / "variants_2024.01.vcf"

Use in a Script#

python
from bdp_data import UNIPROT_P01308_FASTA, CLINVAR_VARIANTS_VCF
from Bio import SeqIO
records = list(SeqIO.parse(str(UNIPROT_P01308_FASTA), "fasta"))

Example: Gene-Disease Variant Analysis#

python
from bdp_data import CLINVAR_VARIANTS_VCF, OPENTARGETS_ASSOCIATIONS_TSV
from cyvcf2 import VCF
import pandas as pd
# Filter pathogenic ClinVar variants
variants = []
for v in VCF(str(CLINVAR_VARIANTS_VCF)):
clnsig = v.INFO.get("CLNSIG", "")
gene = v.INFO.get("GENEINFO", "")
if "Pathogenic" in str(clnsig) and gene:
variants.append({
"gene": gene.split(":")[0],
"chrom": v.CHROM,
"pos": v.POS,
"disease": v.INFO.get("CLNDN", "not_provided"),
})
pathogenic = pd.DataFrame(variants)
# Join with Open Targets association scores
associations = pd.read_csv(
OPENTARGETS_ASSOCIATIONS_TSV, sep="\t",
usecols=["targetId", "diseaseId", "overallAssociationScore"],
)
result = pathogenic.merge(
associations[associations["overallAssociationScore"] > 0.5],
left_on="gene", right_on="targetId", how="inner",
)

Use in a Notebook#

python
# Jupyter / Quarto notebook
from bdp_data import UNIPROT_P01308_FASTA, ENSEMBL_HOMO_SAPIENS_GTF
from Bio import SeqIO
proteins = {r.id: len(r.seq) for r in SeqIO.parse(str(UNIPROT_P01308_FASTA), "fasta")}

Workflow#

  1. Add sources: bdp source add uniprot:P01308-fasta@2024.03
  2. Pull data: bdp pull
  3. Generate paths: bdp generate python
  4. Import in your code: from bdp_data import UNIPROT_P01308_FASTA
  5. Commit bdp.yml, bdp.lock, and bdp_data.py

Regenerating#

Re-run bdp generate python after any bdp pull that changes versions. Add it as a post-pull hook to automate this:

yaml
# bdp.yml
hooks:
post_pull:
- bdp generate python

Full Example#

See the complete Python variant analysis example on Codeberg — includes bdp_data.py (generated) and analysis.py (BioPython + cyvcf2 + pandas).