Documentation

Snakemake Integration#

Use bdp generate snakemake to emit a YAML config block with resolved file paths, ready to include in your Snakemake workflow.

Generate Config#

bash
bdp generate snakemake

Output (config/bdp_data.yaml):

yaml
# BDP data paths - auto-generated by 'bdp generate snakemake'. Do not edit.
# Include in Snakefile: configfile: "config/bdp_data.yaml"
bdp:
clinvar_variants_vcf: ".bdp/data/sources/clinvar/variants/2024.01/variants_2024.01.vcf"
ensembl_homo_sapiens_gtf: ".bdp/data/sources/ensembl/homo_sapiens/110/homo_sapiens_110.gtf"
uniprot_p01308_fasta: ".bdp/data/sources/uniprot/P01308/2024.03/P01308_2024.03.fasta"

Use in a Snakefile#

Include the generated config and reference paths via config["bdp"]:

python
# Snakefile
configfile: "config/bdp_data.yaml"
rule filter_pathogenic:
input:
vcf=config["bdp"]["clinvar_variants_vcf"],
output:
vcf="results/pathogenic.vcf",
shell:
'bcftools view -i \'INFO/CLNSIG~"Pathogenic"\' {input.vcf} > {output.vcf}'

Example: Variant Annotation Pipeline#

A 3-rule pipeline that filters ClinVar variants, intersects with gene annotations, and joins with disease association scores:

python
configfile: "config/bdp_data.yaml"
rule all:
input: "results/annotated_variants.tsv"
rule filter_pathogenic:
input:
vcf=config["bdp"]["clinvar_variants_vcf"],
output:
vcf="results/pathogenic.vcf",
shell:
'bcftools view -i \'INFO/CLNSIG~"Pathogenic"\' {input.vcf} > {output.vcf}'
rule annotate_with_genes:
input:
vcf="results/pathogenic.vcf",
gtf=config["bdp"]["ensembl_homo_sapiens_gtf"],
output:
tsv="results/variant_genes.tsv",
shell:
"bedtools intersect -a {input.vcf} -b {input.gtf} -wa -wb > {output.tsv}"
rule join_with_associations:
input:
variants="results/variant_genes.tsv",
associations=config["bdp"]["opentargets_associations_tsv"],
output:
tsv="results/annotated_variants.tsv",
run:
import pandas as pd
vg = pd.read_csv(input.variants, sep="\t")
assoc = pd.read_csv(input.associations, sep="\t")
result = vg.merge(assoc[assoc["overallAssociationScore"] > 0.5],
left_on="gene_name", right_on="targetId", how="inner")
result.to_csv(output.tsv, sep="\t", index=False)

Workflow#

  1. Add sources: bdp source add clinvar:variants-vcf@2024.01
  2. Pull data: bdp pull
  3. Generate config: bdp generate snakemake
  4. Reference in Snakefile: configfile: "config/bdp_data.yaml"
  5. Run: snakemake --cores 4
  6. Commit bdp.yml, bdp.lock, and config/bdp_data.yaml

Reproducibility#

Because config/bdp_data.yaml is generated from bdp.lock, the exact same file paths are used every time — on your machine, your collaborator's machine, and CI.

Full Example#

See the complete Snakemake pipeline example on Codeberg — includes a 3-rule variant annotation workflow with bcftools, bedtools, and pandas.