Snakemake Integration#

Use bdp generate snakemake to emit a YAML config block with resolved file paths, ready to include in your Snakemake workflow.

Generate Config#

bash

bdp generate snakemake

Output (config/bdp_data.yaml):

yaml

# BDP data paths - auto-generated by 'bdp generate snakemake'. Do not edit.
# Include in Snakefile: configfile: "config/bdp_data.yaml"
bdp:
  clinvar_variants_vcf: ".bdp/data/sources/clinvar/variants/2024.01/variants_2024.01.vcf"
  ensembl_homo_sapiens_gtf: ".bdp/data/sources/ensembl/homo_sapiens/110/homo_sapiens_110.gtf"
  uniprot_p01308_fasta: ".bdp/data/sources/uniprot/P01308/2024.03/P01308_2024.03.fasta"

Use in a Snakefile#

Include the generated config and reference paths via config["bdp"]:

python

# Snakefile
configfile: "config/bdp_data.yaml"
 
rule filter_pathogenic:
    input:
        vcf=config["bdp"]["clinvar_variants_vcf"],
    output:
        vcf="results/pathogenic.vcf",
    shell:
        'bcftools view -i \'INFO/CLNSIG~"Pathogenic"\' {input.vcf} > {output.vcf}'

Example: Variant Annotation Pipeline#

A 3-rule pipeline that filters ClinVar variants, intersects with gene annotations, and joins with disease association scores:

python

configfile: "config/bdp_data.yaml"
 
rule all:
    input: "results/annotated_variants.tsv"
 
rule filter_pathogenic:
    input:
        vcf=config["bdp"]["clinvar_variants_vcf"],
    output:
        vcf="results/pathogenic.vcf",
    shell:
        'bcftools view -i \'INFO/CLNSIG~"Pathogenic"\' {input.vcf} > {output.vcf}'
 
rule annotate_with_genes:
    input:
        vcf="results/pathogenic.vcf",
        gtf=config["bdp"]["ensembl_homo_sapiens_gtf"],
    output:
        tsv="results/variant_genes.tsv",
    shell:
        "bedtools intersect -a {input.vcf} -b {input.gtf} -wa -wb > {output.tsv}"
 
rule join_with_associations:
    input:
        variants="results/variant_genes.tsv",
        associations=config["bdp"]["opentargets_associations_tsv"],
    output:
        tsv="results/annotated_variants.tsv",
    run:
        import pandas as pd
        vg = pd.read_csv(input.variants, sep="\t")
        assoc = pd.read_csv(input.associations, sep="\t")
        result = vg.merge(assoc[assoc["overallAssociationScore"] > 0.5],
                          left_on="gene_name", right_on="targetId", how="inner")
        result.to_csv(output.tsv, sep="\t", index=False)

Workflow#

Add sources: bdp source add clinvar:variants-vcf@2024.01
Pull data: bdp pull
Generate config: bdp generate snakemake
Reference in Snakefile: configfile: "config/bdp_data.yaml"
Run: snakemake --cores 4
Commit bdp.yml, bdp.lock, and config/bdp_data.yaml

Reproducibility#

Because config/bdp_data.yaml is generated from bdp.lock, the exact same file paths are used every time — on your machine, your collaborator's machine, and CI.

Full Example#

See the complete Snakemake pipeline example on Codeberg — includes a 3-rule variant annotation workflow with bcftools, bedtools, and pandas.

Documentation