R / Bioconductor Integration#
Use bdp generate r to produce a YAML data manifest and an R loader script with format-aware Bioconductor package recommendations. BDP knows which R package reads each file format and generates install helpers automatically.
Generate Config#
bdp generate rOutput: two files in r/:
r/bdp_data.yml — YAML config with paths, formats, and reader hints:
# BDP data paths - auto-generated by 'bdp generate r'. Do not edit.# Usage: cfg <- yaml::read_yaml("r/bdp_data.yml") sources: clinvar_variants_vcf: path: "../.bdp/data/sources/clinvar/variants/2024.01/variants_2024.01.vcf" format: vcf reader: "VariantAnnotation::readVcf" package: "VariantAnnotation" ensembl_homo_sapiens_gtf: path: "../.bdp/data/sources/ensembl/homo_sapiens/110/homo_sapiens_110.gtf" format: gtf reader: "rtracklayer::import" package: "rtracklayer" uniprot_p01308_fasta: path: "../.bdp/data/sources/uniprot/P01308/2024.03/P01308_2024.03.fasta" format: fasta reader: "Biostrings::readAAStringSet" package: "Biostrings" packages: bioconductor: - Biostrings - VariantAnnotation - rtracklayer cran: - yamlr/bdp_data.R — loader script with install helper:
# BDP data loader - auto-generated by 'bdp generate r'. Do not edit.# Usage: source("r/bdp_data.R") bdp_install_packages <- function() { if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(c("Biostrings", "VariantAnnotation", "rtracklayer")) install.packages(c("yaml"))} bdp <- yaml::read_yaml(file.path(dirname(sys.frame(1)$ofile), "bdp_data.yml")) .bdp_root <- normalizePath(file.path(dirname(sys.frame(1)$ofile), ".."), mustWork = FALSE)for (name in names(bdp$sources)) { bdp$sources[[name]]$path <- file.path(.bdp_root, bdp$sources[[name]]$path)}Quick Start#
# One-time: install required packagessource("r/bdp_data.R")bdp_install_packages() # Load the data manifestsource("r/bdp_data.R") # Access any sourcebdp$sources$uniprot_p01308_fasta$path # absolute file pathbdp$sources$uniprot_p01308_fasta$reader # "Biostrings::readAAStringSet"bdp$sources$uniprot_p01308_fasta$format # "fasta"Package Mapping#
BDP maps file formats to their canonical Bioconductor/CRAN readers:
| Format | Package | Reader Function | Source |
|---|---|---|---|
fasta | Biostrings | readAAStringSet() | Bioconductor |
vcf | VariantAnnotation | readVcf() | Bioconductor |
gtf | rtracklayer | import() | Bioconductor |
gff | rtracklayer | import() | Bioconductor |
obo | ontologyIndex | get_ontology() | CRAN |
tsv | readr | read_tsv() | CRAN |
csv | readr | read_csv() | CRAN |
json | jsonlite | fromJSON() | CRAN |
Only packages needed for the data types in your project are included in the install helper.
Example: Gene-Disease Variant Analysis#
A complete analysis using all 5 data types with their canonical Bioconductor readers:
source("r/bdp_data.R") library(VariantAnnotation)library(Biostrings)library(rtracklayer)library(GenomicFeatures)library(ontologyIndex)library(readr) # Load protein sequences (UniProt FASTA)proteins <- readAAStringSet(bdp$sources$uniprot_p01308_fasta$path) # Build gene model from Ensembl GTFtxdb <- makeTxDbFromGFF(bdp$sources$ensembl_homo_sapiens_gtf$path, format = "gtf")genes <- genes(txdb) # Filter pathogenic ClinVar variantsvcf <- readVcf(bdp$sources$clinvar_variants_vcf$path, genome = "GRCh38")clnsig <- info(vcf)$CLNSIGpathogenic_vcf <- vcf[grep("Pathogenic", clnsig), ] # Load MONDO disease ontologymondo <- get_OBO(bdp$sources$mondo_disease_terms_obo$path, propagate_relationships = "is_a") # Load Open Targets associationsassociations <- read_tsv(bdp$sources$opentargets_associations_tsv$path, show_col_types = FALSE) # Map variants to genes via genomic overlapvariant_gr <- rowRanges(pathogenic_vcf)hits <- findOverlaps(variant_gr, genes)variant_gene_map <- data.frame( gene_id = genes$gene_id[subjectHits(hits)], chrom = as.character(seqnames(variant_gr))[queryHits(hits)], pos = start(variant_gr)[queryHits(hits)]) # Join with disease association scoresresult <- merge(variant_gene_map, associations[associations$overallAssociationScore > 0.5, ], by.x = "gene_id", by.y = "targetId")Workflow#
- Add sources:
bdp source add clinvar:variants-vcf@2024.01 - Pull data:
bdp pull - Generate R config:
bdp generate r - Install packages:
source("r/bdp_data.R"); bdp_install_packages() - Write your analysis (see example above)
- Commit
bdp.yml,bdp.lock,r/bdp_data.yml, andr/bdp_data.R
Reproducibility#
The combination of bdp.lock (exact source versions) and r/bdp_data.yml (exact package recommendations) gives you a reproducible starting point. For full R environment reproducibility, consider pairing with renv or Docker images from bioconductor/bioconductor_docker.
Full Example#
See the complete R/Bioconductor analysis example on Codeberg — includes bdp_data.yml + bdp_data.R (generated) and analysis.R using VariantAnnotation, Biostrings, rtracklayer, ontologyIndex, and readr.