Documentation

R / Bioconductor Integration#

Use bdp generate r to produce a YAML data manifest and an R loader script with format-aware Bioconductor package recommendations. BDP knows which R package reads each file format and generates install helpers automatically.

Generate Config#

bash
bdp generate r

Output: two files in r/:

r/bdp_data.yml — YAML config with paths, formats, and reader hints:

yaml
# BDP data paths - auto-generated by 'bdp generate r'. Do not edit.
# Usage: cfg <- yaml::read_yaml("r/bdp_data.yml")
sources:
clinvar_variants_vcf:
path: "../.bdp/data/sources/clinvar/variants/2024.01/variants_2024.01.vcf"
format: vcf
reader: "VariantAnnotation::readVcf"
package: "VariantAnnotation"
ensembl_homo_sapiens_gtf:
path: "../.bdp/data/sources/ensembl/homo_sapiens/110/homo_sapiens_110.gtf"
format: gtf
reader: "rtracklayer::import"
package: "rtracklayer"
uniprot_p01308_fasta:
path: "../.bdp/data/sources/uniprot/P01308/2024.03/P01308_2024.03.fasta"
format: fasta
reader: "Biostrings::readAAStringSet"
package: "Biostrings"
packages:
bioconductor:
- Biostrings
- VariantAnnotation
- rtracklayer
cran:
- yaml

r/bdp_data.R — loader script with install helper:

r
# BDP data loader - auto-generated by 'bdp generate r'. Do not edit.
# Usage: source("r/bdp_data.R")
bdp_install_packages <- function() {
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("Biostrings", "VariantAnnotation", "rtracklayer"))
install.packages(c("yaml"))
}
bdp <- yaml::read_yaml(file.path(dirname(sys.frame(1)$ofile), "bdp_data.yml"))
.bdp_root <- normalizePath(file.path(dirname(sys.frame(1)$ofile), ".."), mustWork = FALSE)
for (name in names(bdp$sources)) {
bdp$sources[[name]]$path <- file.path(.bdp_root, bdp$sources[[name]]$path)
}

Quick Start#

r
# One-time: install required packages
source("r/bdp_data.R")
bdp_install_packages()
# Load the data manifest
source("r/bdp_data.R")
# Access any source
bdp$sources$uniprot_p01308_fasta$path # absolute file path
bdp$sources$uniprot_p01308_fasta$reader # "Biostrings::readAAStringSet"
bdp$sources$uniprot_p01308_fasta$format # "fasta"

Package Mapping#

BDP maps file formats to their canonical Bioconductor/CRAN readers:

FormatPackageReader FunctionSource
fastaBiostringsreadAAStringSet()Bioconductor
vcfVariantAnnotationreadVcf()Bioconductor
gtfrtracklayerimport()Bioconductor
gffrtracklayerimport()Bioconductor
oboontologyIndexget_ontology()CRAN
tsvreadrread_tsv()CRAN
csvreadrread_csv()CRAN
jsonjsonlitefromJSON()CRAN

Only packages needed for the data types in your project are included in the install helper.

Example: Gene-Disease Variant Analysis#

A complete analysis using all 5 data types with their canonical Bioconductor readers:

r
source("r/bdp_data.R")
library(VariantAnnotation)
library(Biostrings)
library(rtracklayer)
library(GenomicFeatures)
library(ontologyIndex)
library(readr)
# Load protein sequences (UniProt FASTA)
proteins <- readAAStringSet(bdp$sources$uniprot_p01308_fasta$path)
# Build gene model from Ensembl GTF
txdb <- makeTxDbFromGFF(bdp$sources$ensembl_homo_sapiens_gtf$path, format = "gtf")
genes <- genes(txdb)
# Filter pathogenic ClinVar variants
vcf <- readVcf(bdp$sources$clinvar_variants_vcf$path, genome = "GRCh38")
clnsig <- info(vcf)$CLNSIG
pathogenic_vcf <- vcf[grep("Pathogenic", clnsig), ]
# Load MONDO disease ontology
mondo <- get_OBO(bdp$sources$mondo_disease_terms_obo$path,
propagate_relationships = "is_a")
# Load Open Targets associations
associations <- read_tsv(bdp$sources$opentargets_associations_tsv$path,
show_col_types = FALSE)
# Map variants to genes via genomic overlap
variant_gr <- rowRanges(pathogenic_vcf)
hits <- findOverlaps(variant_gr, genes)
variant_gene_map <- data.frame(
gene_id = genes$gene_id[subjectHits(hits)],
chrom = as.character(seqnames(variant_gr))[queryHits(hits)],
pos = start(variant_gr)[queryHits(hits)]
)
# Join with disease association scores
result <- merge(variant_gene_map,
associations[associations$overallAssociationScore > 0.5, ],
by.x = "gene_id", by.y = "targetId")

Workflow#

  1. Add sources: bdp source add clinvar:variants-vcf@2024.01
  2. Pull data: bdp pull
  3. Generate R config: bdp generate r
  4. Install packages: source("r/bdp_data.R"); bdp_install_packages()
  5. Write your analysis (see example above)
  6. Commit bdp.yml, bdp.lock, r/bdp_data.yml, and r/bdp_data.R

Reproducibility#

The combination of bdp.lock (exact source versions) and r/bdp_data.yml (exact package recommendations) gives you a reproducible starting point. For full R environment reproducibility, consider pairing with renv or Docker images from bioconductor/bioconductor_docker.

Full Example#

See the complete R/Bioconductor analysis example on Codeberg — includes bdp_data.yml + bdp_data.R (generated) and analysis.R using VariantAnnotation, Biostrings, rtracklayer, ontologyIndex, and readr.