Data Sources#

A data source is a versioned, typed reference to a specific entry published by an organization — a single UniProt protein, an HPO ontology release, an Ensembl gene annotation, or any other dataset available through BDP.

Identifier Format#

Every data source is addressed with a single identifier:

Code

<organization>:<entry>-<format>@<version>

bash

bdp source add uniprot:P01308-fasta@1.0

Part	Example	Meaning
`organization`	`uniprot`	The organization publishing the data
`entry`	`P01308`	The specific record or dataset within that organization
`format`	`fasta`	File format to retrieve
`version`	`1.0`	Release or version (see below)

Source Types#

Data sources in BDP are typed. The type determines what kind of biological entity the source represents and how it integrates with the knowledge graph.

Type	Description	Examples
`protein`	Protein sequences and annotations	UniProt entries
`gene`	Gene coordinates, biotypes, cross-references	Ensembl genes
`transcript`	Transcript sequences and annotations	Ensembl transcripts
`genomic_sequence`	Chromosome / contig sequences	NCBI RefSeq
`genome`	Full genome assemblies	GenBank
`taxonomy`	Organism classification trees	NCBI Taxonomy
`go_term`	Gene Ontology terms and relationships	Gene Ontology
`interpro_entry`	Protein family and domain classifications	InterPro
`pathway`	Biological pathway definitions	Reactome
`disease`	Disease ontology terms	MONDO
`phenotype`	Phenotype ontology terms	HPO
`compound`	Chemical compounds	ChEBI, ChEMBL
`drug`	Drug information and interactions	DrugCentral, OpenFDA FAERS
`variant`	Genetic variants and clinical significance	ClinVar
`structure`	3D molecular structures	PDB
`annotation`	Cross-database annotations and links	Open Targets, STRING
`bundle`	Pre-assembled collection of related sources	See Bundles

Dependencies#

Data sources can depend on other data sources. A protein entry may depend on a taxonomy entry for its organism; a pathway may depend on the gene and compound entries it references. BDP tracks these relationships and version-pins them — when you pull a source, its dependencies are resolved to exact versions and recorded in bdp.lock alongside the source itself.

Dependencies are either required or optional. Required dependencies are always pulled. Optional dependencies are declared but not pulled unless you explicitly include them.

This is also what makes bundles work: a bundle is just a data source whose dependencies are the individual sources it aggregates.

Organizations#

The organization part of the identifier is a first-class concept in BDP. Organizations own data sources, define how they are versioned, and are responsible for keeping them up to date.

The organizations currently available — UniProt, NCBI, HPO, Ensembl, and the rest — are maintained by BDP through its ingestion pipelines. Their versioning logic is implemented and kept in sync by the BDP team. If any of these organizations would like to take ownership of their data on BDP and publish it directly, we would love to make that happen — get in touch.

Organizations and labs can also publish their own data. If your lab, institution, or project produces biological data and you would like to make it available through BDP — with proper versioning, checksums, and reproducibility guarantees — this is on our roadmap. Reach out and we will be happy to collaborate and prioritise accordingly.

Versioning#

Versioning is the hardest part of BDP — and intentionally opinionated.

Every organization versions differently. There is no universal standard: UniProt uses release numbers (2024_01), HPO uses calendar dates (2024-04-26), NCBI GenBank uses release integers, and individual protein entries like P01308 are versioned by sequence version. BDP normalises these into a consistent @version string per organization, but the underlying scheme is organization-specific.

@latest is not supported — by design. BDP requires an explicit version. Omitting the version is an error. This is intentional: reproducibility is the core guarantee, and @latest would silently break it the moment an upstream organization ships a new release.

What BDP guarantees:

The version you specify is the version you get, every time
The resolved file, checksum, and all pinned dependencies are recorded in bdp.lock on the first bdp pull
Subsequent pulls reproduce the exact same bytes, regardless of what the upstream organization has since released
To adopt a newer release, you update the version explicitly

bash

# Specify the exact version you want
bdp source add uniprot:P01308-fasta@1.0
 
# Update to a newer version explicitly
bdp source update uniprot:P01308-fasta@2.0

Project Manifest#

Data sources are declared in bdp.yml at the root of your project:

bdp.yml

project:
name: insulin-variant-analysis
version: 1.0.0

sources:
- uniprot:P01308-fasta@1.0
- hpo:hp-obo@1.0
- ensembl:homo-sapiens-gtf@1.0

Running bdp pull downloads all declared sources, verifies their checksums, resolves their dependencies, and writes bdp.lock. Commit both files.

BDP-Maintained Organizations#

Organization	What's indexed
UniProt	Protein sequences and annotations
NCBI	Taxonomy, GenBank, RefSeq
Gene Ontology	GO terms and annotations
InterPro	Protein family and domain data
HPO	Human Phenotype Ontology
MONDO	Disease ontology
ChEBI	Chemical compounds
Reactome	Biological pathways
Ensembl	Gene coordinates and biotypes
PubMed / Europe PMC	Literature
ClinVar	Clinical variants
Open Targets	Disease–target associations
ClinicalTrials	Clinical trial data
ChEMBL	Bioactive molecules
STRING	Protein interaction networks
DisGeNET	Gene–disease associations
DrugCentral	Drug information
OpenFDA FAERS	Adverse event reports

Help Shape Data Sources#

We are not experts in every bioinformatics subdomain — and we know it. We've done our best to model versioning for each organization, but there are 18 of them, each with its own release conventions, and we are almost certainly getting some of them wrong.

That's why your feedback matters more than anything else here.

Is our versioning wrong? Tell us. Did a version string not match what you expected? Did a pull break because we resolved the wrong release? We want to know.

Is an organization missing? Open an issue — we prioritise based on what people actually need.

Does the identifier format not work for your workflow? The organization:entry-format@version scheme felt right to us, but we're open to being wrong about that too.

→ Open an issue · Chat on Matrix · Contribute on Codeberg

Dokumentation

Translation Not Available