Dokumentation

Mit KI übersetzt — wir entschuldigen uns für etwaige Fehler. Helfen Sie uns, diese Übersetzung zu verbessern.

Data Sources#

A data source is a versioned, typed reference to a specific entry published by an organization — a single UniProt protein, an HPO ontology release, an Ensembl gene annotation, or any other dataset available through BDP.

Identifier Format#

Every data source is addressed with a single identifier:

Code
<organization>:<entry>-<format>@<version>
bash
bdp source add uniprot:P01308-fasta@1.0
PartExampleMeaning
organizationuniprotThe organization publishing the data
entryP01308The specific record or dataset within that organization
formatfastaFile format to retrieve
version1.0Release or version (see below)

Source Types#

Data sources in BDP are typed. The type determines what kind of biological entity the source represents and how it integrates with the knowledge graph.

TypeDescriptionExamples
proteinProtein sequences and annotationsUniProt entries
geneGene coordinates, biotypes, cross-referencesEnsembl genes
transcriptTranscript sequences and annotationsEnsembl transcripts
genomic_sequenceChromosome / contig sequencesNCBI RefSeq
genomeFull genome assembliesGenBank
taxonomyOrganism classification treesNCBI Taxonomy
go_termGene Ontology terms and relationshipsGene Ontology
interpro_entryProtein family and domain classificationsInterPro
pathwayBiological pathway definitionsReactome
diseaseDisease ontology termsMONDO
phenotypePhenotype ontology termsHPO
compoundChemical compoundsChEBI, ChEMBL
drugDrug information and interactionsDrugCentral, OpenFDA FAERS
variantGenetic variants and clinical significanceClinVar
structure3D molecular structuresPDB
annotationCross-database annotations and linksOpen Targets, STRING
bundlePre-assembled collection of related sourcesSee Bundles

Dependencies#

Data sources can depend on other data sources. A protein entry may depend on a taxonomy entry for its organism; a pathway may depend on the gene and compound entries it references. BDP tracks these relationships and version-pins them — when you pull a source, its dependencies are resolved to exact versions and recorded in bdp.lock alongside the source itself.

Dependencies are either required or optional. Required dependencies are always pulled. Optional dependencies are declared but not pulled unless you explicitly include them.

This is also what makes bundles work: a bundle is just a data source whose dependencies are the individual sources it aggregates.

Organizations#

The organization part of the identifier is a first-class concept in BDP. Organizations own data sources, define how they are versioned, and are responsible for keeping them up to date.

The organizations currently available — UniProt, NCBI, HPO, Ensembl, and the rest — are maintained by BDP through its ingestion pipelines. Their versioning logic is implemented and kept in sync by the BDP team. If any of these organizations would like to take ownership of their data on BDP and publish it directly, we would love to make that happen — get in touch.

Organizations and labs can also publish their own data. If your lab, institution, or project produces biological data and you would like to make it available through BDP — with proper versioning, checksums, and reproducibility guarantees — this is on our roadmap. Reach out and we will be happy to collaborate and prioritise accordingly.

Versioning#

Versioning is the hardest part of BDP — and intentionally opinionated.

Every organization versions differently. There is no universal standard: UniProt uses release numbers (2024_01), HPO uses calendar dates (2024-04-26), NCBI GenBank uses release integers, and individual protein entries like P01308 are versioned by sequence version. BDP normalises these into a consistent @version string per organization, but the underlying scheme is organization-specific.

@latest is not supported — by design. BDP requires an explicit version. Omitting the version is an error. This is intentional: reproducibility is the core guarantee, and @latest would silently break it the moment an upstream organization ships a new release.

What BDP guarantees:

  • The version you specify is the version you get, every time
  • The resolved file, checksum, and all pinned dependencies are recorded in bdp.lock on the first bdp pull
  • Subsequent pulls reproduce the exact same bytes, regardless of what the upstream organization has since released
  • To adopt a newer release, you update the version explicitly
bash
# Specify the exact version you want
bdp source add uniprot:P01308-fasta@1.0
# Update to a newer version explicitly
bdp source update uniprot:P01308-fasta@2.0

Project Manifest#

Data sources are declared in bdp.yml at the root of your project:

bdp.yml
project:
name: insulin-variant-analysis
version: 1.0.0

sources:
- uniprot:P01308-fasta@1.0
- hpo:hp-obo@1.0
- ensembl:homo-sapiens-gtf@1.0

Running bdp pull downloads all declared sources, verifies their checksums, resolves their dependencies, and writes bdp.lock. Commit both files.

BDP-Maintained Organizations#

OrganizationWhat's indexed
UniProtProtein sequences and annotations
NCBITaxonomy, GenBank, RefSeq
Gene OntologyGO terms and annotations
InterProProtein family and domain data
HPOHuman Phenotype Ontology
MONDODisease ontology
ChEBIChemical compounds
ReactomeBiological pathways
EnsemblGene coordinates and biotypes
PubMed / Europe PMCLiterature
ClinVarClinical variants
Open TargetsDisease–target associations
ClinicalTrialsClinical trial data
ChEMBLBioactive molecules
STRINGProtein interaction networks
DisGeNETGene–disease associations
DrugCentralDrug information
OpenFDA FAERSAdverse event reports

Help Shape Data Sources#

We are not experts in every bioinformatics subdomain — and we know it. We've done our best to model versioning for each organization, but there are 18 of them, each with its own release conventions, and we are almost certainly getting some of them wrong.

That's why your feedback matters more than anything else here.

Is our versioning wrong? Tell us. Did a version string not match what you expected? Did a pull break because we resolved the wrong release? We want to know.

Is an organization missing? Open an issue — we prioritise based on what people actually need.

Does the identifier format not work for your workflow? The organization:entry-format@version scheme felt right to us, but we're open to being wrong about that too.

Open an issue · Chat on Matrix · Contribute on Codeberg