Notice: This Documentation is in the process of being updated, some of the information may be out of date, or incorrect.
Introduction #
The FAIR Data Pipeline is intended to enable tracking of provenance of
FAIR (findable, accessible, interoperable and reusable) data used in epidemiological modelling. Pipeline APIs written in
C,
C++,
FORTRAN,
Java,
Julia,
Python and
R can be called by modelling software for data ingestion. These interact with a local relational database storing metadata and the local filesystem, and are configured using a yaml file associated with the model run. Local files and metadata can be synchronised with a remote registry via a command line interface (
fair
).
The key benefits of using the FAIR Data Pipeline are:
- Opensource, all code is available on the FAIRDataPipeline GitHub
- Data recorded in a FAIR fashion (metadata on all data and code open and available for inspection)
- Provenance tracing allows model outputs to be traced to inputs and modelling code
- Multiple language support
- Designed to run on a broad range of platforms (including HPC, inside Safe Havens)
- Designed to be set up and completed online (to down-/up-load data) and run offline (Safe Havens will require this)
- Open metadata provides knowledge of or access to shared central data for specific domains (e.g. COVID-19 epidemiological modelling)
Running Models #
To use the FAIR Data Pipeline with a piece of modelling software, you must add a language specific Pipeline API as a dependency and interact with data registered in the pipeline via the methods it presents. Each model run must be configured using a
config.yml
file which specifies inputs and outputs by metadata.
graph LR; subgraph CLI fair end subgraph Local API API[Pipeline API] CY[config.yml] end subgraph localhost LR[Registry] FS[File Store] end subgraph Model MC[Model code] end fair --> CY CY --> API fair --> MC MC --> |read/write/link_*| API API --> |read/write/link_*| LR LR --> |read/link_*| API API --> |read/link_*| MC API --> |write_*| FS FS --> |read_*| API MC --> |"(from link_write)"| FS FS --> |"(from link_read)"| MC
Getting data #
The command line interface
fair
is used to download and upload data and metadata required for and produced by model runs.
graph LR; subgraph Remote RR OS URI end subgraph Local LR FS end RR[Remote Registry]-->|fair pull| LR[Local Registry] LR-->|fair push| RR RR-->OS(Managed Object Store) RR-->URI(Arbitrary URI) LR-->FS(Local Filesystem) OS-->|fair pull| FS URI-->|fair pull| FS FS-->|fair push| OS FS-->|fair push| URI