Notice: This Documentation is in the process of being updated, some of the information may be out of date, or incorrect.

Introduction#

The FAIR Data Pipeline is intended to enable tracking of provenance of FAIR (findable, accessible, interoperable and reusable) data used in epidemiological modelling. Pipeline APIs written in C, C++, FORTRAN, Java, Julia, Python and R can be called by modelling software for data ingestion. These interact with a local relational database storing metadata and the local filesystem, and are configured using a yaml file associated with the model run. Local files and metadata can be synchronised with a remote registry via a command line interface (fair).

The key benefits of using the FAIR Data Pipeline are:

  • Opensource, all code is available on the FAIRDataPipeline GitHub
  • Data recorded in a FAIR fashion (metadata on all data and code open and available for inspection)
  • Provenance tracing allows model outputs to be traced to inputs and modelling code
  • Multiple language support
  • Designed to run on a broad range of platforms (including HPC, inside Safe Havens)
  • Designed to be set up and completed online (to down-/up-load data) and run offline (Safe Havens will require this)
  • Open metadata provides knowledge of or access to shared central data for specific domains (e.g. COVID-19 epidemiological modelling)

Running Models#

To use the FAIR Data Pipeline with a piece of modelling software, you must add a language specific Pipeline API as a dependency and interact with data registered in the pipeline via the methods it presents. Each model run must be configured using a config.yml file which specifies inputs and outputs by metadata.

graph LR;
    subgraph CLI
        fair
    end
    subgraph Local API
        API[Pipeline API]
        CY[config.yml]
    end
    subgraph localhost
        LR[Registry]
        FS[File Store]
    end
    subgraph Model
        MC[Model code]
    end
 
    fair --> CY
    CY --> API

    fair --> MC

    MC --> |read/write/link_*| API
    API --> |read/write/link_*| LR

    LR --> |read/link_*| API
    API --> |read/link_*| MC

    API --> |write_*| FS
    FS  --> |read_*| API

    MC --> |"(from link_write)"| FS
    FS --> |"(from link_read)"| MC

Getting data#

The command line interface fair is used to download and upload data and metadata required for and produced by model runs.

graph LR;
    subgraph Remote
        RR
        OS
        URI
    end
    subgraph Local
        LR
        FS
    end

    RR[Remote Registry]-->|fair pull| LR[Local Registry]
    LR-->|fair push| RR

    RR-->OS(Managed Object Store)
    RR-->URI(Arbitrary URI)
    LR-->FS(Local Filesystem)

    OS-->|fair pull| FS
    URI-->|fair pull| FS
    FS-->|fair push| OS
    FS-->|fair push| URI