DPAPI functions #
Note that this is a living document and the following is subject to change.
This page gives a full working example of the user written config.yaml file alongside the working config file generated by FAIR run
. Note that the Data Pipeline API will take the working config file as an input.
The following example downloads some data from outside the pipeline, does some processing in R (for example), and records the original file and the resultant data product into the pipeline.
In this simple example, the user should run the following from the terminal:
fair pull config.yaml
fair run config.yaml
fair push config.yaml
These functions require a config.yaml file to be supplied by the user. This file should specify various metadata associated with the code run, including where external objects comes from and the aliases that will be used in the submission script, data objects to be read and written, and the submission scipt location.
User written config.yaml #
run_metadata:
description: Register a file in the pipeline
local_data_registry_url: https://localhost:8000/api/
remote_data_registry_url: https://data.fairdatapipeline.org/api/
default_input_namespace: SCRC
default_output_namespace: soniamitchell
write_data_store: /Users/SoniaM/datastore/
local_repo: /Users/Soniam/Desktop/git/SCRC/SCRCdata
script: |-
R -f inst/SCRC/scotgov_management/submission_script.R ${{CONFIG_DIR}}
register:
- external_object: records/SARS-CoV-2/scotland/cases-and-management
namespace_name: Scottish Government Open Data Repository
namespace_full_name: Scottish Government Open Data Repository
namespace_website: https://statistics.gov.scot/
root: https://statistics.gov.scot/sparql.csv?query=
path: |
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX data: <http://statistics.gov.scot/data/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mp: <http://statistics.gov.scot/def/measure-properties/>
PREFIX dim: <http://purl.org/linked-data/sdmx/2009/dimension#>
PREFIX sdim: <http://statistics.gov.scot/def/dimension/>
PREFIX stat: <http://statistics.data.gov.uk/def/statistical-entity#>
SELECT ?featurecode ?featurename ?date ?measure ?variable ?count
WHERE {
?indicator qb:dataSet data:coronavirus-covid-19-management-information;
dim:refArea ?featurecode;
dim:refPeriod ?period;
sdim:variable ?varname;
qb:measureType ?type.
{?indicator mp:count ?count.} UNION {?indicator mp:ratio ?count.}
?featurecode <http://publishmydata.com/def/ontology/foi/displayName> ?featurename.
?period rdfs:label ?date.
?varname rdfs:label ?variable.
?type rdfs:label ?measure.
}
title: Data associated with COVID-19
description: The data provide past data around COVID-19 for the daily updates provided by the Scottish Government.
unique_name: COVID-19 management information
file_type: csv
release_date: ${{DATETIME}}
version: 0.${{DATE}}.0
primary: True
write:
- data_product: records/SARS-CoV-2/scotland/cases-and-management/ambulance
description: Ambulance data
version: 0.${{DATE}}.0
- data_product: records/SARS-CoV-2/scotland/cases-and-management/calls
description: Calls data
version: 0.${{DATE}}.0
- data_product: records/SARS-CoV-2/scotland/cases-and-management/carehomes
description: Care homes data
version: 0.${{DATE}}.0
- data_product: records/SARS-CoV-2/scotland/cases-and-management/hospital
description: Hospital data
version: 0.${{DATE}}.0
- data_product: records/SARS-CoV-2/scotland/cases-and-management/mortality
description: Mortality data
version: 0.${{DATE}}.0
- data_product: records/SARS-CoV-2/scotland/cases-and-management/nhsworkforce
description: NHS workforce data
version: 0.${{DATE}}.0
- data_product: records/SARS-CoV-2/scotland/cases-and-management/schools
description: Schools data
version: 0.${{DATE}}.0
- data_product: records/SARS-CoV-2/scotland/cases-and-management/testing
description: Testing data
version: 0.${{DATE}}.0
Working config.yaml #
fair run
should create a working config.yaml file, which is then read by the Data Pipeline API.
run_metadata:
description: Register a file in the pipeline
local_data_registry_url: https://localhost:8000/api/
remote_data_registry_url: https://data.fairdatapipeline.org/api/
default_input_namespace: soniamitchell
default_output_namespace: soniamitchell
write_data_store: /Users/SoniaM/datastore/
local_repo: /Users/Soniam/Desktop/git/SCRC/SCRCdata
script: |-
R -f inst/SCRC/scotgov_management/submission_script.R /Users/SoniaM/datastore/coderun/20210511-231444/
public: true
latest_commit: 221bfe8b52bbfb3b2dbdc23037b7dd94b49aaa70
remote_repo: https://github.com/ScottishCovidResponse/SCRCdata
read:
- data_product: records/SARS-CoV-2/scotland/cases-and-management
use:
data_product: records/SARS-CoV-2/scotland/cases-and-management
version: 0.20210414.0
namespace: soniamitchell
write:
- data_product: records/SARS-CoV-2/scotland/cases-and-management/ambulance
description: Ambulance data
use:
version: 0.20210414.0
- data_product: records/SARS-CoV-2/scotland/cases-and-management/calls
description: Calls data
use:
version: 0.20210414.0
- data_product: records/SARS-CoV-2/scotland/cases-and-management/carehomes
description: Care homes data
use:
version: 0.20210414.0
- data_product: records/SARS-CoV-2/scotland/cases-and-management/hospital
description: Hospital data
use:
version: 0.20210414.0
- data_product: records/SARS-CoV-2/scotland/cases-and-management/mortality
description: Mortality data
use:
version: 0.20210414.0
- data_product: records/SARS-CoV-2/scotland/cases-and-management/nhsworkforce
description: NHS workforce data
use:
version: 0.20210414.0
- data_product: records/SARS-CoV-2/scotland/cases-and-management/schools
description: Schools data
use:
version: 0.20210414.0
- data_product: records/SARS-CoV-2/scotland/cases-and-management/testing
description: Testing data
use:
version: 0.20210414.0
submission_script.R #
A submission script should be supplied by the user, which in this case registers an external object, reads it in, and then writes it back to the pipeline as a data product component. In the above example, this script is located in <local_repo>/inst/SCRC/scotgov_management/submission_script.R.
library(SCRCdataAPI)
# Open the connection to the local registry with a given config file
config <- file.path(Sys.getenv("FDP_CONFIG_DIR"), "config.yaml")
script <- file.path(Sys.getenv("FDP_CONFIG_DIR"), "script.sh")
handle <- initialise(config, script)
# Return location of file stored in the pipeline
input_path <- link_read(handle, "records/SARS-CoV-2/scotland/cases-and-management/mortality")
# Process raw data and write data product
data <- read.csv(input_path)
array <- some_processing(data) # e.g. data wrangling, running a model, etc.
index <- write_array(array,
handle,
data_product = "records/SARS-CoV-2/scotland/cases-and-management/mortality",
component = "mortality_data",
dimension_names = list(location = rownames(array),
date = colnames(array)))
issue_with_component(index,
handle,
issue = "this data is bad",
severity = 7)
finalise(handle)
initialise()
#
responsible for reading the working config.yaml file
- registers the working config.yaml file, submission script, and GitHub repo
- registers a CodeRun (since the CodeRun UUID should be referenced if
${{RUN_ID}}
is specified in a DataProduct name) - returns a
handle
containing:- the working config.yaml file contents
- the object id for this file
- the object id for the submission script file
- the object id for the CodeRun
Note that since, StorageLocation has a uniqueness constraint on path
, hash
, public
, and storage_root
, files with the same hash
in the same storage_root
and public
flag should generate new Object entries that point to the same path
. Likewise, files with the same hash
in the same storage_root
and public
flag should not be duplicated in the data store.
link_read()
#
responsible for returning the path of an external object in the local data store
- updates
handle
with file path (if not already present) and useful metadata - returns file path
read_array()
#
responsible for reading the correct data product, which at this point has been downloaded from the remote data store by fair pull
- reads a specified version of the data
- updates
handle
link_write()
#
- updates
handle
with useful metadata - returns a path to write to
write_array()
#
responsible for writing an array as a component to an hdf5 file
- writes component to the hdf5 file
- updates
handle
- if this is the first component to be written, update
handle
with storage location - if this is not the first component to be written, reference the storage location from the
handle
- if the component is already recorded in the handle, return the index of this handle reference invisibly
- if this is the first component to be written, update
raise_issue()
#
responsible for writing issue related metadata to the handle
- records issue and data product / component related metadata in the handle
- note that where an issue is associated with an entire object, the
whole_object
component is referenced; thewhole_object
component is generated automatically whenever a new Object is created - components or whole data products may be referenced by name or via a reference (for instance returned by
write_array()
) - multiple components or data products may be linked to a single issue, either via two functions - one to raise issues and one to attach them - or via arguments to
raise_issue()
that allow multiple components to be attached
finalise()
#
- renames files with their hash
- until this point, we’ve arbitrarily named each file e.g. dat-{random_hash}.{extension}
- this is renamed as b03bbbe1205b3de70b1ae7573cf11c8b2555d2ed.{extension}
- renames data products if variables are present, e.g. for
human/outbreak/simulation_run-${{RUN_ID}}
,${{RUN_ID}}
is replaced with the CodeRun UUID - records metadata (e.g. location, components, various descriptions, issues) in the data registry
- updates the code run in the data registry
submission_script.py #
Alternatively, the submission script may be written in Python.
from data_pipeline_api.standard_api import StandardAPI
with StandardAPI.from_config("config.yaml") as api:
data = read(api.link_read("records/SARS-CoV-2/scotland/cases-and-management"))
data
api.write_array("records/SARS-CoV-2/scotland/cases-and-management/mortality", "mortality_data", matrix)
api.issue_with_component("records/SARS-CoV-2/scotland/cases-and-management/mortality", "mortality_data", "this data is bad", "7")
api.finalise()
submission_script.jl #
Alternatively, the submission script may be written in Julia.
using DataPipeline
# Open the connection to the local registry with a given config file
handle = initialise("working-config.yaml", "script.sh")
# Return location of file stored in the pipeline
input_path = link_read(handle)
# Process raw data and write data product
data = read_csv(input_path)
array = some_processing(data) # e.g. data wrangling, running a model, etc.
index = write_estimate(array,
handle,
data_product = "records/SARS-CoV-2/scotland/cases-and-management/mortality",
component = "mortality_data")
issue_with_component(index,
handle,
issue = "this data is bad",
severity = 7)
finalise(handle)