File API #
The file API manages file access, provenance, and metadata #
The API is accessed as a “session”. All reads and writes are recorded and logged into a file when the session closes. Files are identified by their metadata, though the metadata is handled differently for reads (where the files are expected to exist) and writes (where they typically do not), described in more detail below.
The file API behaviour is entirely determined by a yaml configuration file (referred to here as a “config.yaml” file, and described below) provided at initialisation. This configuration file defines the “data directory” that the file API should interact with. That directory must contain a file called “metadata.yaml”, described below, that defines the metadata associated with the files in the data directory. The data directory and the metadata.yaml file can be automatically created by a download script which reads the config.yaml file and downloads appropriate data and metadata.
When a model or script is run (in the “run”), any output files are written to the data directory, and an “access.yaml” file is created that enumerates exactly which files were read and written to during the run. The access.yaml file contains sufficient information to upload all of the data and metadata from the run to the data store and data registry respectively. This can be carried out automatically using an upload script if desired. Note that the access.yaml file may not be written until the connection to the API is closed (this is certainly true for the python implementation). When the file API is initialised a “run_id” is created to uniquely identify that invocation. It is constructed by forming the SHA1 hash of the configuration file content, plus the date time string.
For normal modelling runs, the only interaction with the File API happens through setting the config.yaml file (and running the download and upload scripts), but the rest of the information (formats of the metadata.yaml and access.yaml file, and the low-level File API calls themselves are provided here for completeness).
config.yaml file format #
The config file lets users specify metadata to be used during file lookup, and configure overall file API behaviour. A simple example:
data_directory: .
access_log: access-{run_id}.yaml
run_metadata:
description: A test model
data_registry_url: https://data.fairdatapipeline.org/api/
default_input_namespace: SCRC
default_output_namespace: model_test
submission_script: model.py
remote_uri: ssh://boydorr.gla.ac.uk/srv/ftp/scrc/
remote_uri_override: ftp://boydorr.gla.ac.uk/scrc/
read:
- where:
data_product: human/commutes
use:
version: 1.0
- where:
data_product: human/population
use:
filename: my-human-population.csv
write:
- where:
data_product: human/outbreak-timeseries
Component:
use:
namespace: simple_network_sim
data_product: human/outbreak-timeseries
data_directory
specifies the file system root used for data access (default “.”). It may be relative; in which case it is relative to the directory containing the config file. The data directory must contain a metadata.yaml
file.
access_log
specifies the filename used to record the access log (default “access-{run_id}.yaml”). It may be relative; in which case it is relative to the directory containing the config file. It may contain the string {run_id}
, which will be replaced with the run id. It may be set to the boolean value False to indicate that no access log should be written.
run_id
specifies the run id to be used, otherwise a hash of the config contents and the date will be used.
run_metadata
provides metadata for the run that will be passed through to the access log.
The where
sections specify metadata subsets that are matched in the read and write processes. The metadata values may use glob syntax, in which case matching is done against the glob. The corresponding use
sections contain metadata that is used to update the call metadata before the file access is attempted. A filename
may be specified directly, in which case it will be used without any further lookup.
Any other attributes will be ignored.
metadata.yaml file format #
The metadata file contains metadata for all files in the file system “database”. A simple example:
-
data_product: human/commutes
version: 1
extension: csv
filename: human/commutes/1.csv
verified_hash: 075abd810909918419cf7495c16f1afec6fa010c
-
data_product: human/compartment-transition
version: 1
extension: csv
filename: human/compartment-transition/1.csv
verified_hash: 65662d0461471f36a06b32ca6d4003ca4493848f
Each section defines the metadata for a single file, including its filename, relative to the directory containing the metadata.yaml file.
access.yaml format #
The access file is generated whenever close()
is called on the API. It records basic information about the run, and a log of file accesses. An example:
data_directory: .
run_id: 84b87c5f60
open_timestamp: 2020-06-24 14:30:22.010927
close_timestamp: 2020-06-24 14:30:22.038766
config:
...
run_metadata:
git_repo: https://github.com/ScottishCovidResponse/simple_network_sim
git_sha: 353697d0a04ef5d6d5a04ef9aef514cbd72a55fd
...
io:
- type: read
timestamp: 2020-06-24 14:30:22.018370
call_metadata:
data_product: human/mixing-matrix
extension: csv
access_metadata:
data_product: human/mixing-matrix
version: 1.0.0
extension: csv
filename: human/mixing-matrix/1.csv
verified_hash: 075abd810909918419cf7495c16f1afec6fa010c
calculated_hash: 075abd810909918419cf7495c16f1afec6fa010c
- type: write
timestamp: 2020-06-24 14:30:22.038511
call_metadata:
data_product: human/estimatec
extension: csv
access_metadata:
data_product: human/estimatec
extension: csv
source: simple_network_sim
filename: human/estimatec/84b87c5f60.csv
calculated_hash: 91a6791ab4f6d3a4616066ffcae641ca3da79567
data_directory
specifies the file system root used for data access, either as an absolute path, or relative to the config.yaml file used to generate the run.
run_id
specifies the run id of the run.
config
reproduces the config.yaml used to generate this run verbatim.
run_metadata
contains additional metadata about the file API execution taken from the config and possibly overridden using the set_run_metadata
function.
open_timestamp
and close_timestamp
record time at which the file API was initialised, and the time at which close()
was called.
io
points to a list of file access sections containing a common format:
type
is either read or write.
timestamp
is the timestamp of the access.
call_metadata
contains the metadata provided to the open_for_read
or open_for_write
call.
access_metadata
contains the metadata used to open the file. The process for obtaining this metadata is described in the open_for_read
process and open_for_write
process sections below.
open_for_read process #
- Search config for all read sections that are a subset of the given metadata and update the call metadata with the corresponding overrides.
- Search the metadata file for all metadata sections that are a superset of the updated metadata; if any results, use the metadata section with the highest version.
- Use the filename defined in the metadata; fail if there is no filename specified, or if the file is not found.
- Calculate the hash of the file and store it in the metadata.
- If hash verification is enabled, check that there is a verified hash in the metadata, and that it matches the calculated hash; fail otherwise.
- Record the read.
- Open the file for read and return the file handle.
open_for_write process #
- Search config for all write sections that are a subset of the given metadata and update the call metadata with the corresponding overrides.
- If the metadata does not contain a filename, use the metadata and the run id to construct a standard filename, and add it to the metadata.
- Create all missing parent directories.
- If the file already exists, open the file for update, else open the file for write (and thus create it implicitly).
- Register a call-back to record the write on close and return the open file handle.
The File API consists of five logical functions (in python, implementation details may vary):
Function | Description |
---|---|
init(configuration_filename) | Initialise the API with a configuration file. |
open_for_read(metadata) | Use metadata to open a file for reading. |
open_for_write(metadata) | Use the metadata to open a file for writing. |
close() | Write the access log to disk. May be called again. |
set_run_metadata(key, value) | Associate a (key, value) metadata pair with the run. This is used by the Standard API to transmit the model uri and git_sha to the access.yaml. |