Terminology used in this document #
- datum
- A specific value, encoded in a particular way, that travels through the data pipeline.
- config.yaml
- A file (potentially with a different name) used by the data pipeline to allow users to override default pipeline behaviour. See the File API specification for more details.
- metadata.yaml
- A file used by the data pipeline to describe available data files, listing their associated metadata. See the File API specification for more details.
- access.yaml
- A file (potentially with a different name) generated by the data pipeline API to record file access. See the File API specification for more details.
Metadata #
- data_product
- Identifies which kind of quantity a datum represents (e.g. “human/mixing-matrix”). Path-formatted to permit structure in the filename scheme (defined below). The desired data_product is typically specified in model code, and it is a core part of the data identifiers used in config.yaml, metadata.yaml, and access.yaml.
- version
- A semver identifying a version of a data_product (the file API will select the most recent version if this is not specified).
- component
- Identifies a part of a data_product.
- filename
- Specifies the path to a file, typically relative to the data root. Only required on read, and typically inferred from metadata.yaml.
- extension
- Specifies the extension of a file. Required on write to generate a standard filename. Typically provided by a datatype API.
- run_id
- Specifies a unique identifier for a model run. Required on write to generate a standard filename, typically generated by the file API.
- verified_hash
- Specifies a “verified good” SHA1 hash for a file. Used by the file API to verify file contents. Typically defined in metadata.yaml.
- calculated_hash
- Specifies the SHA1 hash computed by the file API for a file. Typically only defined in access.yaml.
- max_warning
- Specifies the maximum known warning level for a particular datum. Could be used by the file API to filter “bad” data (currently not supported). Typically defined in metadata.yaml.
Filenames #
{data root}/{data_product}…/{run_id}.{extension}
e.g. {data root}/human/mixing-matrix/12345.h5