1. First Steps with DATS¶

This document offers an overview of the DATS model from a practical perspective, detailing how DATS may be used to document a specific dataset.

The DATS model is centered around the Dataset entity, which supports most of the relevant information about the data being observed.

The main building blocks of the DATS model are defined as “entities”, which and for convenience purposes, may be compared to the different “sections” of information in a flat document. Each entity has a number of properties that are instantiated either as other entities or as direct entries. For the latter, information may may be structured (e.g., integer, date, URI) or unstructured (string, or free text entries).

First and foremost, Dataset entity aims to cater for essential provenance information: who, when, what, why, where, and how. By answering these questions, each dataset source will define its own view on what a dataset is. The Dataset entity is also designed to declare which variables were measured and what type of data was collected.

1.1. What is the dataset about?¶

The nature of the information available in a dataset can be recorded via the DATS Dimension entity. It is the object to use for reporting variables measured and for which data have been collected.

The DATS Dimension object can be qualitied using the DATS DataType entity.

The DATS DataType covers four aspects of a variable’s nature: type of information (what the data is about), method (how the data was generated), platform (the instrumentation, software and reagents used to generate the data), and instrument (the specific device used to generate the data).

Importantly, it is key to remember that Dataset may be constitutive parts of another Dataset. Each of these dataset parts can be used to describe a particular aspect of a dataset in greater details. For instance, a dataset describing a multi-omics experiment may contain several datasets, one focusing on transcriptomics, one focusing on metabolomics and so on.

1.2. Why was the data produced?¶

As a Dataset property, the “description” is a textual narrative that typically indicates the dataset’s purpose and why it was produced.

In addition, in the extended DATS it is possible to describe the Study that produced one, or several related datasets, including the purpose, objective, or hypothesis that gave origin to the dataset(s) defined as belonging to a study.

Related studies may also be grouped to constitute a series.

Tracking dataset spatial and temporal properties

1.3. Where was the dataset collected and where was it produced?¶

The DATS Dataset property spatialCoverage includes a description of the geography covered by the dataset and/or measured by the dataset’s dimensions or variables.

spatialCoverage is instantiated within a Place entity, which maps to the entity bearing the same name in schema.org (http://schema.org/Place), to “geoLocation” in the DataCite schema (http://schema.datacite.org/meta/kernel-4.0/) and to “Feature” in GeoJSON (https://tools.ietf.org/html/rfc7946).

1.4. When was the dataset produced?¶

DATS model provides a Date object to records key Date(s) associated with the description of a Dataset.

For each Date, users have to identify its type, in relation to a specific event (e.g. creation, update, validation, verification, deprecation…).

Such generic mechanism of providing Date and temporal information offers flexibility and extensibility. Dates may be repeated and differentiated by type. This allows for extensions to new types of dates that may be required in specific scenarios. The actual definition of the types is delegated to existing ontologies.

1.5. Who produced the dataset?¶

Using the Dataset’s “creators” property, DATS records the Person and/or Organization associated with the dataset, and supports documenting their roles (e.g., creator, curator, developer, funder, principal investigator).

1.6. Where and How can the dataset be accessed?¶

DATS provides for a comprehensive description of the ways to access a Dataset. This information can be reported in the Access entity, that is part of DatasetDistribution as well as part of the description of a DataRepository. It covers information such as the dataset landing page and/or access URL if available, a description of the type of access (such as download, remote access, remote service, enclave or not available) as well as any authorization or authentication needed to access the dataset.