Integration of the DIVA
The Data Integrity and Validation Architecture (DIVA) is one of the main building blocks of the MODERATE platform. It is responsible for the validation and quality analysis of the datasets that are ingested into the platform.
This document outlines a possible approach to how the DIVA retrieves datasets from the MODERATE platform to run its data quality pipelines. The main components involved in this workflow include:
- The DIVA itself, which retrieves datasets for validation and quality analysis.
- The object storage service, where the datasets are stored. This service is implemented on top of Google Cloud Storage (GCS), with MODERATE prioritizing the use of the S3-compatible API for GCS to ensure interoperability.
- The platform HTTP API, serving as the main entry point for DIVA to interact with the MODERATE platform. It exposes the catalogue of datasets, allowing the DIVA to list and retrieve the metadata of the datasets that are available for validation.
The details of the API can be reviewed in its interactive API documentation, which is based on OpenAPI and is automatically generated by FastAPI. You can deploy the API documentation locally by following these instructions:
You need Docker to run the development stack.
- Clone the repository MODERATE-Project/moderate-platform-api
- Run the task to deploy the development stack:
task dev-up
- Access the API documentation at http://localhost:8000/docs
- Once you are done, you can stop the development stack by running
task dev-down
Don't worry if the task complains about not being able to pull the image for the Trust Services. The API documentation will work anyway.
The following sequence diagram shows a high-level view of how the DIVA could interact with the MODERATE platform to retrieve datasets and orchestrate the data quality pipelines:
sequenceDiagram
participant USER as User or Periodic Job
participant API
participant DIVA
participant S3 as Object Storage
USER->>API: Requests validation of dataset
alt DIVA is the orchestrator
DIVA->>API: Retrieves list of datasets that are pending validation
API->>DIVA: Responds with the list of datasets
else the platform is the orchestrator
API->>DIVA: Requests data validation for a dataset
DIVA->>API: Acknowledges the request
end
alt access to object storage via the HTTP API
DIVA->>API: Requests download URL for the dataset
API->>DIVA: Responds with pre-signed URL for download
S3->>DIVA: Downloads dataset over HTTP
else direct access to object storage
DIVA->>S3: Uses dataset metadata to fetch it directly
S3->>DIVA: Downloads dataset over S3 protocol
end
DIVA->>DIVA: Runs the data quality pipeline for the dataset
DIVA->>API: Reports pipeline completion status and results
API->>USER: Checks data quality results
Some key points to consider:
- The HTTP API endpoints that the DIVA would use to retrieve the list of datasets and report pipeline results are not yet available. They would be implemented if the MODERATE team decides to adopt this approach.
- There are two possible alternatives for downloading datasets, depending on what is more convenient for the DIVA:
- Access to the object storage via the HTTP API, which provides pre-signed URLs for downloading the datasets.
- Direct access to the object storage service, using the dataset metadata to fetch the datasets directly over the S3 protocol.
- There are also two possible alternatives for triggering the data quality pipeline: either DIVA periodically checks the API, or the platform pushes requests to DIVA, which then executes the pipelines on demand.
- Data quality pipeline runs could be either requested manually by end users via the MODERATE platform web UI or triggered by periodic jobs scheduled to run at specific intervals.
How to download a dataset using the MODERATE API
This is a specific example of how to implement one of the aforementioned alternatives for downloading datasets, namely the one where access is via the HTTP API, which in turn provides presigned URLs for downloading the dataset files.
In this example, we're going to use the public MODERATE API URL, which is deployed at https://api.gw.moderate.cloud
[!WARNING] Please note that the public deployment of the MODERATE platform is intermittently online for the time being.
The first thing that we need to do is get an access token, which is obtained by calling the /api/token
endpoint and passing our username and password.
$ curl --silent --location 'https://api.gw.moderate.cloud/api/token' --header 'Content-Type: application/x-www-form-urlencoded' --data-urlencode 'username=<username>' --data-urlencode 'password=<password>' | jq
{
"access_token": "<jwt-access-token>",
"expires_in": 300,
"refresh_expires_in": 1800,
"refresh_token": "<jwt-refresh-token>",
"token_type": "Bearer",
"not-before-policy": 0,
"session_state": "<session-uuid>",
"scope": "profile email"
}
In MODERATE, there are two dataset entities:
- Asset objects are the actual specific dataset files that users upload to the platform.
- Assets are logical groupings of asset objects. An asset may have several asset objects. All asset objects in a given asset should have some form of relationship or connection.
For example, an asset could be the energy consumption of a building, and the asset objects in that asset could be dataset files for specific devices within that building.
We can browse the catalogue of assets by calling the /asset
endpoint. The following request does not apply any filters and limits the results to one, so we will retrieve the first asset:
$ curl --silent --location 'https://api.gw.moderate.cloud/asset?limit=1' --header 'Authorization: Bearer <jwt-access-token>' | jq
[
{
"uuid": "4477de94-4ffc-490a-ba02-93f0e71db80c",
"name": "One Asset",
"meta": null,
"id": 1,
"objects": [
{
"key": "andres.garcia-assets/weather-0558e356-355b-4379-955b-dfe5023da2af.parquet",
"tags": null,
"created_at": "2024-04-03T10:27:26.247281",
"series_id": null,
"sha256_hash": "aecc871c3aaac446c60009a2902c2a714a35efd117b440f79f6d9c3856261f8e",
"proof_id": null,
"id": 1
},
{
"key": "andres.garcia-assets/customers-100000-7f596126-4771-45e1-9e8a-237e3635eb7c.csv",
"tags": null,
"created_at": "2024-04-03T10:33:05.948939",
"series_id": null,
"sha256_hash": "446d645458479d841c8c0239f6d4f882e4735e63db21ff980c53058eabc6beda",
"proof_id": null,
"id": 4
},
{
"key": "andres.garcia-assets/flights-1m-48653f22-0d5c-4f10-aa2b-81ef35959445.parquet",
"tags": null,
"created_at": "2024-04-03T10:33:20.951760",
"series_id": null,
"sha256_hash": "71ccd0758a73ac9d89ccda6107e3cbfc7e4cd3249d3766f881e48f7513e601fd",
"proof_id": null,
"id": 5
}
],
"access_level": "public"
}
]
Now that we know the ID of the asset that we want to download, we can call the /asset/<id>/download-urls
endpoint, which will return a list of presigned download URLs. These URLs enable any user (e.g., a software service in the DIVA) to download the asset object files in a time-limited fashion by embedding the credentials into the download URL itself.
$ curl --silent --location 'https://api.gw.moderate.cloud/asset/1/download-urls' --header 'Authorization: Bearer <jwt-access-token>' | jq
[
{
"key": "andres.garcia-assets/weather-0558e356-355b-4379-955b-dfe5023da2af.parquet",
"download_url": "https://storage.googleapis.com/moderate-platformapi/andres.garcia-assets/weather-0558e356-<...>"
},
{
"key": "andres.garcia-assets/customers-100000-7f596126-4771-45e1-9e8a-237e3635eb7c.csv",
"download_url": "https://storage.googleapis.com/moderate-platformapi/andres.garcia-assets/customers-100000-7f596126-<...>"
},
{
"key": "andres.garcia-assets/flights-1m-48653f22-0d5c-4f10-aa2b-81ef35959445.parquet",
"download_url": "https://storage.googleapis.com/moderate-platformapi/andres.garcia-assets/flights-1m-48653f22-<...>"
}
]