Database

Using Versatile Data Kit to Ingest and Process Data from REST API

A tutorial on how to use Versatile Data Kit (VDK) to ingest data from a REST API, with some tips on how to also install the VDK server.

Angelica Lo Duca
Towards Data Science
6 min readMar 15, 2022

--

Photo by Kevin Ku on Unsplash

Versatile Data Kit (VDK) is an open-source framework, recently released by VMware. It works with a data lake where you can ingest different formats of data and then easily access them. In my previous article entitled An Overview of Versatile Data Kit, I described the main concepts behind VDK, including the step-by-step guide to install it. In my next article, entitled From Raw Data to a Cleaned Database: A Deep Dive into Versatile Data Kit, I illustrated how to connect VDK and Trino DB, as well as how to ingest and process CSV files.

In this article, I provide you with another example in VDK, extracting data from REST API. Since access to the REST API requires a secret key, I also describe how to configure the VDK Server to store the secret.

The article is organized as follows:

  • Definition of the Scenario
  • Environment Setup
  • Data Ingestion in VDK
  • Data Processing in VDK

1 Definition of the Scenario

The objective of this scenario is to ingest and process in VDK Vincent Van Gogh’s paintings available in Europeana, a well-known European platform for Cultural Heritage. Europeana provides all the Cultural Heritage objects through its public REST API. Europeana provides more than 700 works whose author is Vincent Van Gogh. For each record, the REST API provides much information, including completeness, country, data provider, creator, title, and much more.

The following JSON shows a record extract returned by the Europeana REST API:

[{
completeness: 10,
country: ["Belgium"],
dataProvider: ["Catholic University of Leuven"],
dcCreator: ["http://data.europeana.eu/agent/base/59832",
"Vincent van Gogh"],
dcCreatorLangAware: {
def: ["http://data.europeana.eu/agent/base/59832"],
en: ["Vincent van Gogh"]
},
dcDescription:
["KU Leuven. Glasdia’s kunstgeschiedenis. Université de Louvain, tussen 1839 en 1939. Fotograaf onbekend. Toegevoegde informatie op dia. Stroming/Stijl: Postimpressionisme. Creatie/Bouw: 1889. Techniek/Materiaal: Olieverf op doek. Huidige locatie: Nederland, Otterlo, Kröller-Müller Museum. EuroPhot. Kunstgeschiedenis. 19de eeuw. Schilderkunst. Portret. EuroPhot. Art history. 19th century. Paintings. Portrait."],
...
}]

The access point to the REST API is available at the following link:

https://api.europeana.eu/record/v2/search.json

It is necessary to register for free to obtain an API key.

2 Environment Setup

To run this example, I need:

  • Versatile Data Kit
  • Trino DB
  • Versatile Data Kit plugin for Trino
  • Versatile Data Kit Server
  • Europeana API key

In my previous article, I have already described how to configure the Trino DB and the Versatile Data Kit plugin for Trino, so you can refer to them if you need to install them.

2.1 Versatile Data Kit Server

In this example, I install the Versatile Data Kit (VDK) server locally. I will use the VDK server to store the Europeana API Key. The VDK server requires the following prerequisites:

I can install them, by following their official guide. I can install the Versatile Data Kit Server through the following command:

vdk server — install

For more information on how to install the VDK server, you can refer to the VDK official documentation.

Once the installation is complete, I should be able to access the VDK REST API at this link:

http://localhost:8092/swagger-ui.html

In addition, from the Docker dashboard, I should be able to see the installed VDK server, as shown in the following figure:

Image by Author

2.2 Europeana API key

I can obtain a Europeana API key by registering to this link. I should complete a form that requests your email. Once I submit the form, the API key is sent by email.

I can add the Europeana API key as a VDK property as follows:

vdk properties — set-secret api_key

VDK will ask you for the following information:

Job Name: online-exhibition
Job Team: my-team
api_key: YOUR_API_KEY

After this step, the Europeana API key should be stored in your VDK server. I can control if it has been correctly stored, by running the following command:

vdk properties --list

I write the Job Name and the Job Team, and the system returns me the Europeana API key.

3 Data Ingestion in VDK

Data Ingestion uploads in the database output of the call to the Europeana REST API. Data Ingestion is performed through the following steps:

  • delete the existing table (if any)
  • create a new table
  • ingest table values directly from the REST API.

To access the Europeana REST API, this example requires an active Internet connection to work properly.

3.1 Delete the existing table

I write this step as an SQL command, which simply drops the table if it exists:

DROP TABLE IF EXISTS assets

The output of the REST API is imported in a table called assets.

3.2 Create a new table

Now I define a new step, which creates a new table, that will contain all the output of the REST API. I will import the output as raw values, which could contain a JSON:

CREATE TABLE assets (completeness int,
country varchar,
dataProvider varchar,
dcCreator varchar,
dcCreatorLangAware varchar,
dcTitleLangAware varchar,
edmDatasetName varchar,
edmIsShownBy varchar,
edmPreview varchar,
edmTimespanLabel varchar,
edmTimespanLabelLangAware varchar,
europeanaCollectionName varchar,
europeanaCompleteness int,
"guid" varchar,
id varchar,
index int,
language varchar,
link varchar,
previewNoDistribute boolean,
provider varchar,
rights varchar,
score double,
"timestamp" varchar,
timestamp_created varchar,
timestamp_created_epoch varchar,
timestamp_update varchar,
timestamp_update_epoch varchar,
title varchar,
"type" varchar,
ugc varchar
)

The following table shows an example of the assets table:

Image by Author

The previous table shows only some fields, to give an idea of how raw are the imported data.

3.3 Ingest table values directly from the REST API

The next step is a Python script, which imports values directly from the REST API. I implement the run() function, which will be automatically read by VDK, as follows. Firstly, I retrieve the api_key through the get_property() method of the job_input. Then I access the Europeana REST API. Since Europeana returns the information through pagination, I should loop across all the pages. Finally, I store the retrieved information by using the send_tabular_data_for_ingestion() method. The complete code of the run code is shown below:

def run(job_input: IJobInput):
api_key = job_input.get_property("key")
start = 1
rows = 100

basic_url = f"https://api.europeana.eu/record/v2/search.json?wskey={api_key}&query=who:%22Vincent%20Van%20Gogh%22"
url = f"{basic_url}&rows={rows}&start={start}"
response = requests.get(url)
response.raise_for_status()
payload = response.json()
n_items = int(payload["totalResults"])

while start < n_items:
if start > n_items - rows:
rows = n_items - start + 1
url = f"{basic_url}&rows={rows}&start={start}"
response = requests.get(url)
response.raise_for_status()
payload = response.json()["items"]
df = pd.DataFrame(payload)
job_input.send_tabular_data_for_ingestion(
df.itertuples(index=False),
destination_table="assets",
column_names=df.columns.tolist(),
)
start = start + rows

4 Data Processing

The imported table is quite dirty, so it should be cleaned. Data Processing includes cleaning the assets table and extracting only some useful information. Extracted information includes the following columns:

  • country
  • edmPreview
  • provider
  • title
  • rights

I store the resulting table into a table named cleaned_assets. I define two steps for data processing:

  • delete the cleaned_assets table (if any)
  • create and populate the cleaned_assets table.

4.1 Delete the cleaned_assets table

Similar to that of the assets table, this step is a single SQL statement, that drops the table if it exists:

DROP TABLE IF EXISTS cleaned_assets

4.2 Create and populate the cleaned_assets table

This step manipulates the assets table to extract only useful information. I can write it as follows:

CREATE TABLE cleaned_assets AS (SELECT
SUBSTRING(country, 3, LENGTH(country)-4) AS country,
SUBSTRING(edmPreview, 3, LENGTH(edmPreview)-4) AS edmPreview,
SUBSTRING(provider, 3, LENGTH(provider)-4) AS provider,
SUBSTRING(title, 3, LENGTH(title)-4) AS title,
SUBSTRING(rights, 3, LENGTH(rights)-4) AS rights
FROM assets)

The previous code cleans some fields through the SUBSTRING() function.

Now the dataset is ready for further analysis! Enjoy it!

Summary

Congratulations! You have just learned how to import data from REST API in VDK, as well as how to configure the VDK server to store secrets. You can download the example described in this article from the VDK official repository, under the section examples → online-exhibition.

That is all for now. Stay tuned for further updates on VDK!

Related Articles

--

--

Researcher | +50k monthly views | I write on Data Science, Python, Tutorials, and, occasionally, Web Applications | Book Author of Comet for Data Science