The world’s leading publication for data science, AI, and ML professionals.

Intro to AsterixDB – Part 1

A "one size fits a bunch" Big Data Management System with geospatial capabilities

Photo by Nastya Dulhiier on Unsplash
Photo by Nastya Dulhiier on Unsplash

Introduction

Asterixdb is a scalable, open source Big Data Management System (BDMS) with geospatial capabilities. There are even more features that you can look at here, hence its description as a "one size fits a bunch." Nevertheless, this tutorial will not be going too in depth into the complexities and potential of AsterixDB. Instead, I will be creating three blog posts that will act as a general, practical tutorial for those new to AsterixDB while also showcasing AsterixDB’s new geospatial functions. The posts will go in the following order:

  1. Prepare and Load the Data
  2. Perform Spatial Queries
  3. Measure Performance

In this blog, I will be discussing how to perform spatial queries on existing data. Thus, I will explain how to download the data, create our schema (dataverse, datatypes, and datasets), load our data, and transform and insert our data. Furthermore, we will use different file types so we can learn how to input data in a variety of formats with AsterixDB. In the next section, we will download our data.

Selecting the Data

To start, we will need to find data that contains Geospatial attributes. To do this, we will go to the UCR Star Website (a free visual catalog for big spatial datasets). There are a variety of data to choose from and you are welcome to freely download whichever datasets you want (though it might be better to do that after this tutorial). Nevertheless, in our example, we will be using a subset of the NE/urban_areas (Urban Areas) and Yin/instagram-twitter (Instagram-Twitter) datasets. The Urban Areas dataset utilizes polygons as its geospatial attribute to represent areas with "dense human habitation". The Instagram-Twitter dataset, however, utilizes points as its geospatial attribute to represent where users have tweeted on Twitter or posted on Instagram. I am selecting these datasets because I wanted to make our understanding of geospatial queries simple and intuitive. Thus, with one set that contains polygons (the Urban Areas dataset) and another set that contains points (the Twitter-Instagram dataset), we will create queries where can analyze the relationships between the polygons and points – i.e. we want to figure out how many points are in each polygon. Thus, some of the questions we want to answer at the end of this tutorial will include the following:

  1. How many Urban Areas and Tweets & Instagram posts are in our dataset?
  2. How many Tweets & Instagram posts does each urban area have?
  3. How many Tweets & Instagram posts are not within Urban areas?
  4. Which Urban Area has the most/least Tweets & Instagram posts?
  5. How many tweets are in a close range?

To simplify the amount of data and to reduce querying time, we will only look at (a part of) the West Coast of North America – from British Columbia, Canada to Baja California, Mexico. Again, here is the subset of data for Urban Areas and Instagram-Twitter. The links should bring you to a map that is similar to the images below:

Screenshot of NE/urban_areas dataset (left) and yin/instagram-twitter (right)
Screenshot of NE/urban_areas dataset (left) and yin/instagram-twitter (right)

In the bottom left corner, we can see the attributes of the dataset (aside from the geometry attribute). Click "Download data" to see your downloading format options. Make sure you are on the tab "Visible Area" and not "Full Dataset". For the Urban Areas dataset, select the format JSON+WKT. If you do not see this as an option, refresh the page and try again. An example of downloading the dataset can be found below:

GIF showing how to download the NE/urban_areas dataset
GIF showing how to download the NE/urban_areas dataset

Downloading the Instagram-Twitter dataset will be similar, however, select CSV as the format instead. It is important to note that the format JSON+WKT works well with AsterixDB. However, this is not the case with CSV, thus CSV will require a slightly more effort to sufficiently load into AsterixDB. In the next section, we will move onto defining the data our data model.

Defining the Data Model

Now that we have the files downloaded, we need to install AsterixDB. Please look at the documentation here to get started. Once downloaded, start your sample cluster. Now, with the cluster running, open http://localhost:19006 in your browser to access the AsterixDB Administration Console. Next, we will define our data model by inputting the following dataverse and datatype definitions:

DROP DATAVERSE PracticeData IF EXISTS; 
CREATE DATAVERSE PracticeData;
USE PracticeData;
CREATE TYPE UrbanAreasType AS {
 id: UUID,
 g: geometry?,
 scalerank: int?,
 featurecla: string?,
 area_sqkm: double?,
 min_zoom: double?
};
CREATE TYPE TempInstagramTwitterType AS {
 x: double,
 y: double,
 user_id: int,
 timestamp : int32
};
CREATE TYPE InstagramTwitterType AS {
 id: UUID,
 geometry: geometry,
 user_id: int,
 timestamp: int32 
};

To start, we name the dataverse PracticeData. This is essentially where everything will be stored. Thus, it is important that USE PracticeData; is present before running your queries moving forward.

Now, please carefully examine the schema for the three different datatypes. It is important to note that the name of each attribute in each datatype is named, then the data type of the attribute is specified after the colon. Furthermore, the name of the attributes must corresponding to the attribute of the data being loaded (we will go over a specific example in a later section).

UrbanAreasType is relatively straight forward. We define id with type UUID, a geometry attribute labeled g, and, less importantly, we have scalerank of type int , featurecla of type string, area_sqkm of type double, and min_zoom of type double. I would also like to note that we add a ? after defining the data type to denote that the attribute is optional. We do this because the data in Urban Areas has some rows that do not have an attribute of geometry, scalerank, featurecla, area_sqkm, or min_zoom.

As for the other two datasets, we are creating TempInstagramTwitterType as a temporary datatype to hold the data from our CSV file. We will then transfer and transform the data from the resulting dataset to the dataset that will use InstagramTwitterType. We do this for multiple reasons: (1) CSV lacks support for the primitive type geometry and (2) we do not want to increase the overhead by creating geometries in our spatial queries (which we will go over in a later section). This is exemplified in TempInstagramTwitterType where we have the attributes labeled x and y with the datatype double, but in InstagramTwitterType we have an attribute labeled geometry with the datatype geometry. We will eventually convert the former two attributes into a geometry attribute to be stored in the dataset that will use InstagramTwitterType. This same reasoning can be applied to the lack of the id attribute in TempInstagramTwitterType. Our temporary dataset does not need an id attribute, but we will need an id attribute (which we will autogenerate) for the dataset that will use the InstagramTwitterType.

In the next section, we will create our datasets and load our data.

Creating and Loading the Data

Here we will load the data. So, we do the following below:

USE PracticeData;
CREATE DATASET UrbanAreas (UrbanAreasType) PRIMARY KEY id AUTOGENERATED;
LOAD DATASET UrbanAreas using localfs
(("path"="127.0.0.1:///Users/andretran/Downloads/NE_urban_areas.json"), ("format"="adm")); 
CREATE EXTERNAL DATASET TempInstagramTwitter
(TempInstagramTwitterType) using localfs (("path"="127.0.0.1:///Users/andretran/Downloads/yin_instagram-twitter.csv"), ("format"="delimited-text"), ("header"="true"));
CREATE DATASET InstagramTwitter (InstagramTwitterType) primary key id autogenerated;

First, we again make sure we are using the dataverse PracticeData.

Now, for the first dataset, we create our dataset UrbanAreas using the data model UrbanAreasType and we set the primary key to id and specify that it is autogenerated (this just means AsterixDB will create the unique IDs). If our datasets already had an attribute with unique IDs, we could omit autogenerated. We then load the dataset locally using localfs. To do this, AsterixDB requires two parameters. First, we specify the path of the file. Then, we need to specify our format as adm since the format of the file is in JSON. With these two parameters set, the data should load properly.

For our second dataset, TempInstagramTwitter, we will not be creating a dataset or loading this dataset locally. Instead, we will be creating an external dataset because we will transform that data into our third dataset – InstagramTwitter. This reduces the overhead of loading the file since this is a temporary dataset. Loading this dataset requires three parameters. Again, we specify the path. Then, we specify the format as delimited-text since the format of our file is in CSV. Lastly, we set header to true, so that we do not read the first line in our file. The data should load properly.

To be certain our data has been loaded, we run the following queries:

USE PracticeData;
SELECT * 
FROM UrbanAreas

and

USE PracticeData;
SELECT * 
FROM TempInstagramTwitter;

The queries should return the following:

Screenshot of queried data from UrbanAreas dataset (left) and TempInstagramTwitter dataset (right)
Screenshot of queried data from UrbanAreas dataset (left) and TempInstagramTwitter dataset (right)

With this info, we can answer our first question which was "How many Urban Areas and Tweets & Instagram posts are in our dataset?" As we can see in the screenshots, we have 163 rows in our UrbanAreas dataset and 126263 rows in our InstagramTwitter dataset.

Now, with our two files loaded into the first two datasets, we create our third dataset as InstagramTwitter. Again, we set our primary key to id and we have our IDs autogenerated by AsterixDB. We purposely do not load any data into this dataset as we will now transform our TempInstagramTwitter data and insert it into this dataset in the next section.

Transforming and Inserting the Data

Here, we will transform our data from TempInstagramTwitter to InstagramTwitter. As previously mentioned, CSV lacks support for geometry as a primitive type. So, we will utilize a built in spatial function st_make_point to take the geometry attribute from TempInstagramTwitter and transform it into the primitive type geometry. Additionally, we will also need to transfer the other attributes (user_id and timestamp) to the dataset. We do this by querying the data from TempInstagramTwitter and inserting the resulting data into the InstagramTwitter dataset. Thus, our query should look like the following:

USE PracticeData;
INSERT INTO InstagramTwitter
(
 SELECT st_make_point(x, y) AS geometry, user_id, timestamp
 FROM TempInstagramTwitter 
);

Here are some additional clarifications about this code snippet above. We set the name of our transformed geometry attribute as geometry to ensure that the attribute has the correct name. If we do not set a name, AsterixDB will insert our transformed geometry data into the InstagramTwitter dataset as an attribute with no name. This will make it nearly impossible to query and leave the geometry attribute with out any data. Afterwards, run the query below:

USE PracticeData;
SELECT VALUE i 
FROM InstagramTwitter AS i;

The query should return the following:

Screenshot of queried data from InstagramTwitter dataset
Screenshot of queried data from InstagramTwitter dataset

From the screenshot above, we can verify that our data has loaded by examining our previous screenshot from the TempInstagramTwitter dataset. Again, we have 126263 rows in our transformed dataset which means the data loaded properly. We can also see that there is no longer an x or y attribute. Instead, we have a single geometry attribute of type geometry that is holding the transformed data from the x and y attributes.

Conclusion

With our third dataset loaded, we are finally done loading our data into AsterixDB. In the next part, we will perform spatial queries on our datasets that contain geometric attributes.


References


Related Articles