Doing Data Science at the command line in Google Cloud Platform

Published in

Towards Data Science

12 min readOct 3, 2018

Data engineering is about gathering and collecting data, storing it in a suitable way, doing some processing and serving it, perhaps to a data scientist.

Every Data Scientist must face many data engineering and data preparation tasks before enjoying the always expected modeling stage. Also, when starting a new project you must think about the trade offs about choosing the right language and platform.

Leaving apart the platform at this moment, R, Python, Julia, Matlab, Octave, etc. are some of those languages used in data science, specially R and Python with a big grow in the last years and a great community support.

However, now I want to explain how you can face many initial tasks by using your operating system command line and how the shell must be part of your stack.

We will discover how command line programs like curl, sed, jq or csvkit can facilitate many of your repeating tasks by writing less code, making it portable and even faster by using low level interfaces.

In this post we will use those magic tools together, piping their commands without the need for any IDE, notebooks, etc. to download, convert, transform and load into a big data warehouse.

The data, published by EMT de Madrid (Empresa municipal de transportes de Madrid) is available at http://opendata.emtmadrid.es being a collection of files with records from April 2017 to August 2018 about all travels made by users of BiciMad (The public bike rental service in Madrid).

© Oriol Salvador (https://www.flickr.com/photos/boarderland/)

The collection, ATOW, consists in 17 files and 26.35 GB with 28.550.144 records, each of them logging a bike trip between two bike stations whose information is also provided in a separate file as we will show latter on.

In order to download all that files, I prefer to do some web scrapping so I could even automate the downloads and get the new data programmatically.

This can be easily accomplished using the command line. First we must check that all the links have a common pattern to identify all the files. By using the browser developer console we can inspect the DOM and the HTML document source code to find the links.

The previous image shows that all files have a common structure, so, it’s time to do some coding!

Optional

I love Serverless architectures and cloud computing because they democratize data science by providing unlimited, scalable, fault tolerant and cheap resources to every data scientist with out the need for a specific hardware or system maintenance. Google Cloud Platform is one of my favourites and I’m going to use it to launch a Compute Engine instance and execute all the commands there.
This is basically the same as doing it at my own laptop but I will benefit of the huge bandwidth of a Google Cloud data center. Google also provides all the machines with Google Cloud SDK ready to use without the need of managing authentication required for the last steps of this exercise. This is something that fits well in my idea of doing Agile Data Science.

We have several way to interact with Google Cloud Platform: an awesome web interface, powerful APIs, libraries for the main programming languages, … and the practical SDK. Because I will be only using the shell, the SDK is my first option. By using it I can also be confident that all de GCP commands can be executed in any operating system and will be easy to show you what I’m doing.

If you want to do it by yourself, you can have instructions here to download and setup it in your operating system, but if you are using Linux or Mac (and I hope you do it), it’s as easy as:

curl https://sdk.cloud.google.com | bash
exec -l $SHELL

Once the installation has finished, you can begin it’s setup with:

gcloud init

For the previous step you will need a Google Cloud Platform account. It’s free.

Now I’m are ready to launch a virtual machine. I don’t need a very power full one, but having enough storage to download all the files is mandatory, so I’ll setup a 200G disk.

gcloud beta compute --project=datascience-open-data instances create bicimad-vm --zone=europe-west1-b --machine-type=n1-standard-2 --subnet=default --network-tier=PREMIUM --maintenance-policy=MIGRATE --service-account=874126907357-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform --image=ubuntu-minimal-1804-bionic-v20180917 --image-project=ubuntu-os-cloud --boot-disk-size=200GB --boot-disk-type=pd-standard --boot-disk-device-name=bicimad-vm

Once the virtual machine has been deployed, be can login into it from Google Cloud Shell or from our local computer:

gcloud compute --project "datascience-open-data" ssh --zone "europe-west1-b" "bicimad-vm"

Downloading data

Now we are ready to move on and download the document source code into a local file:

curl 'http://opendata.emtmadrid.es/Datos-estaticos/Datos-generales-(1)' > bicimad.html

We can inspect the source code:

cat bicimad.html

Here we check, once again, that file links have all the same structure:

<li style="margin-bottom: 6px;margin-left: 12px;list-style-type: none;display: list-item;"><img src="/Imagenes/Extensiones-de-archivos/zip_logo.aspx"/><a target="_blank" href="/getattachment/8bb73c41-eab0-4e6a-ac92-80c8c68aacc2/201704_Usage_Bicimad.aspx" title="Datos de uso de Abril de 2017. Nueva ventana" > Datos de uso de Abril de 2017</a></li><li style="margin-bottom: 6px;margin-left: 12px;list-style-type: none;display: list-item;"><img src="/Imagenes/Extensiones-de-archivos/zip_logo.aspx"/><a target="_blank" href="/getattachment/11054216-35d1-4003-b76b-8421c4a46eb4/201705_Usage_Bicimad.aspx" title="Datos de uso de Mayo de 2017. Nueva ventana" > Datos de uso de Mayo de 2017</a></li>

Now we have to get those file links. This is very easy by combining sed with some regular expression knowledge:

cat bicimad.html | sed -n 's/.*href="\/getattachment\([^"]*Bicimad\.aspx\).*/\1/p'

This is what we get:

Now we only need to complete the file path in the remote server with the domain name (http://opendata.emtmadrid.es) to get the full URL:

cat bicimad.html | sed -n 's/.*href="\/getattachment\([^"]*Bicimad\.aspx\).*/\1/p' | sed -e 's/^/http:\/\/opendata.emtmadrid.es\/getattachment/'

We save the final processing made by sed to a temporary file tmp_file :

cat bicimad.html | sed -n 's/.*href="\/getattachment\([^"]*Bicimad\.aspx\).*/\1/p' | sed -e 's/^/http:\/\/opendata.emtmadrid.es\/getattachment/' > tmp_file

This temporary file has all the URLs pointing to the compressed data files:

cat tmp_file

And the result:

Now we only have to walk through each of the previous URLs and download them to our system. One way to do it is combining curl + parallel . In this way we can download several files at same time. First we install parallel :

sudo apt-get update
sudo apt-get install parallel

Once installed we read the tmp_file , sending all the links to curl :

cat tmp_file | parallel -j 4 curl -O

This process will take a few minutes. Once finished we can check the downloaded files by executing ls -l in the current directory:

The next step is to uncompress the files. We need to install unzip first:

sudo apt-get install unzip

… and them we unzip all the files ending in Bicimad.aspx:

TMP_DIR="."
ZIP_FILES=*Bicimad.aspxfor FILE in `ls  ${TMP_DIR}/${ZIP_FILES} `; do
  echo $FILE
  unzip -o $FILE
done

All the zipped files are uncompressed in the same folder, so ls -l *.json show as the Json files with the data:

Next we are going to check Json files contents to understand the structured data provided to us. In order to work with Json files, jq is our best friend, so web are going to install it as we usually do. Then we print the first line of any file:

sudo apt-get install jqcat 201808_Usage_Bicimad.json | head -1 | jq

We can check that this Json follows a typical schema for data exported from MongoDB:

{
  "_id": {
    "$oid": "5b6779012f384302541d6813"
  },
  "user_day_code": "6c30d6e283ea7a160379fa9adb20b93d2c06e16853ad0804e26485e98066f6ba",
  "idplug_base": 11,
  "track": {
    "type": "FeatureCollection",
    "features": [
      {
        "geometry": {
          "type": "Point",
          "coordinates": [
            -3.7078158,
            40.4127144997222
          ]
        },
        "type": "Feature",
        "properties": {
          "var": "28005,ES,Madrid,Madrid,CALLE SAN BRUNO 1,Madrid",
          "speed": 6.19,
          "secondsfromstart": 190
        }
      },
      {
        "geometry": {
          "type": "Point",
          "coordinates": [
            -3.7071841,
            40.4156114997222
          ]
        },
        "type": "Feature",
        "properties": {
          "var": "28012,ES,Madrid,Madrid,PLAZA MAYOR 27,Madrid",
          "speed": 3.47,
          "secondsfromstart": 130
        }
      },
      {
        "geometry": {
          "type": "Point",
          "coordinates": [
            -3.7048058,
            40.4167895
          ]
        },
        "type": "Feature",
        "properties": {
          "var": "28013,ES,Madrid,Madrid,CALLE ARENAL 1,Madrid",
          "speed": 3.61,
          "secondsfromstart": 71
        }
      }
    ]
  },
  "user_type": 1,
  "idunplug_base": 7,
  "travel_time": 228,
  "idunplug_station": 1,
  "ageRange": 0,
  "idplug_station": 43,
  "unplug_hourTime": {
    "$date": "2018-08-01T01:00:00.000+0200"
  },
  "zip_code": ""
}

Each record provides us with the following variables:

_id: record identification
user_day_code: user identification during the same day
idunplug_station: origin bike station
idunplug_base: base in the origin bike station
idplug_station: drop-off bike station
idplug_base: base in the drop-off bike station
unplug_hourTime: rental start time.
travel_time: in seconds, the time between bike unplugging and plugging
track: trip path
user_type: 1.- users with annual subscription, 2.- occasional user, 3.- corporate user
ageRange: 1.- [0–16] years, 2.- [17–18], 3.- [19–26], 4.-[27–40], 5.-[41–65], 6.- [>66]
zip_code: User postal code.

Complete description (in Spanish)

From all the previous variables, there is one of the that has a special estructure: track. This key is in GeoJson format, a convenient way to register geographical data about the bike position during its usage. Take a look into this key and you will notice that the system tracks the bike position every 60 seconds. It’s quite a shame that this information is not available in some of the provided periods.

Loading data into BigQuery

Google BigQuery is perhaps the most powerful server less data warehouse and a good place to process big amounts of data in seconds. It’s possible to import Json files into a BigQuery table but it cannot handle GeoJson yet although it supports Geographic Information Systems (GIS).

Meanwhile, I’ve decided to convert *.json file into *.csv with the help of jq , but before going into this, its recommended to change file encodings. The provided ones have a iso-8859–1 encoding, and this will produce problems with some special Spanish characters in some strings.

We can check the file encoding with:

file --mime-encoding 201808_Usage_Bicimad.json

The best solution is to convert the file to UTF-8 using iconv :

iconv -f iso-8859-1 -t UTF-8 201808_Usage_Bicimad.json > tmp_file.json  && mv -f tmp_file.json 201808_Usage_Bicimad.json

Now it’s time to convert from Json ton CSV. As pointed before, not all the monthly Json files have the track key with the geolocation values and I’ve requested the EMT to fix this. Meanwhile we will process only the objects with that information.

First we check if track exists and the we iterate over all the keys in order to get their values and flat them in an unnormalized way. Once its finished we can use de ‘@csv’ decorator and write de result to a file.

cat 201808_Usage_Bicimad.json | jq -r '. | select(.track != null) | .track.features[]  as $track | [._id[], .user_day_code, .idplug_base, .user_type, .idunplug_base, .travel_time, .idunplug_station, .ageRange, .idplug_station, .unplug_hourTime[], .zip_code]  + [$track.geometry.coordinates[0], $track.geometry.coordinates[1], $track.properties.var, $track.properties.speed, $track.properties.secondsfromstart]|@csv' > 201804_Usage_Bicimad.csv

A highly recommended tool to work with CSV files is csvkit . Although not strictly necessary, we are going to use it to inspect the CSV files in a better way. These are the steps needed to have it running:

sudo apt-get install python-dev python-pip python-setuptools build-essential
pip install --upgrade setuptools
pip install --user csvkit
export PATH=$HOME/.local/bin:$PATH

Let’s check the result for some random columns:

cat 201804_Usage_Bicimad.csv | head -10 | csvcut -c 1,3,10,12,13,14  | csvlook --max-columns 6 --no-header-row

It’s time to load the CSV files into BigQuery! Because we cannot load data directly from the local machine, let’s copy the files into a Google Cloud Storage bucket. The following commands create a bucket and upload a file into it:

gsutil mb open-datasets
gsutil -m cp 201808_Usage_Bicimad.csv gs://open-datasets/bicimad

Then we are going to create a new dataset in BigQuery:

bq --location=EU mk bicimad

Then we must create a table in the dataset. In this example I’m including the table schema:

bq mk --table bicimad.usage_201808 oid:STRING,user_day_code:STRING,idplug_base:STRING,user_type:INTEGER,idunplug_base:INTEGER,travel_time:INTEGER,idunplug_station:INTEGER,ageRange:INTEGER,idplug_station:INTEGER,unplug_hourTime:STRING,zip_code:STRING,latitude:FLOAT,longitude:STRING,var:STRING,speed:FLOAT,secondsfromstart:INTEGER

Finally we create a load job to import the CSV file from the Google Cloud Storage bucket into the new table:

bq load --source_format=CSV --replace --quote='"'  bicimad.usage_201808 gs://open-datasets/bicimad/201804_Usage_Bicimad.csv

When loading big amounts of data into BigQuery, it’s recommended to use partitioned tables. Partitioned tables make easy query data, improves query performance and reduce your bill. In this case we are loading each month of data into different tables, an old way of data partitioning, but feel free to improve it.

Once the loading job has finished we can query the data and check te result:

bq query --use_legacy_sql=false  'SELECT travel_time, unplug_hourTime, var FROM `datascience-open-data.bicimad.usage_201808` LIMIT 10'

Or we can also use the BigQuery UI in the Google Cloud Console, the recommended way for developing your SQL queries:

All together now!

Thus the process of loading a CSV file into our data base. To load of them we can write a small bash script to loop through all the Json files in the directory, transform into CSV, upload to the cloud and load into it’s respective table.

This is exactly what transform_and_load.sh do:

#! /bin/bashTMP_DIR="."REPORT_FILENAME_PATTERN=*Bicimad.jsonfor FILE in `ls  ${TMP_DIR}/${REPORT_FILENAME_PATTERN} `; doBASENAME=$(basename $FILE .json)YEAR_MONTH=`echo ${BASENAME:0:6}`iconv -f iso-8859-1 -t UTF-8 "$FILE" > "$FILE.new"  && mv -f "$FILE.new" "$FILE"cat $FILE | jq -r '. | select(.track != null) | .track.features[]  as $track | [._id[], .user_day_code, .idplug_base, .user_type, .idunplug_base, .travel_time, .idunplug_station, .ageRange, .idplug_station, .unplug_hourTime[], .zip_code]  + [$track.geometry.coordinates[0], $track.geometry.coordinates[1], $track.properties.var, $track.properties.speed, $track.properties.secondsfromstart]|@csv' > $BASENAME.csvgsutil -m cp $BASENAME.csv gs://open-datasets/bicimadbq mk --table bicimad.usage_$YEAR_MONTH oid:STRING,user_day_code:STRING,idplug_base:STRING,user_type:INTEGER,idunplug_base:INTEGER,travel_time:INTEGER,idunplug_station:INTEGER,ageRange:INTEGER,idplug_station:INTEGER,unplug_hourTime:STRING,zip_code:STRING,latitude:FLOAT,longitude:STRING,var:STRING,speed:FLOAT,secondsfromstart:INTEGERbq load --source_format=CSV --replace --quote='"'  bicimad.usage_$YEAR_MONTH gs://open-datasets/bicimad/$BASENAME.csvdone

We add the execute permission for all users to the script and run it:

chmod 755 transform_and_load.sh
./transform_and_load.sh

Once finished we can query all 2018 data to get the median bicycle rental duration by age range (I don’t like averages with no normally distributed data).

SELECT
  ageRange,
  ROUND(APPROX_QUANTILES(travel_time/60,1000)[OFFSET(500)], 2) AS median_travel_time
FROM
  `datascience-open-data.bicimad.usage_2018*`
GROUP BY
  ageRange
ORDER BY
  2 DESC

It seems that the youngest users (ageRange 1) ara also the heaviest users in terms of bike rental duration. We could know check if they are also the most frequent users and so on.

Bike stations data

As I told you before, we are going to repeat the same steps for the dataset with information about each bike station.

curl -o Bicimad_Estacions_201808.rar http://opendata.emtmadrid.es/getattachment/8321277e-4f8b-4a45-89fd-63cbeefa1cf1/Bicimad_Estacions_201808.aspx && unrar x Bicimad_Estacions_201808.rar && rm Bicimad_Estacions_201808.rariconv -f iso-8859-1 -t UTF-8 Bicimad_Estacions_201808.json > tmp_file.json  && mv -f tmp_file.json Bicimad_Estacions_201808.json

This is the information available in the dataset:

{
  "activate": 1,
  "name": "Colón A",
  "reservations_count": 0,
  "light": 0,
  "total_bases": 18,
  "free_bases": 14,
  "number": "106a",
  "longitude": "-3.6877227",
  "no_available": 0,
  "address": "Calle Serrano nº 34",
  "latitude": "40.4251002",
  "dock_bikes": 3,
  "id": 111
}

Let’s convert into CSV format:

cat Bicimad_Estacions_201808.json | jq -r '.stations[] | [.[]] | @csv' > bike_stations.csv

Because this dataset is small (12.9 MB), we can load into BigQuery directly from our local disk.

First web make a new table:

bq mk --table bicimad.bike_stations activate:INTEGER,name:STRING,reservations_count:INTEGER,light:INTEGER,total_bases:INTEGER,free_bases:INTEGER,number:STRING,longitude:FLOAT,no_available:INTEGER,address:STRING,latitude:FLOAT,dock_bikes:INTEGER,id:INTEGER

and the we create a load job:

bq load --source_format=CSV --replace --quote='"'  bicimad.bike_stations bike_stations.csv

By watching the data we can see that there are several records for each bike station, registering changes in freee_bases and other fields during a time frame not provided with the dataset.

Even so, we can check the number of available bike stations:

SELECT
  COUNT(DISTINCT name) AS total
FROM
  `datascience-open-data.bicimad.bike_stations`

Once we have loaded all the data, we can delete our compute instance:

gcloud compute instances delete bicimad-vm --zone=europe-west1-b --quiet

That’s all for now. In the next post we’ll perform some exploratory data analysis using SQL. Nor R or Python yet, I promise!