Building a dataset for the São Paulo Subway operation

Published in

Towards Data Science

7 min readAug 23, 2018

Have you ever taken the subway when it is crowded? I know this happens in most large cities, but here in São Paulo it may look like this:

This is what a crowded subway station may look like here in SP

Now, as a Brazilian and as a Paulistano, what can I do about it? (that’s Portuguese for ‘a person that lives in São Paulo’)

The transparency webpage for the Metrô subway company provides data on the passenger demand but with low granularity, grouping average entries on week days and on weekends for each subway line. Which is not bad, but we can do better.

I know that the webpage for the ViaQuatro company, which has a consortium with the government for the operation of one of the subway lines, provides the operation status for all of the lines in (almost) real time:

The ViaQuatro company webpage provides operation status for all of the subway lines, as highlighted in red

People that take the SP subway on a daily basis intuitively know that a normal operation status is not enough information to infer that the stations won’t be crowded, but a reduced speed or paralyzed status is potentially related to the chaos shown in the first picture.

Unfortunately, this data is not stored. At least not for us, ordinary citizens, so we can have it accessed, monitored, evaluated, presented and so on. Why don’t we solve that using Python?

The steps shown in the rest of this article are a didactic, although completely functional presentation of the project. In case you are a developer or want to go straight to the final result, feel free to explore the Github repository.

So…

Our goal is to continuously store in a non-invasive fashion the history of operation statuses of the São Paulo subway lines and make it easily available for any citizen.

To reach this goal we will write an autonomous web scraper capable of requesting the ViaQuatro webpage, extracting the operation statuses along with a timestamp and store it.

In order to do that we will need the requests package to download the web page and BeautifulSoup for processing it with ease:

>>> import requests
>>> from bs4 import BeautifulSoup

Ok, to download the page all we need to do is

>>> vq_home_request = requests.get('http://www.viaquatro.com.br')

If the request was successful, we have the HTML code in

>> vq_home_content = vq_home_request.text 
>> vq_home_content'\r\n<!DOCTYPE HTML>\r\n<html lang="pt-BR">\r\n<head>\r\n\t<meta charset="UTF-8">\r\n    <title>ViaQuatro | Seja Bem-vindo</title>\r\n\t<meta name="description" content="ViaQuatro - Informa&#231;&#245;es sobre rotas, tarifas, esta&#231;&#245;es e muito mais." /> \r\n\t<meta name="keywords" content="" />\r\n    <meta http-equiv="X-UA-Compatible" content="IE=edge" >\r\n\t<script src="/Content/js/3rd/modernizr.custom.97134.js"></script>\r\n    <!-- Use Zepto for best performance on WebKit based browser -->\r\n\t\t<!-- <script src="js/zepto/zepto.js" type="text/javascript" charset="utf-8"></script> -->\r\n\t<link rel="stylesheet" type="text/css" href="/Content/css/default.css" media="all" /> <!-- pra ser carregado em todas as páginas -->\r\n
(...)

We do have the HTML inside our environment!

Now lets use BeautifulSoup to browse the DOM and extract the title page in order to check we have the HTML elements organized inside the BeautifulSoup object:

>>> soup = BeautifulSoup(vq_home_content, 'html.parser')
>>> soup.title
<title>ViaQuatro | Seja Bem-vindo</title>

Now to the hard part — extracting what really matters. Inspecting the HTML using dev tools, we see that the ‘OPERATION’ panel that shows the current status for the subway lines consists of a <section class="operacao"> element. Inside it, we have all the data we need. So lets pull it out so we can do the following queries on it:

>> operation_column = soup.find(class_= "operacao")

We need to inspect the page code in order to find out what element contains the data we want

Lets also initialize a dictionary to store the lines statuses using a dict comprehension:

>> lines_metro = ['azul', 'verde', 'vermelha', 'amarela', 'lilás', 'prata']>> lines_cptm  = ['rubi', 'diamante', 'esmeralda', 'turquesa', 'coral', 'safira', 'jade']>> all_lines = lines_metro + lines_cptm>> extracted_status = {line:'' for line in all_lines}>> extracted_status
{'amarela': '', 'vermelha': '', 'diamante': '', 'turquesa': '', 'coral': '', 'prata': '', 'lilás': '', 'esmeralda': '', 'safira': '', 'rubi': '', 'verde': '', 'azul': '', 'jade': ''}

Moving on with our investigation, we see that the status for the Amarela (yellow) subway line is wrapped around a span tag with a css class called status.

>> status_amarela = operation_column.find(class_="status").text
>> extracted_status['amarela'] = status_amarela

The operation status for the Amarela (yellow) subway line is wrapped around a ***span*** tag with a css class called ***status***

For all other lines, the data is grouped in three tags <div class="linhas">, one for each company that participates in the urban rail system. Inside each one of those divs we have an unordered list tag, where each item contains the line number, its name and current operation status inside another div that looks like <div class="info"> .

In order to extract all of this we will iterate over the <div class="linhas"> divs and then iterate over each <div class="info"> div to get what we really want: operation status and line name:

we will iterate over the `<div class="linhas">` divs and then iterate over each `<div class="info">` div to get what we really want

>> lines_containers = operation_column.find_all(class_ = "linhas")>> for container in lines_containers:
       line_info_divs = container.find_all(class_ = "info")
       for div in line_info_divs:
           line_title  = ''
           line_status = ''
           spans = div.find_all("span")
           line_title = spans[0].text.lower()
           line_status = spans[1].text.lower()
           extracted_status[line_title] = line_status>> extracted_status
{'safira': 'normal', 'prata': 'normal', 'rubi': 'normal', 'azul': 'normal', 'turquesa': 'normal', 'amarela': 'Operaç\xe3o Normal', 'lilás': 'normal', 'esmeralda': 'normal', 'coral': 'normal', 'verde': 'normal', 'diamante': 'normal', 'jade': 'normal', 'vermelha': 'normal'}

Now we have the operation status for all the lines inside our dictionary! Another crucial information is the timestamp for this data. It is shown in the ‘OPERATION’ panel, right by the title. It is the only <time> element in the page. Lets pull it out of there:

>> time_data = soup.find('time').text
>> time_data
'18/08/2018 23:11'

Finally, we have all the data! I confess I needed a moment to appreciate this.

At this point we can simply write extracted_status and time_data to a file in our local system or store it basically anywhere using the magic of the internet.

But lets go the extra mile and use Google Sheets as it is quite popular, accessible for non-developers and provided with a well documented API. On top of that, the gspread package wraps all of the requests for us and oauth2client takes care of the token acquisition procedure using our API credentials.

I know there is a lot of hidden stuff going on but being able to do so much with so little code is one of the beauties of programming in a language like Python.

In a nutshell, what we need to do is:

Get our google credentials to use them later for authorization (this is manual and only needs to be done once)
Obtain authorization using the credentials through our code
Write data to a public spreadsheet through our code

To get the credentials, we can follow this step-by-step guide provided by the gspread documentation. By the end of this procedure what you get is a client_secret.json file that looks like this:

{
    "private_key_id": "2cd … ba4",
    "private_key": "-----BEGIN PRIVATE KEY-----\nNrDyLw … jINQh/9\n-----END PRIVATE KEY-----\n",
    "client_email": "473000000000-yoursisdifferent@developer.gserviceaccount.com",
    "client_id": "473 … hd.apps.googleusercontent.com",
    "type": "service_account"
}

This is important: we need to share the spreadsheet we will be using with our API user. Otherwise we get a denied access error when trying to write the data. This user is represented by the client_email field from the json file.

Back to the code, lets import gspread and oauth2client and load the credentials from the .json file to our environment:

>> import gspread
>> from oauth2client.service_account import ServiceAccountCredentials>> scope = ['https://spreadsheets.google.com/feeds',
            'https://www.googleapis.com/auth/drive']
>> creds = ServiceAccountCredentials.from_json_keyfile_name(
           'client_secret.json',
           scope)

Now lets get authorized using our credentials and initialize the spreadsheet. To initialize the spreadsheet we need its ID, which is shown in the URL we use to access it in the browser:

https://docs.google.com/spreadsheets/d/1WiKE6SUCqAaUF9qOJOg_tMeOSA_lm_kjm83qwXF9dSg

>> client = gspread.authorize(creds)>> SPREADSHEET_ID = "1WiKE6SUCqAaUF9qOJOg_tMeOSA_lm_kjm83qwXF9dSg">> data_sheet = client.open_by_key(SPREADSHEET_ID).worksheet("data")

Note that we do not write using the spreadsheet object directly, but a worksheet inside it, so it makes sense that we need to specify its name.

Lets write our data! The append_row method does exactly what it sounds like it does, we just need to pass a list, where each element will be written in a different column of that row.

>> for line in all_lines:
    data_sheet.append_row([time_data, line, extracted_status[line]])

We can actually see the data being stored in real time!

In order to have an autonomous scraper that does not keep your computer turned on all the time we must organize this code and have it running on a cloud service. Some protection against exceptions is also important, but these are beyond the scope of this article. Feel free to take a look at the github repo in case you want more details!

I’ll be analyzing this data in a future article! We’ll see what comes out of it!

[Edit] An awesome analysis has been made by Paulo! Read it here!

Building a dataset for the São Paulo Subway operation

Our goal is to continuously store in a non-invasive fashion the history of operation statuses of the São Paulo subway lines and make it easily available for any citizen.

To reach this goal we will write an autonomous web scraper capable of requesting the ViaQuatro webpage, extracting the operation statuses along with a timestamp and store it.

Written by Douglas Navarro