Extracting Data for Machine Learning

Three methods to obtain data for your next machine learning project

Published in

Towards Data Science

7 min readJul 14, 2019

The most important first step in any machine learning project is to obtain good quality data. As a data scientist, you will often have to use a variety of different methods to extract data sets. You might be using publically available data, data available via an API, data found in a database or in many cases a combination of these methods.

In the following post, I am going to give a brief introduction to three different methods in python for extracting data. For the purpose of this post, I am going to be covering how to extract data whilst working in a Jupyter Notebook. I previously covered how to use some of these methods from the command line in an earlier post.

SQL

If you need to obtain data from a relational database the chances are that you will need to use SQL. You can connect a Jupyter Notebook to most common database types using a library called SQLAlchemy. This link provides a description of which databases are supported and how to connect to each type.

You can use SQLAlchemy directly to view and query tables or you can write raw queries. To connect to your database you will need a URL which includes your credentials. You can then use thecreate_engine command to create the connection.

from sqlalchemy import create_engine
engine = create_engine('dialect+driver://username:password@host:port/database')

You can now write database queries and return the result.

connection = engine.connect()
result = connection.execute("select * from my_table")

Scraping

Web scraping is used to download data from a website and extract the required information from those pages. There are a number of python libraries that can be used for this but one of the simplest to use is Beautiful Soup.

You can install this package via pip.

pip install BeautifulSoup4

Let’s work through a simple example of how to use this. We are going to use Beautiful Soup and the urllib library to scrape hotel names and prices from the TripAdvisor website.

Let’s import all the libraries we will be working with.

from bs4 import BeautifulSoup
import urllib.request

Next, we want to download the content of the page that we want to scrape. I am going to scrape prices for hotels on the Greek Island, Crete so I am using a URL that contains hotel listings for that destination.

The code below defines the URL as a variable, uses the urllib library to open the page and Beautiful Soup to read the page and return the results in an easy to read format. Part of the output is shown below the code.

URL = 'https://www.tripadvisor.co.uk/Hotels-g189413-Crete-Hotels.html'
page = urllib.request.urlopen(URL)
soup = BeautifulSoup(page, 'html.parser')
print(soup.prettify())

Next, let’s obtain a list of hotel names on the page. We are going to use the find_all function which allows you to extract the parts of the document that you are interested in. You can filter the document using find_all in a number of ways. By passing in a string, regular expression or a list. You can also filter on one of the tag’s attributes which is the method we will use here. If you are unfamiliar with HTML tags and attributes this article gives a good overview.

To understand how best to access the data point you need to inspect the code for that element on the web page. To find the code for the hotel name we right click on the name in the listing as shown in the image below.

When you click inspect the code will appear and the section that contains the hotel name will be highlighted as shown below.

We can see that the hotel name is the only piece of text in the class named listing_title . The code below passes the class and the name of this attribute to the find_all function, along with the div tag.

content_name = soup.find_all('div', attrs={'class': 'listing_title'})
print(content_name)

This returns each section of code containing the hotel name as a list.

To extract the hotel names from this code we can use Beautiful Soup’s getText function.

content_name_list = []
for div in content_name:
    content_name_list.append(div.getText().split('\n')[0])
print(content_name_list)

This returns the hotel names as a list.

We can get the price in a similar way. Inspecting the code for a price we can see it has the following structure.

So we can use very similar code to extract this section.

content_price = soup.find_all('div', attrs={'class': 'price-wrap'})
print(content_price)

There is a slight complication with the price if we run the following code we will see it.

content_price_list = []
for div in content_price:
    content_price_list.append(div.getText().split('\n')[0])
print(content_price_list)

The output is shown below. Where a hotel listing has a price cut shown, both the original price and the sale price is returned in addition to some text. In order to make this useful we just want to return the actual price the hotel would be should we book it today.

We can use some simple logic to obtain the last price shown in the text.

content_price_list = []
for a in content_price:
        a_split = a.getText().split('\n')[0]
        if len(a_split) > 5:
            content_price_list.append(a_split[-4:])
        else:
            content_price_list.append(a_split)  
        
print(content_price_list)

This gives the following output.

API

API, which stands for application programming interface, in terms of data extraction is a web-based system that provides an endpoint for data which you can connect to via some programming. Typically the data will be returned in JSON or XML format.

In machine learning, you may need to obtain data using this method. I will give a simple example of how you might obtain weather data from a publically available API known as Dark Sky. To access this API you will need to sign up, 1,000 calls are provided per day for free which should be plenty to try this out.

To access the data from Dark Sky I will be using the requests library. To start with I need to obtain the correct URL to request data from. Dark Sky provides both forecasted and historical weather data. For this example, I am going to be using the historical data, I can obtain the correct URL for this from the documentation.

The URL has the following structure.

https://api.darksky.net/forecast/[key]/[latitude],[longitude],[time]

We will use the requests library to obtain the results for a particular latitude and longitude, and date and time. Let’s imagine after obtaining daily prices for hotels in Crete we wanted to find out if price correlated with the weather in some way. As an example let’s choose the coordinates for one of the hotels in the listing Mitsis Laguna Resort and Spa.

First, we construct the URL with the correct coordinates and date time we require. Using the requests library we can access the data in JSON format.

import requestsrequest_url = 'https://api.darksky.net/forecast/fd82a22de40c6dca7d1ae392ad83eeb3/35.3378,-25.3741,2019-07-01T12:00:00'
result = requests.get(request_url).json()
result

We can normalize the results into a data frame to make it easier to read and analyse.

import pandas as pddf = pd.DataFrame.from_dict(json_normalize(result), orient='columns')
df.head()

There is a lot more you can do to automate the extraction of this data using these methods. For the web scraping and API methods, it is possible to write functions to automate the process to easily extract this data for a larger number of days and/or locations. In this post, I wanted to simply give an overview with enough code to explore these methods. In future posts, I will be writing more in-depth articles covering how to build complete data sets and analyse them using these methods.

Thanks for reading!

Extracting Data for Machine Learning

Three methods to obtain data for your next machine learning project

SQL

Scraping

API

Written by Rebecca Vickery