The world’s leading publication for data science, AI, and ML professionals.

3 Tips for First-Time Machine Learning Projects

It has now been a few months since my capstone team wrapped up work on our forest fire management machine learning project. It was a great…

Practical tools and techniques that helped build a winning capstone project.

Source
Source

It has now been a few months since my capstone team wrapped up work on our forest fire management machine learning project. It was a great experience that ultimately resulted in us winning the 2020 Ontario-Wide Software Engineering Capstone Competition! As school gets underway and students begin new projects, I thought it’d be useful to share some tips that helped me while working on this project.

Overall, my big takeaway from working on a capstone involving machine learning was that 90% of the work is whipping the data into submission (or becoming friends with the data, depending on your perspective).

Understanding the importance of clean data, exploratory data analysis, data wrangling was absolutely vital to the success of our project. I think this point is often underemphasized in course work and can lead people to have a skewed perception of the work involved in such a project. So here are a few tools and techniques (with examples) that helped my team keep a healthy relationship with the data and made the project much more enjoyable.

1. Collaboration with Google Colab

Google Colab is a tool for coding and running Jupyter Notebooks (often used for machine learning projects) from Google Drive. There are two big advantages of using Google Colab over local Python scripts or even local Jupyter notebooks.

First, Colab gives you access to a free cloud-GPU which can really speed up your workflow. Second, using Colab allows data storage on Google Drive (every account has up to 15GB) which makes collaboration on data wrangling much easier. To take advantage of these capabilities our team used a shared folder on Google Drive which contained our Colab Notebooks and a shared data folder with gzipped csv files which were outputted by some notebooks and used as inputs by others. Since the Google Drive folder was shared, everyone could easily access the latest data produced by other members of the team.

Here’s an example of importing a gzipped csv from a shared folder using a Google Colab notebook:

If you’re using big csv files, I suggest storing them using gzip compression because the pandas package can unzip and read in the file in one line of code and fast.

In our project, we had 15 active notebook files that had different inter-dependencies, so having one unambiguous place for data storage was very helpful.

2. Speeding up Data Acquisition with aiohttp

Temperature, wind, and humidity all have a significant effect on whether a forest fire will occur and whether it will spread. Since these values change significantly throughout the day, we had to obtain hourly data from weather stations across Canada over several years.

Luckily for us, the Government of Canada (Environment and Climate Change Canada) had historical hourly weather data for hundreds of weather stations across Canada, in some cases going back as far as 1866! They even provided a completely free API for downloading the historical weather data in bulk.

The API to download hourly data required the user to specify the station ID, the year, and the month. So if we wanted data from 150 stations for the last 10 years that would mean 150x10x12 = 18,000 API calls. Doing those calls in sequence would take a long time so we used the Aiohttp package to make the API calls concurrently.

The above code can be used for fetching any sort of csv data available through a URL.

3. Vectorization

During feature engineering, our team hypothesized that absolute humidity may provide additional information that would help with the prediction of forest fires. Absolute humidity can be derived from the temperature and relative humidity using a special formula derived here.

Here’s the formula in code:

There are several ways to apply this for every hour of the weather data in the pandas dataframe. We could iterate through each row in the dataframe and compute the absolute humidity:

%%timeit -n 3
abs_humidity_list = []
for index, row in df.iterrows():
    temperature = row['Temp (°C)']
    rel_humidity = row['Rel Hum (%)']
    abs_humidity = calculate_abs_humidity(temperature, rel_humidity)
    abs_humidity_list.append(abs_humidity)

Result: 3 loops, best of 3: 8.49 s per loop

Although the above code does the job, it computes the function once for every row. For faster results, the pandas apply function can be used:

%%timeit -n 3
abs_humidity_series = df.apply(lambda row:
    calculate_abs_humidity(row['Temp (°C)'], row['Rel Hum (%)']),
    axis=1)

Result: 3 loops, best of 3: 1.9 s per loop

Even though this is faster, the pandas apply function is still fundamentally a loop. For truly speedy calculation, we want to apply the _create_abs_humidity function to all temperature and relative humidity pairs at the same time. In most of our project notebooks our team used numpy’s vectorize_ function for this:

import numpy as np
%%timeit -n 3
abs_humidity_np = np.vectorize(calculate_abs_humidity)(
    df['Temp (°C)'],
    df['Rel Hum (%)'])

Result: 3 loops, best of 3: 34.2 ms per loop

Not a bad improvement. However, during the writing of this post I found out that using numpy’s vectorize still isn’t true vectorization. In fact, we can’t do true vectorization with _calculate_abs_humidity because inside that function we use math.exp_ which can only operate on a single number at a time and not long lists of numbers (vectors).

Luckily numpy has an equivalent exponential function that works on vectors, np.exp, so if we make a small adjustment to our original _calculate_abs_humidity_ function, switching out the exponential functions, we get:

All of a sudden we have a function that can take in entire pandas dataframe columns:

%%timeit -n 3
calculate_abs_humidity_np(df['Temp (°C)'], df['Rel Hum (%)'])

Result: 3 loops, best of 3: 5.64 ms per loop

True vectorization!

You might say, "Does vectorization really matter that much? Saving 8 seconds does not seem like that big of a deal." For small datasets, I agree that it won’t make a huge difference. For reference, the examples above used hourly data for the last 10 years from a single weather station which resulted in 78,888 rows (still relatively small).

But what if we wanted to calculate the absolute humidity for data from 100 weather stations for the same time period. Now if you are iterating through each row you would have to wait 13 minutes and this is where vectorization begins to make a serious difference. Vectorization also has the added bonus of making your code more concise and readable so I recommend making a habit to use it when possible.

Final Notes

The eventual forest fire prediction model used in the Capstone project was trained on a dataset containing over 16 million rows (each row contained information for a 20×20 km grid on a given day in the last 10 years). Assembling this dataset turned out to be a crucial part of the project and the whole process was made much easier through the use of:

  • Google Colab – for easily sharing and collaborating on data
  • aiohttp – for acquiring large amounts of data quickly
  • vectorized functions – for performing efficient operations on datasets with millions of rows

I’ve included all example code in a notebook file here: https://github.com/ivanzvonkov/examples-3-tips-for-ml-projects/blob/master/example-notebook.ipynb.

I hope you have the opportunity to try some of these tools in your next project!

If you want to check out the repository of the actual capstone project you can find it here: https://github.com/ivanzvonkov/forestcasting.

Thanks for reading, all feedback appreciated!


Related Articles