Previously, we analyzed the Brazilian Wildfire dataset. We looked at dataset’s characteristics such as the mean number of wildfires in each state, the standard deviation in the number of wildfires in each state and the distribution in number of wildfires across all of the states. We also looked at the number of wildfires vs the year for each state in a given month.
In this post we will build a Machine Learning model that we will use to predict the number of wildfires in a state for any given year.
The workflow for building our machine learning model will be as follows:
- Data Preparation and Cleaning
- Feature Selection & Engineering
- Model Selection
- Model Tuning & Testing
Luckily the Brazilian dataset is structured, cleaned and labelled so the first step is basically completed for us (In many cases you will have to deal with unstructured and unlabelled data, which reinforces the adage "Data scientists spend 80% of their time finding, cleaning and structuring data").
Now let’s build this model!
The first thing we need to do is import the necessary Python packages:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
Next we can write a function that allows us to initialize our data. There is some freedom in how you choose to preprocess your data. For this example we will keep it simple and say that we initialize the data according to a specific state of interest. We also engineer a month category feature which we will include in our model:
def initialize_data(state):
df = pd.read_csv("amazon.csv", encoding = "ISO-8859-1")
df['month_cat'] = df['month'].astype('category')
df['month_cat'] = df['month_cat'].cat.codes
df = df[df['state'] == state]
return df
Next we define a function that allows us to split our data for training and testing:
def train_test_split(year, df):
df_train = df[df['year'] < year]
df_test = df[df['year'] == year]
X_train = np.array(df_train[['year', 'month_cat']])
y_train = np.array(df_train['number'])
X_test = np.array(df_test[['year', 'month_cat']])
y_test = np.array(df_test['number'])
return X_train, X_test, y_train, y_test
The function ‘train_test_split’ uses ‘year’ to split the data for model training and testing. For example, if ‘year’ = 2015, the training set is defined as all of the wildfire data beforethe year 2015 and testing set is defined as all of the wildfire data during the year 2015. Next we define the feature and target variables, where the features are the year number and month category, and the output is the number of wildfires.
We now define a function which specifies the model parameters for the Random Forest algorithm. This function can be used to optimize the model parameters during testing such that the error is minimized. We do this by changing the N_ESTIMATORS and MAX_DEPTH values until we minimize the error metric. This process is called hyperparameter tuning:
def model_tuning(N_ESTIMATORS, MAX_DEPTH):
model = RandomForestRegressor(n_estimators = N_ESTIMATORS, max_depth = MAX_DEPTH, random_state = 42)
return model
The next thing we do is define a function that fits the model to the training data and predicts number of Wildfires:
def predict_fire(model, X_train, X_test, y_train, y_test):
model.fit(X_train, y_train)
y_pred = model.predict(X_test).astype(int)
mae = mean_absolute_error(y_pred, y_test)
print("Mean Absolute Error: ", mae)
df_results = pd.DataFrame({'Predicted': y_pred, 'Actual': y_test})
print(df_results.head())
Finally, we define a ‘main’ function to test the model with different input values. Below we make two calls to our ‘initialize_data’, ‘model_tuning’, and ‘predict_fire’ functions. We generate predictions for the states ‘Sergipe’ and ‘Distrito Federal’ in the year 2017 and calculate the mean absolute error (MAE) :
def main():
df = initialize_data('Sergipe')
X_train, X_test, y_train, y_test = train_test_split(2017, df)
model = model_tuning(50, 50)
predict_fire(model, X_train, X_test, y_train, y_test)
df = initialize_data('Distrito Federal')
X_train, X_test, y_train, y_test = train_test_split(2017, df)
model = model_tuning(50, 50)
predict_fire(model, X_train, X_test, y_train, y_test)
if __name__ == "__main__":
main()
The output is:

We can also analyze the performance of the model accross all states as follows:
def main():
df = pd.read_csv("amazon.csv", encoding = "ISO-8859-1")
for i in list(set(df['state'].values)):
df = initialize_data(i)
X_train, X_test, y_train, y_test = train_test_split(2017, df)
model = model_tuning(50, 50)
predict_fire(model, X_train, X_test, y_train, y_test)
print(i)
if __name__ == "__main__":
main()
This outputs the prediction values, actual values, the state names and the MAE values for the predictions for each state. The model can be improved through hyperparameter tuning, trying other tree based methods (XGBoost, lightgbm, catboost) and trying neural network time seriesmodels (LSTM, RNN, CNN, WaveNet CNN). I encourage you to play around with the model and see how low you can get the MAE with random forests. Afterwards try applying some of the other methods I suggested and see if you get any improvements in accuracy. All of the code for this is available on GitHub . Good Luck and Happy Machine Learning!