Predicting Forest Cover Type with Tensorflow and model deployment in GCP

Using a Kaggle competition to get started with Tensorflow and learn how to deploy the model in GCP

Dipika Baad
Towards Data Science

--

Forest Cover Type Classification using Tensorflow by Dipika Baad

In this post, I will share:

  1. How I started with Tensorflow
  2. Solving a Kaggle competition with deep learning
  3. Deploying model in GCP
  4. Building a pipeline in GCP for ML

Data used in this project is from Kaggle competition of forest cover type. Although this is not an active competition on Kaggle, this fit right into my criteria of numerical/categorical data to make predictions which is easy to work with so we could focus on building the model in Tensorflow and also to build a small pipeline for GCP. Details of the data are provided on their data description page. Data is provided by US Geological Survey and USFS (Forest Service). Seven types of forest cover types will be predicted in this problem:

1 — Spruce/Fir
2 — Lodgepole Pine
3 — Ponderosa Pine
4 — Cottonwood/Willow
5 — Aspen
6 — Douglas-fir
7 — Krummholz

I will dive into solving this problem with the following steps:

  1. Loading dataset
  2. Preprocessing dataset
  3. Getting started with Tensorflow
  4. Creating tensorflow Dataset
  5. Model building with Keras
  6. Training the model
  7. Testing the model
  8. Submitting results to Kaggle
  9. Deploying model in GCP

1. Loading dataset

Download the data from above Kaggle competition and store it in your google drive or locally in an appropriate folder. Set your folder path to FOLDER_NAME variable. In my case, I am storing the data in google drive and using Google Colab to read the input files. If you wish to write the code in Kaggle notebooks, you can follow the code I have published on Kaggle along with this article. The changes are only in the cases of loading and storing data in case of Kaggle notebook.

Let’s get started with Google Colab by mounting the drive:

This will give a link to get code and you need to enter that into input box presented. Once that is done, we are ready to load the data in dataframe.

Output:

Part of the output

Train dataset has 15120 rows. From the describe table, it can be seen that the soil type 7 and 15 have constant 0 values so it should be removed. Some of the columns having numerical values except the categorical columns (Soil_Types and Wilderness_Areas ) should be normalized to get better results. In the next step, we will do all the preprocessing steps to make the data ready for prediction.

2. Preprocessing dataset

Category data columns of all soil types would be merged into one column from one hot coded form and similarly wilderness area column would be converted as well.

Output:

Soil Type 8 and 25 had only one row per type so they were converted to another column with NA subscript. This is optional, you can drop those columns too.

For numerical columns, MinMaxScaler is the transformer that will be applied to get normalised columns. Before we do that, we need to split the data into train, test and validation so that the data is normalised for those pieces of data.

Splitting data into train, val and test

Data will be split into train, validation and test. Train having 60%, test 20% and validation 20%.

Output:

Once the data is split, normalisation can be applied as follows:

3. Getting started with Tensorflow

Tensorflow is Google’s open source deep learning library based on Theano (Python library) which is used in research as well as production. Some of the core components that you need to know are tensor and graph. Tensor is a vector or matrix of n-dimensional in which data will be stored and functions would be performed. Graph is where all the operations and connections between nodes are described and can be run on multiple CPUs or GPUs.

In this post, I would be explaining how I created my first model using Tensorflow and not so much into very basics of tensorflow. If you are interested in learning, you can go through basic tutorials from here. I must admit going through Tensorflow documentation wasn’t as easy as PyTorch. With PyTorch I was able to build my first model within a week but with Tensorflow documentation it is hard to understand what is the right way to load data itself and on top of that there are conflicts between function formats and it’s backward compatibility with other versions. Hence following tutorials will end up in other errors that are not always easy to solve.

Install tensorflow in your machine or Google Colab. I have used Google Colab with GPU runtime. Tensorflow 2 is used in this problem.

!pip install tensorflow

4. Creating tensorflow Dataset

After all the processing, we will get the data into tensorflow dataset. This will help in building the input pipeline for model. For the dataset, batch_size has to be defined which is the size in which the data can be accessed in batches rather than at once. Optionally you can shuffle the rows as well.

Output:

Part of the output (column names horizontally could not be fit into one screenshot)

5. Model building with Keras

Model will be built with tensorflow Keras having feed forward neural architecture. Feature layer is built with tensorflow Keras layers.DenseFeatures. Categorical and numerical columns are handled separately for creating this input layer which is shown in the code. Model is defined in build_model function with two hidden layers of 100 and 50 followed by output layer containing the number of output neurons to 8. It is 8 since the types are integer values from 1–8 rather than 0–7. In confusion matrix we ignore the class 0. Optimizer adam is used and activation function relu is used. Code for building model and pipeline is as follows:

6. Training the model

For training, we provide the training and validation dataset to model’s fit function. Validation loss makes it easier to look out for overfitting of the model during training. Model is trained for 100 epochs.

Output:

Part of the output

In the output, it was clear that the network is learning based on train accuracy increasing along side validation accuracy. If at some point the validation accuracy was increasing while training accuracy was increasing, that’s the point where the model would have been overfitting. That way you can know to stop the epochs at that point. If it is random then the model is not learning anything. With the architecture designed, I got a good accuracy of ~85% for train and ~84% for validation set.

In model summary, you can see the number of parameters at each layer. In the first hidden layer, 51 input features are connected to 100 hidden nodes with 5100 weights for the fully connected network and 100 bias params for each node which adds up to 5200 params. In the next layer, 100 nodes connected to 50 nodes with 5000 connections with 50 bias params for each node in second hidden layer making up to 5050 params. Similarly, the next number of params are calculated. This is how you read the model summary. It shows how many total params are learned at the end. Model at the end is saved to a directory.

7. Testing the model

To test how the model works on test data, we will classification report from sklearn. Classification report would show precision, recall and f1-score for each forest cover type as well as average accuracy and so on.

Output:

The average accuracy of ~80% was achieved with this model. This was a good enough result with a simple architecture of feed forward neural network. Now we can get the results for kaggle test dataset using the model and submit it to kaggle.

8. Submitting results to Kaggle

We need to predict the cover type for test data given on kaggle. This data is in the same folder as the data downloaded in the first step. Output expected is csv file with Id and Cover_Type columns. Similar transformations done on train data has to be done on test data. Following code shows how to get the results for test data:

Output:

Once you have the file ready, you can upload it in my submissions section on Kaggle competition page. Once submitted you can see the score after a while. After the submission, I got the accuracy of ~62% on the test data. You can do as many submission as possible with different experiments and try to increase this score. That’s all, you can start participating in different competitions and experiment with different kinds of dataset. I started mine with simple prediction problem having numeric/categorical data. My goal was to learn Tensorflow with a real world example so I started with Kaggle dataset of competition which was not active but it had simple problem to work with.

9. Deploying model in GCP

Google AI Platform

Model saved in the previous section can be used to deploy in google cloud so that it can be accessible by any application you have. I am considering that you have basic knowledge about Google Cloud and you have worked on it a little bit. As I would not be explaining, how to get started with GCP (there are many beginner courses on that in coursera / GCP Qwiklabs which you can take).

Prerequisites:

You can create a free google cloud account which has 300$ free credits if you don’t have yet. We would need that in the next steps. Also install the GCP SDK on your computer. Create a project in Google Cloud if there isn’t any. Make sure to have a service account key downloaded in IAM and store in environment variable. (Refer to Google Cloud documentation for basic setup to interact with GCP via command console)

export GOOGLE_APPLICATION_CREDENTIALS=<PATH TO KEY JSON FILE>

Run the following command to authenticate with Google Cloud account and set the project by given instructions in the output.

gcloud auth login

As a first step, upload the folder forest_model_layer_100_50_epoch_100 from the previous section to Google Cloud Storage. I created a bucket forest-cover-model and uploaded the folder in that location.

Once that is done, you are ready to deploy the model.

gcloud components install beta

This is needed before deploying model.

gcloud beta ai-platform versions create v7 \--model ForestPredictor \--origin gs://forest-cover-model/forest_model_layer_100_50_epoch_100/ \--runtime-version=2.1 \--python-version=3.7

The v7 version was the one that worked after some experiments for me. You can start with v1 as version name. This way you can keep different versions of the model. Runtime versions which are suitable can be found here. Origin parameter is the path to the google storage path where model is stored.

Initially I was planning to make custom prediction routine with preprocessor classes etc. but unfortunately after creating all of that, at the time of deployment I got to know that it only works for tensorflow>=1.13,<2. It is evolving so it might be that it will support in future, check here for updates. ( Code for the custom pipeline exists in my repository which I have shared for those who are interested)

To test the model deployed, you can browse to the AI Platform > Models and click on the version number you want to test under the model name. There is Test & Use option where you can give custom input. The input format is as follows, use the following example for testing.

{
"instances":[
{
"Elevation":0.4107053527,
"Aspect":0.9833333333,
"Slope":0.2121212121,
"Horizontal_Distance_To_Hydrology":0.0,
"Vertical_Distance_To_Hydrology":0.2235142119,
"Horizontal_Distance_To_Roadways":0.3771251932,
"Hillshade_9am":0.7716535433,
"Hillshade_Noon":0.842519685,
"Hillshade_3pm":0.6141732283,
"Horizontal_Distance_To_Fire_Points":0.9263906315,
"Wilderness_Areas":"Wilderness_Area1",
"Soil_Types":"Soil_Type33"
}
]
}

You can see the output as follows:

Output gives probabilities for each cover type for input.

Once this is all working fine, you can use the model to do predictions for any input data. I have shared the code in the Github repository. The file forest_classification.py contains the code for calling the model and input pipeline.

I had split the input data into small chunks since it failed to return results for all the rows at once. These are the things that you need to manage in applications and so on. You can refer to that if you are interested in exploring how to create a pipeline for model deployed in GCP.

Tadaa! You are ready to use the model in practical solution with versioning maintained for various experiments and track the performances based on that. In the real world, apart from just building a model, these skills are very important. For fun, you can try to see how these pipelines are built in other cloud environments if that’s what you prefer. If you wish to experiment with how to optimize the model with various parameters or other optimizing functions then you can refer to my previous articles where I have suggested improvements/other opportunities for building models in deep neural networks. Although that was for PyTorch, the basics of changing architectures and parameters remain the same regardless of which framework you use. You will have to just find ways of defining and doing the same stuff in other libraries. Hope this helps you to get started on GCP and Kaggle as well as Tensorflow framework!

As always — Happy experimenting and learning :)

--

--

Big Data Consultant @Netlight | CoFounder @HuskyCodes | Web developer | Passionate about coding, dancing, reading