Plain and Simple Estimators

Yufeng G
Towards Data Science
5 min readSep 7, 2017

--

Machine learning is awesome, except when it forces you to do advanced math. The tools for machine learning have gotten dramatically better, and training your own model has never been easier.

We’ll utilize our understanding of our dataset rather than an understanding of the raw mathematics to code a model that gets us insights.

In this episode, we are going to train a simple classifier using only a handful of lines of code. Here’s all the code that we’ll look at today:

TensorFlow Estimators for machine learning

To train our classifier, we’ll use TensorFlow, Google’s open-source machine learning library. TensorFlow has a pretty large API surface, but the part we are going to focus on is high-level APIs, called Estimators.

Estimators package up the training loop for us, so that we can train a model by configuring it, rather than coding it by hand. This takes away much of the boilerplate, allowing us to think at a higher level of abstraction. This means we’ll get to play with the interesting parts of machine learning, and not get bogged down in too many details.

Since we’ve only covered linear models so far, we’ll stick that here. We will revisit this example in the future to extend its capabilities.

Flower classification: just as interesting as wine vs beer?

This week we’ll be building a model to distinguish between 3 different types of very similar flowers. I realize that this is a tiny bit less exciting than the beer and wine from the previous episode, but these flowers are a bit more difficult to distinguish, making this a more interesting challenge.

In particular, we’ll be classifying different species of iris flowers. Now, I’m not sure I could pick out an iris flower from a field of roses, but our model is aiming to distinguish Iris Setosa, Iris Versicolour, and Iris Virginica.

Iris Setosa, Iris Versicolour, and Iris Virginica

We have a dataset of measurements of the height and width of these flowers’ petals and sepals. These 4 columns will serve as our ‘features’.

Load the data

After importing TensorFlow and NumPy, we’ll load our dataset in, using TensorFlow’s load_csv_with_header function. The data, or features, are presented as floating point numbers, and the ‘label’ for each row of data, or target, is recorded as an integer: 0, 1, or 2, corresponding to the 3 species of flowers.

I’ve printed out the results of our loading, and we can see that we are now able to access the training data and the associated labels, or targets, using named attributes.

Build the model

Next we’ll build the model. To do this, we’ll first set up the feature columns. Feature columns define the types of data coming into the model. We are using a 4 dimensional feature column to represent our features, and calling them “flower_features”.

Building our model using estimators is super simple. Using `tf.estimator.LinearClassifier`, we can instantiate the model by passing in the feature columns we just created; the number of different outputs that the model predicts, in this case 3; and a directory to store the model’s training progress and output files. This allows TensorFlow to pick up training later on from where it left off, if needed.

Input functions

This classifier object will keep track of state for us, and we are now almost ready to move on to the training. There is one final piece to connect our model to the training data, and that is the input function. The job of the input function is to create the TensorFlow operations that generate data for the model.

So we go from raw data, to the input function, which passes the data, that is then mapped by the feature columns to go into the model. Notice that we use the same name for the features as we did in defining the feature column. This is how the data is associated.

Run the training

Now it’s time to run our training. To train our model, we’ll just run classifier.train(), with the input function passed in as an argument. This is how we connect our dataset and to the model.

The train function handles the training loop and iterates over the dataset, improving its performance with each step. And just like that, we’ve completed 1000 training steps! Our dataset is not huge, so this completed rather quickly.

Evaluation time

Now it’s time to evaluate our results. We can do this using the same classifier object from before, as it holds the trained state of the model. To determine how good our model is, we run classifier.evaluate() and pass in our test dataset, and extract the accuracy from the metrics returned.

We got an accuracy of 96.66%! Not bad at all!

Estimators: a straightforward workflow

Let’s pause here for this week, and review what we’ve achieved so far using Estimators.

The Estimators API gives us a nice workflow of getting our raw data, passing it through an input function, setting up our feature columns and model structure, running our training, and running our evaluation. This easy to understand framework allows us to think about our data and its properties, rather than underlying the math, which is great place to be!

What’s next

Today we looked at a very simple version of TensorFlow’s high-level API, using a canned estimator. In future episodes, we’ll look at how to augment this model with more details, use more complex data, and add more advanced features.

Liked this episode? Check out the whole playlist on YouTube!

--

--

Applying machine learning to the world. Developer and Advocate for @googlecloud. Runner, chef, musician. Opinions are solely my own.