Detecting the Language of a Person’s Name using a PyTorch RNN

Published in

Heartbeat

5 min readOct 12, 2018

In this tutorial, we’ll build a Recurrent Neural Network (RNN) in PyTorch that will classify people’s names by their languages. We assume that the reader has a basic understanding of PyTorch and machine learning in Python.

At the end of this tutorial, we’ll be able to predict the language of the names based on their spelling. The dataset of names used in this tutorial can be downloaded here. This tutorial has been adapted from PyTorch’s official docs— check out more about the implementation from these docs.

Plan of Attack

Data Pre-processing
Turning the Names into PyTorch Tensors
Building the RNN
Testing the RNN
Training the RNN
Plotting the Results
Evaluating the Results
Predicting on New Names
Conclusion

Data Pre-processing

As is the case with any machine learning task, we’ll kick off by loading and preparing our dataset. Upon downloading the dataset, we notice that there’s a folder called names inside the data folder. It contains text files with surnames in eighteen different languages.

In order to load all the files in one go, we’ll use a Python module known as glob. The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. Results are returned in an arbitrary order. We’ll use it to load all the files in the folder that end with .txt .

Currently the names are in Unicode format. However, we have to convert them to ASCII standard. This will remove the diacritics in the words. For example, the French name Béringer will be converted to Beringer.

In the next step, we create a dictionary with a list of names for each language.

We can view the first fifteen names in the French dictionary as shown below.

The latest in deep learning — from a source you can trust. Sign up for a weekly dive into all things deep learning, curated by experts working in the field.

Turning the Names into PyTorch Tensors

When working with data in PyTorch, we have to convert it to PyTorch tensors. This is very similar to NumPy arrays. In our case, we have to convert each letter into a torch tensor. This will be a one-hot vector filled with 0s except for a 1 at the index of the current letter. Let’s show how this is done and then convert the letter M to a one-hot vector.

In order to form a single word, we'll have to join several one-hot vectors to form a 2D matrix.

Building the RNN

When creating a neural network in PyTorch, we use the torch.nn.Module, which is the base class for all neural network modules. torch.autograd provides classes and functions implementing automatic differentiation of arbitrary scalar valued functions. torch.nn.LogSoftmax() applies the Log(Softmax(x)) function to an n-dimensional input Tensor.

Testing the RNN

We kick off by creating an instance of the RNN class and passing in the arguments as required.

We’d like the network to give us the probability of each language. In order to achieve this, we’ll pass the tensor of the current letter.

Training the RNN

In order to get the likelihood of each category, we use Tensor.topk to get the index of the greatest value.

Next we need a quick way to obtain a name and its output.

The next step is to define the loss function and create an optimizer that will update the parameters of the model according to its gradients. We also specify a learning rate for our model.

We move forward to define a function that will create the input and output tensors, compare the final output to the target output, and finally do back-propagation.

The next step is to run several examples using this train function as we keep track of the losses for later plotting.

Plotting the Results

We plot the results using Matplotlib’s pyplot. The plot will show us the learning rate of our network.

Evaluating the Results

We’ll create a confusion matrix in order to see how the network performed on different categories. The bright spots off the main axis show the languages it guesses incorrectly.

Predicting New Names

We’ll define a function that will take in a name and return the likely languages the name is from.

Reference

Conclusion

If you’d like to learn more about PyTorch, there are a bunch of tutorials nested in its official docs. If you’re looking to learn more about RNNs, Brian Mwangi. has written a fantastic guide as well:

An Intro Tutorial for Implementing Long Short-Term Memory Networks (LSTM)

Human thoughts are persistent, and this enables us to understand patterns, which in turn gives us the ability to…

heartbeat.comet.ml

The Data Science Bootcamp in Python

Learn Python for Data Science,NumPy,Pandas,Matplotlib,Seaborn,Scikit-learn, Dask,LightGBM,XGBoost,CatBoost and much…

www.udemy.com

Discuss this post on Hacker News and Reddit.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.