Translate a Pandas data frame using googletrans library

Amanda Iglesias Moreno
Towards Data Science
4 min readJan 10, 2020

--

Google translator logo

Googletrans is a free python library that uses Google Translate API. In this article, we explain how to employ the library to translate strings as well as data frames in Python.

Translate a string

Googletrans is a third-party library that we can install by using pip. After installing the library, we import the module googletrans.

The first step is to create a translator object. Then, we use the method translate, passing as first argument the string we want to translate. This method returns an object, whose attribute .text provides the translated string as follows:

The source and the destination language can be specified by using src and dest arguments. If not provided, googletrans tries to detect the language and translates it into English. Below, we translate the previous string Hola Mundo (in Spanish) into Italian and German.

We can consult the supported languages for translation by using the .LANGUAGE attribute. This attribute returns a dictionary that contains available languages in googletrans.

As mentioned before, the translate method returns a translated object. This object has the following attributes:

translate method attributes

So far, we have translated a single string into English, Italian, and German. Now, we want to achieve a much more challenging task: translate a complete data frame.

Translate a data frame

The data frame we are going to translate can be downloaded in the following link:

datos.madrid.es is Madrid´s data service which contains nearly 500 data sets, covering a wide range of topics such as business, transport, tourism, or culture. Currently, more and more European cities provide an open data portal, allowing companies, citizens, researchers, and other public institutions to make use of the data. The data sets are usually provided in the language of the country. In the case of datos.madrid.es, most data sets are available in Spanish.

In this article, we translate the data set that contains information about students who attend Spanish courses for foreigners in the Municipal Office during the first semester of 2018. This data set includes information about the students such as the level of the course they attend, the sex, the age, the nationality, the level of education and the administrative status.

First we download the csv file from datos.madrid.de. Then, we load it into a Pandas data frame using the pandas.read_csv function and visualize the first 5 rows using the pandas.DataFrame.head method.

The data set contains 8 columns: (1) Nivel, (2) Sexo, (3) Edad, (4) Situación adminitrativa, (5) Nacionalidad-País, (6) Area Geográfica, (7) Categoría profesional, and (8) Nivel de estudios.

Now, we start with the translation! First, we translate column names by using pandas.DataFrame.rename function as follows:

Using the pandas.DataFrame.columns attribute, we can check that the translation was carried out correctly. In this article, we are just translating column names, but further data cleaning will be needed (replace spaces by underscores and upper case letters by lower case letters).

After translating column names, we translate the rest of the data (cell values). First, we create a dictionary where the keys are the terms in English and the values are the original terms in Spanish.

Before changing the data frame, we check the translations made by googletrans. As we can observe, some terms are not properly translated (e.g. A2 or A2 DELE). In that case, we change them manually in the following manner:

Now, we can modify the data frame by using the pandas.DataFrame.replace function, employing the previously created dictionary as the input of the function.

data frame in English

As we can observe, the entire data is translated 💪 Job done :)

Interesting readings

Thanks for reading!! 🍀

--

--