Language localization: an end-to-end project on data science and FIFA 20

Exploratory data analysis and Twitter sentiment analysis with Python and Plotly visualizations.

Héctor Ramírez, Ph.D.
Towards Data Science

--

“What is language localization?” You may ask.

Language localization is the process of adapting a product’s translation to a specific country or region. It is the second phase of a larger process of product translation and cultural adaptation (for specific countries, regions, cultures or groups) to account for differences in distinct markets, a process known as internationalization and localization. As explained here:

The localization process is most generally related to the cultural adaptation and translation of software, video games, and websites, as well as audio/voiceover, video, or other multimedia content, and less frequently to any written translation (which may also involve cultural adaptation processes). Localization can be done for regions or countries where people speak different languages or where the same language is spoken.

So at some point, I was referred to this project: Imagine that Electronic Arts (EA) wants to know what would be a good language to translate future versions of the FIFA videogame. This task is assigned to you, the data scientist in turn, and you have no resources but a Kaggle dataset with all the players’ attributes from the game.

With no official data or any previous research, we have to make our way with the player's dataset and perhaps gather our own data. But there’s nothing to fear, this is what a data scientist has to deal with during almost every new project.

TL;DR: Along the way, we use the full dataset of the attributes and skills of the players in the game. This dataset contains an attribute called international reputation (IR) which tells how well-known a player is, internationally speaking. We draw conclusions based on the assumption that a large number of players of the same nationality and with high IR might influence the playability of the game in their country.

Apart from this, we gather and employ a collection of Twitter tweets mentioning the game. By computing the sentiment score of the tweet text, we can discriminate locations where people speak more positively about the game. This, of course, could provide better insights towards the localization of the game.

This story is code-free for readability. The full python source code used, and results can be found in the following repository: https://github.com/hectoramirez/Language-localization_FIFA

FIFA 20 languages

We start off by studying the languages already included in the FIFA 20. According to the official site, FIFA 20 is currently playable in the following 21 languages, with region-specific commentary provided:

Arabic, Czech, Danish, German (Germany), English (American), Spanish (Spain), Spanish (Mexico), French (France), Italian, Japanese, Korean, Dutch, Norwegian, Polish, Portuguese (Brazil), Portuguese (Portugal), Russian, Swedish, Turkish, Chinese (simplified), Chinese (traditional).

Although this is only a small set of the worldwide and players languages, we can be sure that most of the countries are covered due to the popularity of those languages. To see this, I constructed a dataframe by joining the data from this public repository. The head looks like this:

I then filtered the languages included in the game and localized the associated countries in a map (here and throughout, I employed Plotly Express for the maps):

In the map, only one language is associated to each country, however we are interested in looking into those countries with no languages.

We can see that except for the Balkanic area, the Southeast Asia, and a couple of other countries, most of the world is fairly covered. Therefore we would like to look into those countries left out and perhaps into second (or regional) languages spoken in the countries already covered.

The FIFA 20 players dataset

The FIFA 20 players dataset can be obtained from this Kaggle repository. As the description states, the dataset — made of 103 columns and 18278 entries — contains:

  • 100+ attributes.
  • URL of the scraped player.
  • Player positions, with the role in the club and in the national team.
  • Player attributes with statistics as attacking, skills, defense, mentality, GK skills, etc.
  • Player personal data like nationality, club, date of birth, wage, salary, etc.

Multiple kinds of analyses can be performed from this dataset. For the purposes of this analysis, however, we may only need a player’s nationality and club. Let’s also keep the short name for identification, and the overall and international reputation scores as they surely can provide some insights. The head of the dataframe looks like:

Now, one of the key thoughts in this analysis is that the more popular a player is, the more representative he would be for a country. In other words, a country with more good players would influence a more playability of the game. Take Croatia for example, in the last worldcup they finished second with a team of well-recognized players; it would make sense that this fact influence the playability of the game across the country, and that perhaps Croatian is a language we are looking for.

A country with more good players would influence a more playability of the game

We have the international reputation field to measure this influence. However, first we will need to add a language field to the data. For this, we use the countries/languages dataframe to associate a language to a player’s nationality.

Afterwards, we can look into a countplot and see how many players are associated to each language. Remember that we are interested in languages with more players with a high international reputation (IR). Here are the full set of langauges by IR:

Top languages

Notice that IR is an attribute which can take a discrete value from 1 to 5. Now, the thing is that most of those languages are already included in the game. By leaving only the languages not included:

Top languages not yet included in FIFA

Remember that we associated all the languages spoken in a country to the country. Therefore, some values are duplicated given that several languages are spoken in one country.

In order to filter out some languages we note the following facts regarding the top countries:

  • First, we are assuming that all Spanish players speak Basque, Catalan, and Galician, which is clearly not true (and this is the same for a lot of more countries). In fact, Catalan (or Valencian) is spoken by 19%, Galician by 5%, and Basque by 2% of the population in Spain. [Ref.]
  • Occitan, spoken in Southern France, Monaco, Italy’s Occitan Valleys, as well as Spain’s Val d’Aran, has a range from 100,000 to 800,000 total native speakers. [Ref.]
  • Guarani is an indigenous language of South America. It is one of the official languages of Paraguay, where it is spoken by the majority of the population. It is also spoken by communities in neighboring countries, including parts of northeastern Argentina, southeastern Bolivia and southwestern Brazil, and is a second official language of the Argentine province of Corrientes. It has 4.85 million (cited in 1995) native speakers though. [Ref.]
  • Nynorsk is one of the two written standards of the Norwegian language. [Ref.]
  • Although English is the more common first language elsewhere in Ireland, Irish is spoken as a first language in substantial areas. The official status of the Irish language remains high in the Republic of Ireland, and the total number of people who answered ‘yes’ to being able to speak Irish in April 2016 was 1,761,420, which represents 39.8 per cent of respondents out of a population of 4,921,500 (2019 estimate) in the Republic of Ireland. In Northern Ireland 104,943 identify as being able to speak Irish out of a population of 1,882,000 (2018 estimate). [Ref.]

Taking into account these facts, we will take the following considerations:

  • For Catalan speakers, we will only keep Spanish players playing in FC Barcelona, mainly due to their policy of teaching Catalan to their players. [Ref.]
  • For Guarani speakers, we will only keep Paraguayan players as the language is spoken by the majority of the population alike in the rest of the countries.
  • We keep all Irish speakers.
  • We drop Basque, Galician, Occitan and Nynorsk.
  • And keep all the rest of languages with fewer speakers.

This leave us with a fair set of players with distinct nationalities and languages which we now want to group by international reputation.

International reputation

We now want to make an analysis of the players’ IR given their language.

International Reputation, also known as International Recognition, is an attribute that affects the player’s rating according to his club’s local and international prestige. It is based essentially on the popularity, history and results of them both. Basically, IR was created in order to adjust the players’ rating relatively to everything that doesn’t actually have to do with his technical, physical and mental capacities. It converges artificially so that the players who have the most fans around the world always get the highest ratings, but in practice there is no real effect. [Ref.]

By averaging the IR by each language we finally came to a relation of the languages not included yet in the video game which are ordered by the player’s mean international reputation. Take a look:

Top player’s languages sorted by IR

Here we made several changes. First, we dropped languages with less than 15 players (count) in order to avoid bias. We also normalized the IR score and now ranges across [0,1]. Finally, we appended the most representative countries and country codes to each languages (this is for plotting purposes).

And so we finally came to a relation of the languages not included yet in the video game which are ordered by the player’s mean IR. Looking at it, we can conclude that Catalan, Slovenian, Croatian, Bosnian and Hungarian are the most prominent languages.

Let’s finally localize these languages in a map:

Notice how the Balkanic countries are the most prominent for a language localization give the players’ IR.

Social media analysis

In this second part, we perform a social media analysis to obtain insights about how well people talk about the game, specifically, on Twitter.

Twitter allows collecting tweets using tweepy, a Python library for accesing the Twitter API. I don’t intend to give a tutorial on how to collect tweets; however, it is quite straightforward by following the Docs. You can also take a look at my script, but remember that you need to use your own credentials.

Each tweet object comes in JSON format, a mix of ‘root-level’ attributes, and child objects (which are represented with the {} notation). The Twitter developer page gives the following example:

{
"created_at": "Wed Oct 10 20:19:24 +0000 2018",
"id": 1050118621198921728,
"id_str": "1050118621198921728",
"text": "To make room for more expression, we will now count all emojis as equal—including those with gender‍‍‍ ‍‍and skin t… https://t.co/MkGjXf9aXm",
"user": {},
"entities": {}
}

This is of course a small sample out of the huge dictionary composing each tweet.

And it is clear that, for our purposes, we only need a few of this attributes: the text, the text language and the tweet location. Now, unfortunately, these attributes don’t come in a clean format, instead they are spread across the JSON levels — e.g., the tweet location coordinates are located in

tweet_object['place']['bounding_box']['coordinates']

It is due to this fact that the collected tweets need a large process of cleaning and transforming. Indeed, I wrote a separated story explaining the process of flattening the dictionary; selecting, cleaning, translating and calculating the sentiment of the text; cleaning the text language field; and assigning a tweet proper location using the place field or the user location filed. You can find the story here:

Once we have a processed tweets dataframe, we are ready to continue our study. Notice that Twitter does not allow to make tweet data publicly available and thus I’m not publishing any sample of the dataframe but just a one-row snapshot to show you the fields (which we will explain throughout the text):

I collected 52830 tweets like the one in the image over the course of several days containing the following keywords: ‘#FIFA20’, ‘#FIFA21’, ‘FIFA20’, ‘FIFA21’, ‘FIFA 20’, ‘FIFA 21’ and ‘#EASPORTSFIFA’ (also, keep in mind that Twitter does not allow to collect past tweets). As mentioned, all these tweets were processed accordingly to our needs here. Therefore, the final dataset comes with relevant data regarding the location of the tweet (country and coordinates), the sentiment of the English version of the text (which varies in the range [-1,1]), and the language the text was tweeted in — keep in mind, however, that this language is “detected” and thus it could be wrongfully assigned, as we’ll see.

In the following, we will follow four paths:

  1. We will study the tweets with exact location (with coordinates) as they are the most reliable tweets regarding their location.
  2. We then study the tweets by the language the are written in.
  3. Then, by the manually-set user-location.
  4. Finally, we go back to the most prominent languages by international reputation (obtained in the last section) and study the sentiment of the tweets coming from countries where those languages are spoken.

Tweets with coordinates (exact location)

Unfortunately, only around 1% of the tweets in the world come with exact geolocation mainly because the user must enable this option.

In our 52830-tweet dataset, only 485 tweets provide a coordinate-level location.

By filtering these tweets, we can group them by language and compute their mean sentiment. This is what we found:

Tweets’ languages (with exact location) sorted by mean sentiment

Looking at the dataframe we can clearly see that all languages but Tagalog / Filipino, Indonesian and Czech are already included in the game. However the count and sentiment of these are not really illustrative. Therefore we can conclude that this sample of tweets does not provide enough insights and thus we need to look into the other tweet attributes.

Tweets by tweet language

Let’s move on to the language field. This field was taken from the lang attribute in the Twitter JSON and changed it to the language standard name. As mentioned in the documentation:

When present, indicates a BCP 47 language identifier corresponding to the machine-detected language of the Tweet text, or und if no language could be detected.

So we group the dataset by this column and compute the mean sentiment by language. Furthermore, to analyze the localization of these languages, we assign them a country according to where the languages is tweeted the most. Here is what we found:

Tweets detected languages and location sorted by mean sentiment

Let’s look at some anomalies: Haitian is mostly spoken in Ghana; Lithuanian, Estonian and Tagalog / Filipino, in Brazil; and Slovenian in Zimbabwe! This actually only means that the language detector does not work properly well.

These anomalies could hardly be outliers, so we can conclude that the Twitter lang attribute is not reliable enough for our analysis and thus we need to trust on the manually-set user-location field.

Tweets by user location

As previously mentioned, the user-location field in the Twitter JSON is a field that the user fills with raw text. This means that the field could or could not give and actual place, and if so, it could be a city, state or province and not a country.

To overcome this issue, in the process of cleaning this field, we used GeoPy to identify a location (which could be an address) and assign a country to it. Therefore, in the dataset, the location field provides GeoPy's "detected" country. However, the sample of not-null values is about almost half of the full dataset, meaning that either the user didn’t provide a location or the text provided din’t contain an actual location.

We group the set by location and compute again the mean sentiment. As we are actually interested in languages, we add the languages spoken in those countries and then we follow the steps in the last section to remove the countries already included in the game and the others (Basque, Galician, Occitan, etc.). We also remove entries with less than 20 tweets (count) in order to avoid bias. Here is what we found:

Tweets languages and user location sorted by mean sentiment

This dataset is clearly more informative than in the previous cases!

Among the top languages, we can highlight Bulgarian, Hebrew, Hindi and Irish. Maori in New Zealand and Aymara in Bolivia might not be very interesting given that they are spoken by indigenous population. Notice as well that all countries but Greece and Czech Republic have a positive mean sentiment.

Let’s finally localize these languages in a map:

We obtained top countries where people are apparently speaking positively about the game. However, the languages spoken in those countries are different of those found when analyzing the players international reputation. Given this fact, let’s check how those languages, in the last section, scored in the tweets set.

Sentiment of the top languages by international reputation

In the previous section we obtained the top languages by the players international reputation. To finish the analysis, we would like to know the sentiment of those languages and see whether there is a language that stands out by both IR and sentiment among the others.

We then merge the “Top player’s languages sorted by IR” dataset with the “Tweets languages and user location sorted by mean sentiment” dataset to obtain:

Top languages by both IR and sentiment, sorted by sentiment

The above dataframe is sorted by sentiment and the first column represents the position of the languages by international reputation. This allows for a easy and fast comparison of both cases.

Perhaps the best case is Hebrew which is second by sentiment with a well-above-average sentiment score and 10th by IR. Then Ukrainian. On the other hand, Catalan had previously the best IR by far but it does not stand well by sentiment, although let’s remember that this score involves the whole Spain and not only Catalonia.

Let’s finally look at these languages in a map:

Conclusions

In this project, we aimed to give insights on the language localization of the FIFA videogame.

We started by showing the countries where one or more of the languages already included in the FIFA game are spoken. We noticed that the Balkanic countries and Southeast Asia are not covered by the current available languages.

Then in the first section, we localized languages not included in the game and highlighted them by the international reputation (IR) of players who speak that language. This illustrates that languages like Catalan, Slovenian, Croatian, Bosnian or Hungarian are spoken by well-known players and this could influence the playability of the game in those countries/regions.

Regarding the social media analysis, only a small portion of the world’s tweets (almost 1%) contain exact geolocation. In our dataset, they sum up to almost 500 tweets. We showed that this subset is not large enough to draw conclusions.

We then processed the tweets by the manually-set user location, found a belonging country and associated a main language for that country. We then localize those languages in a map and colored them by mean sentiment. We found that Bulgarian, Hebrew, Hindi and Irish are the countries’ languages — not included in the game — where people speak more positively.

Finally, having found that localizing by IR some languages stand out whereas by sentiment, others. In the last part, we aimed to select one or more language which stands out by both attributes. Interestingly, Hebrew which ranks second by a well-above-average sentiment score ranks 10th by IR and thus is the most prominent language to target.

Following steps would involve the use of non-public, or EA’s official data, to make a more specialized decision.

--

--