What App Descriptions Tell Us: Text Data Preprocessing in Python

Applying NLP to app descriptions on the App Store

Published in

Towards Data Science

9 min readAug 10, 2018

Continuing along with the theme of data cleaning and exploration, much of effective NLP analysis is dependent on the pre-processing of textual data. I have thus decided to perform a step by step preprocessing of some textual data derived from Apple Appstore descriptions and a K-Means cluster of the resulting text.

Why the Appstore? This dataset from kaggle contains 7197 apps with all their respective app descriptions. App descriptions are where app creators try their best to “sell” their app. With the preprocessed data, I examine the question of “Are App descriptions good predictors of app genres?”.

The preprocessing “framework” I use here is as follows:

Translations*
Remove non-alphabetic characters
Convert all to lower case
Tokenization
Remove stop words
Stemming**
Analysis

* I wanted to translate all the descriptions first as a regex for non-alphabetic characters would have removed languages like Japanese and Chinese.

** I had left out lemmatization in this particular case as I wanted to look at lexical diversity later.

As usual, we start off by reading in the csv files with the relevant data. Here, we merge the dataframe with basic app information with another dataframe with app descriptions.

As the datatypes for the feature variables are largely appropriate, lets take a quick look at the underlying data.

Unsurprisingly, most of the apps on the App Store are games, with around 54% of the dataset. The next most representative genres are ‘Entertainment’ and ‘Education’, far behind at around 7% and 6%.

The highest rated genres on average are ‘Productivity’ and ‘Music’. It’s interesting to note how ‘Catalogs’, ‘Finance’, and ‘Book’ are by far the lowest rated apps in the App Store.

Preprocessing Process

1 Translating to English

Preprocessing starts by translating all the app descriptions to English. Here, we use the googletrans package to make an API call to Google Translate. Unfortunately this API call has a 15k character limit. While not the most elegant of solutions, we can work around this limitation by reinitializing the translator for each row in the dataframe.

To prevent the needless operation of making an API call when the language is already English, we can use the langdetect package for a conditional check and only made the API call when the language was not ‘en’.

There were 46 apps with descriptions that returned error calls. While length was an issue for some, errors could not be identified for others. Since this is a small portion of the dataset, we can drop these apps.

2 Removing non-alphabetic characters

Now that we are sure that languages such as Japanese or Chinese will not be filtered by the regular expression, we create an expression [^a-zA-Z ] to return all alphabetic characters and spaces. Anything else is removed with the re.sub() method within the helper function cleaned(). The spaces are left so that when tokenization occurs, the string can be split by the spaces.

3 Convert to lower case

Another step in normalizing the text data is by converting all the characters to lower case. This is as simple as running str.lower() on the target column of the dataframe.

4, 5, 6 Tokenization, removing stop words, and stemming

As the next 3 steps are achieved with the nltk package, I created an aggregate helper function for them.

Tokenization refers to splitting a long string into smaller chunks, or tokens. It is quite similar to running a split function on a string to return a list of individual components based on a defined separator. We tokenize the string here with word_tokenize in the nltk package.

Stop words are words that are filtered out as they do not contribute much to the overall meaning of the text. These include words like ‘a’, ‘to’, ‘and’. We find a set of stop words for the english language through stopwords.words(‘english’) in the nltk package.

Stemming refers to the removal of affixes from a word. For example, ‘climbing’ becomes ‘climb’. We initialize a stemmer with SnowballStemmer(‘english’) in the nltk package.

The helper function below will first tokenize the string. Then it checks if each token is a stop word. Finally, it runs the stemmer on the word if it is not a stop word and appends the word to a list.

7 Analysis

Let’s now dive into the sanitized data.

7.1 Word Cloud Generation

What are the most common words used for apps in a particular genre? While a simple word count or tfidf vectorizer would have returned a ranking of the words, perhaps the top words could be more effectively visualized in a word cloud.

To generate a word cloud for each genre, I created a corpus (or collection) of the sanitized app descriptions for each one. Luckily, there was a wordcloud package that conveniently generated wordclouds from a given corpus.

The wordcloud package works by creating a list of the top 200 words for a corpus and an accompanying list of normalized word counts for each respective word. The Python Image Library is then used to draw the wordcloud. This summary does not do his code justice and you can read more about it here.

Taking a small sample of the wordclouds for ‘Games’, ‘Weather’, ‘Shopping’, and ‘Music, we see that the most prominent words are indeed what we would expect for the particular genres. While the distinction seems quite clear for these genres, it is a little fuzzy for others, something we will get back to during clustering.

7.2 Lexical Diversity

Do different genres like ‘Games’ naturally result in more colorful language and descriptions? I seek to answer this by using a simple formula to determine lexical diversity. I took the number of unique words in each list of filtered words and divided it by the total number of words for that description. The higher the number, the diversified the words used.

As expected, the average lexical diversity score is the highest for ‘Games’ and ‘Book’. This is likely due to the diversity of sub genres for both genres and the tendency for such descriptions to be more “engaging”.

Descriptions for catalogs are unsurprisingly dull.

7.3 Sentiment Analysis

Are the app descriptions for some genres more positive than those of other genres? I decided to picture this by running sentiment analysis within the Textblob package.

A sentiment polarity of 0 indicates a neutral sentiment, a polarity below 0 indicates negative sentiment, and a polarity above 0 indicates (you guessed it) positive sentiment.

The categories of ‘Games’, ‘Finance’, and ‘Medical’ have the lowest average sentiment polarities. ‘Games’ could have very low sentiment polarities if the games contained distressing themes such as war or horror (themes that are well represented). The descriptions of ‘Finance’ and ‘Medical’ could be low as the app might be describing an unfavorable financial or medical condition.

There seem to be a few tear-jerkers (John Green) making up the outliers for books. Mentions of ‘capturing happier times’ seems to make up the outliers for ‘Photo & Video’ apps.

7.4 K-Means Clustering

Now to the question we were attempting to answer at the start, “Are App descriptions good predictors of app genres?”.

To answer this, I apply K-means clustering on the word vectors to see how well they map to actual categories.

When we run the clustering with 10 clusters, the top 10 words in the resulting clusters are as follows:

It seems at first glance that some clusters can be labelled:

Cluster 0: Apps for children

Cluster 3: Music apps

Cluster 4: Games involving war/fighting/monsters

Cluster 6: Puzzle games

Cluster 7: Photo and video apps

Cluster 9: Car-related apps

So how does do these clusters map to the actual app genre?

The labels above seem largely representative but there seems to be an issue with games being overrepresented in the sample. The shades for other categories are too faded to derive any meaningful insight. The same heat map is generated again without apps in the ‘Games’ genre.

Without the heavily represented genre of ‘Games’, the depth of representation of other genres are now more clearly identified. Some obvious clusters are cluster 3 for ‘Education’, cluster 4 for ‘Finance’ and ‘Shopping’, cluster 7 for ‘Weather’, and cluster 9 for ‘Photo & Video’.

What if we increased the number of clusters to a number close to the total number of genres (22 without games)?

Lets also cross compare these mappings with the top words in each cluster:

With 20 clusters, the well defined clusters seem to be the following:

Cluster 0: Shopping

Cluster 9: Health & Fitness

Cluster 10: Music

Cluster 11: Photo & Video

Cluster 13: Education

Cluster 14: Finance

Cluster 17: Weather

There are various reasons why only a third of the genres are well represented by the clusters and two most visible ones are unbalanced sample sizes and word overlaps across genres.

Unbalanced sample sizes

As seen in the initial EDA, around 54% of apps were games. Even after taking games out, a large portion was still represented by ‘Education’ and ‘Entertainment’ apps.

It would have been interesting to select a constant sample size across all genres but that would require a larger overall sample size.

Word overlaps

Despite the small sample size, ‘Weather’ apps were well identified by cluster 17. It is arguable that the genres that were well identified by clusters are those with unique nomenclature.

Wordclouds for utilities and reference apps

Looking at the wordclouds for genres like ‘Utilities’ and ‘Reference’, they contain words that are well represented across multiple other genres as well.

Perhaps a list of popular tokens could be formed across all genres and filtered out of the currently sanitized strings. Despite that, genres like ‘Productivity’ and ‘Utilities’ are likely to still overlap and not be perfectly clustered.

I definitely hope to create a model that takes in more than just app descriptions to predict the app genre after fixing some issues with this NLP analysis.

I’ve also been playing around with markovify to generate app descriptions from each genre. Here’s are a few examples for the ‘Sports’ genre: