The world’s leading publication for data science, AI, and ML professionals.

DS Project: How to Predict Google Apps Rating?

Feed your data science knowledge by practicing this exercise.

In this article, I share with you my experience in analyzing and predicting the Google Apps ratings. This is one of the tests I had to solve in Data Science interviews. I have to point out that this is a personal way of solving the problem, it can help you draw your own reasoning.

As with any other DS project, we start with downloading the data file from here.

You can check the code I used in this project by visualizing the Jupyter Notebook of this project on my Github from here.

Without further ado, Let’s go!

Getting started with the data

The first step is data preprocessing. We clean, play and transform all the elements and columns of the data based on their formats and types. As we can see, our data has the following format:

apps = pd.read_csv('GooglePlayApp-ELHOUD.csv')
apps.info()

The data frame contains 13 different columns and 8281 rows. The column "Rating" represents the Y-vector of our model: what we try to predict. We visualize the different values of "Rating":

apps['Rating'].value_counts()

While visualizing the values of "Rating", we notice that there is an unreasonable rating in our data (19.0). In general, the Apps rating is between 0 and 5 stars. We delete this value to avoid biasing our model. We can replace it with 1.9 if we think it was a typing mistake, but since we can’t be sure and we have no direct contact with the data owner, it is better to remove it.

After that, we check all the duplicate apps and remove them:

print('Number of apps at the beginning:', len(apps))
apps.drop_duplicates(subset='App', inplace=True) 
print('Number of apps after removing duplicates:', len(apps))

I think that dropping the three columns (Current Ver, Android Ver, and Last Updated ) is recommended since these three columns are unnecessary for our analysis and have no direct effect on the rating (visualizing correlations after).

In order to visualize the data, we have to convert it into numerical. The conversion is done by replacing all the string and transforming them in different ways to numerical format. In the figure below, we recapitulate all the cleaning, scaling, and conversions steps.

I recapitulate all the steps in the figure below:

All the parts of the code of the transformations I used are set on my Github repository.

Data Visualization

Once we did all the transformations, we visualize at the beginning the distribution of the "apps ratings" and the distribution of the "apps size" using Seaborn. So let’s take a look at it and check for normality, and try to correct it otherwise:

sns.distplot(apps['Rating'],fit=norm)
print('- Total number of ratings:', len(apps['Rating']))
print('- Mean of distribution of rating :', np.mean(apps['Rating']))
print('- Standard deviation:', np.std(apps['Rating']))

We notice that "the fit" of the dataset rating doesn’t follow -technically- a normal distribution with a mean of 4.16 and a standard deviation of 0.559. This information will give a helping hand in defining and developing the model after. Let’s check the probability plot:

from scipy import stats fig = plt.figure() 
prob = stats.probplot(apps['Rating'], plot=plt)

So, certainly not normal: we have skewness.

I thought of doing some transformations in order to fit a gaussian distribution. Let’s apply a Box-Cox transformation on the data and see what happens… The Box-Cox transformation is expressed as follows:

After that, we re-visualize the distribution of the "apps ratings" after transformation:

Much more like a gaussian (normal) distribution! We will be using this transformation for the rest of this project.

Next, we visualize different correlations. This is a crucial step that helps us choose the important features of our model. The correlation matrix is shown below:

apps.corr()

Some remarks related to the correlations:

  • It is obvious that the number of installs is highly correlated with the number of reviews (k=0,59).
  • The higher the price, the lower the rating with a correlation coefficient of 0,02.

Prediction models

We build two models:

  • Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import auc, accuracy_score, mean_absolute_error, mean_squared_error
chosen_features = ['Reviews', 'Size', 'Installs', 'Type','Category', 'Price', 'Content Rating', 'Genres']
X = apps[chosen_features]
y= bcx_target #transformed rating
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=10)
rf_reg = RandomForestRegressor()
rf_reg.fit(X_train,y_train)
y_rfpred = rf_reg.predict(X_test)

For a simple RF Regressor (with the transformed rating), we get:

  • Mean Squared Error: 0.269
  • Mean Absolute Error: 0.352

In the figure below, we have a better visualization of the actual and predicted rating by Random Forest Regressor.:

Now, let’s visualize the effect of the number of estimators of the random forest on MSE:

estimators = np.arange(10, 500, 10)
mse_list = []
for i in estimators:
    rf_reg.set_params(n_estimators=i)
    rf_reg.fit(X_train, y_train)
    y_rfpred = rf_reg.predict(X_test)
    mse_list.append(mean_squared_error(inv_boxcox(y_test,lam), inv_boxcox(y_rfpred,lam)))
plt.figure(figsize=(10, 5))
plt.xlabel("No. of Estimators")
plt.ylabel("MSE")
plt.title("Effect of Number of Estimators")
plt.plot(estimators, mse_list)

We get the lowest MSE for a number of estimators around 370. The lowest MSE is approximatively equal to 0,2697.

  • XGBoost Regressor
y_xgpred = xgb_model.predict(X_test)
mse=mean_squared_error(inv_boxcox(y_test,lam), inv_boxcox(y_xgpred,lam))
print('Mean Squared Error:',mse)

MSE of XGboost is approximatively equal to 0,2695.

In the following figure, we compare the prediction performances of the two models (Random Forest and XGboost).

Perspectives

I stopped here in order to not make the development too long. Although, if I want to go further in this, I would have tried these two ideas:

  • Trying some neural networks model (using Keras). I believe that if we have chosen the best architecture for this problem, we would get good results.
  • I would also get a hand on the second part of the dataset, and apply NLP (Natural Language Processing) to predict the rating of an application based on the review comments of its users (by using all the NLP techniques like Tokenization, Text segmentation…)

I recommend you to check the code I used in this project by visualizing the Jupyter Notebook of this project on my Github from here and follow for more related articles and projects.


Related Articles