Data analysis: ingredients of skincare products not predictive of product price

In this project, I used a combination of machine learning algorithms to test how informative ingredients are in determining a skincare product’s price.

Dora Modrall Sperling

Published in

Towards Data Science

14 min readMar 5, 2021

*Photo credit: Diego Cervo / Getty Images*

Introduction

The skincare industry is extremely profitable — the global market for cosmetic skincare in 2020 was estimated at 145.3 billion USD.

However, how skincare companies price their products seems to be a closely-guarded secret. For example, a 16-ounce tub of La Mer cream is priced at 2,475 USD on the La Mer official website, while the much-cheaper Nivea Creme moisturizer has a nearly identical ingredient list. The Ordinary is another brand that has become hugely popular for its affordable skincare products and no-frills ingredients. The Ordinary is able to keep costs down by minimizing marketing spending compared to other skincare companies, and only using ingredients that have been shown to be effective (e.g., retinoid and vitamin C). This begs the question — when you buy a skincare product, are you paying for brand and marketing costs, or for the ingredients and formulations?

Of course, when most people pay a lot for skincare, they expect high-quality ingredients and formulations that yield the best results. It is in the interest of luxury skincare companies to perpetuate the theory that a higher price tag is synonymous with higher-quality ingredients and performance. In this paper, I will use various machine learning algorithms in Scikit-Learn to predict the price of a product. The first part of my project is exploratory, comparing binary classifiers trained on data with and without brand information to predict if a product is “cheap” or “expensive”. The second part of my paper attempts to build multiclass classifiers to sort products into 1 of 4 price categories — “cheapest”, “cheap, “expensive”, and “most expensive”.

I hypothesized that including the product ingredient list in the training data will not cause the classifiers to perform better, i.e., that a product’s ingredients do not contribute meaningfully to its price.

Instead, I expected that “brand” will be a much more informative feature for predicting the price of a product.

The dataset I used in this project is the “Skincare Products and their Ingredients” dataset by Erin Ward.

Dataset Details
Dataset of skincare products from LookFantastic.com, consisting of 1138 skincare products including their names, URLs, product types, ingredients and prices.

My project consists of 5 parts:

1) Binary classification using ingredient list, amount of product (i.e., contents), and product type.

2) Binary classification using brand, ingredient list, amount of product, and product type.

3) Binary classification using brand and product type (without any information about the product amount or ingredients).

4) Multiclass classification using brand, ingredient list, amount of product, and product type.

5) Multiclass classification using brand and product type (without any information about the product amount or ingredients).

Preparing the dataset

To preprocess the data and prepare it for an algorithm, I first converted the “price” column into a column of floating-point numbers, removing the “£” symbol.

df[‘price’] = df[‘price’].str.replace(‘£’, ‘’) 
df[‘price’] = df[‘price’].astype(float)

I also added a “brand” column and filled in the brand of each product manually. Then, I added a “contents” column, containing the amount of product for in item, by using a regex to extract the milliliter, gram, or kilogram amount from “product_name”. For example, the item “Acorelle Pure Harvest Body Perfume — 100ml” has the value “100” in the “contents” column (the limitations of this process will be discussed in my “Limitations” section.)

contents = [] 
for title in df[‘product_name’]: 
    try: 
       m = re.search(‘\d+(ml|g|kg)’, title)
       contents.append(m.group()) 
    except AttributeError: 
       contents.append(np.nan) 
df[‘contents’] = contents

Unfortunately, the “contents” column contained 155 missing values, where the regex was not able to find a product amount in “product_name”. I manually searched for and filled volume amounts into the “contents” column to replace these missing values. I decided to create this “contents” column because if the ingredient list had any predictive value, it is logical that the amount of those ingredients would have an effect on the product price as well. I kept the columns “brand”, “contents”, “product_type”, and “ingredients” as my predictive features in my dataset.

Next, I removed all ‘ml’ and ‘g’ units from the “contents” column. Then I casted the ‘contents’ and ‘price’ column to floating point numbers.

skincare[‘contents’] = skincare[‘contents’].str.replace(‘ml’, ‘’) skincare[‘contents’] = skincare[‘contents’].str.replace(‘g’, ‘’)
skincare['contents'] = skincare['contents'].astype(float)
skincare['price'] = skincare['price'].astype(float)

Here is the prepared dataset, ready to be processed and fed to an algorithm:

Finished dataframe with product names, brands, URLs, product type, ingredients, price (in GBP) and contents

Most of the products were under 1 kg, and of these products, most were clustered around under £100. There seems to be little relation between the product price and how much of a product there is.

As the histogram below shows, the vast majority of products are priced below £50.

The mean price in the dataset is around £24.

Distribution of prices in the skincare product dataset

In binary classification experiments, I defined 2 price categories based on the inter-quartile range above: “cheap” (class 0, under £18.90— half of the products in the dataset were below this price) and “expensive” (class 1, over £18.90 — half of the products in the dataset were above this price).

skincare[‘price’] = skincare[“price”].apply(lambda x: 1 if x > 18.90 else 0)

Admittedly, this distinction is arbitrary. But he goal of this section was simply to see if any information about price could be obtained from a combination of the predictors “brand”, “contents”, “product_type”, and “ingredients”.

In multiclass classification experiments, I defined 4 price categories : “cheapest” (<£9.95£), “cheap” (£9.95£-18.90), “expensive” (£18.90-31.25) and “most expensive” (>£31.25) using the same inter-quartile range above to get an equal number of products in each category (code below).

def price_to_class(x):
   if x <= 9.950000:
      x = 0
   elif (x > 9.950000) and (x <= 18.900000):
      x = 1
   elif (x > 18.900000) and (x <= 31.250000):
      x = 2
   elif x > 31.250000: 
      x = 3
   return x
skincare['price'] = skincare["price"].apply(price_to_class

Preprocessing

I used a column transformer to prepare the dataset to be fed to the algorithm. “product_type” had 14 unique values that were encoded as strings, so I used OneHotEncoder() on this column to get the corresponding dummy variables. I used StandardScaler() on the column with “contents”, containing numeric data, to center the column values on mean 0 with a standard deviation of 1 (StandardScaler() was not used in experiments with decision trees, since decision trees do not benefit from features being scaled).

Finally, I used CountVectorizer() on the “ingredients” column. I chose CountVectorizer()for the “ingredients” column because one-hot-encoding each ingredient would cause errors if unseen ingredients appeared in the test set’s “ingredients” column. CountVectorizer() simply ignores vocabulary it has not been trained on.

For the count vectorizer, I defined a custom tokenizer to remove unnecessary characters or artefacts of the dataset, splitting on commas (the ingredients were separated by commas, e.g., “water, butylene glycol, …”).

def tokenizer(x) -> list: 
   x = x.replace('(', '')
   x = x.replace(')', '')
   x = x.replace('\xa0', '')
   x = x.replace('.', ', ')
   x = x.replace('&', ', ')
   x = re.split(', ', x)
   return x

These transformers were compiled using the function make_column_transformer(). The column transformer was combined in a pipeline with each machine learning algorithm.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer

ohe = OneHotEncoder()
vect = CountVectorizer(tokenizer=lambda x: tokenizer(x))
scaler = StandardScaler()ct = make_column_transformer(
   (ohe, ['product_type']),
   (vect, 'ingredients'),
   (scaler, ['contents']), 
    remainder='passthrough')

Binary Classification of Price Category

In the first part of my experiment, I included all predictors except “brand”. That is, I used the predictors “contents”, “product_type”, and “ingredients”, excluding “brand”. In the second part of my experiment, I included all predictors: “brand”, “contents”, “product type”, and “ingredients”. In the third part of my experiment, I removed all information about both the ingredients and the amount of product. Instead, I only used the “brand” and “product_type” column. I suspect that “content” does not provide useful information about price. For example, the average serum (coming in 30–50ml bottles) can cost over 30 euros, while body wash can come in 1-liter bottles and only cost a fraction of the former’s price.

In binary classification, I tested the accuracy of the perceptron, logistic regression, decision tree, K- nearest neighbors (KNN), and support vector machine (SVM) classifiers. I also included three ensemble learning models, such as a voting classifier, bootstrap aggregation (bagging), and random forests classifiers in predicting a product’s price category. Hyperparameter tuning was performed by randomized grid search on the logistic regression, decision tree, KNN, SVM, and random forests classifiers (GridSearchCV was extremely slow and evidently too computationally expensive).

Randomized grid search was not used on the perceptron, voting, and bagging classifiers. I chose to give the ensemble voting classifier the 3 best-performing algorithms (all fitted with the optimal hyperparameters obtained during the randomized grid search on these algorithms). The bagging classifier used decision trees that were fitted with the optimal hyperparameters obtained during randomized grid search on the decision tree classifier.

Binary Classification Results and Discussion

Table 1. For each classifier, the 10-fold cross-validation score was calculated. These 10-fold cross-validation scores are reported above. The perceptron, ensemble voting, and bagging classifiers did not undergo hyperparameter tuning, though the bagging classifier’s *n_estimators=* hyperparameter was optimized. The best accuracies of each experiment iteration are highlighted in yellow for visibility.

On average, classifiers performed the best with only “brand” and “product_type” as predictors, with an average classifier prediction of roughly 80.3%. When the “ingredients” column was included, the random forests classifier performed the best, whether or not “brand” was present. I suspect that the random forests classifier performed the best when ingredients were present because vectorizing the ingredient lists produced extremely long feature vectors. In random forests, only d features can be sampled at each node of the tree without replacement, where d ‘s default value = sqrt (# of features). This might help reduce the number of dimensions and reduce overfitting when working with extremely long feature vectors.

The extremely long feature vectors might also explain the poor performance of KNN when the “ingredients” column was included. The feature space was likely so sparse that neighboring data points were too far away from each other to make meaningful predictions (i.e., the dataset was suffering from the curse of dimensionality).

The best-performing model, on average, was the ensemble voting classifier, with an average of ~78.5% accuracy. This makes sense, since the ensemble voting classifier was trained with the three highest-performing non-ensemble machine learning algorithms (logistic regression, SVM, and perceptron), and made predictions by majority vote. However, the best accuracy was obtained by logistic regression when using only “brand” and “product_type”, with an accuracy of 83.3%. Overall, parametric models including SVM, logistic regression, and perceptrons performed better than non- parametric models including KNN and decision tree classifiers.

Multiclass Classification of Price Category

I built multiclass classifiers to put skincare products into one of four categories — “cheapest” (<£9.95), “cheap” (£9.95-18.90), “expensive” (£18.90-31.25) and “most expensive” (>£31.25). I obtained these by inspecting the inter-quartile range of the price column, and using each quartile as a category (code below).

def price_to_class(x):
   if x <= 9.950000:
      x = 0
   elif (x > 9.950000) and (x <= 18.900000):
      x = 1
   elif (x > 18.900000) and (x <= 31.250000):
      x = 2
   elif x > 31.250000: 
      x = 3
   return x
skincare['price'] = skincare["price"].apply(price_to_class)

The goal of this section was to see if a specific price range could be predicted from a combination of the predictors “brand”, “contents”, “product_type”, and “ingredients”.

By the time I had finished the binary classification experiments, it had become clear that “brand” was a useful predictor of price category. Therefore, in this section, I performed multiclass classification using all predictors: “brand”, “contents”, “product_type”, and “ingredients”. I compared these results with multiclass classification involving “brand” and “product_type”, with no information about the ingredients or amount of product included.

In multiclass classification, I tested the accuracy of the logistic regression, decision tree, K-nearest neighbors (KNN), and support vector machine (SVM) classifiers. I also included the same three ensemble learning models: a voting classifier, bootstrap aggregation (bagging), and random forests.

I did not include the perceptron because making the perceptron accommodate multiclass labels was cumbersome and likely would not have yielded good results. Instead, I explored different multiclass strategies in logistic regression and SVM. The decision tree, K-nearest neighbors, ensemble voting, bootstrap aggregation (which used decision trees) and random forests classifiers did not have to be explicitly modified to accommodate multiclass labels.

I used one-vs-rest and multinomial classification in my logistic regression models. In one-vs-rest, the logistic regression algorithm calculated the probability of each point being of a certain class compared to the other classes. In multinomial logistic regression, the log odds of the labels are predicted instead.

The one-vs-rest strategy is identical in logistic regression and SVMs, but since SVMs do not naturally output probabilities, I also used a one-vs-one strategy in the SVM classifier. In the one-vs-one strategy, the SVM classifier splits the multiclass classification problem into one binary classification problem per pair of classes (e.g. “cheap” vs “most expensive”, “expensive” vs “cheapest”, etc.).

Hyperparameter tuning was performed by randomized grid search on the logistic regression, decision tree, KNN, and random forests algorithms. The bagging classifier was fitted with decision trees with the optimized hyperparameters found with RandomizedSearchCV(), and the voting classifier was fitted with the three best-performing algorithms with optimized hyperparameters. I did not perform RandomizedSearchCV() on the SVM classifiers because the OneVsOneClassifier() and OneVsRestClassifier() did not have easily-optimizable hyperparameters.

Multiclass Classification Results and Discussion

Table 2. For each classifier, the 10-fold cross-validation score was calculated after performing a randomized grid search to optimize classifier’s hyperparameters. These 10-fold cross-validation scores are reported below. The SVM, ensemble voting, and bagging classifiers did not undergo hyperparameter tuning, though the bagging classifier’s *n_estimators=* hyperparameter was optimized. The best accuracies of each experiment iteration are highlighted in yellow for visibility.

On average, classifiers performed better with only “brand” and “product_type” as predictors, with an average classifier prediction of roughly 59.3%. However, these results are dismal compared to the modest accuracy scores obtained in the previous section.

A quick visualization of the learning curve for one of the learning algorithms (ensemble voting in this case) might shed some light onto the problem.

It is clear that the training accuracy is quite high with small standard deviations, while the test accuracy is low. Because there is a large gap between training and test accuracy, it is clear the at this model suffers from high variance, meaning that the complexity of the model needs to be reduced or that more data needs to be collected.

The best-performing model, on average, was the ensemble voting classifier, with an average of~59.6% accuracy. This was expected since the voting classifier was trained with the three highest- performing non-ensemble machine learning algorithms (in this case, the logistic regression with one-vs.- rest, SVM with one-vs.-one, and a decision tree classifier), and made predictions by majority vote. KNN, again, was the worst-performing, probably because even without the “ingredients” column, the feature space was too sparse to make meaningful predictions about the specific price range of products.

Experimental Limitations

There are many problems with the experimental design of this project. One of the most important problems is the encoding of ingredients and their names with CountVectorizer(). There was variation in what the same ingredients were called from product to product. For example, in one product, an ingredient was called “fragrance”, and in another, “parfum”. When CountVectorizer() is applied, these identical ingredients will be separated into two different feature columns, making the matrix more sparse. It is possible that there are hundreds of duplicate ingredients in separate columns because of different naming conventions or even spelling — I am not familiar enough with chemistry to catch these redundancies. In future experiments, ingredient names should be thoroughly researched to look for alternative names for the same product that appear in ingredient lists of other products in the dataset. The ingredient lists should then be standardized to have one ingredient name for one ingredient and vice versa.

A second problem with this experiment was that the training and test datasets were not stratified by brand. There were many brands represented by only one product in the dataset. Because of the CountVectorizer() preprocessing step, if one of these products ended up in the test dataset, the model wouldn’t be able to use its brand to make a prediction about the price category. Collecting more data would solve the problem of brands that are represented by only one product in the dataset. Gathering data on other products by these brands would help models make more predictions about a brand’s typical price range.

A third problem with this experiment was that the order of ingredients within the ingredient list matters. For example, if “water” appears at the beginning of the ingredient list, it means that there is more water in that product than any other ingredient. Ingredients at the end of the ingredient list have the lowest concentration in that product. CountVectorizer() does not preserve information about the order of ingredients, so information about the relative amounts of each ingredient is lost. Potential insight into effect of ingredients on price is therefore obscured. In the future, a CountVectorizer() should be designed to split a list of ingredients, record the index of the ingredient in the list, and fill feature columns with respective indices to indicate the place of the ingredient in the list.

A fourth problem with this experiment is the “content” column. When I created this column, I stripped product amounts (like 100ml, 50g, etc.) of their units (‘g’ and ‘ml’) and treated them as the same unit to get a measurement. Grams are a measurement of mass, and milliliters are a measurement of volume. 1 gram is equivalent to 1 milliliter of water, so in this experiment, I assumed that all products had roughly the same viscosity as water when I removed the “ml” and “g” labels. Of course, this is an inaccurate assumption, but I did not have access to the viscosity measurements of each product to make a conversion of grams to milliliters. It was likely a mistake to make this column, and in future experiments, it should not be included.

A fifth problem with this experiment is that this dataset contains no information about the different amounts of money these brands spend on marketing in general. The amount of money spent on marketing is recouped in the price of the product, so marketing budget influences the price of a product. I was tempted to see “brand” as a proxy for the amount of money spent on marketing, but even within brands, there is likely to be variation in marketing budget by product line. For example, one product line might be more aggressively marketed than another within a brand — L’Oreal, for instance, has the L’Oreal Luxe line, which is priced higher on average than the rest of the company’s products. Therefore, “brand” is not a straightforward proxy for marketing budget and therefore not an ideal predictor of price category. A solution for this problem is to gather more data points per brand and within different product lines.

Conclusion

The results of the experiments with binary and multiclass classifiers show that brand, indeed, is somewhat predictive of price. Predicting the price with the ingredient list of a product yielded worse accuracies, whether or not brand was included as a predictor. The best accuracies were achieved in binary classifiers that only used the features “product_type” and “brand”. Multiclass classifiers that attempted to separate products into four price ranges with these same features, however, performed very poorly. Because models were unable to perform a more complex task like multiclass classification, it is clear that additional data is needed to make these models more robust. Nevertheless, this cursory analysis has shown that one should not assume that a hefty price tag means higher-quality skincare ingredients.

Datasets and code can be found in my GitHub.