The world’s leading publication for data science, AI, and ML professionals.

Tip and tricks for building a price estimation model for used cars

Challenges and solutions for creating a robust machine learning prediction model for car prices

Photo by Campbell on Unsplash
Photo by Campbell on Unsplash

Recently, I’ve published a short Medium story in which I presented the motivations, goals, and challenges that prompted the ELCA Data Science team to develop a machine learning (ML) model to accurately estimate the price of used cars on the Swiss market. The price prediction task has always gathered considerable attention from the machine learning community, and the rise of online marketplaces for all kinds of used objects has increased the need for automated tools to accurately predict reasonable price tags.

In this note, I will briefly present the architecture of our model, along with some strategies we employed to tackle the following key challenges:

  • How to handle and encode categorical variables in this setting,
  • How to make up for an insufficient number of data samples for car models
  • How to implement an outlier detection approach to protect our training process from noisy data.

In particular, our cross-market entity-embedding solution is a transfer learning approach that boosts the predictive accuracy of our model by around 4.3%.

Datasets and task formalization

We worked with three distinct car datasets composed of car sale announcements from different Swiss and European digital marketplaces.

  • The AutoScout24-CH dataset contains 119’414 announcements and was used for training and validation purposes.
  • AutoScout24-DE is a set of 558’295 German sale announcements (extracted from the European website version). This larger dataset is used for our transfer learning approach.
  • A third Swiss dataset (Comparis-CH, 111’972 samples) is used as the test set, to assess model generalization

We limit ourselves to a small and simple set of car features, known to the general public and easy to collect for most used cars.

The set of features we adopted as our car modeling approach. Image by author.
The set of features we adopted as our car modeling approach. Image by author.

The goal is to perform a standard regression task with this feature set, in which the target quantity is the sale price _yn of car _xn in Swiss francs (CHF), divided by the mean price of the respective car model in the training set.

This simple manipulation makes the targets a relative quantity and stabilizes the optimization process. Estimated prices in CHF are obtained by simply scaling back the prediction. _Root Mean Squared Erro_r (RMSE) is the loss function of choice. We report it with respect to the original targets in CHF for interpretability reasons.

Model architecture

We identified and validated an XGBoost regressor as the best model for our prediction pipeline. XGBoost is a popular and powerful framework that implements the Gradient Boosting technique. At its core, it’s an ensemble method that combines weak decision trees. All model search and optimization experiments were carried out following standard machine learning procedures, with 10-fold stratified cross-validation on the AutoScout24-CH dataset. The figure below shows the final prediction pipeline.

Our price prediction pipeline. Image by author.
Our price prediction pipeline. Image by author.

Encoding categorical variables

The manufacturer (maker) and car model (model) features encode most of the available information about the price class of a new car. The datasets contain 37 different car manufacturers and 211 car models. As these are categorical features (as opposed to continuous), we need to encode them before we can concatenate them with the remaining features and use them to train a model. The simplest encoding approach is the standard one-hot scheme, where we create as many binary variables as there are classes in each field. This method is almost equivalent to training a different regression model for each class. Given that data is scarce for many car models, this encoding scheme is not enough. A method that maps the variables to a single continuous space is preferable, as it would allow the model to leverage similarities between classes.

A simple representation choice that accomplishes this goal is to represent the categorical variables with two continuous values each: the mean price of the class in the training data, and its standard deviation. Intuitively, this encoding method enriches the car representation vector with approximate information about the starting price and the depreciation range. Nevertheless, it does not offer the possibility to learn more fine-grained details of the depreciation pattern for each car class.

The Entity Embedding (EE) approach proposed by C. Guo and F. Berkhan in 2016 and displayed in the schema below is a more involved method, which uses a small neural network to learn a mapping from one-hot representations of each variable to a latent representation. This mapping is represented by an encoding EE layer for each variable, whose weights are trained alongside the rest of the model with respect to the price prediction task.

Illustration of entity embedding layers (EE), which correspond to extra layers on top of each one-hot encoded input. Image by C. Guo and F. Berkahn.
Illustration of entity embedding layers (EE), which correspond to extra layers on top of each one-hot encoded input. Image by C. Guo and F. Berkahn.

The three encoding approaches provide the following mean validation performances. Variable substitution with mean class price and standard deviation slightly outperforms the EE approach.

Data scarcity: knowledge transfer to the rescue

The lack of enough data samples for some car models might explain the disappointing results of the EE method. Some models occur less than 100 times in AutoScout24-CH, which does not allow learning sufficiently precise mappings. Therefore, we train the EE network on the larger AutoScout24-DE dataset, since the German automobile market presents a distribution similar to the Swiss one. The relative distances of car classes in the latent EE space should hence be transferable.

Our cross-market entity-embedding approach to overcome the data scarcity problem. EE encoding layers are trained on German data and then transferred to our Swiss model. Image by author.
Our cross-market entity-embedding approach to overcome the data scarcity problem. EE encoding layers are trained on German data and then transferred to our Swiss model. Image by author.

This cross-market transfer learning approach provides a performance boost of about 4.3% on the Swiss RMSE, making the EE approach superior to other encoding schemes:

Train on likely pricing trends: outlier detection

Further analysis of the datasets allowed us to realize that sale announcements might not always be a reliable ground truth for prices. There is no control or limit on the inaccuracy of a listed price tag, even more so since we do not account for car accessories and many other important car features. There is a clear need to "see through the noise" and train the model only on samples that adhere to a general pricing trend for each car manufacturer/model.

Our solution is to apply a rather aggressive outlier removal procedure for each car model separately. We perform density-based outlier detection in three separate bi-dimensional subspaces of the feature-target space: Price-Mileage, Price-Age, and Price-Power. Mileage and age are our two wear level indicators and the two primary causes of depreciation. We also found a non-negligible set of samples for which the price is abnormally high with respect to the listed power class. If a sample is detected as an outlier in any of the three planes, it is removed from the dataset. We use the Local Outlier Factor (LOF) algorithm to find anomalous data points by measuring their local density deviation with respect to neighbors. The following schema clarifies this approach:

The proposed outlier detection approach, based on the Local Outlier Factor (LOF) method, applied to three different feature-target subspaces. Image by author.
The proposed outlier detection approach, based on the Local Outlier Factor (LOF) method, applied to three different feature-target subspaces. Image by author.

We remove approximately 12.5% of the training data and train the entire model on the new datasets (including the German EE layers). This results in the following large boost to performances:

Results and performances

We report both the final validation/test RMSE and Mean Absolute Error (MAE) of our ML model computed on all car models. For the vast majority of cars, the average estimation error is less than 20% of the price variability for that particular car model in the training data. Given the relative simplicity of our car modeling approach, the performances of our approach are more than satisfactory.

Our model can be queried through this demo interface. The user can insert the characteristics of the car and obtain a price estimation from the predictor, along with a confidence interval and information about the number of same-model cars that the model has seen during training. We hope that this quick overview of our experiments with this task and the successful solutions we employed was a helpful read for anyone intending to tackle similar problems from a Machine Learning perspective.

References


Related Articles