by Ginna Gomez with Greg Page
Equivocal Zones: Overview
Typically, when we measure the performance of classification models, we do this by using all of the records in some particular dataset. We might say, for instance, that a model achieved "71 percent accuracy against the training set, and 68.4 percent accuracy against the validation set", or that the model has a "sensitivity rate of 81.2 percent against the training set." However, there is no requirement that we must consider all of our records when making predictions.
Instead, there might be scenarios in which we would wish to focus only on some particular subset of records, based on the underlying probabilities with which our model made its classification decisions.
Let’s say, for instance, that we are running a subscription service, and that we have built a model to predict customer subscription renewals. Rather than just focus on the categorical predicted outcomes of "Yes" or "No", we might instead prefer to zero in on the probability of renewal for each consumer in our database. To concentrate our energy in the most efficient way, we might ignore the customers whose renewal prediction is less than 0.20 (perhaps they are a lost cause), and the ones whose probabilities are above 0.80 (they seem like a ‘lock’ for renewal). By channeling our energies towards the people whose probabilities fall between 0.20 and 0.80, we can most efficiently leverage our marketing resources.
Alternatively, a business could encounter a scenario in which it makes the most sense to focus on records whose probability of class membership is very high. Perhaps we are selling luxury vacation home rentals, and our method relies on a manpower-intensive, in-person sales pitch. In such a scenario, we could concentrate our sales resources on those customers whose probability of purchase exceeds some high threshold – by only targeting those customers, we would avoid
In Applied Predictive Modeling, Kuhn and Johnson describe the use of an "equivocal or indeterminate zone", in which a classification is simply not predicted. For example, for a two-class problem, a modeler could simply label samples with prediction probabilities between 0.40 and 0.60 as "equivocal" or "indeterminate", and instead use the rest of the samples for calculations and more detailed analysis. 1
Used Cars in Belarus: Preparing the Data & Building the Model
In the model presented below, we start with a dataset containing information about used car offerings in Belarus. After binning the used car prices (our outcome variable) into three relatively-evenly sized groups, and binning several other input variables so that they can be used in a Naive Bayes process,
This used licensed CC0 public domain cars dataset, created by Kirill Lepchenkov data scientist downloaded from Kaggle, contains 38,145 observations.
To download the licensed CC0 public domain dataset click on the link: https://www.kaggle.com/lepchenkov/usedcarscatalog
Variables
manufacturer_name: Manufacturer name of the car. There are more than 50 categorical options, such as Acura, Alfa Romeo, Audi, BMW, Buick, Chevrolet, Mercedes-Benz, Nissan, Volvo, among others.
transmission: Whether the transmission is mechanical or automatic.
color: Color of the car’s exterior, with 12 unique levels.
year_produced: The year in which the car was produced. This variable was numeric, but we converted it into a factor. The "older" cars are those built before the median year, and the "newer" cars are those built during or after the median year.
engine_fuel: The type of fuel used by the car (diesel, electric, or gas)
engine_has_gas: Whether the car’s engine is equipped with propane tank and tubing.
engine_capacity: The engine capacity was a numeric variable, measured in liters, but we converted it to a categorical variable with 5 levels: lowest engine capacity, slightly below average engine capacity, average engine capacity, slightly above average engine capacity, and highest engine capacity.
body_type: Body type of the car. This variable has 12 levels, including cabriolet, coupe, limousine, pickup, sedan, universal, SUV, and others.
has_warranty: Whether the car still has a valid warranty.
state: This variable contains three levels: new, owned, and emergency. Emergency means the car has been damaged, sometimes severely.
drive_train: Whether the car has a front, rear or all-wheel drive train.
is_exchangeable: Whether the owner is ready to exchange this car for another car, with little or no additional payment.
location_region: A category with six levels, each representing different parts of Belarus.
duration_listed: How long the car has been listed in the catalog, measured in days. The variable was numeric, but we converted it into two categories: shorter and longer.
number_of_photos: Number of photos posted of the car in the catalog. This variable was numeric but we converted it into two categories: higher and lower.
price_USD: The price of a car as listed in the catalog in USD. This variable was numeric, but we converted into a categorical variable with 3 bins: cheaper, moderate, and expensive.
First, we need to load all the libraries we will use for the analysis, and load the dataset to the environment. read_csv2() is used here because the dataset contains some Cyrillic characters, and because it contains a ‘;’ delimiter between columns.
The original dataset came with a series of 10 other feature variables with boolean values. Those variables’ meaning was unclear to us, and we dropped them from the dataset.

We need to bin the _priceusd variable so that it can be used as a categorical outcome in a Naive Bayes model.
First, we identified 386 records whose price_usd values were missing, and we removed those records from the dataset:
With the help of the quantile() function, as shown below, we split the records into three balanced groups, labeled as "cheap", "moderate", and "expensive."

Here are the three outcome category groups, with nearly identical numbers of records in each:

The modified data set will be renamed to carsMod.
Next, we converted the character variables and logical variables into factors.


The _enginecapacity variable was numeric, so we converted it into the four categorical bins, as follows:
We converted the odometer variable into five categorical bins, as follows:
The variables _year_produced, number_of_photos, upcounter, and _durationlisted were all split into two outcome groups, based on their respective median values. Those conversions were performed as follows:
Finally, as shown below, our variables were fully converted into factors:

Then we partitioned this data, randomly assigning 60 percent of the records to the training set, with the other 40 percent going to the validation set. The seed value of 699 is used here for code reproducibility.
To assess the suitability of our potential independent variables for use in a naive Bayes model, we built proportional barplots. These barplots are constructed with the training set data only, and they demonstrate the proportional differences in price_usd among the various levels of these potential input variables.





Based on these graphs, we decided to eliminate _durationlisted and _isexchangeable, as the various levels for these variables do not considerably impact the price.
We also eliminated _manufacturername and _modelname, which contained 55 and 1,116 unique levels, respectively.
Next, we ran the Naïve Bayes classification model:
Then we generated the following confusion matrix, using the training set data:


From the two confusion matrices shown above, we can see very similar model performance against both the training and validation sets – that suggests that the model was not overfit to the training set. In both cases, we can see considerable outperformance of the naive rule (No Information Rate), with accuracy more than doubling.
Changing the Game: Narrowing the Focus
What if, instead of aiming to predict which of the three outcome classes a car would land in, we shifted our focus to just the cars that were most likely to wind up in a specific outcome class? By examining the probabilities associated with the class predictions, we can accomplish this.
First, we use the predict() function with the type="raw" argument, which returns the percentage likelihood for class membership that our model assigns to each record. (For the sake of brevity, we will not explain the entire process by which these numbers are generated, but we do recommend this YouTube tutorial for a breakdown of how the class probabilities are generated).
After attaching the outcome class probability predictions to the validation set, we can use the arrange() function from dplyr to sort the records by predicted class probability. For instance, the first line of code below gives us the 500 cars from the validation set that are most likely to belong to the ‘cheap’ outcome group; the line immediately beneath it generates a table that shows the actual outcome classes for those 500 records.

From this group, we can see that 89.2% of the cars fall into the "cheap" outcome group. If we were specifically looking to identify cars whose prices were unknown, but expected to land in this range, such a process would have tremendous practical value – for every 10 cars we decided to more closely inspect, nine would be priced within our target range.
We can use a similar process with the other outcome classes:

And finally, we can repeat the process for the "expensive" outcome group:

When we sift through the model to find the 500 records that our model most strongly predicts to be "expensive", we will almost always identify expensive cars.
But Wait, Isn’t This Cheating?
At first, the very concept of only classifying the records for which the underlying classification probabilities meet some threshold may seem unfair. In some ways, it is unfair – if we were comparing the performance of a model that used this approach to one that attempted to classify every record that it saw, then yes, that would be a completely unfair comparison.
Of course, a company or a researcher who plans to publish or otherwise share the results would have to very clearly disclose to the audience that the results were only achieved after the records in the equivocal zone were removed from consideration. We could never claim, for example, that the naive Bayes model shown above achieved "98.6 percent accuracy" with labeling expensive cars, without very clearly noting that this occurred only after first filtering the data in such a way that only the ones most likely to belong to the "expensive" group remained.
In the business world, however, models are not built for hackathons or for publications. Instead, they are built with practical purposes in mind – and the elimination of ‘equivocal zone’ records from large groups of records can deliver very important practical benefits. For answering business questions such as whether a customer will renew a subscription, whether a prospect will make a purchase, or whether a current customer will upgrade to a premium plan, the probabilities that undergird model classification predictions can offer hugely informative clues about where the businesses’ efforts should be focused in order to return the greatest possible Return on Investment (ROI). Acknowledging that a 51 percent likelihood and a 99 percent likelihood are quite meaningfully different – even though they result in the same predicted categorical outcome – is a huge step towards generating the best possible ROI from any classification model.

References
- Kuhn, Max and Kjell Johnson. Applied Predictive Modeling. Springer, 2013: p. 254.
- Kirill Lepchenkov. (2019, December 2). Licensed CC0 public domain Used cars catalog dataset. Retrieved from Kaggle. https://www.kaggle.com/lepchenkov/usedcarscatalog