The world’s leading publication for data science, AI, and ML professionals.

Electric Cars In the Netherlands: Exploratory Data Analysis with Python

Data analysis and visualization with Python, Pandas, and Bokeh

Smart EQ Car, Image Source https://en.wikipedia.org/wiki/Smart_electric_drive
Smart EQ Car, Image Source https://en.wikipedia.org/wiki/Smart_electric_drive

When was the first electric car registered? (Spoiler: it was much earlier than most people may think.) Which Cars are more expensive, the electric Porcshe or Jaguar? Exploratory data analysis (EDA) is not only an important part of building every data pipeline, but it is also a pretty interesting process. In this article, I will use the Dutch RDW (Netherlands Vehicle Authority) public dataset to find information about electric cars. We will see which data can be extracted and displayed with Python, Pandas, and Bokeh.

Let’s get started.

Loading the data

The RDW ("Rijks Dienst Wegverkeer", https://www.rdw.nl) is a Dutch organization that handles approvals and registration of motorized vehicles and driving licenses in the Netherlands. As a public governmental institution, it has its data available to everyone. Most interesting for us is the "Gekentekende voertuigen" ("Vehicles with license plates") dataset. It is available for free under a Public Domain license and can be downloaded from opendata.rdw.nl. The file size is about 10 GB; it contains information about all vehicles, registered in the Netherlands since 1952. Processing a file of such a size can also be a challenge – which makes the task more interesting.

I will use Jupyter Lab, in this case, it is more convenient than using a standard IDE because reloading the 10 GB file every time a project is starting not looks like a good idea. Also, I will use Pandas for processing and Bokeh for visualization. First, let’s import the needed libraries:

import os
import pandas as pd
import numpy as np
import datetime

from bokeh.io import show, output_notebook, export_png
from bokeh.plotting import figure, output_file
from bokeh.models import ColumnDataSource, LabelSet, Whisker
from bokeh.palettes import *
output_notebook()

Now we are ready to load the dataset. Let’s try a "naive" approach first:

filename = "Open_Data_RDW__Gekentekende_voertuigen.csv"
df = pd.read_csv(filename)
display(df)

After running this code, the PC is freezing for about 30 seconds… and the Python kernel is crushing. Oops. It’s not only loading slowly, but we also don’t have enough memory. At least on my computer, 32 GB of RAM was not enough for that task.

Well, if we cannot load the file in memory, we can read it line by line; this approach is well known since the time of IBM mainframes and tape drives. Let’s read the first lines of the file and see what is inside:

filename = "Open_Data_RDW__Gekentekende_voertuigen.csv"
with open(filename, 'r') as f:
    header_str = f.readline()
    print(header_str)
    for _ in range(10):
        print(f.readline())

The result looks like this:

Kenteken,Voertuigsoort,Merk,Handelsbenaming,Vervaldatum APK,Datum tenaamstelling,Bruto BPM,Inrichting,Aantal zitplaatsen,Eerste kleur,Tweede kleur,Aantal cilinders,Cilinderinhoud,Massa ledig voertuig,Toegestane maximum massa voertuig,Massa rijklaar,Maximum massa trekken ongeremd,Maximum trekken massa geremd,Datum eerste toelating,Datum eerste tenaamstelling in Nederland,Wacht op keuren,Catalogusprijs,WAM verzekerd,Maximale constructiesnelheid,Laadvermogen,Oplegger geremd,Aanhangwagen autonoom geremd,Aanhangwagen middenas geremd,Aantal staanplaatsen,Aantal deuren,Aantal wielen,Afstand hart koppeling tot achterzijde voertuig,Afstand voorzijde voertuig tot hart koppeling,Afwijkende maximum snelheid,Lengte,Breedte,Europese voertuigcategorie,Europese voertuigcategorie toevoeging,Europese uitvoeringcategorie toevoeging,Plaats chassisnummer,Technische max. massa voertuig,Type,Type gasinstallatie,Typegoedkeuringsnummer,Variant,Uitvoering,Volgnummer wijziging EU typegoedkeuring,Vermogen massarijklaar,Wielbasis,Export indicator,Openstaande terugroepactie indicator,Vervaldatum tachograaf,Taxi indicator,Maximum massa samenstelling,Aantal rolstoelplaatsen,Maximum ondersteunende snelheid,Jaar laatste registratie tellerstand,Tellerstandoordeel,Code toelichting tellerstandoordeel,Tenaamstellen mogelijk,Vervaldatum APK DT,Datum tenaamstelling DT,Datum eerste toelating DT,Datum eerste tenaamstelling in Nederland DT,Vervaldatum tachograaf DT,Maximum last onder de vooras(sen) (tezamen)/koppeling,Type remsysteem voertuig code,Rupsonderstelconfiguratiecode,Wielbasis voertuig minimum,Wielbasis voertuig maximum,Lengte voertuig minimum,Lengte voertuig maximum,Breedte voertuig minimum,Breedte voertuig maximum,Hoogte voertuig,Hoogte voertuig minimum,Hoogte voertuig maximum,Massa bedrijfsklaar minimaal,Massa bedrijfsklaar maximaal,Technisch toelaatbaar massa koppelpunt,Maximum massa technisch maximaal,Maximum massa technisch minimaal,Subcategorie Nederland,Verticale belasting koppelpunt getrokken voertuig,Zuinigheidsclassificatie,Registratie datum goedkeuring (afschrijvingsmoment BPM),Registratie datum goedkeuring (afschrijvingsmoment BPM) DT,API Gekentekende_voertuigen_assen,API Gekentekende_voertuigen_brandstof,API Gekentekende_voertuigen_carrosserie,API Gekentekende_voertuigen_carrosserie_specifiek,API Gekentekende_voertuigen_voertuigklasse
85XXXA,Personenauto,VOLKSWAGEN,CALIFORNIA,20230702,20220915,10437,kampeerwagen,,GROEN,Niet geregistreerd,5,2461,2088,2800,2188,700,2000,20010626,20010626,Geen verstrekking in Open Data,,Ja,,,,,,,0,4,0,0,,0,0,M1,,,r. in watergoot v. voorruit,2800,,,e1*96/79*0066*10,AJTCKX0,N1P00J2SGFM52B010U,1,0.03,292,Nee,Nee,,Nee,4500,0,0.00,2022,Logisch,00,Ja,07/02/2023 12:00:00 AM,09/15/2022 12:00:00 AM,06/26/2001 12:00:00 AM,06/26/2001 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,,,,https://opendata.rdw.nl/resource/3huj-srit.json,https://opendata.rdw.nl/resource/8ys7-d773.json,https://opendata.rdw.nl/resource/vezc-m2t6.json,https://opendata.rdw.nl/resource/jhie-znh9.json,https://opendata.rdw.nl/resource/kmfi-hrps.json
85XXXB,Personenauto,PEUGEOT,3*RFN*,20230920,20210224,5162,hatchback,5,ZWART,Niet geregistreerd,4,1997,1194,1719,1294,625,1300,20010720,20010720,Geen verstrekking in Open Data,,Ja,,,,,,,4,4,0,0,,420,0,M1,,,op r. schroefveerkoker onder motorkap,1719,,,e2*98/14*0244*00,C,B,0,0.08,261,Nee,Nee,,Nee,3019,0,,2022,Logisch,00,Ja,09/20/2023 12:00:00 AM,02/24/2021 12:00:00 AM,07/20/2001 12:00:00 AM,07/20/2001 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,D,,,https://opendata.rdw.nl/resource/3huj-srit.json,https://opendata.rdw.nl/resource/8ys7-d773.json,https://opendata.rdw.nl/resource/vezc-m2t6.json,https://opendata.rdw.nl/resource/jhie-znh9.json,https://opendata.rdw.nl/resource/kmfi-hrps.json
...
85XXXN,Personenauto,NISSAN,NISSAN MURANO,20240106,20111126,18921,stationwagen,5,ZWART,Niet geregistreerd,6,3498,1833,2380,1933,750,1585,20081206,20081206,Geen verstrekking in Open Data,,Ja,,,,,,,4,4,0,0,,484,0,M1,,,r. voorzitting by dwarsbalk,2380,Z51,,e1*2001/116*0478*00,A,A01,0,0.1,283,Nee,Nee,,Nee,3965,0,,2023,Logisch,00,Ja,01/06/2024 12:00:00 AM,11/26/2011 12:00:00 AM,12/06/2008 12:00:00 AM,12/06/2008 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,E,,,https://opendata.rdw.nl/resource/3huj-srit.json,https://opendata.rdw.nl/resource/8ys7-d773.json,https://opendata.rdw.nl/resource/vezc-m2t6.json,https://opendata.rdw.nl/resource/jhie-znh9.json,https://opendata.rdw.nl/resource/kmfi-hrps.json

As we can see, there are plenty of different data fields, and we actually don’t need all of them. About every car, I want to know only its type, license plate, model name, price, and registration date. This database is old enough, and there is no field indicating if the car is electric or not. But at least, there is a field, containing the "Number of cylinders", which can help us to exclude the cars which are not electric.

Now we have only 7 fields to load, and in Pandas, we can specify the columns list, which drastically reduces the data size. The second trick is to specify the pd.UInt32Dtype to columns ‘Number of cylinders’ and ‘Price’. I also want to see only "personal" cars ("Personenauto" in Dutch), and not trucks or buses:

cols = ['Kenteken', 'Voertuigsoort', 'Merk', 'Handelsbenaming', 'Aantal cilinders', 'Catalogusprijs']
cols_date = ['Datum eerste tenaamstelling in Nederland']

filename = "Open_Data_RDW__Gekentekende_voertuigen.csv"
df = pd.read_csv(filename, usecols=cols + cols_date, parse_dates=cols_date, 
                 dtype={"Catalogusprijs": pd.UInt32Dtype(), 
                        "Aantal cilinders": pd.UInt32Dtype()})
display(df)

df = df[df['Voertuigsoort'] == 'Personenauto']
df.info(memory_usage="deep")

Now the file was loaded correctly, and as the "info" method shows, the memory usage is 2.5 GB:

Dataset information, Image by author
Dataset information, Image by author

But because of the large file size, data loading still takes a long time. The easiest way is to save the filtered dataset as a new file, and to use this file for further experiments:

df.to_csv("Open_Data_RDW__Gekentekende_voertuigen_short.csv", sep=',', 
          encoding='utf-8')

This file has only a 580 KB size, which is much smaller than the original 10 GB, and its loading does not cause any problems.

We also don’t need a "Voertuigsoort" field anymore, and dropping this column will release some RAM and screen space. And as a last step, let’s translate data fields to English from Dutch – it is not mandatory for analysis but will be more convenient to readers:

df = df.drop('Voertuigsoort', axis=1)

translations_dict_en_nl = {
                           'Kenteken': 'License plate', 
                           'Merk': 'Model', 
                           'Handelsbenaming': 'Trade name', 
                           'Aantal cilinders': 'Number of Cylinders', 
                           'Catalogusprijs': 'Catalog price',
                           'Datum eerste tenaamstelling in Nederland': 'First registration NL', 
                          }

df.rename(translations_dict_en_nl, axis='columns', inplace=True)

And now we are ready to go.

Basic analysis

In the beginning, let’s see the main properties of the dataset, like data samples and dimensionality:

display(df)

display(df.shape[0])

display(df.isna().sum())

The display(df) method shows us the first and last rows of the dataset, so we can see what the data looks like. The second line shows us the total amount of records, which can be useful for calculations, and the last request will return the number of Null values per column.

The output looks like this:

Dataframe properties, Image by author
Dataframe properties, Image by author

We have 9,487,265 records, where each car has a license plate, model, and registration date (these fields are probably mandatory for the registration), but the other fields, like "Trade name" or "Catalog price" are missing for some cars. Technically, we don’t need any cleaning right now, but for some requests (like price distribution) we should remove Null values before making the request.

As an example of this approach, let’s sort the data, to see the most expensive and the cheapest cars in the Netherlands:

df[df['Catalog price'].notna()].sort_values(by=['Catalog price'], 
                                            ascending=False)
Dataframe sorted by price, Image by author
Dataframe sorted by price, Image by author

The result is interesting. In first place is "PEUGEOT 5008" with a €9,700,305 price, which is weird because its price in Google is about €41,000 – maybe it is an error in the database, or the owner spent a lot of money for upgrades 😉 Or maybe it is a brand-new electric "PEUGEOT E-5008", but it was going to be released only in 2024. Anyway, already at this point we can see that the public data is not always consistent. For the "PORSCHE CAYENNE" in 2nd place, the price is probably real. For others, it is hard to tell, I’m not an expert in luxury cars, if someone knows more, please write in the comments below. As for the cheapest cars, they have a €1 price. Probably they were imported in NL as "parts" from a second-hand market, so the owner declared the lowest possible value.

Data transform

Let’s check how good our data is for further analysis. The first car model in the list is "PEUGEOT", let’s display all cars with the same name. The "unique" method will return only unique values in the column:

display(df[df['Model'].str.contains("PEUGEOT", case=False)]['Model'].unique())

The output looks like this:

"Peugeot" models request, Image by author
"Peugeot" models request, Image by author

We can see that the model names in the database are not consistent. Some cars have the name "PEUGEOT", other cars were saved as "PEUGEOT BOXER" or "PEUGEOT/MOBILCAR". To group the cars by model name, the first word "PEUGEOT" is enough, and the right part of the name can be removed. It’s also better to convert all characters to the upper case because the car model in theory can be written as "PEUGEOT" or "Peugeot". And just to be sure that there are no extra characters, I’ll call the "strip" method, which removes extra spaces from the string. I created a method "name_normalize", which is doing that type of conversion:

def model_normalize(s_val: str):
    """ Convert 'PEUGEOT BOXER/GLOBE-TRAVE ' to 'PEUGEOT' """
    if s_val and isinstance(s_val, str) and len(s_val) > 0:
        return s_val.replace("-", " ").replace("/", " ").split()[0].upper().strip()
    return None

A more flexible conversion can be made using regular expressions, but this code looks enough for our task. When we have this method, we can convert all rows in the Pandas dataframe by using the "map" function:

df["Model"] = df['Model'].map(lambda s: model_normalize(s))

Let’s now process the "Trade name" field:

"Trade name" samples, Image by author
"Trade name" samples, Image by author

As we can see, most of the cars have the manufacturer name in the first field and the trade name in the second field, like "VOLVO" + "C30" in the screenshot. But some other cars have the manufacturer name duplicated in both columns, like "NISSAN" + "NISSAN MURANO". Let’s make it more consistent by removing the duplicates, and as a bonus, it will also make the dataset a bit smaller:

def name_normalize(model: str, trade_name: str):
    """ Remove duplicates and convert the name to upper case """
    if isinstance(trade_name, str) and len(trade_name) > 0:
        name = trade_name.upper().strip()
        # Remove duplicates from model and trade name: 
        # ("TESLA", "TESLA MODEL 3") => ("TESLA", "MODEL 3")
        if name.split()[0] == model:
            # "TESLA MODEL 3" => [TESLA, MODEL, 3] => "MODEL 3"
            return ' '.join(name.split()[1:])  
        return name
    return None

The isinstance check here is important because the "Trade name" field is optional, and some records have None instead of the string, getting a len(None) will obviously make the method crash.

To update the dataframe, we can use an "apply" method in Pandas:

df["Trade name"] =  df.apply(lambda x: name_normalize(model=x['Model'], 
                                           trade_name=x['Trade name']), 
                                           axis=1)

Let’s check the result. Having this data, we can extract some useful information, for example, let’s see Top-50 of the most popular cars in the Netherlands:

n_top = 50
all_models = df_models["Model"].to_numpy()
models, counts = np.unique(all_models, return_counts=True)
cs = counts.argsort()  # Take sort indexes from 'counts' array
x = counts[cs][-n_top:]
y = models[cs][-n_top:]

p = figure(y_range=y, width=1400, height=600, 
           title="Top-%d cars in the Netherlands (data 2023)" % n_top)
p.hbar(right=x, y=y, height=0.8, color=Viridis256[:n_top])
p.xgrid.grid_line_color = None
p.x_range.start = 0
p.below[0].formatter.use_scientific = False
show(p)

The np.unique method can calculate amounts per model, and we don’t need to do it manually. The second tricky part here is to sort two arrays (car amounts and car models) simultaneously, we get the sort indexes sequence using the counts.argsort method, then we apply the same indexes to the "models" array.

The Bokeh library is good for drawing charts like this:

Top car models bar chart, Image by author
Top car models bar chart, Image by author

The next part of the data transformation is more tricky – we need to determine if the car is electric or not. It is tricky because every manufacturer has its own naming system, and there is no universal rule for that. For some brands, like "TESLA", it’s easy – all Tesla cars are electric. For other models like "HYUNDAI IONIQ" or "NISSAN LEAF" the specific keyword is present in the name, and for some other cars, there is no clear rule at all ("HONDA E" is electric, but "HONDA EE8" is not).

Using Google search and car manufacturer’s websites, I’ve created this dictionary:

electric_cars = {
    "AIWAYS": ['U5', 'U6'],
    "AUDI": ['E-TRON'],
    "BMW": ['I3', 'I4', 'I7', 'IX'],
    "CITROEN": ['E-C4'],
    "FIAT": ['500E', 'ELETTRA'],
    "FORD": ['MACH-E'],
    "HONDA": ['"E"', '"E ADVANCE"'],  
    "HYUNDAI": ['IONIQ', 'KONA'],
    "JAGUAR": ['I-PACE'],
    "KIA": ['NIRO', 'E-SOUL'],
    "LEXUS": ['RZ'],
    "LUCID": ['AIR'],
    "MAZDA": ['MX-30'],
    "MERCEDES": ['EQA', 'EQB', 'EQC', 'EQS', 'EQV'],
    "MG": ['ZS EV'],
    "MINI": ['COOPER SE'],
    "NISSAN": ['ALTRA', 'ARIYA', 'EVALIA', 'LEAF', 'NUVU'],
    "OPEL": ['AMPERA-E', 'COMBO-E', 'CORSA-E', 'MOKKA-E', 'VIVARO-E', 'ZAFIRA-E'],
    "PEUGEOT": ['E-208', 'E-2008', 'E-RIFTER', 'E-TRAVELLER'],
    "POLESTAR": ['2', '3'],
    "PORSCHE": ['TAYCAN'],
    "RENAULT": ['MASTER', 'TWINGO', 'KANGOO ELEC', 'ZOE'],
    "SKODA": ['ENYAQ'],
    "SMART": ['EQ'],
    "TESLA": [''],
    "TOYOTA": ['BZ'],
    "VOLKSWAGEN": ['ID.3', 'ID.4', 'ID.5', 'E-GOLF'],
    "VOLVO": ['C40', 'XC40']
}

Now I can easily check if the specific keyword is present for the car model, or if there is a direct match for the model name. And as a last check, I can use the number of cylinders that I have in the database. If this value is greater than zero, then we know that the car is not fully electric. The final method (well, maybe not final, but it works more or less well for our task) looks like this:

def is_electric(model: str, trade_name: str, cylinders: int):
    """ Determine if the car is electric """
    if isinstance(cylinders, int) and cylinders > 0:
        return False
    for e_model, e_names in electric_cars.items():
        if model == e_model:
            for e_name in e_names:
                if trade_name and (e_name in trade_name or e_name.replace('"', '') == trade_name):
                    return True
                if trade_name is None and len(e_name) == 0:
                    return True
    return False

As a sort of unit test, we can use this method with different parameters:

print(is_electric("AUDI", "E-TRON S SPORTBACK 55 QUATTRO"))
print(is_electric("AUDI", "80 COUPE"))

print(is_electric("HONDA", "E"))
print(is_electric("HONDA", "EE 8"))
print(is_electric("HONDA", "INTEGRA TYPE R"))

print(is_electric("NISSAN", "MICRA"))
print(is_electric("NISSAN", "LEAF 62KWH"))

print(is_electric("TESLA", "ANY"))

Using this function we can easily add a new field to the dataframe and keep only electric cars:

df["Electric"] = df.apply(lambda x: is_electric(model=x['Model'], 
                                       trade_name=x['Trade name'], 
                                       cylinders=x['Number of Cylinders']), 
                          axis=1)

df_electric = df.query("Electric == True").drop(columns=['Number of Cylinders',
                                                         'Electric'])

If everything was done correctly, we should get something like this:

Electric cars dataframe, Image by author
Electric cars dataframe, Image by author

Analysis

Now we are finally ready to start our analysis of electric cars in the Netherlands.

As a warm-up, it is easy to calculate the mean, standard deviation, and percentiles:

print(f"Cars total: {df.shape[0]}")
print(f"Cars electric: {df_electric.shape[0]} ({100*df_electric.shape[0]/df.shape[0]:.2f}%)")

# Calculate percentiles - all cars

prices = df[df['Catalog price'].notna()]['Catalog price'].to_numpy()

print("Price mean:", np.mean(prices))
print("Price standard deviation:", np.std(prices))
print("Percentiles [5, 25, 50, 75, 95]:", np.percentile(prices, [5, 25, 50, 75, 95]))

# Calculate percentiles - electric cars

prices = df_electric[df_electric['Catalog price'].notna()]['Catalog price'].to_numpy()

print("Electric cars price mean:", np.mean(prices))
print("Electric cars price standard deviation:", np.std(prices))
print("Electric cars percentiles [5, 25, 50, 75, 95]:", np.percentile(prices, [5, 25, 50, 75, 95]))

The output looks like this:

Mean, standard deviation, and percentiles results, Image by author
Mean, standard deviation, and percentiles results, Image by author

Here we can see that there are 9,487,265 cars in the Netherlands, and only 278,141 of them (2.93%) are electric. Well, in 2023 we are only at the beginning of that era. According to the rvo.nl report, there were 1.22% of electric cars in 2019, 1.98% in 2020, and 2.55% in 2021, so the numbers are growing and it would be interesting to compare the results 10–20 years later. As for the prices of non-electric cars, the 95th percentile is €71,381. It means that 95% of cars in the Netherlands have a price lower than this value. Electric cars are in a much more "premium" segment – the mean price is €49,975 and the 95th percentile is €106,989.

When did the first electric cars appear in the Netherlands, and how has this amount changed over the years? It is easy to answer this question. Let’s build a bar graph of car registrations per quarter. To do this, I need to create a new Quarter field in the Pandas dataframe and group data by this field. We can extract the quarter number from a "datetime" object in Python, but Pandas already have all the needed converters:

reg_dates = df_electric[["Model", "Trade name", "First registration NL"]].copy()
reg_dates["Quarter"] = reg_dates['First registration NL'].dt.to_period('Q')

data_per_year = reg_dates.groupby(['Quarter'], as_index=False).size()

dates = data_per_year['Quarter']
amount = data_per_year['size']

p = figure(x_axis_type='datetime', width=1600, height=500, 
           title=f"Electric car registrations in the Netherlands, 1992-2022")
p.vbar(x=dates, top=amount, width=datetime.timedelta(days=3*22), line_color='black')
p.xaxis[0].ticker.desired_num_ticks = 30
p.xgrid.grid_line_color = None
show(p)

The result is interesting:

Electric car registrations, Image by author
Electric car registrations, Image by author

The first (and the single one in the whole country for the next 15 years!) electric car was registered in the Netherlands in 1992, more than 30 years ago. We can easily find in the dataset that it was a Fiat Panda Elettra, a small two-seat car with a 70 km/h maximum speed, 100 km range, and 12 6V lead-acid batteries as a power source. The 3 next Tesla Roadster cars were registered only in 2009. The second interesting thing here is a seasonality pattern – it is easy to see that the biggest number of registrations occurs at the end of each year (the more detailed graph will be at the end of this article).

Having the electric cars dataframe, it is also easy to see a price distribution:

df_prices = df_electric[df_electric['Catalog price'].notna()]
prices_to_display = df_prices.query('`Catalog price` < 170000')['Catalog price'].to_numpy()

hist_e, edges_e = np.histogram(prices_to_display, density=False, bins=50)

# Draw
p = figure(width=1400, height=500, 
           title=f"Electric cars price distribution in the Netherlands ({df_electric.shape[0]} cars total)")
p.quad(top=hist_e, bottom=0, left=edges_e[:-1], right=edges_e[1:], line_color="darkblue")
p.x_range.start = 15000
p.x_range.end = 150000
p.y_range.start = 0
p.xaxis[0].ticker.desired_num_ticks = 20
p.left[0].formatter.use_scientific = False
p.below[0].formatter.use_scientific = False
p.xaxis.axis_label = "Price, EUR"
p.yaxis.axis_label = "Amount"
show(p)

The output looks like this:

Electric car price distribution, Image by author
Electric car price distribution, Image by author

As we can see, the distribution is mostly right-skewed. Before calculating the histogram I removed the outliers from the right, otherwise, the picture looks mostly empty because of 2–3 cars with >1 M€ price.

Using the histogram we can see that most electric cars in the Netherlands have a price in the €40–70K range, but we don’t know what models they are. We can explore prices in more detail – let’s group prices by model name:

df_price = df_electric[df_electric['Catalog price'].notna()]

def q0(x):
    return x.quantile(0.01)

def q1(x):
    return x.quantile(0.25)

def q3(x):
    return x.quantile(0.75)

def q4(x):
    return x.quantile(0.99)

agg_data = {'Catalog price': ['size', 'min', q0, q1, 'median', q3, q4, 'max']}

prices = df_price[['Model', 'Catalog price']].groupby('Model', as_index=False).agg(agg_data)
display(prices)

Here I grouped all cars by model and aggregated that data into a table:

Electric cars grouped by model name, Image by author
Electric cars grouped by model name, Image by author

Let’s draw this data in the form of a box plot:

prices = prices.sort_values(by=('Catalog price', 'median'), ascending=True)

models = prices["Model"].to_numpy()
q1 = prices["Catalog price"]["q1"].to_numpy()
q3 = prices["Catalog price"]["q3"].to_numpy()
v_min = prices["Catalog price"]["q0"].to_numpy()
v_max = prices["Catalog price"]["q4"].to_numpy()

palette = (Inferno10 + Magma10 + Plasma10 + Viridis10)[:models.shape[0]]
source = ColumnDataSource(data=dict(models=models, 
                                    bottom=q1, 
                                    top=q3, 
                                    color=palette, 
                                    lower=v_min, 
                                    upper=v_max))

p = figure(x_range=models, width=1400, height=500, 
           title=f"Electric cars price distribution in the Netherlands") 
whisker = Whisker(base="models", upper="upper", lower="lower", source=source)
p.add_layout(whisker)
p.vbar(x='models', top='top', bottom='bottom', width=0.9, color='color', 
       line_color="black", source=source)

p.left[0].formatter.use_scientific = False
p.y_range.start = 0
show(p)

Here I sorted all models by price. In a combination with a box plot, we can see a clear picture of prices distribution:

Electric car manufacturers and prices box plot, Image by author
Electric car manufacturers and prices box plot, Image by author

Not surprisingly, at the top of the distribution are the famous luxury cars such as Porsche, Jaguar, or Lucid (which I’ve never heard about before, by the way). But what is more surprising, the cheapest cars from this distribution are not the most popular. For example, there are only 1,269 "Smart" and 15,414 "Renault" cars in the Netherlands, which is fewer than 25% compared to the 65,885 "Tesla" models. I was even thinking that there is an error in the graph, but the cars sales UK distribution for 2021 look in general, the same.

The maximum range in km can be an important factor in choosing an electric car, and it would be interesting to build a graph displaying a correlation between range and price. But alas, there is no "Range" field in the RDW dataset. Some values for different electric cars can be obtained from the Kaggle dataset. But practically, there is no direct match between both tables. For example, there is an "E-TRON SPORTBACK 50 QUATTRO" model in the dataset. In the RDW data, there are two cars with similar names, "E-TRON SPORTBACK 50" and "Q4 SPORTBACK 50 E-TRON", but I have no idea if these models are actually the same. Every letter in the model naming can have its own meaning, and without being a car expert, it is impossible to match all the names for all car manufacturers. Though readers who wish can try it on their own.

At least, having a cars price and registration date from RDW data, we can build the scatter plot:

df_data = df_electric[df_electric['Catalog price'].notna()]

models = df_data['Model'].unique()

p = figure(x_axis_type='datetime', width=1400, height=800, 
           title="Electric car prices and registrations in the Netherlands")
palette = (Inferno10 + Magma10 + Plasma10 + Viridis10)[:len(models)]
draw_ratio = 15
for ind, model in enumerate(models):
    df_model = df_data[df_data['Model'] == model]
    if df_model.shape[0]//draw_ratio == 0:
        continue

    df_model = df_model.sample(df_model.shape[0]//draw_ratio)
    x = df_model['First registration NL'].to_numpy()
    y = df_model['Catalog price'].to_numpy()
    p.scatter(x, y, size=2, color=palette[ind], legend_label=model[:3]) 

p.left[0].formatter.use_scientific = False
p.legend.orientation = "horizontal"
p.legend.location = "top_center"
p.legend.click_policy = "mute"
p.xaxis[0].ticker.desired_num_ticks = 20
show(p)

Here I group all cars by the model name, then I take a 1:15 random sample from each subset (we obviously cannot draw all 216316 cars on one plot). A really nice feature in Bokeh, is the ability to "mute" labels by clicking them, it allows, for example, to highlight only Tesla cars with a mouse click:

Prices and registration dates scatter plot, Image by author
Prices and registration dates scatter plot, Image by author

It is pretty interesting to see a seasonality pattern – it looks like car registrations occur in "waves" 4 times per year, and a large number of new cars are registered at the end of each year. Maybe some customers are especially waiting for end-of-the-year sales, or just want to start a new year having a new car? The second interesting thing is plenty of electric cars in the medium price range, appeared in 2019. Before this year, the possible choice was only between cheap and premium models (relatively cheap, of course, as most electric cars are in the premium segment, compared to the gasoline ones).

Conclusion

As we can see, using the dataset from the "real world" gives some challenges. It’s not only the data size, but incomplete or inconsistent data, even syntax errors in the data fields (I saw a "Tesla Raodster" instead of "Tesla Roadster", for example), and so on. At the same time, it is much more fun to explore this data and find interesting patterns in it, and I suggest readers do similar experiments on their own. This analysis was made for fun and self-education purposes, and there is obviously, enough room for improvement, like finding a better detection method if the car is electric or not, finding a driving range for different models, and so on. And it is nice that the Dutch government keeps this data public and available to everyone. If anyone knows about datasets from other countries, please add the links in the comments below, and I will try to do a similar analysis and compare the results in the next post.

If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors.

Thanks for reading.


Related Articles