The world’s leading publication for data science, AI, and ML professionals.

Overviewing the Global Chocolate Trade

Using network analytics to explore international trade data by UN Comtrade

In this article, I explore the UN Comtrade international trade database by focusing on the "Chocolate and other food preparations containing cocoa" trade category. While this particular focus gives my article a clear direction on a literal niche market, the analytical steps and methodological layers are generic. Thus, building on these, one can quickly analyse any sort of international trade relationship from energy to weaponry. When incorporating the spatiotemporal dimensions of trade and optionally enhancing that by information on, e.g. timely political events, international conflicts, and more, one can easily link such events to their macro-economic footprint as well. While the implications of international trade data analysis are far-reaching, let’s now use Network and explorative data science to dive into the top exporters and their trading relationships on chocolate.

In this article, all images – if not stated in the caption otherwise – were created by the author.

1. Data collection

Once I logged into the Comtrade website, I arrived at its TradeFlow interface – a nice online platform where I could easily build queries to obtain international Trade data of any sort.

In that interface, one can freely select the exporting parties, the traded goods, the time period, and a few more features characteristic to trade. For this article, I picked the group of items marked HS1806 ("Chocolate and other food preparations containing cocoa"), using the terminology of the so-called Harmonized System. Additionally, I focused on only exported goods, between all possible Reporters (who export) and Partners (who receive), and a time frame going back to the past 20 years. Note: one query can cover 12 years top, so I run two queries, each covering 10 years. Once a query is done, the web-based platform automatically downloads the resulting .csv file.

2. Explore the data

Once I downloaded the two time periods on two .csv files, I started to explore them. The first thing I do is to combine the two files into one Pandas DataFrame, and then explore its features:

df1 = pd.read_csv('chocolate_trade_2012_2022.csv', encoding='iso-8859-1')
df2 = pd.read_csv('chocolate_trade_2002_2012.csv', encoding='iso-8859-1')

df = df1.append(df2)
print(len(df))
df.head(5)

The output of this cell:

So, it seems that the merged DataFrame has more than 100k records! That is quite some data on sweets. Let’s see the features of each record:

print(df.keys())

The output of this cell:

As this cell shows, there are quite a few – 48, to be exact – columns in this spreadsheet. However, to uncover the main patterns behind chocolate trading, I will only focus on a few of them.

Namely, on the sending and receiving parties, which countries’ English names are stored in the ReporterDesc and PartnerDesc columns. I also add information on the quality and quantity of the traded items – the PrimaryValue containing the total cash value of the transactions and the GrossWgt, the total physical mass of the transported sweets. Finally, I also keep some parts of the temporal information – in particular, RefYear– showing the year the transaction occurred.

Then, I create a cleaned version of the original DataFrame by only keeping the selected columns and dropping those records that do not correspond to any actual trade.

Note: once looking at the missing values, it turns out that there are many transactions with a PrimaryValue field, but the GrossWgt is actually missing. This is a data quality issue, so I was careful when following the following filtering steps.

# compute the number of missing values
features_to_kep = ['PartnerDesc', 'ReporterDesc',  'PrimaryValue', 'GrossWgt', 'RefYear']

print('Get the fraction of missing values:')
for feat in features_to_kep:
    print(feat,round((len(df)- len(df.dropna(subset = [feat])))/len(df),3))

The output of this code block:

So it seems that about half the trading transactions miss the GrossWgt information – however, they still have a price tag!

# filter the data
print(len(df))
df = df[['PartnerDesc', 'ReporterDesc',  'PrimaryValue', 'GrossWgt', 'RefYear']]
df2 = df.dropna(subset = ['PrimaryValue'])
df2 = df2[df2.PrimaryValue>0]
df2 = df2[df2.ReporterDesc != 'World']
df2 = df2[df2.PartnerDesc != 'World']
print(len(df2))
df2.head(3)

The outputs of this cell:

The filtering only decreased the size of the data by a few percentages, implying that very few countries are missing out on the chocolate export. Seems intuitive!

Now create a simple distribution plot to see how each continuous value of these look like:

import matplotlib.pyplot as plt

# this function does some nice formatting on the axis and labels
def format_axis(ax):   
    for pos in ['right', 'top']:   ax.spines[pos].set_edgecolor('w')    
    for pos in ['bottom', 'left']: ax.spines[pos].set_edgecolor('k')         
    ax.tick_params(axis='x', length=6, width=2, colors='k')
    ax.tick_params(axis='y', length=6, width=2, colors='k') 
    for tick in ax.xaxis.get_major_ticks():  tick.label.set_fontsize(12) 
    for tick in ax.yaxis.get_major_ticks():  tick.label.set_fontsize(12)

f, ax = plt.subplots(1,4,figsize=(15,4))

ax[0].hist(df2['PrimaryValue'], bins = 20)
ax[1].hist(df2['GrossWgt'], bins = 20)
ax[2].hist(df2['RefYear'], bins = 8)
ax[3].scatter(df2['PrimaryValue'], df2['GrossWgt'], alpha = 0.1)

ax[0].set_ylabel('Number of records', fontsize = 14)
ax[0].set_xlabel('PrimaryValue', fontsize = 14)
ax[1].set_xlabel('GrossWgt', fontsize = 14)
ax[2].set_xlabel('RefYear', fontsize = 14)
ax[3].set_ylabel('PrimaryValue', fontsize = 14)
ax[3].set_xlabel('GrossWgt', fontsize = 14)

ax[0].set_yscale('log')
ax[1].set_yscale('log')

ax[3].set_yscale('log')
ax[3].set_xscale('log')

for aax in ax: format_axis(aax)

plt.tight_layout()

The output:

The first two figures show the log-scaled distributions of the cash value and gross weight of the traded transactions, each showing a relatively unsurprising power-law trend, with many smaller-sized transactions and a few extremely large ones.

Then comes the temporal evolution – the number of transactions per year and a scatter plot indicating a high correlation between the total weight and price of cocoa products – the more is traded, the more it costs.

Now, let’s take a look at the extremes – which countries have been the top suppliers during the past two decades:

df2.groupby(by = 'ReporterDesc').sum().sort_values(by = 'PrimaryValue', ascending = False)[['PrimaryValue']].head(10)

The result of this code block is the table of top exporter countries:

As the trade values are stored in current dollar value, I summed them up during the 20-year window to come up with this top list. Based on global chocolate brands, I was kind of reassured to see Germany, Belgium, and Switzerland at the top, while the USA probably made it there due to the sheer size of the market.

Now flip this around and see who are the biggest customers of cocoa products:

3. Country rankings

After seeing the aggregated, 20-year top 10, let’s see how the countries ranking evolved and the market changed over time:

# get the list of overall top 10 countries
top10 = df2.groupby(by = 'ReporterDesc').sum().sort_values(by = 'PrimaryValue', ascending = False)[['PrimaryValue']].head(10).index.to_list()
top10_ranks = {t : [] for t in top10}

# store the latest order here for the visualization
top10_latest = {}

# for each year, create a ranking, and store them in a dictionary of lists
# where each key is a country
for year in range(2002, 2023):

    df2_year = df2[df2.RefYear==year]
    df2_year = df2_year.groupby(by = 'ReporterDesc').sum().sort_values(by = 'PrimaryValue', ascending = False)[['PrimaryValue']]

    df2_year['rank'] = [1 + i for i in range(len(df2_year))]

    for c, r in df2_year.to_dict()['rank'].items():
        if c in top10_ranks:
            top10_ranks[c].append((year, r))
            if year == 2022:
                top10_latest[c] = r

top10_latest = [c for c, r in sorted([(c, r) for c, r in top10_latest.items()], key=lambda x: x[1])]            
top10_ranks  

The output of this code block:

Now, visualise how this ranking has evolved, and see what positions did the overall top ten countries have taken during the past twenty years.

f, ax = plt.subplots(1,1,figsize=(15,5))

for country in top10_latest:

    ranks = top10_ranks[country]    
    y, r = zip(*ranks)
    ax.plot(y, r, 'o-', label = country, linewidth = 2)

format_axis(ax)
ax.set_ylim([13.5,0.5])
ax.set_xlim([2001,2023])
ax.legend(loc=(1.00, 0.14), frameon = False, handletextpad=.5, fontsize = 16)
ax.set_xlabel('Year', fontsize = 20)
ax.set_ylabel('Annual rank', fontsize = 20)

The graph tells us that the rule of Germany and Belgium has never been questioned. The Netherlands closely followed them, but Italy, slowly climbing up since 2022, finally took its third place, and since then, it slowly went down to place 5.

To the end of extremes, France seems to have gone through the biggest fall, starting at position 3 and now arriving at position 9, while Poland, coming up from as low as position 13, made it to the 4th place in 2022!

4. The Trade Network

Trading is a classic example of network behaviour – actors connecting to each other via goods, services, and cash transactions. So, the natural next step in my explorative analysis is to build up the trade network of countries. For this, I am going to rely on the package called NetworkX.

This will be a directed network where each source node is an exporting country, while the target nodes are the end stations of each transaction. Additionally, each connection is weighted based on the dollar value of the total trade between each two countries during the past 20 years combined.

Additionally, I create a column Weight that is the normalized log of the PrimaryValue.

import networkx as nx
import math

df2['Weight'] = [math.log(v) for v in df2.PrimaryValue.to_list()]

G = nx.from_pandas_edgelist(df2, source='ReporterDesc', target='PartnerDesc', edge_attr= 'Weight', create_using = nx.DiGraph)
G.number_of_nodes(), G.number_of_edges()

This network first results in 245 nodes and 11741 links, so I generate it again but filter out each transaction that is smaller than the 1% of the largest one!

top_value = max(df2.PrimaryValue)
df3 = df2[df2.PrimaryValue>top_value*0.001]

G = nx.from_pandas_edgelist(df3, source='ReporterDesc', target='PartnerDesc', edge_attr= 'Weight', create_using = nx.DiGraph)
G.number_of_nodes(), G.number_of_edges()

G = nx.from_pandas_edgelist(df3, source='ReporterDesc', target='PartnerDesc', edge_attr= 'Weight', create_using = nx.DiGraph)
G.number_of_nodes(), G.number_of_edges()

Based on the outputs of this cell, this filtered graph has 51 nodes and 123 links connecting them! Let’s export it into a Gephi file format and create the final visualisation in Gephi by applying a force layout, setting the node sizes to be proportional to the number of significant trading partners (degree in the filtered network) and scaling the link width based on the total traded value between the connected parties. Additionally, I colour the nodes based on their network communities.

The final visual:

A few more thoughts on the final network, visualised below, with the 30 highest-degree nodes labelled. As the colouring indicates, there are five communities – more densely interconnected subgraphs of the whole trade network, uncovering some rather unexpected, unofficial trade unions, for instance, one centred around the rising star of the field, Poland. Germany, France, and Italy seem to have built their own trading axis, while the Netherlands falls in the same community as the UAE. Interestingly, Belgium appears to be the largest node strongly linked to Asia, India, and Australia.

Conclusion

In my opinion, Comtrade is a vibrant source to analyse various aspects and dimensions of international trade, including volume-based and temporal statistics and network relationships. While this article focuses on the basics and aims to provide a starting point for trade data analysis, these data can easily be used to study macro-economical phenomena, such as the effect of international conflicts, sanctions, new regulations, changes and new trends in different industry domains.


Related Articles