Notes from Industry

The magic of innovation has always been set off at the confluence of diverse knowledge pools. There, where stubborn scholars experiment with new ideas so as to push scientific frontiers. It is most certainly akin to a chemical reaction that’s unleashed the moment knowledge areas merge like molecules. When crafty techniques and abstract theories are brought together in an inferno of ignition and alchemy.
The same goes for economics and Data science, two professions that share not only common objectives, but also draw from the same well of knowledge, statistics. Over the last few years, they have become increasingly entwined.
Einav and Levin (2014), in a paper titled: ‘Economics in the age of big data‘, coined this development "the empirical turn". In their prognosis for the future they further state that economics will more and more adopt the techniques used to analyze big data within the scholarly community of machine learning. Indeed, the emergence of Python packages such as EconML, DoWhy and DeepIV bears testimony to the accuracy of their prediction.
In light of these interesting developments, the aim of this article is to illustrate how clustering, an unsupervised learning method, may be used to solve the problem of optimal taxation.
The Dataset

Working with a rather small dataset of 200 Somalia-based companies, this article will focus on finding hidden patterns that might aid in the setting of appropriate taxation rates. A quick look at the information table of the dataset shows that we have both rich geographical information as well as financial information. There are 14 columns in total.
The most significant columns are revenue in US dollars, years of consecutive activity, growth potential, the industry in which the company operates, and its tax burden. Geographic location is of course important as well.
If you would like to get a feel for the data, or if you want to play around with it yourself, you can download it from Datapane below:
Interesting Columns:
- Growth_Potential_Index (0–100 higher values indicating that the company could grow rapidly in the short to medium term)
- Monthly_Revenue ($) (the monthly gross revenue of the company)
- An_Revenue ($) (monthly revenue extrapolated over a year)
- Tax_Burden (the official tax rate that is deductible from the gross revenue)
- Tax_Income (the income local administrations derive from the company annually)
- Industry (rather obvious)
- Geo-information (we have longitude and latitude, city and province)
This dataset can be used for a variety of purposes, but the purpose of this article will be to apply unsupervised learning algorithms to cluster the companies and identify whether they are over-taxed or under-taxed.
But before we go into exploring the topic of tax incidence, let us first conduct some exploratory analysis of the dataset.

Figure 1) shows that a large number of the companies are located at the coast of Benadir and in Mogadishu, the capital of Somalia. Following this are larger cities such as Bosaso, Baidoa and Beledweyne, of which the first is known for its bustling port and fishing industry, while the latter two are major agricultural trade centers in the southern part of the country.

Figure 2) illustrates that the vast majority of companies in our dataset operate in the financial services sector, reflecting Somalia’s mobile money and remittances boom of the post-civil war era (2000s and 2010s). The second sub plot reveals that there are cluster patterns, wherein firms are clearly distinguishable in that some generate high tax revenue and others rather lower tax amounts. The firms also differ in their growth potential.

Figure 3) has few treasures of insight buried, lamentably. The only noteworthy and eye-catching pattern seems to indicate that modern companies are in need of support, perhaps tax reductions to spur growth and unlock economies of scale. For one of the scatter plots shows that there is a cluster of companies with low growth potential and relatively meagre monthly revenue. Perhaps it is a case of tax evasion, nullifying my suggestion of tax inducements to spur growth.

The plot that is Figure 4) certainly has the most profound insight detected thus far. We can clearly see – on account of the contrasting colours and sizes of the bars – that the sector Manufacturing & Construction generates the highest share of tax income in relation to annual revenue. Agriculture & Livestock also exhibits a disproportionate tax rate in comparison to the likes of Energy & Electricity and Finance & Money Transfer. The below plot lends further credibility to this observation.

In Figure 5), we compare growth potential by sector with tax burden by sector, in descending order. It is clear as day that Manufacturing & Construction and Agriculture & Livestock exhibit the highest tax burden share of all industries, even though they have the greatest growth potential. A possible explanation for this might be the fact that both sectors pay import and export taxes to the major ports of Mogadishu and Bosaso. Government intervention is therefore needed to relieve the burden of taxes and transportation costs, which undoubtedly hamper growth. In addition, clustering algorithms might help identify hidden patterns that need to be addressed.
K-means Clustering
The majority of machine learning models are divided into three distinct categories, as is commonly understood: 1) supervised learning, 2) unsupervised learning, and 3) reinforcement learning. In the first category, the outcome variable is already known, whether it is a continuous numeric variable or a class that must be predicted. Clustering falls under category two, where there is no obvious outcome to predict. The main objective is to find hidden patterns and simply organize disorganized data points into groups.
Thus the starting point of a clustering exercise is always a blob of data points, and our goal is to enable the model to find hidden patterns and assign each point to a group based on similarity across columns (cluster). There are many applications of clustering algorithms, from biogenetics to customer segmentation to content recommendation engines.

As can be seen from the previous plot, Figure 6), we always inspect the scattered points across the most important variables first. The goal of this effort is to visually gauge the optimal number of clusters. Isn’t it obvious that the optimal number of clusters for the given situation is five? Observe the cluster in the top left and top right, a blob in the middle of closely clustered data points, and the bottom two in both directions. Alternatively to this primitive method, we could use the Elbow Method and calculate the inertia scores.

Inertia scores represent the within cluster sum of squares (WCSS) which represent the distances between each data point and the center of the cluster which it belongs to. K-means clustering seeks to optimize this WCSS so as to find as few clusters as possible for the minimum attainable distance. Thus the larger the WCSS, the looser the clusters, and vice versa. From Figure 7) we can spot that the best balance between WCSS and cluster size is five, where the line plot bends like an elbow.

Having detected that five is the optimal cluster size, we can now turn to implementing the K-means clustering model.
The outcome shows us the five neatly identifiable clusters with their IDs.

Conclusion
It is often the final task to give names to the clusters and communicate results to decision makers. In our case the following cluster names and consequent conclusions are as follows:
Cluster 0 (bottom right): High revenue but low growth potential HL.
Cluster 1 (bottom left): Low revenue and low growth potential LL.
Cluster 2 (top right): High revenue and high growth potential HH.
Cluster 3 (top left): Low revenue but high growth potential LH.
Cluster 4 (middle): Medium revenue and medium growth potential MM.
In terms of economic policy recommendations we can conclude from this exercise, the following decisions are advised:
- The Energy & Electricity sector falls under the LL cluster 1. Firms in this sector are subject to low tax burden and this should remain as is. The likelihood of this sector further developing is high in the socio-economic environment of contemporary Somalia.
- The Telecommunication sector falls under the category of HL Cluster 0. Firms in the sector are subject to a low tax burden, although they are among the highest revenue generating. It seems as though this sector is under-regulated. These companies should pay their fair share of taxation, which would give more room for policy to aid other, more ailing sectors.
- HH Cluster 2 is definitely the Manufacturing & Construction sector, currently subject to one of the highest tax burdens (around 38%). We’ve identified earlier how this sector suffers from import duties of raw materials at the major ports. Increased revenue from the Telecommunication sector could permit authorities to do so.
- The LH Cluster 3 could be identified as the Agriculture & Livestock sector, which is the backbone of the Somali economy. Much of the countries’ workforce is employed in this sector. Besides that, it plays a pivotal role for better food security and must thus be nurtured and protected from unreasonable duties and taxes at ports and airports. The high tax burden currently close to 40% has to come down to realize these strategic goals.
- The last sector we have not yet discussed is the MM Cluster 4. It is the last industry left from our policy guidance formulation. It can be identified as the Finance & Money Transfer industry, which as we have seen has a very low tax burden of around 15%. For the sake of the more strategically important industries, policy makers are advised to increase this rate.
In this article, we’ve examined how economics and Data Science could be combined to get a new perspective on pressing issues. We have explored the K-means clustering algorithm and underwent a journey of inspirational gravity. Of course, there are many more clustering algorithms out there: such as DBSCAN, Affinity Propagation, MeanShit, and so on. But that is a story for another post.
Subscribe to the newsletter on my blog where I’m building up a readership for future publications. My Twitter handle is @warsame_words and I welcome feedback and constructive criticism – for the latter, LinkedIn is a welcome avenue. Thank you for accompanying me on this journey. Follow me on Medium to stay in tune with my latest data-related articles like these:
Data for Good – Somalia Drought Management
Data Analyst vs. Data Scientist