
I have had a bit of experience using Python for Data Analysis on academic projects and I wanted to do a straightforward project to combine Python and SQL to extract some data insights. I have come across the"Olympics dataset – 120 years of data" with every participant of the Olympic games since its beginning. Such analysis is interesting to perform and see, whether the system of state support in sport results in higher medal productivity of the athletes. As much as it was a method of propaganda, it also helped children find a purpose in life as well as helping people stay fit and healthy with mass sport culture. The success of the USSR`s athletes in the Olympics is a great example of that. While the country joined the IOC in 1954 and have been competing until 1992* the results are staggering and worth exploring a bit further.

My hypothetical client:
I am working closely with SportsStats to find interesting facts for their partners. Upon reviewing the data provided, there can hopefully be made news worthy stories or health insights.
Hypotheses:
1) Outside of the USA, all of the state controlled sport system countries will perform better, than the amateur ones.
2) Each soviet country will have a time lag, in which the new system will be implemented. Therefore, the results of all will improve after 3 Olympic cycles from start of participation.
Approach:
- Import and clean up data; split the raw dataset into 1st and 2nd word countries tables;
- Filter against teams and years to see performance;
- Run statistical analysis via Pearson and P-value to find correlations.
Step 1: Import and clean up data.
We begin by importing libraries and the csv of the dataset. Once imported, it is best to familiarize ourselves as to what data we have got.
After looking through the columns, it is quite clear that not all of the data provided will be necessary for the analysis. So we will drop some of them altogether.
Once we got the required columns filtered, it is time to split the dataset into soviet and capitalist. For the more accurate results, we will use the time constraints from 1952 -1992, since within that time the soviet system was implemented and active in the Warsaw pact countries.
Step 2: Filter against teams and years to see the performance.
Once we have our countries split into tables, it is time to see, how they performed on the global scale! We will create a function to count the medals of each country, as well as the overall contribution to the soviet pool of medals won. When we run the function with the first 3 Olympic cycles, we get the following:


Judging by the numbers presented, the trend is not consistent across the board. USSR, Bulgaria, Czechoslovakia, Poland and Romania show increase in medals as the time passes. However, Hungary (who finished 3rd overall in 1954 Olympics) and Yugoslavia show a downward trend. It is also worth noting, that until 1964, Germany participated as a unified country, so it will be a good idea to look into the medal situation from 1968 games.


With the introduction of East Germany, most teams show a similar trend. USSR, East Germany, Bulgaria, Cuba, Poland, and Romania show an increase in medals as the time passes. Whereas Hungary, Yugoslavia (after a slight increase) and Czechoslovakia show downward trend. Which makes our second hypothesis a bit of a mixed batch – most countries start showing better results with time, but there are some of the countries, that show opposite dynamic.
We will use the same build of the function to establish, which countries in the capitalist block contribute most to overall medal performance.


The two games summaries provided allow us to compare the performance of the soviet against capitalist countries. Once the metric was calculated, it turned out that highest performing capitalist countries were more productive until 1968. However, outside of USA, these countries performed better than their soviet counterparts until the increasing results of East Germany and its contributions, tipping the scales from 1972 onward.
(*NOTE: we excluded 1980 & 1984 Olympics, due to the boycott from capitalist countries in 1980 and soviets in 1984 respectively.)
After running the function on all of the games, we managed to isolate the best performing countries. Now we can create a table of their most productive disciplines, so we can determine, which system was best at a given sport and what field was proving most fruitful for each block.
We can clearly see on the bar chart above which sports were most accountable for the USA success on the Olympics. Running these charts for every team on our lists, we will create the performance table for both blocks.
All of the countries in question have several sports excellence to thank for their high performance. Upon filtering the country via the year of the games, we get the following results.


These graphs indicate, that the soviet countries favor more team sports, where they went on to be a formidable force all throughout.
the measured period. Individual disciplines were more adherent to what sporting school was strong for the individual country (skiing in the USSR). It is also visible, that the results improved after the soviet reform of the sporting system in the country, as well as finding the federations from scratch (Boxing in Cuba, which was not popularized until the revolution).
Capitalist countries, on the other hand, were better at individual disciplines – athletics, swimming, cycling. As later findings show, the sport most popularized or widespread in the country ends up taking charge in the medal hopes – like athletics for the USA, cycling for the UK or hockey for the UK and Australia.
Step 3. Run statistical analysis via Pearson and P-value to find correlations.
Pearson correlation (the bivariate correlation) is a statistic that measures linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.
P-value is the probability that you would have found the current result if the correlation coefficient were in fact zero (null hypothesis). If this probability is lower than the conventional 5% (P<0.05) the correlation coefficient is called statistically significant.
We are going to find correlations between the performance of sports in each table. Since the biggest contributors to the medal pool are athletics and swimming, we will compare them first and then see, how well the other disciplines perform.

Looks like the correlation is quite strong for both tables. This comes as no surprise, since we are looking at the most successful teams across the games. Let`s look at the correlation between athletics and cycling.

This shows even stronger correlation, but our capitalist table gave a weak Pearson`s with a very high P-value. Hence, the correct result will be a non-linear correlation between these two. The situation is with the water sports bears close resemblance to the whole watersport family: good performers in one usually perform well in the other.

However, the capitalist table shows more of a linear correlation, whereas soviet one looks like a non-linear relationship.
But what about team sports? Since the soviet teams should perform "so much better"? Well, the results are less than straightforward:

The relationships in the capitalist table shows hardly any correlation, whereas the soviet performance is more linear, although not entirely.
Conclusions.
- Adoption of the soviet sport system improves the performance of every country with every consecutive games, even though the biggest benefactor, judging by the results was the USSR
- Capitalist countries with independent systems of preparation show more consistent results in their strongest disciplines.
- Eastern European school of gymnastics gave a huge advantage to the soviet countries. From the early 1950s, gymnastics popularization became widespread for the "improvement of physical wellbeing of soviet people". Later that decree transformed into a way to earn more medals at the highest level, which resulted in soviet dominance on the Olympic stage. Apart from Japan, no other capitalist country was decisively better performing at the Olympics.
- A special mention needs to be attributed to fencing. As a competitive sport, it originated in Italy, who long has been the most formidable force in the sport. Much praise can also be said about the Hungarian school of fencing, considered second best in Europe. After the European division, the high standards of Hungarian fencing were taught all over the soviet block, resulting in much improvement of the soviet athletes.
- The strong schools of each country, joining the soviet block, contributed to the other countries` training methods. Soviet system took the experience of one nation and adopted the practice. Good examples – fencing, gymnastics and water polo
Hypotheses: Verdict
- Outside of the USA, all of the state controlled sport system countries will perform better, than the amateur ones. Partially correct – from the 1972 games onward.
- Each soviet country will have a time lag, in which the new system will be implemented. Therefore, the results of all will improve after 3 Olympic cycles from start of participation. Partially correct – some countries showed downward trend. After the 1968 inclusion of East Germany, all countries showed a steady growth.
Discussion.
As much as sports were a method of political propaganda for the soviet regime, it was a mass phenomenon, that drew from the best practices of the countries it harbored and implemented them across the board. The same approach was taken by China in preparation for 2008 games – and it propelled them to become one of the major forces at the games. There is much room in international cooperation, when it comes to best training methods to make the games more competitive and the pool of victors wider and more inclusive of other nations.
Thank you for reading, you can find the code on my GitHub, any feedback, comments, suggestions would be much appreciated!