
Julia is fairly well-known in the world of scientific computing. Following the release of a stable 1.0 version in 2018, it has gradually matured into a highly powerful general purpose programming language. Julia is dynamically typed, designed to be as fast as C (see benchmarks) and makes use of an impressive math-friendly syntax. I recently completed an introductory course on Coursera, and thereafter started to include Julia in my daily workflow. As a small project, I decided to make use of DataFrames in Julia to visualize Covid-19 time-series data. During this process, I realized that information on using and troubleshooting Julia is relatively hard to find, which could raise the barrier to entry for new users like me. I, therefore, decided to put together this guide where I have shared the process, code and also the results.
I usually make use of Jupyter notebooks, they are easy to use and get the job done. I have not included detailed steps to set up Julia on your system, it is relatively straightforward. Keep in mind that:
- You need to have a working Jupyter installation, Anaconda is highly recommended
- Julia binaries are available online, don’t forget to verify the sha256 checksums
- You will need the ‘IJulia’ package to make Julia work with Jupyter notebook, these instructions worked for me! I was easily able to set up a working Julia environment in Elementary OS (based on Ubuntu).
Let’s get started: Loading the essential packages
Similar to Python, Julia also makes use of a number of packages which can be loaded in a Jupyter notebook. Unlike Python, most of them are written in Julia itself! ‘Pkg’ is the built-in package manager in Julia and handles their installation, update and removal. ‘Pkg’ comes with a REPL (Read-Evaluate-Print-Loop). Enter the ‘Pkg’ REPL by pressing ]
from the Julia REPL. To get back to the Julia REPL, press backspace or ^C.

Packages can be added as shown above. Alternatively, you can add them directly in the Jupyter notebook, for example by executing: import Pkg; Pkg.add("HypothesisTests"). Since packages are being updated on a regular basis (bug fixes, new features etc.), it might happen that some of the code linked in this post does not work for you at a different point of time. Don’t worry though, Julia has a very clever solution. ‘Pkg’ creates two additional files – ‘Project.toml’ and ‘Manifest.toml’, which include information about dependencies, versions, package names, UUIDs etc. These files can be easily shared, and will let you recreate the exact working environment as what I have.
Use package ‘InstantiateFromURL’ in the first cell of your notebook to directly download these files from my GitHub respository. It also activates the required environment.
Packages can now be compiled in the next cell as shown below. Package versions will be exactly the same as what I have used while working on this project.
Importing data
For this exercise, I made use of COVID-19 data from the well-known GitHub repository maintained by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. The data is in CSV format and is updated daily. It can be imported directly into a file as shown below:
A first look
Before moving ahead, let’s take a look at our data. I want to know more about its structure, size and type.

- size(data_df) returns the number of rows and columns within the DataFrame. Output will look different depending on the day you retrieve the data since new columns are added everyday. There is a column with country names, and columns with individual dates.
- Use names(data_df) to list all column names, see example output below. This will be quite handy later on!

- A quick investigation reveals that we have a column with country names and many columns with all individual dates. Missing entries are represented using a special missing object in Julia.
Selecting a country within the DataFrame
I want to be able to select the time series data of a given country from within the DataFrame. Some countries have ‘Province/State’ listed as well, that is not very useful for me. However, the rows with ‘Province/State’ listed as missing contain the sum total of all regions within that country.
I will write a function that selects only the rows with ‘Province/State’ missing and discard the rest. Then I can easily find the country by matching the ‘Country/Region’ entry with my supplied keyword. While exploring the data, I have already noted that country names are mentioned as Australia, Germany, India etc, so my input keywords should match the same.
Dealing with dates
Dates can be read directly from the column names. In order to plot a time series data (dates on the x-axis), we need to have them in the correct date-time format. Have a look at the code snippet below:
Creating input data for a list of countries
So far, I have created the data useful for plotting the x-axis in a time series. On the y-axis, we can plot the total confirmed cases per day for that particular date. However, I want to do this for a list of countries of my choosing, this will help compare the spread of infection between them. Recall that I earlier created a function which returns a row of data for a particular country. We can use this function to loop through a user-specified list of countries.

Bringing it all together – It’s time to plot!
We now have the y-axis data (‘y’ DataFrame) suitable for a time series plot for various countries. Julia allows the use of a macro ‘@df’ to directly plot from a DataFrame. We have earlier loaded the ‘Plots’ package, which uses the ‘gr()’ backend to generate figures.
Plot outputs are shown inline when returned to a cell. Here, I have used ‘display’ to explicitly show the original plot overlaid with a red dashed line (using plot!) marking the one million count (a worrisome number to say the least). Additional plot options have been added to improve how the plot looks. A comprehensive guide on plotting can be found here.


Calculating the number of daily reported cases
Another interesting metric to look at is the number of cases reported on a daily basis. We already have a DataFrame containing the total number of reported cases on a given date, the daily increase is simply the difference between any two consecutive dates. This operation can be easily performed for all the days. See code snippet below:
A bar plot is usually more informative in such cases. For countries where the total number of reported cases starts to saturate, we can see that the increase in daily cases starts to go down. It will ideally become zero once no new cases are reported. Italy, Germany and UK have managed to limit the growth, whereas India is currently seeing an exponential increase.


Finding the top five countries with highest number of reported cases
In order to find countries with the highest number of confirmed cases, we can sort the original DataFrame ‘data_df’ based on the values in the last (most recent) date column. Sorting is done using the ‘sort!’ function (bang version, sorts in place) in the descending order.

It can be particularly useful to visualize which countries occupied the top five spots as days progressed. Luckily, Julia allows us to easily generate an animation using the ‘@animate’ macro. We simply need to loop through the dates and generate a plot of top five countries (as shown above) for each day. The plots are then combined into an animation which can be run at any desired frame rate. The code block used previously needs to be placed within the animation loop as shown below:
Number of deaths, recovered and currently infected
The CSSE GitHub repository also contains time series data for number of deaths and recovered cases. We can use the same code as shown earlier to import these CSV data into two new DataFrames: ‘data_df_recovered’ and ‘data_df_deaths’.
We can collect data from these new DataFrames, once again for a list of countries. The code shown below creates new DataFrames ‘y_r’ and ‘y_d’ which contains data for recovered cases and number of deaths, respectively. ‘y’ contains number of confirmed cases, which is the same as earlier.
A grouped bar plot is a useful way to visualize multiple data sets (number of confirmed, recovered, infected, deaths) for a list of countries. Recall that we have already created DataFrames ‘y_r’ and ‘y_d’ above. Since we know the total number of confirmed cases, we can also calculate the number of currently infected = number of confirmed cases−(number of recovered cases + number of deaths).
In order to make a bar plot, we need to collect data from all the individual DataFrames. The below code converts the last DataFrame row (most recent date) into a vector, which are then added as columns to the 2-D array ‘Y’. This array has the size: rows (list of country names)×columns (=4, which will correspond to confirmed, recovered, deaths and infected data).
Each row of 2-D array ‘Y’ now contains 4 data points for a given country. Plot type ‘groupedbar’ allows us to directly read this data and arrange groups of bars according to the country using the code below. Extra plot options control the appearance of the bars.

Germany seems to have done a great job at managing the pandemic, evident by the similar ‘Confirmed’ and ‘Recovered’ numbers. For countries such as India and Brazil, the recovery looks promising although there is still quite some catching up to do. France has more number of infected people compared to the number of recovered.
Finding the top five countries with highest number of recovered cases and number of deaths
In order to find countries with the highest number of recovered cases, we can sort the DataFrame ‘data_df_recovered’ based on the values in the last (most recent) date column. Sorting is done using the ‘sort!’ function (bang version, sorts in place) in the descending order, similar to what was done earlier. Number of deaths follows the same logic using DataFrame ‘data_df_deaths’.


Conclusion: Julia – A versatile tool for problems in Data Science
Through this article, I have tried to present a beginner’s experience in using Julia for tackling some basic problems in Data Science. Full code can be found here. I am aware that I might have missed some details, which is expected since this is by no means a definitive guide. Other resources also exist on the web, although such guides can be hard to find. I am still new to the world of Julia and would be very happy to receive some feedback, even suggestions to improve the code. I will continue to explore advanced Julia libraries, and use them to create more insightful visualizations using large and complex datasets. Stay tuned for more such guides. Thank you for taking the time to read this post! Feel free to connect with me on LinkedIn.
References:
- Link to complete Julia code gist
- https://syl1.gitbook.io/julia-language-a-concise-tutorial/language-core/getting-started
- https://julialang.org/
- Another excellent Data Visualization tutorial using Python