Getting Started with Data Analysis in Julia

A step-by-step introductory on getting started with Data Analysis in Julia.

René
Towards Data Science

--

Photo by Pixabay from Pexels

Getting started with Julia is pretty straightforward, especially when you are familiar with Python. For this walk-through we will be using data on Covid-19 as provided by the Center for Systems Science and Engineering at Johns Hopkins University in their GitHub repository.

Getting Started

For our data analysis we will be using just a few packages to keep thing simple: CSV, DataFrames, Dates and Plots. Simply type the statement using followed by the name of the package and you are ready to go.

In case packages are not yet added to your project environment, you can added them easily.

Reading Data

Reading the data is done in a few easy steps. First we specify the URL of the CSV file. Secondly we specify the path to the file on our local machine. We will join the present working directory and file name “confirmed.csv” as path. Then we download the file from the URL to the specified path. The fourth and final step is reading the CSV file into a data frame called “df”.

Let’s view the first 10 rows of our data frame.

Tidying Data

For this walk-through we do not need the columns Province/State, Lat and Long. So we will drop them first. By putting an exclamation mark after the select statement, the data frame is modified in place.

There are multiple rows for Australia, as for some other countries. When we would like to plot the data per country, we have to aggregate the data. We will do that by performing a split — apply — combine technique. First we split the data by country using a groupby function. Than we apply a sum function per group (i.e. per country) for all the date columns, so we need to exclude the first column “Country/Region”. And finally, we combine the results into one data frame.

Let’s see what we have so far.

Our data frame has now (at the time of writing) 320 columns. However, we would prefer to have one column with dates and one column with values which we will call “Cases”. In other words, we are going to transform the data frame from a wide format to a long format, using the stack function.

Here is our nicely long formatted data frame for which the last ten rows are displayed.

There is one more thing left to do. We need to convert de column “Date” from a categorical string format to a date format to plot time series.

So, here is the description of our final tidy data frame.

Let’s write our tidy data to disk before visualizing the data.

Visualizing data

In our first plot we are going to visualize the (cumulative) confirmed Covid-19 cases for the US.

Plotting time series for multiple countries in one plot is pretty straightforward. First you create base plot and add one layer per country.

In our last plot, we are going to plot the daily new cases for the US. To do so, we have to calculate the differences between to successive days. So, for the first day in the time series, this value will not be available.

Finally, we will save our figure to disk.

Final thoughts

In this walk-through article we covered the basics of using Julia for data analysis. According to my experience Julia walks like Python. Both languages are easy to code and learn. And both are open-source. What I love about Julia is its high performance and its interoperability with other programming languages like Python. What I love about Python is enormous collection of packages and its large online community. Let me know your thoughts.

For more information about Julia, visit https://julialang.org and https://juliahub.com.

--

--

Senior Information Manager with a passion for all things data. Official author of Towards Data Science (TDS).