Julia: A New Age of Data Science

Learn why Julia is future of data science and write your first Julia code

Abid Ali Awan
Towards Data Science

--

Cover photo by Starline | Freepik

Introduction

Julia is a high-level and general-purpose language that can be used to write code that is fast to execute and easy to implement for scientific calculations. The language is designed to keep all the needs of scientific researchers and data scientists to optimize the experimentation and design implementation. Julia (programming language).

“Julia was built for scientific computing, machine learning, data mining, large-scale linear algebra, distributed and parallel computing”-developers behind the Julia language.

Python is still famous among data science enthusiasts as they get an ecosystem with loaded libraries to makes the work of data science but Python isn’t fast or convenient enough and it comes with securities variabilities as most of the libraries are built from other languages such a JavaScript, Java, C, and C++. The fast execution and convenient development makes it quite attractive among the data science community and the majority of libraries are directly written in Julia to provide an extra layer of security. InfoWorld.

According to Jeremy Howard “Python is not the future of machine learning”.

In his recent video, he talks about how Python is frustrating when it comes to running machine learning models as you have to use other languages libraries such as CUDA and to run parallel computing you must use other libraries which makes it quite challenging. Jeremy have also suggested, if you want to become future-proof then start learning Julia as it will take over Python within few years.

Julia

While comparing with Python, Julia takes lead in multiple fields as mentioned below.

Julia is fast

Python can be made optimized by using external libraries and optimization tools, but Julia is faster by default as it comes with JIT compilation and type declarations.

Math-friendly syntax

Julia attracts non-programmer scientists by providing simple syntax for math operations which are similar to the non-computing world.

Automatic memory management

Julia is better than Python in terms of memory allocation, and it provides you more freedom to manually control garbage collection. Whereas in python you are constantly freeing memory and collecting information about memory usage which can be daunting in some cases.

Superior parallelism

It’s necessary to use all available resources while running scientific algorithms such as running it on parallels computing with a multi-core processor. For Python, you need to use external packages for parallel computing or serialized and deserialized operations between threads which can be hard to use. For Julia, it’s much simpler in implementation as it inherently comes with parallelism.

Native machine learning libraries

Flux is a machine learning library for Julia and there are other deep learning frameworks under development that are entirely written in Julia and can be modified as needed by the user. These libraries come with GPU acceleration, so you don’t need to worry about the slow training of deep learning models.

For more information, you can read the InfoWorld article.

Overview

In this article, I will be discussing the advantages of Julia language and I will display how it’s easy to use DataFrame.jl, just like pandas in python. I will use simple examples and a few lines of code to demonstrate data manipulation and data visualization. We will be using the famous Heart Disease UCI | Kaggle Dataset which has a binary classification of Heart Disease based on multiple factors.

Exploring Heart Disease Dataset

Before we start coding, I want you to look into A tutorial on DataFrames.jl prepared for JuliaCon2021 as most of my code is inspired by the live conference.

Let’s set up your Julia repl, either use JuliaPro or set up your VS code for Julia and if you are using a cloud notebook just like me, I will suggest you add the below code into your docker file and build it.

The Docker code below only works for Deepnote envirnoment.

FROM gcr.io/deepnote-200602/templates/deepnote
RUN wget https://julialang-s3.julialang.org/bin/linux/x64/1.6/julia-1.6.2-linux-x86_64.tar.gz &&
tar -xvzf julia-1.6.2-linux-x86_64.tar.gz &&
sudo mv julia-1.6.2 /usr/lib/ &&
sudo ln -s /usr/lib/julia-1.6.2/bin/julia /usr/bin/julia &&
rm julia-1.6.2-linux-x86_64.tar.gz &&
julia -e "using Pkg;pkg"add IJulia LinearAlgebra SparseArrays Images MAT""
ENV DEFAULT_KERNEL_NAME "julia-1.6.2"

Installing Julia packages

The method below will help you download and install all multiple libraries at once.

Importing Packages

We will be focusing more on loading data manipulation and visualization.

Loading Data

We are using the famous Heart Disease UCI | Kaggle dataset for our beginner-level data analysis.

Features/Columns:

  1. age
  2. sex
  3. chest pain type (4 values)
  4. resting blood pressure
  5. serum cholesterol in mg/dl
  6. fasting blood sugar > 120 mg/dl
  7. resting electrocardiographic results (values 0,1,2)
  8. maximum heart rate achieved
  9. exercise-induced angina
  10. oldpeak
  11. the slope of the peak exercise ST segment
  12. number of major vessels
  13. thal: 3 to 7 where 5 is normal.

Simply use CSV.read() just like pandas pd.read_csv() and your data will be loaded as data frame.

For more Information on Comparison with Python/R/Stata · DataFrames.jl (juliadata.org)

Checking the shape of the dataframe

Checking multiple columns distribution. We can observe mean, min, max, and missing values all in one, by using describe()

Data select

To convert columns into categorical types by using :fbs => categorical => :fbs and we have used Between to select multiple columns at once. The select function is simple, for selecting columns and manipulating the types.

Using Chain

If you want to apply multiple operations on your datasets all at once, I will suggest you to use @chain functionality, in R it’s equivalent to %>%.

  • The dropmissing will remove missing values row from the database. We don’t have any missing values in our dataset, so this is simply for a showcase.
  • The groupby function groups the data frame on a given column.
  • The combine function merge the rows of a data frame by aggregation function.

for more operations check out Chain.jl documentation

We have grouped the data frame by target and then combine five columns to get the mean values.

another way to use groupby and combine is by using names(df, Real) , which returns all the columns with real values.

We can also add multiple columns using groupby and combine them by nrows which will display a number of rows for each sub-group.

We can also combine it and then unstack as shown below. So that one category becomes an index, and another becomes a column.

Groupby

The simple groupby function will show all groups at a time and to access specific group you need to use Julia hacks.

we can use👇

gd[(target=0,)] | gd[Dict(:target => 0)] | gd[(0,)]

to get a specific group that we are focusing on, this will help when we are dealing with multiple categories. gd[1] will show the first group where target =0.

Density Plot

We will use StatsPlots.jl package to plot graphs and charts. This package contains statistical recipes that extend the Plots.jl functionality. Just like seaborn with simple code, we can get our density plot with the separate groups are defined by colors.

We are going to group it by target column and display cholesterol , which display cholesterol distrubution.

Group Histogram

Similar to density plot we can use grouphist to plot histogram for different target categories.

Multiple plots

You can plot multiple graphs on the same figure by using! at the end of the function name for example boxplot!().

The example below shows violin plot, boxplot, and dotplot of cholesterol which is grouped by the target.

Predictive Model

We will use the GLM model just like in R you can use y~x to train the model.

The example below has x= trestbps, age, chol, thalach, oldpeak, slope, ca and y= target which is binary. We are going to train our generalized linear model on binomial distribution to predict heart disease. As we can see our model is trained but it’s still need some tuning to get better performance.

Conclusion

We have showcased how simple it is to write Julia code and how powerful it is when it comes to scientific calculations. We have discovered that this language has the potential to overtake Python due to simple syntax with higher performance. Julia is still new to data science but I am sure it is the future of machine learning and artificial intelligence.

To be honest I am also learning new stuff about Julia every day and if you want to know more about parallel computing and deep learning using GPU or machine learning in general do look out for my other article in the future.I haven’t done more exploration on the predictive models as this is an introductory article with generalized examples. So, if you think there is more, I could do, do let me know and I will try to add it in my next article.

You can follow me on LinkedIn and Polywork where I publish article every week.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

Originally published at https://www.analyticsvidhya.com on July 30, 2021.

--

--