Julia: A New Age of Data Science
Learn why Julia is future of data science and write your first Julia code
Introduction
Julia is a high-level and general-purpose language that can be used to write code that is fast to execute and easy to implement for scientific calculations. The language is designed to keep all the needs of scientific researchers and data scientists to optimize the experimentation and design implementation. Julia (programming language).
“Julia was built for scientific computing, machine learning, data mining, large-scale linear algebra, distributed and parallel computing”-developers behind the Julia language.
Python is still famous among data science enthusiasts as they get an ecosystem with loaded libraries to makes the work of data science but Python isn’t fast or convenient enough and it comes with securities variabilities as most of the libraries are built from other languages such a JavaScript, Java, C, and C++. The fast execution and convenient development makes it quite attractive among the data science community and the majority of libraries are directly written in Julia to provide an extra layer of security. InfoWorld.
According to Jeremy Howard “Python is not the future of machine learning”.
In his recent video, he talks about how Python is frustrating when it comes to running machine learning models as you have to use other languages libraries such as CUDA and to run parallel computing you must use other libraries which makes it quite challenging. Jeremy have also suggested, if you want to become future-proof then start learning Julia as it will take over Python within few years.
Julia
While comparing with Python, Julia takes lead in multiple fields as mentioned below.
Julia is fast
Python can be made optimized by using external libraries and optimization tools, but Julia is faster by default as it comes with JIT compilation and type declarations.
Math-friendly syntax
Julia attracts non-programmer scientists by providing simple syntax for math operations which are similar to the non-computing world.
Automatic memory management
Julia is better than Python in terms of memory allocation, and it provides you more freedom to manually control garbage collection. Whereas in python you are constantly freeing memory and collecting information about memory usage which can be daunting in some cases.
Superior parallelism
It’s necessary to use all available resources while running scientific algorithms such as running it on parallels computing with a multi-core processor. For Python, you need to use external packages for parallel computing or serialized and deserialized operations between threads which can be hard to use. For Julia, it’s much simpler in implementation as it inherently comes with parallelism.
Native machine learning libraries
Flux is a machine learning library for Julia and there are other deep learning frameworks under development that are entirely written in Julia and can be modified as needed by the user. These libraries come with GPU acceleration, so you don’t need to worry about the slow training of deep learning models.
For more information, you can read the InfoWorld article.
Overview
In this article, I will be discussing the advantages of Julia language and I will display how it’s easy to use DataFrame.jl, just like pandas in python. I will use simple examples and a few lines of code to demonstrate data manipulation and data visualization. We will be using the famous Heart Disease UCI | Kaggle Dataset which has a binary classification of Heart Disease based on multiple factors.
Exploring Heart Disease Dataset
Before we start coding, I want you to look into A tutorial on DataFrames.jl prepared for JuliaCon2021 as most of my code is inspired by the live conference.
Let’s set up your Julia repl, either use JuliaPro or set up your VS code for Julia and if you are using a cloud notebook just like me, I will suggest you add the below code into your docker file and build it.
The Docker code below only works for Deepnote envirnoment.
FROM gcr.io/deepnote-200602/templates/deepnote
RUN wget https://julialang-s3.julialang.org/bin/linux/x64/1.6/julia-1.6.2-linux-x86_64.tar.gz &&
tar -xvzf julia-1.6.2-linux-x86_64.tar.gz &&
sudo mv julia-1.6.2 /usr/lib/ &&
sudo ln -s /usr/lib/julia-1.6.2/bin/julia /usr/bin/julia &&
rm julia-1.6.2-linux-x86_64.tar.gz &&
julia -e "using Pkg;pkg"add IJulia LinearAlgebra SparseArrays Images MAT""
ENV DEFAULT_KERNEL_NAME "julia-1.6.2"
Installing Julia packages
The method below will help you download and install all multiple libraries at once.
Importing Packages
We will be focusing more on loading data manipulation and visualization.
Loading Data
We are using the famous Heart Disease UCI | Kaggle dataset for our beginner-level data analysis.
Features/Columns:
- age
- sex
- chest pain type (4 values)
- resting blood pressure
- serum cholesterol in mg/dl
- fasting blood sugar > 120 mg/dl
- resting electrocardiographic results (values 0,1,2)
- maximum heart rate achieved
- exercise-induced angina
- oldpeak
- the slope of the peak exercise ST segment
- number of major vessels
- thal: 3 to 7 where 5 is normal.
Simply use CSV.read()
just like pandas pd.read_csv()
and your data will be loaded as data frame.
For more Information on Comparison with Python/R/Stata · DataFrames.jl (juliadata.org)
Checking the shape of the dataframe
Checking multiple columns distribution. We can observe mean, min, max, and missing values all in one, by using describe()
Data select
To convert columns into categorical types by using :fbs => categorical => :fbs
and we have used Between
to select multiple columns at once. The select function is simple, for selecting columns and manipulating the types.
Using Chain
If you want to apply multiple operations on your datasets all at once, I will suggest you to use @chain
functionality, in R it’s equivalent to %>%
.
- The
dropmissing
will remove missing values row from the database. We don’t have any missing values in our dataset, so this is simply for a showcase. - The
groupby
function groups the data frame on a given column. - The
combine
function merge the rows of a data frame by aggregation function.
for more operations check out Chain.jl documentation
We have grouped the data frame by target and then combine five columns to get the mean values.
another way to use groupby
and combine
is by using names(df, Real)
, which returns all the columns with real values.
We can also add multiple columns using groupby
and combine
them by nrows
which will display a number of rows for each sub-group.
We can also combine it and then unstack as shown below. So that one category becomes an index, and another becomes a column.
Groupby
The simple groupby
function will show all groups at a time and to access specific group you need to use Julia hacks.
we can use👇
gd[(target=0,)] | gd[Dict(:target => 0)] | gd[(0,)]
to get a specific group that we are focusing on, this will help when we are dealing with multiple categories. gd[1]
will show the first group where target =0.
Density Plot
We will use StatsPlots.jl package to plot graphs and charts. This package contains statistical recipes that extend the Plots.jl functionality. Just like seaborn with simple code, we can get our density plot with the separate groups are defined by colors.
We are going to group it by target
column and display cholesterol
, which display cholesterol distrubution.
Group Histogram
Similar to density plot we can use grouphist
to plot histogram for different target categories.
Multiple plots
You can plot multiple graphs on the same figure by using!
at the end of the function name for example boxplot!()
.
The example below shows violin plot, boxplot, and dotplot of cholesterol which is grouped by the target.
Predictive Model
We will use the GLM model just like in R you can use y~x
to train the model.
The example below has x= trestbps, age, chol, thalach, oldpeak, slope, ca
and y= target
which is binary. We are going to train our generalized linear model on binomial distribution to predict heart disease. As we can see our model is trained but it’s still need some tuning to get better performance.
Conclusion
We have showcased how simple it is to write Julia code and how powerful it is when it comes to scientific calculations. We have discovered that this language has the potential to overtake Python due to simple syntax with higher performance. Julia is still new to data science but I am sure it is the future of machine learning and artificial intelligence.
To be honest I am also learning new stuff about Julia every day and if you want to know more about parallel computing and deep learning using GPU or machine learning in general do look out for my other article in the future.I haven’t done more exploration on the predictive models as this is an introductory article with generalized examples. So, if you think there is more, I could do, do let me know and I will try to add it in my next article.
You can follow me on LinkedIn and Polywork where I publish article every week.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Related
Originally published at https://www.analyticsvidhya.com on July 30, 2021.