Working with MS SQL Server in Julia

Time to supercharge your data analysis workflow

Vikas Negi
Towards Data Science

--

Photo by Venti Views on Unsplash

SQL databases are one of the most widely deployed software around the world. They form the backbone of numerous applications ranging from business data analytics to weather forecasting. A number of client-server implementations currently exist, and Microsoft’s SQL Server happens to be one of them. The fully-featured developer edition is available for free. It runs on Windows, Linux and via Docker.

Data scientists often need to interact with data stored in SQL databases. While it’s easy to find guides on how to do this with languages such as Python, tutorials for Julia are quite scarce. Therefore, in this article, I will focus on how to work with SQL Server using Julia. Example code is generated using a Pluto notebook with Julia 1.9.1 running on Linux (Elementary OS).

Prerequisites

  1. SQL Server 2022

You need to have a SQL server running locally. Easiest way to set it up is via Docker. Instructions for SQL Server 2022 are given here. To verify if the docker container is running, use the following command:

watch -n 2 sudo docker ps -a

This will update every 2 s, and the STATUS column should show something like ‘Up X minutes’ where X is the time elapsed from when the container was started.

2. Microsoft ODBC driver 17 for Linux

Instructions are given here. I was unable to connect to the database using the newer driver 18, hence cannot recommend to use that one.

3. sqlcmd utility (optional)

The sqlcmd utility lets you enter Transact-SQL statements, and is great to test if everything is working as expected. Follow instructions here.

Loading packages

The following Julia packages will be needed. When using a Pluto notebook, it’s built-in package manager will automatically download and install them for you.

using ODBC, DBInterface, DataFrames

Check drivers

Open Database Connectivity (ODBC) drivers allow us to make connections to the SQL server. Using the ODBC.jl package, we can check the currently available drivers on our system:

Image by author

It is also possible to install a driver once it’s location is known.

Image by author

To remove a driver, use:

Image by author

Add connection

Using a full connection string, we can now connect to the locally running SQL server, which was set up previously. The IP address, port, existing database name, user ID and password are needed. Note that in case the database name is unknown, we can connect to ‘master’ as this name always exists by default.

Image by author

List all existing databases

Using the conn_master object, we can now execute queries on the server. Let’s list all the databases.

Image by author

Create a new database

In order to create a new database, we should first check if the name already exists using the list_db function. If not, then we create it as shown below with ‘FruitsDB’ as an example.

Image by author

Listing all the databases again, we can verify that ‘FruitsDB’ has now been created.

Image by author

Create a new table

SQL Server databases can contain a number of tables, which are simply an ordered collection of data. A table itself is a collection of rows, also known as records. Before we can start populating a table, we first need to create it within an existing database. As an example, let’s create a table called ‘Price_and_Origin’ within ‘FruitsDB’. This table will contain three columns — Name (String), Price (Float) and Origin (String). Note that VARCHAR(50) is used to denote variable-size string data. 50 is the size in bytes, and for single-byte encoding it also represents the length of the string.

Add to new table

Once a table exists, we can add data to it. Easiest way is to use DataFrame as the source. Remember that our table ‘Price_and_Origin’ expects three columns with name, price and origin. Therefore, we can use some dummy data as shown below:

Image by author

To insert values, we can make use of DBInterface.executemany function, which allows passing multiple values in sequence. This can be done as shown in the function below. The finally clause ensures that database connection is closed using the DBInterface.close! function. This is generally a good practice, which helps avoid accidentally reusing the same connection for something else.

Image by author

Let’s verify if the database got populated as we had expected. We first set up a connection ‘conn_fruit’ to connect to ‘FruitsDB’ on the SQL Server. Then we can select all entries from the table ‘Price_and_Origin’ and pass it to a DataFrame sink.

Image by author

Updating a table

Following the same sequence as shown in the previous section, the database can now be updated with new data.

Adding new fruits (Image by author)

Let’s verify if the new data is indeed present within the database.

Note that the number of rows is now 7 (Image by author)

Removing duplicates

Re-executing the add_to_fruit_table function above again would add duplicate rows to the table.

“Lichi” and “Pear” appear twice (Image from author)

Using a common table expression (CTE), we can delete duplicate rows from a given table. The following function helps us achieve this:

Remove duplicate entries (Image by author)

Check if the rows are unique.

Duplicate entries have been removed (Image by author)

Delete records

It is often needed to remove entries (matching a certain condition) from the table within a database. For example, we can remove all fruits whose price is > 95 as shown below:

Fruits with price > 95 have been removed (Image by author)

Delete table

Using the DROP statement within DBInterface.execute function, a table can be deleted. Rest of the function will remain the same as delete_rows.

DBInterface.execute(conn_db, 
"DROP TABLE $table_name")

Conclusion

The DBInterface.execute function accepts valid SQL statements as an input. It is therefore possible to execute all kinds of queries as outlined here in addition to what has already been presented. As shown earlier, the results of a query can be easily passed to a Julia DataFrame sink, which can then be used to perform additional operations.

The packages ODBC.jl and DBInterface.jl are being actively maintained and seem to integrate well with existing workflows especially if they involve the use of DataFrames. This opens up exciting new possibilities for performing data analysis and visualization using Julia. I hope you found this exercise useful. Thank you for your time! Connect with me on LinkedIn or visit my Web 3.0 powered website.

--

--