Introduction to SQL in Python for Data Scientists

The data scientist’s guide for using SQL in Python environment.

Nick Minaie, PhD
Towards Data Science

--

This article provides an overview of the basic SQL statements for data scientists, and explains how a SQL engine can be instantiated in Python and used for querying data from a database.

As a data scientist using Python, you often need to get your data from a relational database that is hosted either on your local server, or on the cloud (e.g. AWS cloud). There are many ways to approach this. For example, you can query your data in Oracle, save the file as a .csv file, and then import it in Python. However, the most efficient way it to use SQL directly in Python. Coupling SQL and Pandas would give you many options to query, process, and use the data for your project in Python.

First Things First! What is SQL?

SQL (aka Structured Query Language) is a programming language used for managing or querying data stored in a relational database management system (RDBMS). SQL has been the dominant language for handling structured data where the entities in the database (e.g. tables, or table entities) are related (that is why these databases are called relational databases). There are other options for handling such data, but SQL has been the most popular, widely used language in the industry.

How is “SQL” pronounced?

SQL was developed at IBM in the early 1970s, and it was originally called “SEQUEL (Structured English Query Language)”. Later on, the name was changed to SQL (Structured Query Language) due to a trademark issue. However, the pronunciation “see-qu-el” (/ˈsiːkwəl/ ) stayed with the language, and that is the adopted pronunciation by most practitioners.

[Pro tip: when you go to an interview, make sure you pronounce it “see-qu-el”, if you want the job!]

What Does a Relational Database Look Like?

Amazon Web Services provides the best definition for a relational databl:

A relational database is a collection of data items with pre-defined relationships between them. These items are organized as a set of tables with columns and rows. Tables are used to hold information about the objects to be represented in the database. Each column in a table holds a certain kind of data and a field stores the actual value of an attribute. The rows in the table represent a collection of related values of one object or entity. Each row in a table could be marked with a unique identifier called a primary key, and rows among multiple tables can be made related using foreign keys. This data can be accessed in many different ways without reorganizing the database tables themselves.

Databases can have very complex designs, with many tables, and each table with many entities (columns) and many rows. It would be extremely difficult, or maybe even impossible, to query data when the relationships between tables is not known. ERD (Entity Relationship Diagram) is used to visualize these relationships and also show the entities in each table and their datatypes. Your database administrator should be able to give your the database’s ERD.

A Sample ERD — https://www.visual-paradigm.com/guide/data-modeling/what-is-entity-relationship-diagram/

How is SQL used in Python?

There are many ways to use SQL in Python. Multiple libraries have been developed for this purpose that can be utilized. SQLite and MySQL are examples of these libraries.

In this article, we are going to use Python Pandas in conjunction with sqlalchemy library.

Creating the SQL Engine

Let’s get started! We need to install and then import the libraries first. We will be using the create_engine feature from this library.

!pip install sqlalchemyimport pandas as pd
from sqlalchemy import create_engine

Once the library is imported, we need to create a SQL engine using this command which creates a new class`.Engine` instance.

engine = create_engine(*args, **kwargs)

The first argument is usually a string that indicates database dialect and connection arguments in the form of a URL and can be written as:

dialect[+driver]://user:password@host/dbname[?key=value..]

where dialect is a database name such as mysql, oracle, postgresql, etc., and driver the name of a DBAPI, such as psycopg2, pyodbc, cx_oracle, etc. More details on this can be found at https://www.sqlalchemy.org/.

Data Queries using SQL Statements

Now that you are connected to the database, you can submit data queries. to use sqlalchemy you need to wrap your SQL statements in a container, send it to the database, get the response back, and then put the response in a panda dataframe. The two primary clauses that must be present in every query are SELECT, and FROM.

  • SELECT allows you to select a subset of columns (or all of them) from a table,
  • FROM specifies which table the column(s) are being pulled from.

For example, the following code snippet will return all entities (columns) from table_1saves the response into a dataframe, and displays the head.

sql = """
SELECT *
FROM table_1
"""
df = pd.read_sql_query(sql, engine)
df.head()

Instead, you could also pull specific columns from the table, using this code instead:

SELECT entity_1, entity_2, entity_3
FROM table_1

If you are dealing with multiple tables (which you surely will in a real world project), you may need to specify which entity from which table because the entities of interest may come from different tables in the database. We will discuss how multiple entities from multiple tables can be queries, but this example is for entities from one table. In this case, you can use namespacing in your SQL statement:

SELECT table_1.entity_1, table_1.entity_2
FROM table_1

You can also assign aliases to each entity name or table name for simplification or readability purposes:

SELECT t.entity_1 AS name, t.entity_2 AS id
FROM table_1 AS t

If you want to get the distinct rows from a column, you can send this SQL statement:

SELECT DISTINCT entity_1
FROM table_1

If you want to order your data by a specific column (or multiple columns) you can use ORDER BY and specify a you want ASC (ascending) or DESC (descending) order. Remember, if you use multiple columns in ORDER BY the order in which SQL orders the data will be from left to right.

SELECT entity_1, entity_2, entity_3
FROM table_1
ORDER BY entity_1 DESC, entity_3 ASC

Sometime you are dealing with a very large dataset, but you may only need to retrieve a limited dataset from the database. If this case, you can use LIMIT:

SELECT *
FROM table_1
LIMIT 10

If you want to include a condition for the query, you can use WHERE. You can either use boolean conditions, or wildcards for string entities. For example:

SELECT *
FROM table_1
WHERE entity_1 > 5

or WHERE entity_1 BETWEEN 5 and 10

or WHERE entity_1 > 5 AND entity_1 < 10.

Wildcards (or wild characters) are symbols used to replace or represent one or more characters in a word. The one familiar one is * that is used for zero or many charactersor ? that is used for one character. We can use wildcards effectively in SQL when querying string entities using LIKE statement in SQL. The different between % and * is that % also account for underscore, but * does not. In Python, you should use %% instead of one %. The statement below returns all rows in which entity_3 starts with M.

SELECT *
FROM table_1
WHERE entity_3 LIKE "M%%"

ILIKE make this query insensitive to the character case, and NOT LIKE returns all rows in which the entity is NOT like the wildcard.

To deal with the null values, you can use:

SELECT *
FROM table_1
WHERE entity_1 IS NULL

or WHERE entity_1 IS NOT NULL.

Often you need to aggregate the data, group the data, and apply conditions to aggregated data. These aggregate statements include COUNT, AVG, MIN, MAX, and SUM. For example:

SELECT SUM(entity_1) AS sum, entity_2
FROM table_1

When using aggregates, you should use HAVING instead of WHERE, like:

SELECT SUM(entity_1) AS sum, entity_2
FROM table_1
HAVING entity_2 BETWEEN 5 and 10

To group your data by specific entity, you can use GROUP BY:

SELECT entity_1, SUM(entity_2) AS sum
FROM table_1
GROUP BY entity_1
HAVING entity_3 BETWEEN 5 and 10

Joining Tables

When querying data from multiple tables, you need to join these tables. There are multiple ways to join tables in SQL. Figure below illustrates these joins. You will likely work with inner joins more often, but it is important to understand what each type of join does.

Different Types of Table Joins in SQL — https://www.dofactory.com/sql/join

Joining tables can only be done when there is a common entity between the two tables, and you need to define that relationship using ON.

SELECT t_1.entity_1, t_2.entity_2
FROM table_1 AS t_1
INNER JOIN table_2 AS t_2 ON t_2.key = t_1.key

These statements cover the basics of SQL in Python. You can combine these statements based on the database you are dealing with, the type of data you need. There are many more statements that you can use. https://www.w3schools.com/sql/ provides a more comprehensive overview of SQL statements.

Nick Minaie, PhD (LinkedIn Profile) is a senior leader and a visionary data scientist, and represents a unique combination of leadership skills, world-class data-science expertise, business acumen, and the ability to lead organizational change. His mission is to advance the practice of Artificial Intelligence (AI) and Machine Learning in the industry.

--

--

Machine Learning and Artificial Intelligence Leader, Blockchain Enthusiast, Entrepreneur