The world’s leading publication for data science, AI, and ML professionals.

Who designed this database?

A well designed database and up to date documentation is the ideal, but if that is not the case and you are not in the position to improve…

Photo by Sebastian Herrmann on Unsplash
Photo by Sebastian Herrmann on Unsplash

A well designed Database and up to date documentation is the ideal, but if that is not the case and you are not in the position to improve the database design, what are some workarounds?

Consider the following scenario

You’re working as a freelance data scientist and are hired to do some analysis of a company’s data. That seems simple enough, you know how to use SQL and Python, and how to conduct hypothesis tests, and you are told that you’ll be given a database schema.

The company has been offering discounts at different levels to some customers. As your first task, you’re asked to figure out if offering product discounts is useful for increasing the number of items in an order, and if so is it specific to a certain discount or discounts?

Take a look at the schema

The first thing you do is look at the database schema you’ve been provided and come up with a plan for gathering the data.

The provided database schema
The provided database schema

You see that product discounts are listed in the Order Details table. You already see an issue with this database design, a space in this table’s title which isn’t a good sign, but you can work around it by quoting the table name.

What does the Order Details table data look like?

You’ll be using Python’s sqlite3 package to query the database and plan to store the queried data in a Pandas Dataframe.

First you want to look at a few entries in the table.

import sqlite3
import pandas as pd
query = """SELECT * FROM 'Order Details' LIMIT 10"""
df = pd.DataFrame(cur.execute(query).fetchall(), columns = [description[0] for description in cur.description])
df.head(10)

Instead of data you receive this error:

OperationalError                          Traceback (most recent call last)
<ipython-input-5-8791be329a46> in <module>
----> 1 cur.execute(query).fetchall()

OperationalError: no such table: Order Details

Hmm. After double checking the schema to check for a spelling error, it’s clear that there is no table called Order Details.

If the schema is wrong, what are the table names?

Unfortunately, how to do this is dependent on what relational database implementation you are working with, but in our case, sqlite, you can use the following query.

query = """
        SELECT name FROM sqlite_master
        WHERE type='table'
        ORDER BY name;
        """
cur.execute(query).fetchall()

The returned table names are:

[('Category',),
 ('Customer',),
 ('CustomerCustomerDemo',),
 ('CustomerDemographic',),
 ('Employee',),
 ('EmployeeTerritory',),
 ('Order',),
 ('OrderDetail',),
 ('Product',),
 ('Region',),
 ('Shipper',),
 ('Supplier',),
 ('Territory',)]

WHAT. Not one table name matches the schema. At this point, you can’t trust the schema, but hopefully the relationships between each table still hold even if the table names don’t match.

Now that you know the correct table name you can try to retrieve some data again.

query = """SELECT * FROM 'OrderDetail' LIMIT 10"""
df = pd.DataFrame(cur.execute(query).fetchall(), columns = [description[0] for description in cur.description])
df.head(10)

This time you get some data!

Output from the OrderDetail table
Output from the OrderDetail table

You see that there are more inconsistencies between the schema and the current database implementation. There is an Id column not shown on the schema and the capitalization of the OrderId column and ProductId don’t match the schema. Although capitalization doesn’t matter in SQL, once you put that data into a Pandas dataframe, capitalization will matter. You’ll need to make note of that for doing future queries and analysis.

You’d now be in position to conduct your hypothesis test using this table which I won’t go into detail in this post, but what if you need data from some other tables?

What about the Order table?

You may have noticed above that there is a table called Order. If you are familiar with SQL, you know that ORDER is a reserved word. You should suspect that querying data from the Order table will be problematic, but how so?

As a simple example, you can try to query a few rows of data from that table.

query = """SELECT * FROM Order LIMIT 5"""
df = pd.DataFrame(cur.execute(query).fetchall(), columns = [description[0] for description in cur.description])
df.head()

As expected you receive an error

---------------------------------------------------------------------------
OperationalError                          Traceback (most recent call last)
<ipython-input-31-dc8e46beab2b> in <module>
      1 query = """SELECT * FROM Order LIMIT 5"""
      2 
----> 3 df = pd.DataFrame(cur.execute(query).fetchall(), columns = [description[0] for description in cur.description])
      4 df.head()

OperationalError: near "Order": syntax error

What can you do about this? How can SQL know that Order is a table name not a reserved word. Unfortunately the solution is database implementation dependent. I’ll show you how to deal with it in the case of sqlite. You will need to enclose the table name in backticks.

query = """SELECT * FROM `Order` LIMIT 5"""
df = pd.DataFrame(cur.execute(query).fetchall(), columns = [description[0] for description in cur.description])
df.head()

This time some data is retrieved.

Data from the Order table
Data from the Order table

If you reference the problematic schema from above you will notice that what was specified to be OrderID is actually called Id, and will need to be noted.

What if you need to reference a column of the Order table? You still use the backticks. Here’s an example of a more complicated query.

query = """
SELECT `Order`.Id AS OrderId, 
       SUM(OrderDetail.Quantity) AS TotalQty 
FROM `Order` JOIN OrderDetail 
ON `Order`.Id = OrderDetail.OrderId 
GROUP By `Order`.Id
"""

You can see that anywhere Order is used it needs to be enclosed by backticks. If you needed to use this as a subquery, renaming Order.Id to OrderId will help keep the rest of the query "clean".

Thank you for taking the time to read this post. I hope the workarounds I detailed will be helpful when you run into poorly designed and/or documented databases.


Related Articles