
Pandas is a Python library for Data Analysis and manipulation. SQL is a programming language that is used to communicate with a database. Most relational database management systems (RDBMS) use SQL to operate on tables stored in a database. What they have in common is that both Pandas and SQL operate on tabular data (i.e. tables consist of rows and columns).
Both Pandas and SQL are essential tools for data scientists and analysts. There are, of course, alternatives for both but they are the predominant ones in the field.
Since both Pandas and SQL operate on tabular data, similar operations or queries can be done using both. In this post, we will compare Pandas and SQL with regards to typical operations in the data analysis process.
We will use the customer churn dataset that is available on Kaggle.
For Pandas, I will use Google Colab. After uploading the csv file to the Colab environment, the dataset is read into a pandas dataframe.
import pandas as pd
churn = pd.read_csv("/content/churn.csv")
For SQL, I created a MySQL database on Amazon RDS and used MySQL Workbench to connect to it. After establishing the connection, I created a table called "CHURN" and uploaded the data in the csv file to that table. You can do the same operation by running the following SQL syntax in MySQL Workbench.
CREATE TABLE CHURN (
RowNumber INT NOT NULL,
CustomerId INT NOT NULL,
Surname VARCHAR(20) NULL,
CreditScore INT NULL,
Geography VARCHAR(20) NULL,
Gender VARCHAR(20) NULL,
Age INT NULL,
Tenure INT NULL,
Balance DECIMAL(10,2) NULL,
NumOfProducts INT NULL,
HasCrCard INT NULL,
IsActiveMember INT NULL,
EstimatedSalary DECIMAL(10,2) NULL,
Exited INT NULL,
PRIMARY KEY(RowNumber)
);
LOAD DATA LOCAL INFILE "C:/Users/soner/Desktop/SQL/churn.csv" INTO TABLE CHURN
FIELDS TERMINATED by ','
ENCLOSED by '"'
LINES TERMINATED by 'n' IGNORE 1 LINES;
Note: If you use the latest version (8. ) of MySQL Workbench, the load data statement above will not work. I use version 6.
We now have a churn dataframe and a CHURN table. Let’s start on doing basic operations that are commonly done in the data analysis process.
Getting an overview of the data
What the overview implies here is a little vague. What I mean here is a list of columns and their data types.
The DESC statement followed by the table name will do the job in SQL.
DESC CHURN;

For Pandas, "dtypes" method will return the data type for each column. We can also use the "columns" method just to see the list of columns.
churn.dtypes

Displaying the first 5 rows
This can also be considered as getting an overview of data. The number 5 is just the common practice but you can change it.
On the SQL side, we will use the SELECT statement.
SELECT * FROM CHURN LIMIT 5;

We selected the first 5 rows (LIMIT 5) of all columns (*). The screenshot includes only 7 columns so that it fits the screen well.
Not every DBMS has the same syntax. We are using MySQL here. For instance, SQL Server uses SELECT TOP statement.
Pandas offers a simpler syntax for this operation.
churn.head(5) # The default is 5 so churn.head() also works

Selecting particular column(s)
For SQL, we just need to replace "*" with the column names.
SELECT CustomerId, Geography FROM CHURN;

It is done as follows for Pandas.
churn[['CustomerId', 'Geography']]

Selecting rows that fit a condition
When analyzing data, we are likely to take some conditions into consideration. For instance, in our dataset, the task might be comparing different countries based on some measurements. Thus, it is very important to be able to conditionally select data points.
We can use the WHERE statement in SQL. For instance, the following query will return the selected columns for which the value in the Geography column is France.
SELECT CustomerId, Geography FROM CHURN
WHERE Geography = 'France'
LIMIT 5;

For Pandas, the dataframe is filtered based on the condition and then the desired columns and rows are selected.
churn[churn.Geography == 'France'][['CustomerId','Geography']][:5]

Unique values in columns
It is important to analyze the categorical columns as well as the numerical ones. One key characteristic of a categorical column is the unique categories.
Let’s see which countries exist in the Geography column.
The SELECT DISTINCT statement will return the distinct values in a table.
SELECT DISTINCT Geography FROM CHURN;

The same operation can be done with the unique function of Pandas.
churn.Geography.unique()
array(['France', 'Spain', 'Germany'], dtype=object)
The number of unique values in columns
In the Geography column, there are only 3 different countries so we can visually count it. In some cases, the number of unique categories is high so we need a way to count them.
For SQL, we can use the COUNT statement as follows:
SELECT COUNT(DISTINCT Geography) FROM CHURN;
3
For Pandas, the nunique function will do the job.
churn.Geography.nunique()
3
The number of rows in each category
We have covered how to check unique categories and the number of unique categories. Taking it one step further, it might be more useful to see how many data points (i.e. rows) each category has.
For SQL, it can be done using COUNT and GROUP BY statements as in the following query:
SELECT Geography, COUNT(Geography) AS 'count'
FROM CHURN
GROUP BY Geography;

We selected two columns, one is the Geography and the other one is the count of the rows in the Geography column. Without the "GROUP BY" statement at the end, the query would return one row indicating the total number of rows in the table.
The same task can be done with the value_counts function of Pandas.
churn.Geography.value_counts()

Groupby and aggregate functions
Both SQL and Pandas allow applying aggregate functions to the query result which is very important for efficient data analysis.
For instance, we can check the average age of customers in different countries. Following SQL query will accomplish this task and it will also sort the results based on the average age values.
SELECT Geography, AVG(Age) AS 'Average Age'
FROM CHURN
GROUP BY Geography
ORDER BY Age;

The Pandas follows the same idea just with a different syntax.
churn[['Geography','Age']].groupby('Geography').mean()
.sort_values(by='Age', ascending=False)

Conclusion
We have covered only a small portion of what can be done with Pandas or SQL. However, these operations are essential to a typical data analysis process. I consider them as basics that are required to master in order to learn complex operations.
As you have seen in the examples, the logic behind Pandas and SQL are pretty similar. Once you are familiar with one of them, learning the other one will be quite easy. It just becomes a syntax issue.
If you are working or plan to work in the field of Data Science, I strongly recommend you to learn both Pandas and SQL.
Thank you for reading. Please let me know if you have any feedback.