Pandas vs. SQL — Part 4: Pandas Is More Convenient

Aditya Parameswaran
Towards Data Science
8 min readNov 24, 2022

--

Photo by gamene at Openverse (CC BY 2.0)

tl;dr: In this post, we compare Pandas vs. SQL on the third of three axes: convenience. We describe six ways in which the Pandas dataframe data model is more convenient for data science and machine learning use cases.

In this fourth offering of our epic battle between Pandas vs. SQL, we illustrate how Pandas is more convenient than SQL for data science and machine learning. Pandas was designed by data scientists for data scientists, and benefited from thousands of improvements contributed back enthusiastically by the open-source data science community — all with an eye towards greater utility and ease-of-use. So it’s no surprise that it’s a good fit!

Before we begin, if you missed our previous Pandas vs. SQL offerings, you can still catch up here: Part 1: The Food Court and the Michelin-Style Restaurant, Part 2: Pandas Is More Concise, and Part 3: Pandas Is More Flexible. Our previous posts focused on comparisons of the dataframe data model and dataframe algebra — in this post, we focus on the dataframe ergonomics: specifically, how dataframes are used.

For easy lookup, here is a handy list of the multiple ways Pandas dataframes are more convenient than their relational/SQL counterparts:

  1. In Pandas, you can incrementally construct queries as you go along; in SQL, you cannot.
  2. In Pandas, operating on and naming intermediate results is easy; in SQL it is harder.
  3. In Pandas, it is easy to get a quick sense of the data; in SQL it is much harder.
  4. Pandas has native support for visualization; SQL does not.
  5. Pandas makes it easy to do machine learning; SQL does not.
  6. Pandas preserves order to help users verify correctness of intermediate steps — and allows users to operate on order; SQL does not.

For this post, we’ll use a dataset from Kaggle’s 5 day data cleaning challenge; this is a dataset of building permits in San Francisco. The data is from the San Francisco government data portal, who makes their data available via the Public Domain Dedication and License v1.0.

1. In Pandas, you can incrementally construct queries as you go along; in SQL, you cannot.

An important distinction between Pandas and SQL is that Pandas allows users to incrementally layer operations on top of others to construct more complicated queries. At the same time, users can inspect intermediate results of these query fragments — in an effort to verify correctness as they go along. Debugging is a breeze with Pandas!

So in our dataset, say we want to focus on permits corresponding to Geary Street. We can extract that subset of the dataset as follows:

One thing we might have noticed is that Geary spans many neighborhoods, encoded here as 'Neighborhoods - Analysis Boundaries'. Suppose we only wanted to examine this column 'Neighborhoods - Analysis Boundaries' (and drop the remaining 42 columns), we can simply append the clause [['Neighborhoods - Analysis Boundaries']] at the end of the previous expression.

This is a lot of rows: 1966. Then, as our last two steps, say we want to identify the neighborhoods on Geary with the most permits. One way to do so is by appending a 'sort_values' followed by a 'value_counts'.

Interesting, so the top neighborhood is the Tenderloin, followed by Outer Richmond. Note that while this sequence of operations can certainly be expressed in SQL, it would have been much more painful. We can't simply append operators at the end of one's SQL query: there are specific locations within the query where we would need to make changes. For example, to change which columns are displayed we would need to modify the SELECT portion of the query early on. Pandas instead allows you to think operationally (or imperatively) — and construct your final result step-by-step, all the while examining intermediate results.

2. In Pandas, operating on and naming intermediate results is easy; in SQL it is harder.

Pandas, since it is embedded in a real programming language, Python, borrows many of the familiar programmatic idioms for operating on dataframes. In particular, we can assign a dataframe expression to a variable; these variables can then be operated on and/or assigned to other variables.

We’ll take a simple example to illustrate. Since this dataset is from a data cleaning challenge, say we suspect that there might be many null values. We can check how many there are per column, using the following:

That’s a lot of null values! Suppose I want to create a cleaned version of my dataset, dropping columns with too many null values, with the threshold set at 190000 non-null values. (The overall dataset has about 199000 rows.)

Wow — the number of columns drops from 43 to just 13. As we saw here, we were able to easily define a new variable'sf_permits_cleaned' (just like we created the previous variable 'missing_values_count'), using standard programmatic variable assignment and subsequently operate on it. This approach is natural for programmers. In SQL, one can accomplish a similar effect via views, but defining views and operating on them is less intuitive and more cumbersome.

3. In Pandas, it is easy to get a quick sense of the data; in SQL it is much harder.

Pandas offers quick ways to understand the data and metadata of a dataframe. We've already seen examples of this when we print a dataframe by simply using its variable name, or if we use the functions 'head/tail()'. For convenience, to fit on a screen, certain rows and columns are hidden away with '...' to help users still get a high-level picture of the data.

If we want to inspect a summary of the columns and their types, one convenient function offered by Pandas is 'info()', which lists the columns of the dataset, their types, and number of null values. We can use this function to inspect the dataframe we just created.

So it looks like the only column still containing null values is the description column; all other columns are fully filled in.

Another useful Pandas function, targeted at numerical columns, is 'describe()', which provides a convenient summary of these columns, with counts, means, standard deviations, and quantiles.

Hmm, so there appears to be a street number 0. Curious!

Unfortunately, SQL offers no similar conveniences to understand the shape and characteristics of one’s dataset — you’d have to write custom queries for this purpose. For the previous example, the length of this query would be proportional to the number of numeric columns.

4. Pandas has native support for visualization; SQL does not.

Analysis of tables of numbers will only get your so far. Often what you need are visual ways of making sense of information in dataframes. Unlike SQL, which requires you to load your data into a separate visualization or BI (Business Intelligence) tool, Pandas offers in-built visualization support right within the library. For example, I can simply call 'plot()' to see a bar chart of the 'Current Status' of various permits.

It looks like the vast majority of permits are in the completed, issued, and filed categories, with a small number in other categories.

The power of this feature is obvious: unlike SQL databases, you don’t need to leave the library if you want to generate visualizations — you can do it right there! If you’d like to “power-up” your visualization experience, there are any number of visualization libraries that integrate tightly with pandas, including Matplotlib, seaborn, and altair. And if you’re lazy, like me, and don’t wish to write any code at all to generate visualizations, you can use Lux, our Pandas-native visualization recommendation library, to generate visualizations for you automatically, all tuned to your dataset. Read more about Lux here.

5. Pandas makes it easy to do machine learning; SQL does not.

Machine learning is a key component of data science, enabling users to not just make sense of unstructured data such as images, video, and text, but also make predictions about the future. Since Pandas is tightly integrated into the data science ecosystem, it comes as no surprise that it works well with machine learning libraries, including common ones like scikit-learn, pytorch, numpy, among others. Here, we’ll use the spaCy library, a relatively new natural language processing library, to make sense of a text column in our dataset. SpaCy offers various word pretrained models to perform word embedding, named entity recognition, part of speech tagging, classification, among others. To install spaCy, we run the following commands:

Now that we've installed it, suppose we want to understand the type of activities (e.g., demolition, removal, replacement, etc.) involved in each permit application (i.e. row) in our dataset. This is hard to understand upfront, but is buried within the text field, 'Description'. Let's use the package to extract a list of verbs that are mentioned in this field. As part of this, we first load spaCy's 'en_core_web_md' model, and then follow it up by extracting each verb in the tokenization of the description using the model, storing it in an array, as follows.

So, as we can see above, the model does a reasonable job of extracting verbs, even though it does miss a few (e.g., install). With the increasing availability of large pretrained models (e.g., transformer models), I expect even greater integration of such models into day-to-day data processing within pandas.

Integration of machine learning in SQL databases is extraordinarily difficult. While some databases offer machine-learning-specific constructs (e.g., BigQuery ML), users are limited in what they can accomplish, and do not have fine-grained control. Another kludgy approach is to use UDFs to do machine learning. Often what ends up happening is users exporting their data outside of the database context to perform machine learning.

6. Pandas preserves order to help users verify correctness of intermediate steps — and allows users to operate on order; SQL does not.

Pandas preserves order. This is important for debugging and validation as one is constructing more complicated query expressions. Continuing with my example fresh after the spaCy extraction of verbs, say I want to use the 'explode' function to expand out the individual verbs in the previous dataframe into multiple rows, one per verb; I can do it simply as follows.

Notice that I now have three rows corresponding to the original row 1, one with each of the verbs extracted. This preservation of order makes it easy to verify correctness of this step. Using a SQL database, this would be much harder because order is not guaranteed, so one would need to look at the entire output to see where a given row ended up (or instead add an ORDER BY clause to enforce a specific output order).

Conclusion

In this post, we covered various ways in which Pandas is more convenient than SQL from an end-user perspective. This includes ease of constructing Pandas queries correctly, via preservation of order, incremental composition, naming and manipulation, and inspection along the way. This also includes integration with other data science and data analysis needs, including visualization and machine learning: Pandas both lets users visualize and perform predictive modeling entirely inside Pandas, but also provides the hooks to connect the outputs to other popular visualization and machine learning libraries and packages, especially within the PyData ecosystem. Ultimately, Pandas sits within a full-fledged programming language, Python, and inherits all of its constituent power.

If you can think of other examples where Pandas is more convenient than SQL, or vice versa, we’d love to hear it! Feel free to respond to our tweet, and follow us on Twitter or LinkedIn for more Pandas / Python / data science content!

--

--

Associate Prof @ UC Berkeley and Cofounder/President @ Ponder; simplifying data analytics