Data Prep with Spark DataFrames

Using PySpark to continue investigating the Financial Services Consumer Complaint Database

The serenity you’d hope to have while filing a complaint with the Consumer Financial Protection Bureau — Photo by Stephen Walker on Unsplash

As we saw in last week’s blog, the big three credit reporting agencies are among the most complained about companies in the US Federal Financial Services Consumer Complaint Database. It’s interesting to think about everything your credit score is used for: getting a loan, renting an apartment, getting a cell phone plan. Scratch that. It’s more frightening than interesting. So today we are going to dig further into the Complaint database to answer several questions that have been on my mind:

  • How does the number of complaints received relate to the day of the week?
  • How has the number of complaints changed over the years?
  • How has the number of complaints varied by month?
  • Residents of which states complain the most?
  • Do people from different states submit complaints at different rates?
  • Which products get the most complaints?

Once you have your data in a Spark DataFrame (if not, check out last week’s post), you’re ready to do some exploration and cleaning. The PySpark DataFrame, PySpark Column and PySpark Functions documentation will definitely be your friends as you work in your own context (Ross, Monica, and Chandler, respectively…sorry Joey, I’m still not sure where your place in the world of Data Science lies). For my project on the US Federal Financial Services Consumer Complaint Database, I’ll be counting, filling, and dropping NaNs, converting date column to datetime and extracting date features, and checking values against a list of acceptable values.

More accessible output formatting when using Jupyter Notebooks

Before getting started, I recommend that you set the Spark configuration like so:

spark.conf.set('spark.sql.repl.eagerEval.enabled', True)

This setting makes the output more like pandas and less like command-line SQL. After this, you no longer need to specify show() to see the output. Alternatively, you can also use .toPandas() or .toPandas().T (for transpose) to see the pandas style output. Remember to only do this on DataFrames that are small enough to fit in memory. It’s easy to crash your kernel with a too-large pandas dataframe.

Counting NaNs and Nulls

Note that in PySpark NaN is not the same as Null. Both of these are also different than an empty string “”, so you may want to check for each of these, on top of any data set specific filler values.

Check for NaNs like this:

from pyspark.sql.functions import isnan, when, count, coldf.select([count(when(isnan(c), c)).alias(c) for c in df.columns])

You can see here that this formatting is definitely easier to read than the standard output, which does not do well with long column titles, but it does still require scrolling right to see the remaining columns.

This data set has no NaNs, so I move on to Null.

You can count your Null values using the following code:

from pyspark.sql.functions import when, count, coldf.select([count(when(col(c).isNull(), c)).alias(c) for c in 
df.columns]).toPandas().T

Check for Duplicates

To check for duplicates, I compared df.count() to df.distinct().count(). In this case I had none.

Dealing with Null values

Next, I decided to drop the single row with a null value in company_response_to_consumer. Here we see that it is very similar to pandas.

df_clean = df.dropna(subset='company_response_to_consumer')

For the consumer_disputed column, I decided to replace null values with No, while adding a flag column for this change:

# add flag column
df_clean = df_clean.withColumn('null_c_disputed',
df_clean['consumer_disputed?'].isNull())
# fill na in consumer_disputed? with 'No'
df_clean = df_clean.fillna('No', subset=’consumer_disputed?’)

Again, this is very similar to pandas, with the exception of the new syntax for adding columns “withColumn”). You’ll see an alternate way to add new columns in the next section.

Converting strings to date(time) and generating date-related features

from pyspark.sql.functions import (to_date, datediff, date_format, 
month)
# add datetime columns
df_clean = df_clean.select('*', to_date(df_clean['date_received'],
'MM/dd/yyyy').alias('date_received_dt'))
df_clean = df_clean.select('*',
to_date(df_clean['date_sent_to_company'],
'MM/dd/yyyy').alias('date_sent_to_company_dt'))
# drop string date columns
df_clean = df_clean.drop(‘date_received’) \
.drop(‘date_sent_to_company’)

Now I’m ready to add my new features:

# add time difference between receipt and sent to company
df_clean = df_clean.withColumn('transit_time',
datediff(df_clean['date_sent_to_company_dt'],
df_clean['date_received_dt']))
# add submitted day of week (1=Monday, 7=Sunday)
df_clean = df_clean.withColumn('dow_submitted',
date_format('date_received_dt', 'u') \
.alias('dow_submitted'))
# add submitted month, year
df_clean = df_clean.withColumn('y_submitted',
date_format('date_received_dt', 'y') \
.alias('y_submitted')) \
.withColumn('m_submitted',
month('date_received_dt') \
.alias('m_submitted'))

This allowed me to investigate the temporal aspect of our data set

Interesting that people are mostly making complaints M-F, even though there is an online form. I wonder how this breaks out by submission type.

There does seem to be potential for a cyclical pattern here, but it’s hard to say from this graph alone. Another area for future investigation.

Clearly an upward trend in complaints from <75k 2012 to 275k in 2019.

Comparing to a list of acceptable values

I cleaned up my state column using a list of state abbreviations, replacing all non-standard responses with a single value “unknown.”

# add clean state column, drop original column, rename new column
df_clean = df_clean.withColumn(‘state_c’, when(col(‘state’)
.isin(states),
col(‘state’)) \
.otherwise(‘unknown’)) \
.drop(‘state’) \
.withColumnRenamed(‘state_c’, ‘state’)

I ended up with 26k rows with ‘unknown’ states — a prime reason to restrict inputs, but at least this is now contained, and not a whole bunch of random mistyped state values.

With this, I grouped by state and converted my now 50-row, 2-column dataframe to pandas. Then I joined this dataframe with my state-based data (thank you SimpleMaps CC4.0), and graphed using geopandas (and shapefiles from census.gov).

state_counts = df_clean.groupby(“state”).count().toPandas()

In GeoPandas, I found that Alaska’s most western islands did not play nice with my visualizations, and made the choice to only plot the Contiguous United States for today — more on how to separate and manipulate Shapely Multipolygons another day.

This is still not a very attractive choice of projection (forcing a globe onto a flat screen), but we are getting the gist of where the majority of the complaints are coming from — the most populated states. I wanted to even the playing field and account for population size:

The choropleth auto-scales the coloring, so we must have a small state with a disproportionate amount of complaint submissions. A little digging shows that Washington, DC, leads the way with more 1.1 complaints per 100 residents.

Products by Complaint Count

Lastly, I performed another groupby().count() to see which products have been the biggest cause for concern.

We find Credit reporting twice in the top four most problematic products. This matches what we discovered in my previous blog, where the three big credit reporting agencies, Equifax, Experian, and Transunion, were the top three financial companies in terms of complaints.


As always, you can check out the details on the GitHub repo. Happy coding!

Towards Data Science

A Medium publication sharing concepts, ideas, and codes.

Allison Honold

Written by

Data scientist with a background in business, education, and environmental science

Towards Data Science

A Medium publication sharing concepts, ideas, and codes.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade