Hands-on Tutorials, Data Engineering, Data Cleansing, SQL, and Tableau

"Garbage in, garbage out."-Anonymous
Introduction
Many Data Engineering, data analytics, and data science project lifecycles include data cleansing processes. The primary purpose of cleansing data is to detect and correct corrupt, inaccurate, or incomplete data records. In addition, effective cleansing helps to ensure that data analytics or data science products wholly and accurately convey the needed information.
In a recent position at a large enterprise, I developed an employee travel analytics dashboard in Tableau. I used SQL to acquire the data from the IBM DB2 database of a trip request and management system.
Each trip request included the destination city, state or province, if applicable, and country, among many data elements. Unfortunately, the trip request web form did not perform any edit checks on these fields, so the trip requesters misspelled many place names and filled the application’s database records with incorrect geographic locations.
Since the travel analytics dashboard was updated weekly, I needed a straightforward and low-effort process to clean newly-introduced geographic data errors each week. The method I developed involved using Tableau to identify geographic place name errors, Google and Wikipedia to determine the correct and confirm values, and SQL to retrieve and clean the data.
This article describes a revised version of the method I used to clean the geographic data. It presents the solution with a small-scale project, including a sample database table, SQL code, and a Tableau data visualization project.
Sample Data Cleansing Project Overview
The sample geographic place-name data cleansing project involves these steps and components:
- Use a SQL SELECT statement to retrieve US city population data from a database table. I used Microsoft SQL Server for this sample project, but it would work with any relational database management system, such as Oracle, MySQL, PostgreSQL, etc.
- Load the city population data retrieved with SQL into a Tableau worksheet.
- Create a map of cities and their populations. Tableau will automatically detect and report on invalid city names or state codes. In other words, Tableau will identify city names and state codes that do not exist in its internal geographic place name database.
- For invalid place names, use Google search to find the correct spelling of the location. Also, use Wikipedia, if necessary, to confirm the city name, state code, and population value combination.
- Add or modify CASE statements in the SQL statement described in step 1 above to correct misspelled city names and state codes.
- Reload the city population data retrieved with the modified SQL statement into the Tableau worksheet.
- Review the city population map. If inaccuracies remain, repeat steps 4 through 7 until the SQL statement has cleaned all dirty data.
"In God we trust, all others bring data." – W Edwards Deming
Sample City Population Database Table
To demonstrate the data cleansing technique, I created a table called City_Population with these columns:
- City – The name of the city
- State_Code – The two-character code for the state
- Population – The 2020 US Census population or the latest post-2010 estimated population recorded in Wikipedia

I then populated the table with seventeen rows, rife with data errors, as shown below.

Base SQL SELECT Query
The SELECT query shown below retrieves the City, State_Code, and Population values for all rows in the City_Population table. Note that square brackets have been placed around Population since it is a keyword in SQL Server.
Loading Data into Tableau
In my previous position, I called SQL queries directly from Tableau Desktop to load data from SQL Server and other databases into Tableau projects. For this project, I am using Tableau Public, which excludes the ability to call SQL queries. Tableau Public supports various other data sources. I chose Excel for this project. Here are the general steps to load the data into Excel, and in turn, into Tableau Public:
- Run the SELECT query in SQL Server Management Studio (SSMS). You may use SSMS if you run SQL Server. Otherwise, use the query tool of choice for the database you use.
- Copy and paste the data returned from the query into an Excel worksheet. Save the Excel workbook.
- Open Tableau Public (or Tableau Desktop) and add the Excel file as a data source.
Follow the steps shown below to copy data from the SSMS output window.

Following is the Excel spreadsheet that includes the data pasted from SSMS. Note that the worksheet is called city_population_dirty, and the workbook is called city_population_dirty.xlsx. I loaded this worksheet into Tableau. Later, I will load a "clean" worksheet into Tableau.

To load the data into Tableau, open Tableau Public or Tableau Desktop. Next, under the Connect header, click on [Microsoft Excel]. When prompted to open a file, navigate to city_population_dirty.xlsx and click [Open].

The dirty city population data should load into Tableau as shown below.

Reviewing Dirty Data in Tableau
After loading the data into Tableau, I created a map in a worksheet called "Population Dirty." Note the message "10 unknown" in the bottom-right corner of the screen. This message indicated that Tableau could not map 10 of the place names included in the data.

To identify the dirty data, click on "10 unknown." Tableau will display the Edit Locations dialog box like the one shown below.
Note the unrecognized city names, such as Aims and Amerillo. I could have replaced these values with a value in the dropdown box (located under "Unrecognized"). However, while doing so would clean the data for this Tableau dashboard, the cleaned data would not be available for other Tableau projects or other uses. So instead, for the project described in this article, I have decided to clean the data in SQL. Then, the cleaned data will be available for all downstream uses, such as Tableau.

Finding the Correct City Names in Google
Google search uses the sort of technology that Microsoft uses in its IntelliSense in its Visual Studio tools to find possible results as the user enters text. In addition, Google seems to correct spelling and might even use a "sounds like" algorithm to identify search terms.
For the unrecognized city of "Aims," I reviewed the data in the Excel file and noted that the state code for record with the dirty value is "IA." Thus, when I search for "Aims ia," Google returns "Ames, IA" and "City in Iowa" as its first result. Confident that "Ames" is the correct spelling for "Aims," I continue to the next misspelled city.

But if unsure that Google has returned the correct city, I also look up the spelling of the place suggested by Google in Wikipedia. When I search in Google for "Charlestowne SC," for example, it returns "charleston sc" and "Charleston – A city in South Carolina." To verify, I look up the city in Wikipedia and compare its population with the population in the Excel spreadsheet. The value is 150,277 in both sources.

Cleaning Data in SQL
To clean place names in SQL, I use CASE statements within the SELECT query. For example, the query shown below corrects ten city names. It also updates the state code for Minnesota from "MP" to "MN."
To prepare the data from this query for Tableau, I run it and copy its output data to an Excel worksheet called "city_population_clean" in a workbook called "city_population_clean.xslx," as shown below.

Loading and Viewing the Clean City Population Data in Tableau
Next, I load the clean dataset into Tableau and build a map worksheet based on the clean data. The process is identical to the one described above for the dirty data.
Note that the worksheet no longer displays a message like "10 unclean" in the bottom-right corner. That’s because Tableau recognized all place names from the input file. Also, it has placed each of the cities on the map.

The final step is to save the "clean" worksheet of the Tableau project to the Tableau Public server. You can view it here.

Pros and Cons of this Data Cleansing Approach
While the data cleansing method presented here successfully cleaned the place names in the city population table, it has some advantages and disadvantages when compared and contrasted with other methods.
Pros
- The process is simple.
- Dirty data can be quickly detected and corrected.
- This approach is practical when end-users or software infrequently introduce dirty data into the database table (few changes needed to SQL query).
Cons
- Requires SQL coding skills to potentially change the SQL query with each new data set.
- Prone to error since code needs to change whenever dirty data is detected. Even a good coder might make a mistake.
- SQL code may become unwieldy. Eventually, the SQL query may contain hundreds of lines, making it difficult to maintain.
Alternative Cleansing Methods
While the data cleansing process described above offers a practical solution, you might consider other solutions. For example, in an ideal system, data errors would be detected and corrected by the web form upon data entry by the employees who request the trips. Such an approach would save time and money by eliminating most geographic data cleansing needs.
Another data cleansing method would involve a table that maps dirty place names to clean place names. The SQL query would substitute clean place names for dirty values retrieved from the city population table.
Conclusion
The geographic data cleansing method described here proved effective in maintaining the travel analytics dashboard. Even with hundreds of new travel records each week, the entire process to acquire, clean, and publish the data took less than an hour.
I hope that you found this information about data cleansing helpful. For me, data cleansing is one of the most challenging and satisfying steps in the lifecycle of data engineering, data analytics, and Data Science projects.
"It is a capital mistake to theorize before one has data." – Sherlock Holmes