Dealing with Data Disasters: Simple Fixes

Kristin Henry
Towards Data Science
4 min readJul 18, 2017

--

How often have you received a data set that’s all set and ready for analytics or data visualization? One that didn’t come from a viz example or tutorial? I don’t know about you, but almost all of the data that ends up in my hands requires at least a little bit of cleaning.

Most of these fixes are fairly simple, and some could easily be done manually. However, sometimes I get a large set of files and manually fixing each one would be rather time consuming. So I’ve been developing a little command line tool, in python, to speed up some the drudgery of data cleaning. The tool creates a new file, so that the original data file is still intact.

Using the Tool via Command Line

Skim

python cleaner.py -skim

This removes any empty or extra rows at the beginning of the csv file. This one may seem easy to fix by hand. Just open up the file and remove the troublesome rows. Easy-peasy.

But what if it’s a set of 2,000+ files, and there’s a logo, and maybe an email address in the first two rows of every file. Some of the files have one empty row under those, and some have three. Some have a mailing address as well.

To deal with these extra lines above the header row, the ‘-skim’ flag tells the cleaner.py script to save all the data to a new csv file without the extra rows. The empty rows are easy. The script just checks if there is anything in those rows, and skips ahead to the first row with content.

empty rows above header row

If there are extra rows, with a logo or contact information, things get slightly more tricky. Most of the time, the file has several columns of data and the extra rows are just one or two columns. In that case, the script checks if the first rows are less than half as long as the data rows (which get checked for length before this).

extra rows above header row

However, this approach fails in cases where there are only a few columns of data. There are other edge cases, but this is the most common.

extra row, but too few columns of data → fail to

When this is the case, it’s probably best to just use the ‘-rows’ flag to indicate which rows to save.

Select specific rows

If you know which rows you want, you can just select them with the ‘-rows’ flag.

python cleaner.py -rows 1-20

If you just want to skip the first 10 rows, you’d use just the row number you want to start at. The script assumes you want everything after this.

python cleaner.py -rows 10

This can also be combined with ‘-skim’ and ‘-cols’

python cleaner.py -skim -rows 10 -cols 2-5

This removes extra headers and saves specific rows and columns from the original file. I use this a lot when I’m creating tests, or sample sets.

Select Specific Columns

python cleaner.py -cols 2, 5-9, 12
  • can select specific columns individually, and in ranges, by index
  • index starts at 0.

Functions a little differently than when selecting rows.

  • if only one index given --> only that column is selected.
  • to select a column and all those following, such as all after 8, use “+”
 -cols 8+ 

Why?

The column selection differs from row selection because I’ve found that I’m usually interested in specific columns in a raw data set. Sometimes, a data set has a lot of issues, and I only want to work with one column (or a few) at a time. With rows, I’m usually just wanting a small subset and it doesn’t really matter which rows. I might want just the last 20 rows, but I probably don’t need to be specific about the 9th and 21st.

What happens by default?

A number of things get done, automatically.

“Unnamed” header-removal

Converting an excel sheet to csv may result in empty cells of first row to be filled with ‘Unnamed: #’, so these are removed from the header row.

Empty column and row removal

Some csv files have empty columns and rows. These are automatically removed. If a file has a lot of empty columns and rows, I might run the cleaner, without any flags, before doing anything else.

Remove extra tables and sums row:

Some files have summary tables below all the actual data, or a row with column sums. These are removed.

Flatten headers:

If there are multiple layers of headers, they get flattened into a single row.

So that a table like this

will be converted to something like this

Note: If the ‘save as json’ option is set, this flattening function is ignored…we’ll talk more about json in another article.

In the next article in this series, we’ll talk about some slightly more complicated fixes…like extracting a missing header name from the content of a column.

The source code for this project is available on github.

--

--

Generative and Data Artist. Creative Coder. Data Visualization Consultant. Founder of GalaxyGoo. http://kristinhenry.github.io/ Admin on vis.social