The world’s leading publication for data science, AI, and ML professionals.

Fixing Common CSV Errors with Regex

A powerful, underused tool that can fix many common data problems.

Hands-on Tutorials

Photo by Jackson Simmer on Unsplash
Photo by Jackson Simmer on Unsplash

Regular expressions are a powerful tool that is often overlooked.

In this post, I am going through several common issues with CSV files and fixing them using regular expressions.

Often as a data scientist, you work with large datasets that you have not produced yourself. As a result, scenarios where you have received data from an external data provider are prone to errors. You do not have direct control over the data exportation before transfer. When these errors occur, there are several options available – one of them using regular expressions.

There is a joke about using regular expression in software development that goes something along the lines of this:

You have a problem that could benefit from regex, and now you have two problems.

But regular expressions are less complicated than they’re made out to be. Essentially what you’re trying to accomplish is describing a pattern. Regex packages are well-defined in many languages, and the syntax is relatively consistent.

However, in this post, I will not be showing specific code chunks. Instead, I will showcase using regular expressions in Notepad++. Notepad++ is an incredibly lightweight editor. Therefore, experimenting with regex in an editor instead of directly in code allows for rapid testing of the expressions created.

Notepad++ can handle CSV files reasonably well up to a million records. However, when file sizes are larger, the better option is to contact the data provider to fix the issues at the source.

Many online regex tools can help build out expressions, but direct feedback from an editor can be more beneficial. In addition, the patterns you craft often break down for some instances in the file, which you may not see in an online regex tool without uploading your entire file.


Encoding

File Encodings (Photo by Author)
File Encodings (Photo by Author)

One of the first issues with CSV files is strange characters in the file or weird newline characters. While there are no catch-all solutions to these issues as there are many different formats, a common fix is to open everything in UTF-8. While this does not always fix your problems, if you do this consistently, then you can gradually understand your solution for UTF-8.


Visualization

It may seem strange to talk about visualizing a CSV file. However, simply looking at the data can often clearly show the issues you’re facing. However, many editors, by default, ignore some characters in the GUI. Sometimes something as small as an extra space can be the root cause of your issues. So, when handling file errors, ensure you can see every symbol within the file and confirm that what you see is what you expect.

To show all file symbols in Notepad++, go to View > Show Symbol > Show All Characters.

Showing all File Symbols (Photo by Author)
Showing all File Symbols (Photo by Author)

Changing Date Formats

Most file reading functions will have options to parse dates. However, these generally assume that the date follows the same format. When the date is the same format, then determining that format is a relatively straightforward process.

But, if you’ve ever worked in the real world, you’ll know everyone writes their dates differently. Everyone has their reasons, and those reasons make sense to them. To argue is usually fighting a losing battle.

However, if your file has multiple DateTime formats, you can update these formats with regular expressions.

Multiple DateTime Formats (Photo by Author)
Multiple DateTime Formats (Photo by Author)

A few things to note here:

  • The groups are captured using the parentheses.
  • Referenced groups in the replace portion of the regular expression. Using 1 for the first group, 2 for the second, and so on.

The aim is to capture the date portions as a group and rearrange the groups into a single format consistent with the rest of the dates.

I know no one asked, but the best date format is ‘yyyy-mm-dd’, and if you need time, then ‘yyyy-mm-dd hh:mm:ss’. The scale of DateTime for this format is strictly decreasing. Each element is separated, leading zeros are used, and the time is separated from the date. Easily machine-readable and logically clear. The best format.

If you add the month in date fields as abbreviated text, you are a monster, stop.


Text Qualifiers

Text is one of the most flexible data formats. And precisely because of this flexibility, it becomes the source of many problems. Often text can contain the column separator that is used in the file.

Because of this issue, text qualifiers are added. Text qualifiers, usually a character such as a quotation mark, specify the text to follow. A file reader will ignore any of the column separators until the second text qualifier is met. When these qualifiers are not present, it is difficult to add text qualifiers at a later time effectively.

The best option is to try to get text qualifiers. Perhaps you can ask for another data extract with text qualifiers.

The reason this is the first option I mention is that text is too flexible. The following replacement expression requires that the adjacent columns have a consistent, well-structured pattern. There is no way to guarantee that the remaining character is not part of the text without a consistent pattern. Text is a nightmare. Ask for text qualifiers.

If getting text qualifiers is not an option and there is some structure to the rest of the data, there are some options. The examples show an extra comma in text, which is the most common problem of this type.

The following pattern takes advantage of a few aspects of text and the adjacent columns.

  • In-text, there is usually a space after a comma. Thus, the pattern is captured easily regardless of the other text.
  • There may be multiple commas. The text with commas pattern is matched zero or more times with the * character.
  • The adjacent columns are matched with a consistent pattern.
  • The group captured within the parentheses is specified with 1. For multiple groups, use 2, 3, etc.
File Separators within Text (Photo by Author)
File Separators within Text (Photo by Author)

Mid Line Breaks

CSV with a line break in the middle of the line can be a common issue when multiple data exports occur. For example, this error often occurs during data exportation without a text qualifier and during the initial data imputation system that allows for line breaks within the text.

The most straightforward case is when this line break is only in a single line. While it is usually easier to determine the culprit of the line break (often developers will add checks to count the number of delimiters in a line to find the rows), it is far more challenging to fix these issues automatically.

Text with Newline Character (Photo by Author)
Text with Newline Character (Photo by Author)

When there is only a single break in a line due to text, these can be automatically handled with negative lookaheads. Essentially negative lookaheads are regular expressions that match patterns but only when not followed by another pattern.

The first part of this is easy to identify in regex. Match the newline character.

The second part, the negative lookahead, is more complex. The pattern to look for is what you would expect at the start of the next newline. For example, if each line starts with a date, ‘ yyyy-mm-dd’ followed by a number, the negative pattern you want to capture is rn(?!dddd-dd-dd,) and replace with the content of the first group 1.

A few things to keep note of here:

  • Lookaheads are specified with the character ‘?’ followed by either the ‘!’ character (for negative lookaheads) or the = character (for positive lookaheads)
  • The entire date format pattern is specified. This pattern will be needed to capture any date and must be generalized.
  • The group captured within the parentheses is specified with 1. For multiple groups, use 2, 3, etc.
  • The replacement removes the newline character, captured with ‘n’. However, in UTF-8, the newline is both a newline character and a carriage return ‘r’. Shown as CRLF in the notepad file.

Particularly astute readers will notice the additional comma in the matching pattern. After matching the first group, a comma is added to enforce that the first column only has this format. Matching the first few columns will significantly improve the reliability of this pattern matching, but it has a trade-off. If the problem text field is in the first column of the CSV, this pattern is no longer viable. But suppose the first few columns have a clean, consistent pattern. In that case, you should match each of these as it is improbable that the break in the text has the same pattern multiple columns.

When the first column is the problem column for the file, the replacement is more manageable.

Text with Newline Character (Photo by Author)
Text with Newline Character (Photo by Author)

The basis for this pattern is that some of the columns in the CSV have a well-formatted design. One approach to ensure there is at least proper formatting in the CSV is to ask for an ID in the first column or a date. This ID ensures there is some pattern to base your pattern matching.

The easiest solution for multiple line breaks in text fields is to repeat the pattern until no matches are found. While this is not the most elegant solution, there is an issue when the number of breaks is unknown. With an unknown number breaks, an unknown number of groups is required to combine the text into a single line successfully. Therefore, the easiest solution is to repeat the pattern with replacement.

Unescaped Characters

Another common issue is text with special characters that are not escaped. This problem is even more problematic when the unescaped characters are text qualifiers.

The following regular expression looks quite complicated, but each component is relatively straightforward. First, the aim is to capture the quotation marks, which are unescaped and not the text qualifiers. The second aspect here means ignoring both a proceeding separator and the following separator.

Unescaped Characters (Photo by Author)
Unescaped Characters (Photo by Author)

The pattern here uses some different regular expression components.

  • The negative look behind, specified with ‘?<!’ meaning that the following pattern can not have the escape character ” or separator before it. Written with an additional escape character to specify the exact character match.
  • The second group uses exceptions to handle the valid text qualifiers with the trailing separator. This group captures the quote to be escaped then the following character.
  • The replacement places the literal escape character (specified with the double escape) and the content of the second group.
  • For the stand-alone double quotes, this pattern will need to be run twice since the unescaped second quote is placed back in the first replacement.

Conclusion

Regular expressions are a powerful tool that is often overlooked. Yet, it is available in virtually all editors and programming languages. Since most Data Science work is data wrangling, regular expression becomes an essential tool in a data scientist’s arsenal.

Despite the strange syntax and challenging learning curve, regular expressions are simply a pattern matching tool. If you have trouble finding the correct expression to use, speaking out loud the pattern you want to capture, with exceptions and conditions, is a great place to start.

Moreover, data cleaning and structural problems are often solvable with a single regular expression. These expressions become increasingly powerful when you don’t only consider the malformed pattern but the correct pattern surrounding the malformed pattern. Unfortunately, people often don’t use the known patterns present in files and focus solely on the malformations. This is why expressions are not capturing all the cases or, worse; they capture cases you don’t want.

Ultimately understanding regular expression is worth the time invested.


If you’re interested in reading articles about novel data science tools and understanding machine learning algorithms, consider following me on Medium.

If you’re interested in my writing and want to support me directly, please subscribe through the following link. This link ensures that I will receive a portion of your membership fees.

Join Medium with my referral link – Zachary Warnes


Related Articles