What is it like to analyze unclean data?

4 levels of user experience for user function

Shawn Cao
Towards Data Science

--

Photo by Aaron Meacham on Unsplash

One day, I saw this image in a Reddit post, it made me laugh loudly. How much time do our data people do “data laundry”?
[¹this image was originally found in this textbook on data analytics.]

Some people call it “data cleansing”, “data cleaning”, or “data preparation”, but it all points to the same fundamental challenge: How do we get more accurate insights out of messy data?

It’s one thing to do data prep for large, structured, and predictable data. When you know what to expect, you can build automated data cleaning jobs within an ETL pipeline, and if you’re building it for repeated use, it can certainly be cost-effective.

But there are instances where the investment of an ETL pipeline is not justified, in which case you have to resort to manual cleaning:

  • ephemeral or temporary data: not worth building new jobs.
  • real-time data: don’t want to wait too long to process it.
  • small data: spreadsheets shared by a colleague.

In this piece, I want to discuss a framework for approaching data cleaning in the context of poorly structured, real-time, or small data. We’ll start by considering the desirable features of good data cleaning and the available options.

User Experiences In Data Cleaning

In the abstract, the ideal data cleaning process would have the following three features:

  1. Distributed: it would be able to process unlimited-sized data with predictable (low) latency of response time.
  2. Real-time: it would be able to handle streaming data of a moving window in a fast and accurate manner.
  3. Functional: It would have the ability to run user-defined ETL logic (with limit) in a common programming language.

Among the platform features, Functional is the main one related to “data cleaning”, data analysts generally have four main options offered by a platform:

  1. NO-CODE: intuitive UI helps users to enable transform/extract logic, often with a few clicks.
  2. FORMULA: A pre-defined function that users can apply to a dataset, which is common in Excel, Google Sheets, and similar platforms.
  3. COMMON-LANG: A popular programming language, such as Python, R, or JavaScript, allows analysts to write code to carry out specific ETL logic.
  4. SPECIAL-LANG: special language designed for users to learn to write their data processing logic.

From a user experience perspective, the top options offer a better experience, but the lower options offer greater control and customizability. A good analytical product for handling poorly structured, real-time, or small data should aim for the top options while trying to strike a balance by enabling users to get what they need from the product. Option #4 is far from ideal for handling this type of data.

Case Studies

We will skip illustrating FORMULA which is widely used by many Excel and Google Spreadsheet users. Similarly, we skip illustrating SPECIAL-LANG as well, dax guide is a good reference for a special language called DAX which is used by many Power BI users.

Columns.ai is building a data analytics tool that is simple, fast, and accessible by prioritizing user experience over feature set. Let’s take a look at two examples of how it supports NO-CODE and COMMON-LANG in its simple data analyzer.

CASE 1 (NO-CODE) — bucket a column and group by buckets.

Bucketing a column into a list of histograms is a common task, usually, people want to understand specific metrics for each bucket, such as to answer questions like “number of customers for each spending range”, “number of states for each GDP range”, and so on. Since it’s very common, users don’t need to write code, just use the bucketing slider from UI, see this demo:

Analyze metrics with dynamic buckets

This NO-CODE experience, we think it’s the best (level 1 user experience) for users to make progress in a super-fast way.

CASE 2 (COMMON-LANG) — write code to transform dirty column

Writing some logic to extract keys from messy text is a normal task, using regex is a common approach and Javascript is a popular language with adoption growing day by day. There are massive resources easy to find on the internet to learn how to write a simple snippet of JS to do some transformation work.

This example analyzes a messy data presented in this Google Spreadsheet. After loading it in, we could write a simple function to extract all hash keys as of #key1, #key2 and #key3 and get metrics for those keys. The JS function looks like this:

const func = () => {
const d = nebula.column('dirty');
const m = d.match(/^.*(#[a-z0-9]+).*$/);
if(!m || m.length <1) return 'none';
return m[1];
};
columns
.apply('clean', columns.Type.STRING, func)
.select('clean',sum('value'))
.run();

You could copy and paste the above code in the console of this visualizer and try it out lively: data visualizer.

If you do so, it will end up the same as illustrated in this demo:

Supporting a common language like Javascript is a level 3 user experience to make progress in their analytical problems, and it has a vast capacity to allow users to implement any logic in their analysis.

Summary

In this post, we explored how critical data cleaning is for data analysts. Sometimes it’s not fun… but if we do well, we can still achieve a good user experience.

We went through a few examples to feel the different levels of user experience in supporting functions to help users make progress in data cleaning.

At last, thank you for reading thus far, I wish “data cleaning” is fun and easy to do in our regular data analytics.

If you enjoyed the post, please follow us in any way listed here.

--

--

Drive towards the mission of enabling data science technology accessible to everyone.