The world’s leading publication for data science, AI, and ML professionals.

Examining and Cleaning Data

Understanding categorical and quantitative data, levels of measurements, datatypes and how they are cleaned and inspected.

Image by Author
Image by Author

Data science beginners tend to struggle with Data Cleaning. They don’t know the right commands to use, they don’t understand why one command works with a column but returns error for another. Having to explain this over and over I decided to write a comprehensive guideline that I can always refer beginners to.

The first thing on getting a new dataset is to inspect, clean and transform into a format suitable for use. Datasets are cleaned to eliminate inconsistencies, errors, empty values and to avoid inaccurate, imprecise, meaningless and inconsistent results, insights and models. Before cleaning is performed, the dataset is inspected and scrutinized to know the right cleaning steps to perform.

Cleaning is a major part of the preprocessing operation and can be performed using Python (NumPy and Pandas) or Power BI (Power Query and M). This article is based on Python using Jupyter Notebook to run commands and is focused on data cleaning only. Data cleaning is not straightforward, it is time consuming and due to the peculiarity of dataset, it is not easily automated. However, there are certain commands that are frequently used and we’ll be highlighting some of them in this article.

Most statisticians and analysts are comfortable with quantitative data but not qualitative data so I will briefly explain data classification and statistical levels of measurement. It is quite different from datatypes and data fields and even though most of our learning models will require the encoding of our data to numerical types, understanding the levels of measurements and data classification will help us understand these conversions and the operations that can be performed on them.


Understanding numerical and categorical data

Numerical data also known as quantitative variable are measured in numbers to indicate their amount and units to indicate the quantity. Remember measurement in physics, fundamental and derived quantities. Examples include time, temperature, currency, weight. Age, Years of experience, salary, price. They are all described by numbers and units. Units must be uniform before numerical data can be compared.

Categorical data also referred to as qualitative variables refers to attributes or classes like colours, states, size, car models, country, animal species, chemical compounds, classes of foods.


Understanding levels of measurements

The four levels of measurements in statistics are Nominal, Ordinal, Interval and Ratio.

  • Nominal Level

Most categorical data fall into the nominal level. The categories are just names and classes with no progressive relationship between them. Examples are names of cities or colours. They are disjoint and every observation falls into one category. Categorical data (nominal data) can be assigned numeric values and encoded as numbers for classification algorithms. These numeric values are arbitrary and arithmetic operations or comparisons including ordering cannot be performed on the encoding numbers. Only frequency counts can be performed. You will understand this when we use the .describe(), .unique() and .value_counts() while inspecting our data.

  • Ordinal Level

Ordinal levels are categorical but ordering can be performed on them. So when encoded using numbers, progressive or consecutive numbers must be used because the order matters. A common example is user feedbacks; very bad, bad, indifferent, good, very good which is usually encoded as 1 = very bad, 2 = bad, 3 = indifferent, 4 = good, 5 = very good. Another good example is educational level; less than high school, high school graduate, Bachelors degree, Masters degree, Doctorate degree. This measurement is progressive. However, we cannot tell anything about the nature of difference between each category and so arithmetic operations cannot be performed since the scale difference cannot be interpreted.

Numerical data can be converted to categorical data (ordinal level of measurement). For example, weight could be converted to very heavy, heavy, medium, light, very light.

Converting numerical data to categorical data | Image by Author
Converting numerical data to categorical data | Image by Author
  • Interval Level

This level is for numerical data only, it has regular, constant intervals like a scale or number lines. This scale has a zero point but this zero point does not represent the absence of that value. The values here are relative and taken with reference to a point, which is usually the zero point. Example is temperature measured in Celsius; height measured with reference to sea level; date and years measured as BC and AD. For this measurement level, division and multiplications by each other are meaningless, only differences between them are relevant. Mean, standard deviation, range and some other statistical analysis can be performed. Values can also be sorted

  • Ratio Level

Numerical values with absolute zeros, that is zero score represents the total absence of that quantity, are classified as ratio level of measurements. These values have the same characteristics as real numbers. Example include price, age, salary, height and weight of an object.


Understanding datatypes

Datatypes in Pandas are also known as dtypes. Pandas datatypes include

  • float (float64), which represents decimal numbers such as 1.0
  • int (int64), which represents integers such as 1
  • datetime (datetime64[ns]), used to represent date, time and time difference such as pd.Timestamp(‘20180310’)
  • string (object) which includes alphabetic strings and combination of multiple datatypes, an example is ‘foo’.

Every categorical data is a string and so is an object datatype. Columns containing currencies, age, dates and other numerical data can be interpreted as object datatypes and will require conversion to the proper Pandas datatype. Pandas datatypes are designed to optimize operations that can be performed on them, so the datetime dtype for example makes it easier to perform operations involving dates and time. The difference between float and int is the decimal. Age for instance is best stored as int values and currencies as float.

Pandas datatypes | Image by Author
Pandas datatypes | Image by Author

EXAMINING DATA

Let’s discuss the examination of data and the regular commands used. We examine data to get common insights on our datasets. We check for the column size, data types, unique values, null values, duplicate values and outliers. Numpy and Pandas are imported and the data is loaded. A brief glimpse of the data is gotten using data.head(). To obtain a brief summary of the data, we use data.info().

Importing libraries and loading data | Image by Author
Importing libraries and loading data | Image by Author

The frequently used commands for cleaning data are shown below:

data.head()

Gives a brief glimpse of the data by showing the first five rows. Inserting a number into the bracket will display that number of rows.

data.tail()

Shows the last five rows. Inserting a number into the bracket will display that number of rows.

data.sample()

Shows a single randomly selected row. Inserting a number into the bracket will display that number of rows.

data.info()

Shows the columns, their datatypes and the number of non-null values.

data.duplicated().describe()

Using data.duplicated() will show us true and false but adding describe() will show us unique values and the frequency of the top value. One unique value of false means no duplicate value. Two unique values means there is a duplicate value and by subtracting the frequency of the top count from the total count we can tell the number of duplicate values.

data.describe()

These shows us the statistical summary of the numerical and datetype columns; count, mean, min, max, standard deviation. From this we can easily check for outliers. Object datatypes must be done separately by selecting the object columns as in data[[‘diagnosis’, ‘country’, ‘gender’ ]].describe(). This will show the count (number of rows for each columns), unique (number of unique values for each column), top (the most occurring value) and frequency of the most occurring value.

data[‘Column_name’].unique()

Returns a list of unique values for single columns.

data[‘Column_name’].value_counts()

Returns a list of unique values for columns and their frequencies. The first value is usually the mode. It also indicates the number of unique values and their datatype.

Some of the above functions return a new object but inplace = True is used to modify the existing object in place.

To get more information about each command, simply replace ‘()’ at the end of the command with ‘?’ in your jupyter notebook. For example, running data.describe? will give you basic information about the function describe().

Using describe() for numerical and object datatypes | Image by Author
Using describe() for numerical and object datatypes | Image by Author

CLEANING DATA

Our basic cleaning involves dropping (selected columns, outliers, null values and duplicates), transforming (conversion of column datatypes, conversion of null values to specified values, renaming columns). The steps you take depend on your datasets.

The frequently used commands for cleaning data are shown below:

data.drop_duplicates()

Drops duplicate

data.dropna()

Drops rows with null or missing values

data.fillna(0, inplace=True)

Fills null values with 0. Any value can replace 0. Inplace is used to modify the value in place instead of returning a new value.

Data.droplevel(‘index’, axis = 0)

Deletes that row. Setting axis = 1 deletes the column. Inplace can also be used.

All the above functions can be used for specific columns for example, data[[‘diagnosis’, ‘country’, ‘gender’ ]].dropna()

You have to decide if it is best to drop null and duplicate values. In certain datasets, duplicate rows are expected and so can be permitted while for others, duplicate values rarely occur and the elimination of such values won’t affect the data. Null values in numeric columns can be assigned to the mean, median or mode of that column so as not to lose the data in other columns of that row.

data[‘Column_name’] = data[‘Column_name’].str.replace(‘.’, ”)

Performs string substitution element-wise.

data[‘Column_name’].replace(‘old_value’, ‘new_value’, inplace=True)

Replaces every occurrence of the exact old value of the selected column with the new value.

data[data[‘Column_name’].str.contains(‘ old_value ‘)] = ‘ new_value ‘

Replaces every occurrence containing the old value of the selected column with the new value.

data[‘Column_name’] = ( data[‘Column_name’]).str.lower()

Converts every value in that column to lowercases.

data[‘Column_name’] = ( data[‘Column_name’]).str.capitalize()

Capitalizes every value in that column.

data[‘Column_name’] = pd.to_numeric( data[‘Column_name’])

Converts column to numeric datatype.

data[‘Column_name’] = pd.to_datetime( data[‘Column_name’])

Converts column to datetime datatype.

data.rename( index = {‘old_index’: ‘new_index’}, columns = {‘old_column’: ‘new_column’})

Used to rename the columns and row index.

Examining and cleaning data can be tricky at times, requiring domain knowledge to interpret data and decide how it should be transformed. For numerical values, attention is paid to outliers while for categorical data, misspellings are common and can be sorted using replace or contains method shown above.

The lists of functions above are not exhaustive and will require modifications for specific cases. Replacing ‘()’ at the end of the command with ‘?’ in jupyter notebook brings up the documentation. Additional information can be obtained from recognized texts like ‘Python for Data Analysis Data Wrangling with Pandas, NumPy, and IPython’ by Wes McKinney.

If you enjoyed this article, drop a comment and share with your colleagues.

We can continue the conversation on twitter at https://twitter.com/ChijiokeGodwin_ and https://www.linkedin.com/in/chijioke-godwin-nwankpa-4a0a83106.


Related Articles