The world’s leading publication for data science, AI, and ML professionals.

Data Preparation Cheatsheet

Common feature engineering/EDA tasks, compiled

Image from Unsplash
Image from Unsplash

The longest part of any Data Analysis/science task is preparing and configuring your data properly. A model only performs as well as the data that it is fed and there’s a lot of transformations that the data may have to undergo to be ready for model training. Over the years I have compiled a Notion page that highlights many of the common tasks Data Scientists need to perform for data preparation. I’ve listed a few of the examples below, but the entirety of the examples can be found in the following link. I will continue to expand this link as I continue my learning journey with other common functions that I’ve used repeatedly during EDA or Feature Engineering.

Note: All these examples are in Python and mainly use the Pandas, Numpy, and Sci-Kit Learn libraries. For visualization MatPlotLib or Seaborn was used.

Table of Contents

  1. Checking for Missing Values in a DataFrame
  2. Dropping a Column
  3. Applying a Function to a Column
  4. Plot Value Counts of a Column
  5. Sort DataFrame by Column Values
  6. Dropping Rows based off a Column Value
  7. Ordinal Encoding
  8. Encoding DataFrame with all Categorical Variables
  9. Additional Resources

Checking for Missing Values in a DataFrame

This code block uses the Pandas functions isnull() and sum() to give a summary of missing values from all columns in your dataset.

Dropping a Column

To drop a column, use the pandas drop() function to drop the column of your choice, for multiple columns just add their names in the list containing the column names.

Applying a Function to a Column

Many feature engineering tasks require encoding or data transformation which can be done through traditional Python functions. By using the pandas apply() function you can apply the function you have created to an entire column to either create a new column or transform the one that you have chosen.

Plot Value Counts of a Column

A common task for feature engineering is understanding how balanced your dataset is. For example, in a binary classification problem if there’s nearly 90% of one class and 10% of data points representing the other class, this leads to the model predicting the first class the majority of the time. To help avoid this its essential to visualize the counts of your response variable especially. The pandas value_counts() function enables you to get a count of the occurrences of each value in a column and then plot() function lets you visualize this through a bar graph.

Sort DataFrame by Column Values

Sometimes for data analysis, you want to visualize your columns in a specific order and you can add multiple columns in the sort_values() function for your DataFrame.

Dropping Rows based off a Column Value

If you ever want to subset your data according to the values of another column, you can do this by capturing the index of a specific set of rows. By creating these series you can then use the drop function to drop these specific rows/indices you have identified.

Ordinal Encoding

Ordinal Encoding is one of many ways to encode your categorical data. There are various methods of encoding such as One-Hot encoding and more that I have linked here. Ordinal Encoding is used when you want to retain the order of your categorical variable and if there’s an inherent order your column follows.

Encoding DataFrame with all Categorical Variables

In the case that your dataset has only categorical columns, you might need to create a pipeline/function to encode your entire dataset. Note that before using this function you want to identify whether order matters or not for each column that you are working with.

Additional Resources

— – –

Data wrangling is essential to prepare your data for model training/feeding. Python libraries such as Pandas, Numpy, and Sci-Kit Learn help make it easy to manipulate and transform your data as necessary. With so many new ML algorithms coming into the field its still essential to understand how to prepare your data for the model that you will be using, whether its a traditional model such as Logistic Regression or a domain like NLP, data preparation is a must.

I hope some of these examples have been useful and saved time for those performing any EDA or Feature Engineering to their specific datasets. Check out the Notion link for all other examples that I’ve documented, this will continue to be updated. I’ve attached other resources and cheatsheets for Feature Engineering that I have found helpful above. Feel free to connect with me on Linkedln or follow me on Medium for more of my writing. Share any thoughts or feedback, thank you for reading!


Related Articles