Data Preprocessing for Non-Techies: Feature Exploration and Engineering

Part Two — Checklist of Most Common Practices

Melody Ann Ucros

--

Now that we have covered the basic terms and definitions for data types and structure on my previous post, let’s dive into the creative and most time consuming side of data science — cleaning and feature engineering.

What are some of the basic strategies that data scientists use to clean their data AND improve the amount of information they get from it?

The type of cleaning and engineering strategies used usually depend on the business problem and type of target variable, since this will influence the algorithm and data preparation requirements.

Therefore, I will provide you a basic checklist that can help any beginner brainstorm what to do with the data at this stage. (including me)

The most important part of data cleaning is the experimentation, and checking how applying one or many of this strategies affects your ability to actually predict or classify in the model.

Also, although there is some logic in the order, keep in mind that these steps always happen in iteration, and you will always go back and forth between:

→ Exploration, Cleaning, Creation, and Selection

Data Exploration

A. Variable Identification:

  1. Context of Target Variable (logical connection)
  2. Data Type per Feature (character, numeric, etc)
  3. Variable Category (Continuous, Categorical, etc.)

B. Uni-variate Analysis:

  1. Central Tendency & Spread for Continuous
  2. Distribution(levels) for Categorical

C. Bi-variate Analysis:

  1. Correlation of Continuous Variables
  2. Two-Way Table or Stacked Columns for Categorical
  3. Chi-Square Test for Categorical
  4. Z-Test for Categorical vs Continuous

Data Cleaning

A. Remove Noise:

  1. Duplicates
  2. Paragraph Columns
  3. Erroneous Values
  4. Contradictions
  5. Mislabels

B. Missing Values:

  1. Delete
  2. Mean/Mode/Median Imputation
  3. Prediction Model
  4. KNN Imputation

C. Outliers:

  1. Cut-Off or Delete
  2. Natural Log
  3. Binning
  4. Assign Weights
  5. Mean/Mode/Median Imputation
  6. Build Predictive Model
  7. Treat them separately

D. Variable Transformation:

  1. Logarithm
  2. Square / Cube root
  3. Binning / Discretization
  4. Dummies
  5. Factorization
  6. Other Data Type

Feature Creation

A. Indicator Features

  1. Threshold (ex. below certain price = poor)
  2. Combination of features (ex. premium house if 2B,2Bth)
  3. Special Events (ex. christmas day or blackfriday)
  4. Event Type (ex. paid vs unpaid based on traffic source)

B. Representation Features

  1. Domain and Time Extractions (ex.purchase_day_of_week)
  2. Numeric to Categorical (ex. years in school to “elementary”)
  3. Grouping sparse classes (ex. sold, all other are “other”)

C. Interaction Features

  1. Sum of Features
  2. Difference of Features
  3. Product of Features
  4. Quotient of Features
  5. Unique Formula

D. Conjunctive Features

  1. Markov Blanket
  2. Linear Predictor

E. Disjunctive Features

  1. Centroid
  2. PCA
  3. LDA
  4. SVD
  5. PLS

F. Programming

  1. Logic (FRINGE)
  2. Genetic

Feature Selection

A. Filter Methods

  1. Correlation
  2. Statistical Score
  3. Ranking (Relief Algorithm)

B. Wrapper Methods

  1. Forward Step Wise
  2. Backward Step Wise

B. Embedded Methods

  1. Ridge Regression
  2. Lasso Regression
  3. Decision-Trees
  4. Elastic Net
  5. XGBoost
  6. SVM
  7. LightGBM

BONUS

Check out my classmates Kaggle, where he applied most of these methods to get to the Top 2% of the Leaderboard in the Housing Regression Challenge:

If you thought this was useful, please SHARE it with your friends and CLAP BACK. And feel free to COMMENT below if you feel something can be added to one of the strategies. We are in this learning journey together!

Author: Melody Ann Ucros

Masters in Big Data & Business Analytics Student @ IEBusinessSchool

Director of Operations @ Fundie, a social ventures consultancy and impact fund.

Follow Me on Medium or Reach out on Linkedin

#bigdata #datascience #featureengineering #datacleaning

--

--

Melody Ann Ucros

Entrepreneurial Techie who loves helping startups, playing with data, leading projects & exchanging knowledge with impact-makers around the world.