The world’s leading publication for data science, AI, and ML professionals.

Data Types From A Machine Learning Perspective With Examples

Almost anything can be turned into DATA. Building a deep understanding of the different data types is a crucial prerequisite for doing…

Almost anything can be turned into DATA. Building a deep understanding of the different data types is a crucial prerequisite for doing Exploratory Data Analysis (EDA) and Feature Engineering for Machine Learning models. You also need to convert data types of some variables in order to make appropriate choices for visual encodings in data visualization and storytelling.

Most Data can be categorized into 4 basic types from a Machine Learning perspective: numerical data, categorical data, time-series data, and text.

Data Types From A Machine Learning Perspective
Data Types From A Machine Learning Perspective

Numerical Data

Numerical data is any data where data points are exact numbers. Statisticians also might call numerical data, quantitative data. This data has meaning as a measurement such as house prices or as a count, such as a number of residential properties in Los Angeles or how many houses sold in the past year.

Numerical data can be characterized by continuous or discrete data. Continuous data can assume any value within a range whereas discrete data has distinct values.

Numerical Data
Numerical Data

For example, the number of students taking Python class would be a discrete data set. You can only have discrete whole number values like 10, 25, or 33. A class cannot have 12.75 students enrolled. A student either join a class or he doesn’t. On the other hand, continuous data are numbers that can fall anywhere within a range. Like a student could have an average score of 88.25 which falls between 0 and 100.

The takeaway here is that numerical data is not ordered in time. They are just numbers that we have collected.

Categorical Data

Categorical data represents characteristics, such as a hockey player’s position, team, hometown. Categorical data can take numerical values. For example, maybe we would use 1 for the colour red and 2 for blue. But these numbers don’t have a mathematical meaning. That is, we can’t add them together or take the average.

In the context of super classification, categorical data would be the class label. This would also be something like if a person is a man or woman, or property is residential or commercial.

There is also something called ordinal data, which in some sense is a mix of numerical and categorical data. In ordinal data, the data still falls into categories, but those categories are ordered or ranked in some particular way. An example would be class difficulty, such as beginner, intermediate, and advanced. Those three types of classes would be a way that we could label the classes, and they have a natural order in increasing difficulty.

Another example is that we just take quantitative data, and splitting it into groups, so we have bins or categories of other types of data.

Ordinal Data
Ordinal Data

For plotting purposes, ordinal data is treated much in the same way as categorical data. But groups are usually ordered from lowest to highest so that we can preserve this ordering.

Time Series Data

Time series data is a sequence of numbers collected at regular intervals over some period of time. It is very important, especially in particular fields like finance. Time series data has a temporal value attached to it, so this would be something like a date or a timestamp that you can look for trends in time.

For example, we might measure the average number of home sales for many years. The difference of time series data and numerical data is that rather than having a bunch of numerical values that don’t have any time ordering, time-series data does have some implied ordering. There is a first data point collected and the last data point collected.

CREA
CREA

Text

Text data is basically just words. A lot of the time the first thing that you do with text is you turn it into numbers using some interesting functions like the bag of words formulation.

These are four types of data from a Machine Learning perspective. Depending on exactly the type of data, this might have some repercussions for the type of algorithms that you can use for feature engineering and modeling, or the type of questions that you can ask of it.

Let me know if you have any questions or comments. I would like to write an article about feature engineering based on different data types in the future. Thank you for reading.


Sign up for Udemy course 🦞:

Recommender System With Machine Learning and Statistics

https://www.udemy.com/course/recommender-system-with-machine-learning-and-statistics/?referralCode=178D030EF728F966D62D
https://www.udemy.com/course/recommender-system-with-machine-learning-and-statistics/?referralCode=178D030EF728F966D62D

Related Articles