The world’s leading publication for data science, AI, and ML professionals.

Complete Data Science project: Data Understanding

Project of forecasting the number of orders in a production company listed on the stock exchange based on the Global Management Challenge.

COMPLETE DATA SCIENCE PROJECT

Photo by Chris Liverani on Unsplash
Photo by Chris Liverani on Unsplash

This is the first article from the Data Science Project series. I would like to present a guide to the world of Machine Learning. I will describe basic theoretical assumptions as well as practical solutions. Whether you are a beginner and you are just interested in the subject of artificial intelligence or you are already working as a Data Scientist, I think that my article will reach each of you and motivate you to continue your work. Each article has been divided into Data Science process phases according to the methodology of Cross Industry Standard Process for Data Mining:

  1. Business Understanding
  2. Data Understanding — you are here
  3. Data Preparation
  4. Modeling
  5. Evaluation
  6. Deployment

You will find the whole project on GitHub.

Data Understanding as a phase of the Data Science Project Life Cycle, Image by Author, inspired by source
Data Understanding as a phase of the Data Science Project Life Cycle, Image by Author, inspired by source

Our goal of the business problem is to correctly predict the number of product orders, so the data we have are the material from which the solution to our problem will be created. In this phase, we need to obtain the data and then assess its usefulness.

Garbage In Garbage Out, Image by Author
Garbage In Garbage Out, Image by Author

We should understand their advantages and limitations, because according to the GIGO principle (Garbage In Garbage Out) using poor data model results will also be of low quality. The data should be representative as well as useful to solve our problem.

Acquire Data

In the previous article, I mentioned that the reports are in spreadsheets. We have selected variables that, according to business analysis, have an impact on the number of products ordered. Several times it happened that the data came from previous reports. This is obviously logical, because the past has an impact on the future, not just our current activities. Moreover, as you can see the variables differ between the markets. We have used advertising for the product and advertising for the image (corporate), commission and support for agents and distributors and their number. Next, for the Internet, the percentage of unsuccessful visits to the website and website development. Issues of the company’s operation such as the number of training days for employees. Budget for the management or product assembly time. Moreover, the backlog of orders, such appeared in the situation when there were more orders than actually available, the evaluation of the product by our customers and the price at which we wanted to sell the product. We noticed that there were 4 different stories scrolling through the contest, which we also took as a variable and decision cycle to have information about seasonality. When selecting the variables, we had in mind to prevent the lookahead phenomenon. These are data that are known after receiving our target, so they look very valuable (strong correlation with the predicted variable), but are actually worthless. We have indicated in which line they are in the report and whether the data comes from the last report(c) or the previous one(p).

Variables and their location in the report, Image by Author
Variables and their location in the report, Image by Author

We have 417 reports, 20 of which are the histories that the teams get to start with, because as I mentioned earlier there were 4 scenarios and the story had 5 reports. We wrote the script and collected all reports into files with variables by product and sales area.

Script to Connect reports, Image by Author
Script to Connect reports, Image by Author

I did not decide to create a single model, and this is because between the areas sometimes other variables affected the number of orders. In the end, 3 sets of data were created depending on which area the sales were related. Thanks to this approach we had more data than the number of reports.

Exploration

Let’s have a look at our data and see the first five rows of each dataset.

The first five rows in the data_Europe set, Image by Author
The first five rows in the data_Europe set, Image by Author
The first five rows in the data_Nafta set, Image by Author
The first five rows in the data_Nafta set, Image by Author
The first five rows in the data_Internet set, Image by Author
The first five rows in the data_Internet set, Image by Author

Let’s check the number of rows in each of the datasets, what type of variables are there and whether all values are non-null.

RangeIndex: 1190 entries, 0 to 1189
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   NumOrders       1190 non-null   int64   
 1   History         1190 non-null   category
 2   Cycle           1190 non-null   category
 3   DirAdv          1190 non-null   float64 
 4   CorpAdv         1190 non-null   float64 
 5   Commission      1190 non-null   float64 
 6   AgentsDistr     1190 non-null   int64   
 7   Training        1190 non-null   float64 
 8   ManagBudget     1190 non-null   float64 
 9   BacklogOrders   1190 non-null   int64   
 10  RandD           1190 non-null   float64 
 11  Price           1190 non-null   float64 
 12  AssemblyTime    1190 non-null   float64 
 13  MarketShares_p  1190 non-null   float64 
 14  MeanPrice       1190 non-null   float64 
 15  NumSales_p      1190 non-null   int64   
 16  Product         1190 non-null   category
dtypes: category(3), float64(10), int64(4)
memory usage: 134.3 KB
RangeIndex: 1191 entries, 0 to 1190
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   NumOrders       1191 non-null   int64   
 1   History         1191 non-null   category
 2   Cycle           1191 non-null   category
 3   DirAdv          1191 non-null   float64 
 4   CorpAdv         1191 non-null   float64 
 5   Commission      1191 non-null   float64 
 6   AgentsDistr     1191 non-null   int64   
 7   Training        1191 non-null   float64 
 8   ManagBudget     1191 non-null   float64 
 9   BacklogOrders   1191 non-null   int64   
 10  RandD           1191 non-null   float64 
 11  Price           1191 non-null   float64 
 12  AssemblyTime    1191 non-null   float64 
 13  MarketShares_p  1191 non-null   float64 
 14  MeanPrice       1191 non-null   float64 
 15  NumSales_p      1191 non-null   int64   
 16  Product         1191 non-null   category
dtypes: category(3), float64(10), int64(4)
memory usage: 134.4 KB
RangeIndex: 1187 entries, 0 to 1186
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   NumOrders       1187 non-null   int64   
 1   History         1187 non-null   category
 2   Cycle           1187 non-null   category
 3   DirAdv          1187 non-null   float64 
 4   CorpAdv         1187 non-null   float64 
 5   Support         1187 non-null   float64 
 6   FailedVisits    1187 non-null   float64 
 7   Training        1187 non-null   float64 
 8   ManagBudget     1187 non-null   float64 
 9   WebDev          1187 non-null   float64 
 10  BacklogOrders   1187 non-null   int64   
 11  RandD           1187 non-null   float64 
 12  Price           1187 non-null   float64 
 13  AssemblyTime    1187 non-null   float64 
 14  MarketShares_p  1187 non-null   float64 
 15  MeanPrice       1187 non-null   float64 
 16  NumSales_p      1187 non-null   int64   
 17  Product         1187 non-null   category
dtypes: category(3), float64(12), int64(3)
memory usage: 143.2 KB

The difference in the number of rows for each dataset is due to the fact that some teams have not decided to sell products on a particular area. There are almost 1120 examples for each dataset, which means that it is quite a small set with this number of variables. In general, the larger the sample, the better the representation of the population, provided the sample is not bias. There is no non-null value, but that does not mean that all data is correct. There may be errors in the data. In the next analyses we will check this. The variables History, Cycle and Product are categorical. The remaining variables are numerical and we will present the most important metrics for numerical variables.

Summary of all numerical variables for data_Europe, Image by Author
Summary of all numerical variables for data_Europe, Image by Author
Summary of all numerical variables for data_Nafta, Image by Author
Summary of all numerical variables for data_Nafta, Image by Author
Summary of all numerical variables for data_Internet, Image by Author
Summary of all numerical variables for data_Internet, Image by Author

Most orders were in Europe and the least in the Nafta area. We can draw similar conclusions from data from each area. The 75th percentile is significantly smaller than the maximum value, one can expect outliers here – probably such a situation happened when a very good team played in Global Management Challenge or the team gave high advertising values and/or low prices. The advertisement also shows outliers close to the maximum. Half of the teams did not have BacklogOrders and the others had very low, but there were extremes also. The biggest problem seems to be with MarketShares_p, where there are a lot of 0 values. This is an impossible situation, because if the team sold products, it must have market shares. The lack of data results from the fact that the team had to buy such a service to have this information. Many teams have not decided to do so, despite the fact that it is valuable information. In the data preparation phase we will have to deal with this problem.

See what percentage are values equal to 0, which we consider to be false.

In Europe dataset is 50.59 of values equal to 0.
In Nafta dataset is 50.63 of values equal to 0.
In Internet dataset is 50.8 of values equal to 0.

Slight differences result from differences in the number of rows. Half of the data is missing and the data is very valuable because it shows the position of the company on the market.

Group data by product.

_dataset ➜ data_Europe| data_Nafta | dataInternet

Group data_Europe by Product, Image by Author
Group data_Europe by Product, Image by Author
Group data_Nafta by Product, Image by Author
Group data_Nafta by Product, Image by Author
Group data_Europe by Product, Image by Author
Group data_Europe by Product, Image by Author

Product 1 is the most sold and Product 3 the rarest. The area of Nafta is the area where the least money is spent on advertising. This is because when we start the game in each scenario it is a developing area where business has recently started. Some variables are the same, and this is because they do not differ in terms of products and areas. It should also be noted that the average price of products of teams from the group is the highest on the Internet.

Group Number of orders by History and Cycle, Image by Author
Group Number of orders by History and Cycle, Image by Author

The cycle number and history are variables that seem to represent seasonality well. Moreover, with each subsequent cycle the number of orders increases (company expansion), although there are deviations from this rule (market breakdowns, etc.).

Boxplot is a different kind of box chart, it is a convenient and clear form of presenting basic statistics. It consists of a box and whiskers. The box boundaries are determined by the lower and upper quartile, so the value not less than 25% and at the same time not more than 75% of observations. The width of the box is therefore represented by the IQR = Q3 – Q1 (Interquartile range). There is always a median inside the box. The length of whiskers is 1.5 of the IQR. Higher or lower values are called outliers. Of course, it may happen that there are no such outliers in the sample, then the lengths of whiskers are equal to the minimum and the maximum element of the sample, respectively.

Boxplot for numerical data, Image by Author
Boxplot for numerical data, Image by Author

Okay, we see that the data between the areas are very similar, the most important differences are due to other variables and different demand. Next the analysis will only concern the Europe area, on this set we will also deal with data preparation, modeling and implementation. If the results are correct, it will be possible to implement this solution in other areas. Let’s see how often the categorical data appears.

The frequency for categorical data in Europe, Image by Author
The frequency for categorical data in Europe, Image by Author

The data is not balanced against history, so most of the reports are from history 2 and the least from history 1. With each subsequent cycle we have fewer reports, because some teams did not save the last reports.

Pairplot is a scatter plot for all combinations of numerical variables in pairs. In relation to the product variable we divided the observations into subgroups. Moreover, for each variable histograms with density estimation were made, so the empirical distribution of the analyzed quantitative variable.

Pairplot for Europe, Image by Author
Pairplot for Europe, Image by Author

A heat map with Pearson’s linear correlation for Europe is presented. 1 means that points indicate a strong positive linear correlation and -1 a strong negative correlation. 0 means no linear correlation, which does not mean that there is no other non-linear correlation. We can estimate the number of closely related variables.

Correlation heatmap for Europe, Image by Author
Correlation heatmap for Europe, Image by Author

PCA is used to reduce dimensions. We get rid of unnecessary data, so we remove noise. We group features that represent the same information. I showed what number of dimensions corresponds to the percentage of variance. 90% of variance/information can be described with half the number of variables.

PCA explained variance, Image by Author
PCA explained variance, Image by Author

We already have a view of the variables. If you think I could have done something better or added something to it – write it, please.

See you in the next post!


Related Articles