COMPLETE DATA SCIENCE PROJECT

This is the first article from the Data Science Project series. I would like to present a guide to the world of Machine Learning. I will describe basic theoretical assumptions as well as practical solutions. Whether you are a beginner and you are just interested in the subject of artificial intelligence or you are already working as a Data Scientist, I think that my article will reach each of you and motivate you to continue your work. Each article has been divided into Data Science process phases according to the methodology of Cross Industry Standard Process for Data Mining:
- Business Understanding
- Data Understanding — you are here
- Data Preparation
- Modeling
- Evaluation
- Deployment
You will find the whole project on GitHub.

Our goal of the business problem is to correctly predict the number of product orders, so the data we have are the material from which the solution to our problem will be created. In this phase, we need to obtain the data and then assess its usefulness.

We should understand their advantages and limitations, because according to the GIGO principle (Garbage In Garbage Out) using poor data model results will also be of low quality. The data should be representative as well as useful to solve our problem.
Acquire Data
In the previous article, I mentioned that the reports are in spreadsheets. We have selected variables that, according to business analysis, have an impact on the number of products ordered. Several times it happened that the data came from previous reports. This is obviously logical, because the past has an impact on the future, not just our current activities. Moreover, as you can see the variables differ between the markets. We have used advertising for the product and advertising for the image (corporate), commission and support for agents and distributors and their number. Next, for the Internet, the percentage of unsuccessful visits to the website and website development. Issues of the company’s operation such as the number of training days for employees. Budget for the management or product assembly time. Moreover, the backlog of orders, such appeared in the situation when there were more orders than actually available, the evaluation of the product by our customers and the price at which we wanted to sell the product. We noticed that there were 4 different stories scrolling through the contest, which we also took as a variable and decision cycle to have information about seasonality. When selecting the variables, we had in mind to prevent the lookahead phenomenon. These are data that are known after receiving our target, so they look very valuable (strong correlation with the predicted variable), but are actually worthless. We have indicated in which line they are in the report and whether the data comes from the last report(c) or the previous one(p).

We have 417 reports, 20 of which are the histories that the teams get to start with, because as I mentioned earlier there were 4 scenarios and the story had 5 reports. We wrote the script and collected all reports into files with variables by product and sales area.

I did not decide to create a single model, and this is because between the areas sometimes other variables affected the number of orders. In the end, 3 sets of data were created depending on which area the sales were related. Thanks to this approach we had more data than the number of reports.
Exploration
Let’s have a look at our data and see the first five rows of each dataset.



Let’s check the number of rows in each of the datasets, what type of variables are there and whether all values are non-null.
RangeIndex: 1190 entries, 0 to 1189
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 NumOrders 1190 non-null int64
1 History 1190 non-null category
2 Cycle 1190 non-null category
3 DirAdv 1190 non-null float64
4 CorpAdv 1190 non-null float64
5 Commission 1190 non-null float64
6 AgentsDistr 1190 non-null int64
7 Training 1190 non-null float64
8 ManagBudget 1190 non-null float64
9 BacklogOrders 1190 non-null int64
10 RandD 1190 non-null float64
11 Price 1190 non-null float64
12 AssemblyTime 1190 non-null float64
13 MarketShares_p 1190 non-null float64
14 MeanPrice 1190 non-null float64
15 NumSales_p 1190 non-null int64
16 Product 1190 non-null category
dtypes: category(3), float64(10), int64(4)
memory usage: 134.3 KB
RangeIndex: 1191 entries, 0 to 1190
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 NumOrders 1191 non-null int64
1 History 1191 non-null category
2 Cycle 1191 non-null category
3 DirAdv 1191 non-null float64
4 CorpAdv 1191 non-null float64
5 Commission 1191 non-null float64
6 AgentsDistr 1191 non-null int64
7 Training 1191 non-null float64
8 ManagBudget 1191 non-null float64
9 BacklogOrders 1191 non-null int64
10 RandD 1191 non-null float64
11 Price 1191 non-null float64
12 AssemblyTime 1191 non-null float64
13 MarketShares_p 1191 non-null float64
14 MeanPrice 1191 non-null float64
15 NumSales_p 1191 non-null int64
16 Product 1191 non-null category
dtypes: category(3), float64(10), int64(4)
memory usage: 134.4 KB
RangeIndex: 1187 entries, 0 to 1186
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 NumOrders 1187 non-null int64
1 History 1187 non-null category
2 Cycle 1187 non-null category
3 DirAdv 1187 non-null float64
4 CorpAdv 1187 non-null float64
5 Support 1187 non-null float64
6 FailedVisits 1187 non-null float64
7 Training 1187 non-null float64
8 ManagBudget 1187 non-null float64
9 WebDev 1187 non-null float64
10 BacklogOrders 1187 non-null int64
11 RandD 1187 non-null float64
12 Price 1187 non-null float64
13 AssemblyTime 1187 non-null float64
14 MarketShares_p 1187 non-null float64
15 MeanPrice 1187 non-null float64
16 NumSales_p 1187 non-null int64
17 Product 1187 non-null category
dtypes: category(3), float64(12), int64(3)
memory usage: 143.2 KB
The difference in the number of rows for each dataset is due to the fact that some teams have not decided to sell products on a particular area. There are almost 1120 examples for each dataset, which means that it is quite a small set with this number of variables. In general, the larger the sample, the better the representation of the population, provided the sample is not bias. There is no non-null value, but that does not mean that all data is correct. There may be errors in the data. In the next analyses we will check this. The variables History, Cycle and Product are categorical. The remaining variables are numerical and we will present the most important metrics for numerical variables.



Most orders were in Europe and the least in the Nafta area. We can draw similar conclusions from data from each area. The 75th percentile is significantly smaller than the maximum value, one can expect outliers here – probably such a situation happened when a very good team played in Global Management Challenge or the team gave high advertising values and/or low prices. The advertisement also shows outliers close to the maximum. Half of the teams did not have BacklogOrders and the others had very low, but there were extremes also. The biggest problem seems to be with MarketShares_p, where there are a lot of 0 values. This is an impossible situation, because if the team sold products, it must have market shares. The lack of data results from the fact that the team had to buy such a service to have this information. Many teams have not decided to do so, despite the fact that it is valuable information. In the data preparation phase we will have to deal with this problem.
See what percentage are values equal to 0, which we consider to be false.
In Europe dataset is 50.59 of values equal to 0.
In Nafta dataset is 50.63 of values equal to 0.
In Internet dataset is 50.8 of values equal to 0.
Slight differences result from differences in the number of rows. Half of the data is missing and the data is very valuable because it shows the position of the company on the market.
Group data by product.
_dataset ➜ data_Europe| data_Nafta | dataInternet



Product 1 is the most sold and Product 3 the rarest. The area of Nafta is the area where the least money is spent on advertising. This is because when we start the game in each scenario it is a developing area where business has recently started. Some variables are the same, and this is because they do not differ in terms of products and areas. It should also be noted that the average price of products of teams from the group is the highest on the Internet.

The cycle number and history are variables that seem to represent seasonality well. Moreover, with each subsequent cycle the number of orders increases (company expansion), although there are deviations from this rule (market breakdowns, etc.).
Boxplot is a different kind of box chart, it is a convenient and clear form of presenting basic statistics. It consists of a box and whiskers. The box boundaries are determined by the lower and upper quartile, so the value not less than 25% and at the same time not more than 75% of observations. The width of the box is therefore represented by the IQR = Q3 – Q1 (Interquartile range). There is always a median inside the box. The length of whiskers is 1.5 of the IQR. Higher or lower values are called outliers. Of course, it may happen that there are no such outliers in the sample, then the lengths of whiskers are equal to the minimum and the maximum element of the sample, respectively.

Okay, we see that the data between the areas are very similar, the most important differences are due to other variables and different demand. Next the analysis will only concern the Europe area, on this set we will also deal with data preparation, modeling and implementation. If the results are correct, it will be possible to implement this solution in other areas. Let’s see how often the categorical data appears.

The data is not balanced against history, so most of the reports are from history 2 and the least from history 1. With each subsequent cycle we have fewer reports, because some teams did not save the last reports.
Pairplot is a scatter plot for all combinations of numerical variables in pairs. In relation to the product variable we divided the observations into subgroups. Moreover, for each variable histograms with density estimation were made, so the empirical distribution of the analyzed quantitative variable.

A heat map with Pearson’s linear correlation for Europe is presented. 1 means that points indicate a strong positive linear correlation and -1 a strong negative correlation. 0 means no linear correlation, which does not mean that there is no other non-linear correlation. We can estimate the number of closely related variables.

PCA is used to reduce dimensions. We get rid of unnecessary data, so we remove noise. We group features that represent the same information. I showed what number of dimensions corresponds to the percentage of variance. 90% of variance/information can be described with half the number of variables.

We already have a view of the variables. If you think I could have done something better or added something to it – write it, please.
See you in the next post!