Case Study: Applying a Data Science Process Model to a Real-World Scenario

Development of a machine learning model for materials planning in the supply chain

Jonas Dieckmann
Towards Data Science

--

In today’s rapidly changing environment, one of the most critical challenges facing companies is the ability to predict future demand accurately. This is especially true for supply chain teams, where accurate demand planning is vital for maintaining customer satisfaction and keeping costs under control.

In this case study, we will explore how a data science process model can help companies tackle this challenge hands-on by leveraging statistical forecasting methods. The goal of the fictitious company was to develop a more accurate demand planning process that reduced stock-outs, increased inventory turnover, and improve overall supply chain performance.

Image by Unsplash

This project is a powerful example of how data science can transform a business by unlocking new insights, increasing efficiency, and improving decision-making. I hope that this case study will help you to think about the potential applications in your organization and showcase how you can apply the process model DASC-PM successfully.

Please note that the entire article has also been published in the below publication and was written by Daniel Badura and Jonas Dieckmann:

Chapter 3: “Development of a Machine Learning Model for Materials Planning in the Supply Chain” in: Schulz et al. (2023): DASC-PM v1.1 Case Studies. Available from: https://www.researchgate.net/publication/368661660_DASC-PM_v11_Case_Studies

1. Domain and project description

SCHRAMME AG is a leading provider of dressings, band-aids, and bandages. The management thinks that there are qualitative optimization potential and savings opportunities in materials planning and the resulting production processes. Management assigns an internal project manager the task of developing a model based on machine learning to plan the materials and requirements in the supply chain. Due to negative experiences in previous data science projects, it is proposed that this project should initially be developed by using a process model.

The DASC-PM is chosen to ensure a structured and scientific process for project management. To gain an overview of the project assignment, the project manager initially works out various use cases that are then checked for suitability and feasibility. The suitable use cases then serve as the basis for figuring out the specific problems and the design of the project. This design is then checked again for suitability and feasibility.

Image by Unsplash

Starting point and use case development

The company manually plans and then produces over 2,500 different products at present. In the last few quarters, they increasingly had inventory shortages for some product series, while for individual products inventories exceeded storage capacities. While the controlling department complains about rising storage costs due to imprecise planning, the demand planners lament the insufficient amount of time for the planning. For some time, the head of the supply chain has criticized the fact that the planning is done solely manually, and the opportunities of digitalization appear not to be taken advantage of.

Project goals
One goal of the project is the development of a machine learning model where a large part of the product requirements should be planned automatically in the future, based on various influential factors. The demand planners should increasingly address the planning of important product groups and advertising. The system should take account of seasonality, trends, and market developments, and achieve planning accuracy of 75%. This means that the forecasts for quantities of each product should deviate from actual requirements by no more than 25%. Order histories, inventory and sales figures for customers, and internal advertising plans should be used as potential data sources.

Phase 1: Project Order (Schulz et al. 2022)

Current team set-up
Along with the inclusion of the Supply Chain department, close collaboration with Sales and IT is also expected. The planning team in the Supply Chain department now consists of a global market demand planning team that deals with long-term planning (6–18 months) based on market developments, product life cycles, and strategic focus. In individual markets, there are local customer demand planning teams that implement short-term materials and advertising planning (0–6 months) for retail through the corresponding sales channels.

The data science model to be developed should support the monthly planning cycles and quantify the need for short-term and long-term materials. The projection is then loaded into the internal planning software and should be analyzed and, if need be, supplemented or corrected. The final planning quantity will ultimately be used by the factories for production planning. To take account of the customer- and product-specific expertise, seasonality, and experiences from the past, individual team members of the planning team should be included in the project, allocating up to 20% of their working hours to it.

Suitability Check
An important partial aspect during the use case selection is the suitability test. The project manager tries to examine whether the project can fundamentally be classified as feasible and whether the requirements can be carried out with the available resources. Expert interviews have shown that the problem in general is very well suited for the deployment of data science and corresponding projects have already been undertaken externally and also published. The data science team confirmed that there are a sufficient number of potentially suitable methods for this project and the required data sources are available.

Finally, the project manager analyzes feasibility. It is necessary to coordinate with the IT department to check the available infrastructure and the expertise of the involved employees. The available cloud infrastructure from Microsoft and the experience of the data science team withDatabricks software make the project appear fundamentally achievable. The project risk is classified as moderate in general since the planers assume a major role as controllers in the implementation phase and the results are checked.

Data Science Process Model DASC-PM (Schulz et al. 2022)

Project design

Based on the problem and specific aspects of the domains, the project manager, the head of the supply chain, and a data scientist are now responsible for formally designing the project.

The project objective is assumed to be an improvement in planning accuracy and a reduction in the manual processes and is tied to the aim of developing an appropriate model for the project. According to an initial estimate, the cost framework totals EUR 650,000. A period of six months is proposed as the timeframe for the development, with an additional six months planned for process integration.

Since full planning and a description of the course of projects in the data science context are usually not possible in contrast to many other projects, the project manager solely prepares a project outline for this process with the basic cornerstones that were already indicated in the previous sections. The budget includes financial resources for 1 full-time project manager, 2 full-time data scientists, and 0.5 full-time data engineers. As already mentioned, the demand planners should allocate roughly 20% of the working hours to share their expertise and experience.

The project as a whole should be handled with an agile working method and based on the DASC-PM phases according to the Scrum methodology. The work is done iteratively in the areas of data procurement, analysis, utilization, and use, with the preceding and following phase moving into focus in each phase. The back-steps are especially important if gaps or problems are found in key areas and can only be solved by returning to the previous phase. The project outline is prepared visually and placed in a very visible area of the SCHRAMME AG office for all participants. Then the entire project description is checked for suitability and feasibility once again until the process moves on to the next phase.

2. Data provision

Data preparation

SCHRAMME AG has several data sources that can be included in automatic planning. Besides the historical sales data from the ERP system, order histories and customer data from the CRM system are options, along with inventories and marketing measures. Azure Data Factory is used to prepare a cloud-based pipeline that loads, transforms, and integrates the data from various source systems. The primary basis for the automatic forecasts should be the order histories: The remaining data is used either as background information for the planning teams or to carry out cluster analyses in advance if need be. In the initial phase of the project, the individual data sources still exhibit big differences regarding quality and structure. That is why adjustments are made together with the IT and technical departments to prepare the forecasts later on a solid basis.

ELT data preparation process for analysis. Image by author

Data management

The data management process is automated by data engineers and done according to a daily schedule to always remain up to date. To keep the complexity reasonable, the most promising data sources are initially processed and the pipeline is then incrementally expanded with Continuous Integration / Continuous Deployment (CI/CD). After deployment, the processed data are stored in Azure Data Lake Storage where they can be used for future analysis with Azure Databricks. DataLake also stores the backups of the prepared data and analysis results as well as other data such as protocols, quality metrics, and credential structures. Writing and reading authorizations as well as plan versions also ensure that only the latest planning period can be processed so that the values from the past no longer change.

Phase 2: Data Provision (Schulz et al. 2022)

Exploratory data analysis

An important step in data preparation is the exploratory data analysis (EDA) where various statistics and visualizations are produced to start with. This results in an overview of the distributions, outliers, and correlations in the data. The results of the EDA provide insights into characteristics to be considered for the next phase of the analysis. In the second step, Feature Selection and Feature Engineering are used to select the relevant characteristics or produce new features. A dimension reduction method such as a principal component analysis is applied for data with high dimensionality. The EDA provides information about the existing demand histories of SCHRAMMEAG.

Example of results from exploratory data analysis. Image by author

3. Analysis

Identification of suitable analysis methods

The feasibility test at the beginning of the project made it clear that this project can and should be solved with data science methods. The two data science employees involved initially provide an overview of the existing methods that are well suited for the existing problem. This existing problem is part of the regression problem class in the supervised learning algorithms. Fundamentally, this is a type of time series analysis that can be expanded by additional factors or multiple regression.

In connection with the key area of scientificity, the latest developments in research on comparable problems were examined. This showed that XGBoost, ARIMA, FacebookProphet, and LightGBM are frequently named methods for the problem class. A data scientist documents the corresponding advantages and disadvantages of each method and sorts them according to the complexity and computational intensity. To receive the first indications on the model ability for products from SCHRAMME AG, simpler models are initially selected by the project team, which then adopts the classical exponential smoothing and ARIMA model family.

Phase 3: Analysis (Schulz et al. 2022)

Application of analysis methods

Since multiple users are involved in the analysis process for this project, the team initially relies on a suitable notebook-based development environment in Databricks. Along the typical machine learning workflow, the code for the import and data cleaning is initially implemented. To ensure validity, the underlying dataset is ultimately divided into training, validation, and test data by cross-validation. The selected methods are then applied to training and validation datasets to optimize the model. In this context, attempts are also repeatedly made to optimize the parameters of processes and sensibly reduce the number of available dimensions, if need be. The data scientists at SCHRAMME AG document the execution and validation results of the individual runs. The ARIMA family models fundamentally exhibit a better performance relative to the exponential smoothing, even if the target accuracy of 75% still cannot be achieved with a currently resulting value of 62.4%. The RMSE and MAPE metrics also show potential for optimization.

Comparison of the ARIMA forecast with actual need. Image by author

The parameter configurations and the basis for selecting the final model after the first application iteration are documented and prepared for the project manager and the head of the supply chain in a technically understandable way. What is seen in particular, is that some product groups have very unusual seasonality and certain products are generally very difficult to predict. Even if the product portfolio of SCHRAMME AG is affected somewhat less due to temporary closures (lockdowns) during the corona pandemic, a slight decline in demand for dressing products has been observed. It is assumed that less activity and transport, as well as fewer accidents and injuries, account for this drop.

The trend can be modeled quite well in the analysis method used. To improve the target accuracy, technically more complex methods are used in another experiment, with these methods proving to be relevant and applicable in the context of identifying suitable methods. After some iterations to optimize parameters and cross-validate, the Prophet and XGBoost methods demonstrated the highest validation results at 73.4% and 65.8%, respectively.

The data scientists consider Prophet to be the most suitable method among the applied processes and determine the planning accuracy relative to the test time series. Even if the accuracy is slightly below the target value of 73.4%, a significant improvement in planning accuracy is achieved. The MAPE is at 16.64% and the RMSE at 8,130, which implies a less absolute deviation in comparison to the RMSE in the XGBoost method (10,134). Similar to the first experiment, however, there are product groups that are very difficult to predict overall (37.2%) and negatively impact the cumulative accuracy.

Performance comparison of various methods. Image by author

Evaluation

The results of the analyses are used as the basis for a logical evaluation and classification by the head of the supply chain and the analysts, which is organized and moderated by the project manager. The adopted metrics for evaluation are the cumulative planning accuracy of all products defined in advance together with the common RMSE and MAPE metrics. The department needs to have a realistic, trackable, and reliable basis for determining requirements on the product level.

Evaluation of the three best models. Image by author

The benchmark for planning accuracy is assumed to be the current (manually planned) median accuracy of 58% over the last two years. The evaluation of results shows that many product groups overall can be planned with a high degree of accuracy by using the data science model and vastly exceed the benchmark. However, there are also product groups that reflect similar accuracy concerning manual planning. It is necessary to discuss above all the product area of drainage, which sees much worse results with the model than in the manual planning and appears to be unsuitable for a statistical calculation of requirements with the methods used to date.

Evaluation of the best model, distributed across product groups. Image by author

From a technical perspective, the head of the supply chain believes that it makes little sense to plan such product groups statistically since only limited planning accuracy is possible due to their specific seasonal and trend-based characteristics. She recommends the introduction of an error threshold value on a product basis to determine which products should be predicted with the model and which product groups will be removed from the modeling and still planned manually. A range slightly below the current benchmark seems to be a suitable threshold value since nearly as good accuracy with a less manual effort from the perspective of the department is always an improvement on the way to achieving the project objective. The project leader documents the results of the evaluation with the decisions and measures adopted.

The required quantities of all selected products for the next 18 months can be documented as the analysis result after the first real modeling. This can now be utilized and integrated into the planning process of the teams.

4. Deployment

The team now enters the utilization phase of the DASC-PM for integration.

Phase 4: Deployment (Schulz et al. 2022)

Technical-methodological preparation

It is possible to rely on the existing infrastructure for utilization. The forecasts are loaded in the planning software IBM Planning Analytics where they are tested and reprocessed. The so-called TurboIntegrator is used to automate the loading process that represents a central component of IBM Planning Analytics. The OLAP structure of Planning Analytics allows for the creation of flexible views where the users can personally choose their context (time reference, product groups, etc.)and adjust calculations in real-time. Furthermore, the reporting software QlikSense is also integrated for more in-depth analyses. Here, the components of the time series (trends, seasonality, noise) can be visualized on the one hand and additional information such as outliers and median values can be displayed on the other hand. The final plans are loaded into the Data Lake after processing by the planning teams so they can be referenced in the future.

Ensuring technical feasibility

The forecasts themselves are automatically regenerated at the beginning of the month. The planners can make their corrections during the first four working days of the month and view the results in the planning system in real-time. Since the algorithms work in a cloud environment, the computing power can be scaled, if need be. To get all processes to run automatically, changes in the data sources should be minimized. If there is a need for adjustment, the data engineer will be informed, and the interface document will be updated by recording all the information on data sources and connections. The planning and forecasting system is a mixture of the cloud (Microsoft Azure) and an on-premise system (Planning Analytics), with the planners only having active access to the on-premise structures. Credentials are awarded here so the local planners only have access to their areas, while the global planners can view all topics. After the end of the development phase, the support services are mainly handled by the IT department. In the case of complex problems, data scientists or data engineers are also consulted.

Image by Unsplash

Ensuring applicability

Users of the solution are the local and global planning teams. Since members of the teams have less of a technical orientation, training sessions are held to help them interpret the forecasts and classify their quality. The user interface is also designed with a focus on clarity and understandability. Simple line and bar charts for processes and benchmarks are used, along with tables reduced to what is most important. The users are included in the development from the beginning to ensure technical correctness and relevance and to ensure familiarity with the solution before the end of the development phase. In addition, complete documentation is drafted. The technical part of the documentation mostly builds on the interface document by demonstrating the data structures and connections, while the content part is jointly prepared with the users.

Technical preparation

To ensure that the new solution does not lose relevance or quality after a few months, work continues to be done on improvements after the completion of the first development phase, even if substantially less time is spent on it. The most important aspect of the ongoing improvement is the constant automated adjustment of the prediction model to new data. Other parts of the system still requiring manual work at the beginning are also automated over time. A change in various parameters such as the forecast horizon or threshold values for the accuracy of the prediction can be made by the planners themselves in Planning Analytics, with the model remaining flexible. Problems occurring after the release of the first version are entered via the IT ticket system and assigned to the data science area. At regular intervals, it is also checked whether the model still satisfies the expectations of the company or whether changes are necessary.

5. (Application) Use and summary

Phase 5: Application (Schulz et al. 2022)

The transition to the use of the developed model means that the Data Science Process Model(DASC-PM) enters its last phase. As a whole, SCHRAMME AG was able to achieve the objectives it had set in the supply chain area by using a structured and holistic approach. Additional or new projects can now be derived from here. The planning processes were largely automated and supported by machine learning algorithms. The relevant stakeholders in management, finance, and the supply chain were highly satisfied. After initial skepticism, the planning team itself is now also convinced by the reduction in workload and possible prioritization. However, it is also conceivable that weak points will surface during use and more iterations will be required in later phases.

The case study as a whole showed that non-linear process models in particular are advantageous for the area of data science. The DASC-PM is a suitable novel process that can be transferred to numerous other domains and problems.

Conclusion

In conclusion, data science plays an integral role in solving complex business problems by identifying hidden patterns and extracting actionable insights from data. Through this case study, we demonstrated how data science techniques can be used to develop predictive models to help businesses make informed decisions e.g., in the supply chain.

While this case study focuses on demand planning, the process model can be used in various ways, such as for building personalized recommendations on e-commerce websites, identifying fraud in financial transactions, or predicting customer churn in telecom or subscription-based businesses.

However, it’s essential to note that real-world data science projects pose several challenges, such as data quality issues, lack of domain expertise, and inadequate communication between stakeholders. In comparison, fictitious case studies provide an idealized environment with clean, well-labeled data and well-defined problem statements. Thus, real-world projects require a pragmatic approach that takes into account various factors such as business objectives, data quality, computational resources, and ethical considerations. I am pretty sure you know this from your own experience. Do not underestimate reality!

In summary, data science has immense potential to transform industries, and society and create new opportunities for businesses. The DASC-DM (or any) process model can help to structure the approach logically to ensure clear guidance for both, business stakeholders as well as the project team itself.

Please let me know about your experience with data science projects. How do you structure them & what are the biggest challenges? Feel free to leave a comment!

Image by Unsplash

References

The whole case study has been published in:

[1] Schulz et al. (2023):DASC-PM v1.1 Case Studies” Available from: https://www.researchgate.net/publication/368661660_DASC-PM_v11_Case_Studies

Process images have been taken from:

[2] Schulz et al. (2022): DASC-PM v1.1 — A Process Model for Data Science Projects (2022), Publisher: NORDAKADEMIE gAG Hochschule der Wirtschaft, ISBN: 978–3–00–064898–4, DOI:10.25673/32872.2

--

--

team lead @ philips | passionate about data science, agile work & digital transformation