The world’s leading publication for data science, AI, and ML professionals.

Integrating Scikit-learn Machine Learning models into the Microsoft .NET

While being part of a team working on designing and developing a lead scoring system prototype, I faced the challenge of integrating…

Notes from Industry

Integrating Scikit-learn Machine Learning models into the Microsoft .NET ecosystem using Open Neural Network Exchange (ONNX) format

Using the ONNX format for deploying trained Scikit-learn Lead Scoring predictive model into the .NET ecosystem

Photo by Miguel Á. Padriñán from Pexels
Photo by Miguel Á. Padriñán from Pexels

While being part of a team working on designing and developing a lead scoring system prototype, I faced the challenge of integrating machine learning models into the target environment built around the Microsoft .NET ecosystem. Technically, I implemented the lead scoring predictive model using the Scikit-learn machine learning built-in algorithm for regression, more precisely Logistic Regression. Considering the phases of initial data analysis, data preprocessing, exploratory data analysis (EDA), and the data preparation for the model building itself, I used the Jupyter Notebook environment powered by Anaconda distribution for Python scientific computing. Previously, I have investigated and touched Python within Flask as a micro web framework written in this programming language. However, I aimed to integrate or deploy the machine learning model written in Python into the .NET ecosystem, using the C# programming language and Visual Studio IDE.

* Note: The source code presented in this article is for demonstration purposes only – simplified to the level of emphasizing the core concepts, libraries and the conversion solution approach.

What is Lead Scoring?

Lead scoring represents a shared sales and marketing methodology used to rank lead to describe their potential of sales readiness to the company. It is a technique of assigning different values to the companys leads database guiding the marketing and sales teams through conversion to "hot leads" and official customers/clients. From a strategic point of view, lead scoring is considered a significant process of adjusting the companys strategy and improving the sales teams` performance and efficiency.

Why ML in this story?

Usually, digital marketing, lead generation, and sales teams generate a lot of lead data, typically stored in a predetermined structured format on single or multiple platforms. Since data is the new oil of the upcoming digital era, it is essential how companies handle and analyze it. However, many teams and people will be needed assuming that the human expertise is enough for retrieving valuable data insights. But, even if this is entirely true, it will be a sustainable model only for large-scale corporations with a lot of experience and high seniority digital marketing, sales and business analytical profiles.

On the other side, the machine learning approach naturally arises immediately after mentioning the buzz term of "Big data". In general, machine learning algorithms are not replacing human resources but establishing as a complementary tool, improving and boosting the process and rates of lead conversions. Utilizing the mathematics and statistics behind the data, machine learning can open the horizon and "catch" additional in-depth insights and conclusions that are currently not "visible" with the human eye. Adopting lead scoring as a use case is worth mentioning in the supervised machine learning domain. We can use the complete history of generated data (labelled datasets as a prerequisite for the supervised learning approach).

ONNX in the picture

After developing the machine learning regressor for lead scoring, I started to investigate the possibilities of integrating it into the .NET world, accessing it with the C# programming language. It is a widespread topic of discussion, regardless of the target environment, which can be model deployment within, for example, Keras, TensorFlow, or Java. In general, I am not considering it as an issue, but more like a challenge addressing the system interoperability, integration simplicity, and maintenance. All it is about is the complexity of the deployment strategy at the end of the day. Accordingly, our challenge was to mitigate the technology "gap" of doing the data analysis and science effort in Python and Scikit-learn while having the opportunity to use the model directly as part of the target system infrastructure, written in C# and supported by the .NET Core framework.

Open Neural Network Exchange (ONNX) is an open file format standard representing the machine learning models. By design, it is wholly implemented as a framework-agnostic standard, meaning that it is built as a solution for providing unique format interoperability. I used the Onnx standard for converting and storing the machine learning model. Furthermore, I also used the ONNX Runtime, a cross-platform, high-performance engine providing a set of different APIs, for working and integrating to different target environments, in our case .NET Core framework with C#.

Dataset overview

To build the initial version of the lead scoring prototype, I decided to use the Lead Scoring Dataset case study publicly accessible on Kaggle. The dataset consists of historical data retrieved by the marketing/sales teams for the company named X Education. In general, the company is a model of an online learning platform, where many professionals/students can find interest and participate in one or multiple courses. The data is generated using different sources, like web pages, landing pages, forms, etc. There are also additionally developed fields populated with data due to the process of contacting the candidates interested in specific areas of learning.

The initial dataset covers 9.240 unique records, organized in 37 different columns, including the fields for unique prospect identification. The data analysis, the process of EDA, and the data preprocessing and preparation are beyond the scope of this article, so I will proceed directly to present the practical approach of deployment the machine learning model between the different working environments.

Building the ONNX Model

I will briefly present the core aspects of machine learning model building as a prerequisite for creating the ONNX model. For this purpose, I divided the processed and optimized resulting dataset into the train and test subsets using the standard splitting process from sklearn.model_selection. In this case, I consider 30% of the data as testing data.

Image by Author
Image by Author

It is time for the model building. I created and fitted a logistic regression machine learning model (configuring it with a ‘liblinear’ algorithm for addressing the optimization problem) using the sklearn.linear_model library.

Image by Author
Image by Author

Using the Scikit-learn`s pipeline mechanism, I created a simple pipeline form containing the data scaling (standard scaling algorithm imported from sklearn.preprocessing) followed by the model.

Image by Author
Image by Author

Also, looking from another design perspective, it is valid if scaling only the numeric attribute(s) (to adjust a common mean of zero).

Image by Author
Image by Author

Finally, I referenced cros_val_predict functionality from the sklearn.model_selection and generated the cross-validated estimations explicitly configuring the ‘predict_proba’ estimator method. This estimator method is ensuring retrieving the probabilities of successful target lead conversion – in fact, is addressing the challenge of lead score assignment to every prospect.

Image by Author
Image by Author

* Note: The process of model validation, evaluation and performance tuning is beyond the scope of the article – it is part of the machine learning design.

Once the machine learning model is ready, it is time for creating the ONNX model. The standard ONNX packages that should be imported are mentioned below. As a side note, I would like to emphasize that importing the ONNX packages is only valid when skl2onnx is installed on the development environment (this module can be installed using the "pip install skl2onnx" command).

Image by Author
Image by Author

The creation of the ONNX model practically means the process of converting the Scikit learn model into the file following the ONNX format. This is done using the convert_sklearn() method, taking the previously created model and its dimension, which in this case are the number of the features from the processed dataset. In this scenario, the process of feature engineering as well as features dimensionality reduction resulted in total number of 19 input features.

Image by Author
Image by Author

The last step is the ONNX model export, serialized on a previously specified system location/path (declaring the file extension as ‘onnx’).

One important information here is the s et of machine learning algorithms currently supported by ONNX library. The version, as well as the list of all integrated algorithms can be generated using the following commands (the image preview is presenting part of the complete list of supported algorithms).

Image by Author
Image by Author
Image by Author
Image by Author

Thus, I am always recommending checking the list in the context of validation that you are using supported version of algorithm for model building. This check should be done in the machine learning phase, more precisely during the algorithm selection. It is worth mentioning that any other algorithm, which is not part of the ONNX package, can be supported as well, but the appropriate custom configuration and adjustments are beyond the scope of this article. Also, there are some well known limitations related to the package, which can be addressed following the official documentation page.

.NET Core integration

As mentioned, converting the sklearn model into the ONNX model is the first segment of the solution and prerequisite for the following step of importing the ONNX model into the .NET ecosystem. This can be done using the Microsoft.ML.OnnxRuntime Nuget package. I decided using the Microsoft Azure Function template to demonstrate the ONNX conversion into the .NET as well as the integration validation and testing described in the following section. I have created the Azure Function as a new project within a Visual Studio solution targeting the .NET Core 3.1 framework version.

Image by Author
Image by Author
Image by Author
Image by Author
Image by Author
Image by Author

Initially, I extracted the request body coming from the Http request and deserialized it into dynamic data object, technique for retrieving the type at the run time. Then, I used the straightforward approach of declaring and initializing every variable which is part of the request body. It is worth mentioning here that the number of variables (number of request body parameters) should be identical with the number of input features defined in the moment of ONNX model creation.

Image by Author
Image by Author

Afterwards, I defined the local system path of the previously saved ONNX model and created the input tensor necessary for creating the list of input features and creating the inference session. Having the session initialized, it is time for running it and extracting the model output in raw format. The inference session can be also used for making predictions on the already created ONNX model in Jupyter Notebook solution.

Image by Author
Image by Author

After the session execution, I have parsed the retrieved raw format value into collection of DisposableNamedOnnxValue objects, which is then used for extracting the resulting array into the form of Dictionary<long, float>. The dictionary structure is then used for extracting the value for the probability of converting the lead (lead scoring result).

Image by Author
Image by Author

Integration validation

Considering that the lead scoring model, initially designed and written in Python and Scikit-learn, is now successfully integrated, I will test and validate the complete scenario using the Postman API Platform. Since I wrapped the integration using the Microsoft Azure function template, I can access it via an HTTP call locally on my development machine, following the previously configured path (url, port and function name binding).

Image by Author
Image by Author

Using the advantage of the Visual Studio`s integrated debugger, I drill down the model response, investigating the objects structure and types.

Image by Author
Image by Author
Image by Author
Image by Author
Image by Author
Image by Author

Final words

In this article, I presented part of the practical implementation of a lead scoring prototype we are currently working on. Basically, it describes the technical approach addressing the challenge of converting and deploying ML models from Python and Scikit-learn to different target technology environments, in this case the .NET ecosystem. Utilizing the advantage of ONNX format, I demonstrated the simplest form of building the ONNX models and using the format flexibility for incorporating it as a structural part of the target environment source code. Besides the fact that this approach is bridging the technical differences between the different data science and application development platforms, it also provides the opportunity to integrate the already designed model with the benefits of ML.NET (the machine learning framework developed for .NET). Moreover, the ONNX NuGet package and implementation approach can be also used for integrating the Scikit-learn model into Web API solution.

Further applications

Apart from building the lead scoring system, we have successfully created a lead decision prototype that follows a similar pattern, except the machine learning algorithm selection. So far, I have developed a lead decision integration module using the Random Forest classifier. It is also part of the list of supported algorithms by the ONNX package. Thus, looking from a business perspective, we are providing the opportunity to generate lead decisions based on historical and current data.

Moreover, while doing the exploratory data analysis and covering the process of designing and building the ML model, we are providing and generating data insights and knowledge beneficial for the preferable types and format of data that should be gathered and tracked in the future. This comprehensive data analysis is followed by the process of the model’s interpretation, which in general provides other valuable aspects of the features and correlations importance.

— – – – – – – – – – – –

Thank you for reading the content, which I strongly believe is clear, detailed and helpful.

My current focus and expertise are related to the latest and edge technology advantages in enterprise web development and architectures, specifically the Microsoft .NET ecosystem. Additionally, I enjoy working on machine learning applications in data science, bioinformatics, and digital marketing/lead generation. So, I would be very grateful if you take the time to comment, support and shared the article.


Originally published at https://www.linkedin.com.


Related Articles