Building a Machine Learning Model for Lead Decisions using the Tree Ensemble Learner in KNIME

Practical guide for building a classification-based predictive model for Lead Decisions in the KNIME Platform

Miodrag Cekikj
Towards Data Science

--

Image by Author

Having the opportunity to utilize the Machine Learning algorithms in the Lead Generation field, I encountered the possibility to design and implement various lead scoring and lead decision predictive models. Previously, while covering the technical part of integrating Python-based Scikit-learn models into the Microsoft .NET ecosystem, I also introduced the concept, potential and the benefits of such lead scoring solutions. I have also mentioned developing the lead decision integration module using classification-based algorithms within the further applications section of the referenced article. Assuming that the integration approach follows the already described steps, in this article I will focus more on the lead decision model design, powered by the KNIME platform. Hence, I will emphasize the methodological approach of building the system, including the core nodes for importing, normalizing, preparing and applying the data for training, testing, evaluating and optimizing the decision model.

* Note: The solution design presented in this article could be used for demonstration purposes only — simplified to the level of emphasizing the core fundamentals and building blocks, but it is a fully applicable ground for building and exporting predictive decision-based models within real testing or production deployed systems as well.

** Note: I am not going to cover in detail the nature or how the tree ensemble machine learning algorithm is working as well as the practical meaning and various possibilities related to its configuration parameters. Drill down the process and detailed explanation of each configuration property are different comprehensive topics requiring special attention and description. More information can be found as part of the referenced literature and documentation mentioned within the article and as part of the official libraries’ documentation as well.

What is Lead Decision?

The Lead Decision follows the almost identical definition construct I defined in the previous article. The only difference is just the technique of assigning specific decisions to the company`s leads database for guiding the marketing and sales teams through conversion to end clients. This difference stems directly from the technical approach of treating the model`s output, directly assigning the decisions (excluding the calculation of the associated probabilities). Although it seems like a more generalized scheme, it is considered a comprehensive and beneficial process of adjusting its strategy and improving the marketing and sales teams` performance and efficiency.

Importing the data

As mentioned in the referenced article for building the Lead Scoring prototype, I also used the identical dataset publicly available on Kaggle. Considering that the initial data overview, the process of exploratory data analysis (EDA), and the data preprocessing are beyond the scope of this article, I will directly proceed to presentation of the processed data structure overview and import it into the KNIME platform. In terms of this, I used the Jupyter Notebook environment and the advantages of the data analysis and wrangling libraries like NumPy and pandas, followed by the support of the statistical data visualization libraries like Matplotlib and seaborn.

The resulting dataset dimension, attributes, and the process of storing it on the local system path are shown below (data file saved in comma separated value format).

Image by Author

I will not consider applying additional feature engineering and dimensionality reduction in this scenario and will import all features (retrieved as a result of the data analysis and processing) directly into the KNIME platform.

Image by Author

Data Filtering and Normalization

Taking in consideration the data was initially explored and processed in the Jupyter Notebook workspace, I will move along explaining the design of filtering and normalizing data module — visualized in the screenshot presented below.

Image by Author

I setup the data filtering using the Column Filter utilized for manual configuration, which provides the functionality of removing the one extra automatically generated column storing the incremental id values of the records. This node can also be used for more advanced filtering scenarios applicable in the process of data preprocessing — supporting wildcard and type selection filtering modes.

Image by Author

After the data filtering procedure, I added the standard Shuffle node manipulator for achieving random records order (default configuration applied). This is followed by incorporating the Normalizer node for normalizing the numerical features (in this case ‘TotalVisits’, ‘Total Time Spent on Website’, ‘Page Views Per Visit’) using the Z-score linear normalization, based on the Gaussian (0, 1) distribution.

Image by Author

Going further, I have also connected the output of the normalized data into the Cronbach Alpha node for comparing the variance of the individual columns with the variance of the sum of all columns. In general, it represents a reliability coefficient calculated as a measure of internal consistency and featured correlation, respectively. The coefficient interpretation is beyond the scope of this article.

I will wrap up this module using the Number to String node for converting the target attribute ‘Converted’ type into string (the format is required by the following modules for preparing the data and building the machine learning model).

Image by Author

Data Partitioning

This module represents the stage before the actual procedure of model building. Thus, I am using the Partitioning node and two pairs of Table View nodes for splitting the data into train and test subsets — a prerequisite for proceeding with the supervised machine learning approach.

Image by Author

I want to mention that I decided to split the dataset relatively, assigning 70% of the data as train data and 30% as testing data and configuring it to draw randomly using explicitly specified random seed. This process helps keep the execution result consistent across multiple different flow executions. In this case, it ensures that the dataset will always be identically divided. The table views are inserted just for transporting the data to the following modules and to improve the flow structure and readability.

Image by Author

Model Building and Optimization

Generally, I decided to use the Tree Ensemble Learner and Tree Ensemble Predictor nodes for implementing the supervised machine learning model. Based on the article referenced in the introduction, instead of the lead scoring and logistic regression-based solution, building a lead decision system means implementing machine learning algorithms for classification. Since I have also mentioned the Random Forest classifier, I want to emphasize that the Ensemble Tree learner represents an ensemble of decision trees (as in the case of Random Forest variants). There are Random Forest Learner and Random Forest Predictor nodes in KNIME as well. Still, I have chosen the ensemble tree implementation due to the enriched configuration possibilities and the simplified attribute statistics needed in the process of results interpretations — the result interpretations (features importance) are beyond this article’s scope. The Tree Ensemble nodes are surrounded by Parameter Optimization Loop Start and Parameter Optimization Loop End nodes providing the hyperparameter optimization/tuning — the process of execution presented in the screenshot below.

Image by Author

The Parameter Optimization Loop Start node starts a parameter optimization loop based on the previously provided configuration. In terms of this, it utilizes the concept of flow variables in KNIME, ensuring that each parameter value will be passed as a specific flow variable. I used the standard settings approach configuring the ‘Number of Models’, ‘Maximum Levels’ and ‘Minimum Node Size’ as tree ensemble algorithm parameters. Utilizing the values interval and step size, this loop provides building and evaluating different classification models.

Image by Author

Simultaneously, the Parameter Optimization Loop End node collects the objective function value from a flow variable and transfers the control back to the corresponding loop cycle. It also provides the Best Parameters output port from where I am retrieving the best parameters combination for the maximized accuracy configured.

Image by Author

This design approach should be supported by using the Scorer node to provide the loop resulting model predictions on the previously defined testing dataset. So, for achieving this flow, I am connecting the Ensemble Tree Predictor node (default configuration) with the scoring mechanism and wrapping it around the optimization loop.

It is worth mentioning the Tree Ensemble Learner node configuration, though. In the Attribute Selection area, everything that needs to be configured is the Target column attribute, the ‘Converted’ dataset feature.

Image by Author

Going further, in the Tree Options area, I configured the Information Gain Ratio as an algorithm split criterion, used to normalize the information gain of an attribute against how much entropy that attribute has.

Image by Author

As it is visible at the bottom of the configuration window, it is explicitly mentioned that “maxLevels”, “minNodeSize” and “nrModels” are controlled by variables. These are the input flow variables coming as a result of the loop process, I previously explained. So, in this case, every created model will experience different configuration parameters — parameters tweaking. Considering that explicit flow variables match is needed, I have mapped the specific names within the Flow Variables screen.

Image by Author

On the other hand, I set the data sampling in random mode by looking through the Ensemble Configuration. I configured the sample (square root) attribute sampling and used each tree node’s different sets of attributes. As used before, I also assigned a specific static random seed value.

Image by Author

The Parameter Optimization Loop End node provides two output ports, Best Parameters and All Parameters. I used the all parameters overview for analyzing the performance of the lead decision classificatory based on different parameters combinations. It is considered as good practice in initial understanding how model works as well as indicator how it can be additionally optimized, techniques that are beyond the scope of the article.

Image by Author

In the end of this module, I used the Table Writer node for keeping the best parameters as record in table stored on the local system path.

Image by Author
Image by Author

Lead Decision Model Extraction

The final module represents the optimized machine learning classifier extraction using separate Tree Ensemble learner and predictor nodes. The technique I am using here is a follow up of the previous module where I stored the best parameters retrieved from the algorithm optimization procedure. In general, I used the Table Reader node to read the table from the previously defined system location and used the advantage of the Table Row to Variable node for converting the table record in a variable format. Therefore, the variable conversion configuration, filtering the accuracy value parameter, is shown in the following screenshot.

Image by Author
Image by Author

Following the identical approach of flow variable inputs, I configured the variable parameters as Tree Ensemble algorithm parameters on the Tree Ensemble Learner. I have to emphasize here that it is essential that the training data set from the building and optimization should be identical in extracting the best performant lead decision model. The same is applicable for the testing data as well. This was achieved using the specific random seed value explained in the previous module. Moreover, the static random seed configured in the Ensemble Configuration of the Tree Ensemble learner should also be identified in the modules of building and optimizing and extracting the final model.

As a final touch, I inserted a Scorer node (JavaScript) for analyzing the extracted lead decision model performance. In this case, I selected scorer supporting Java Script views for generating more insightful and interactive summary. The testing data predictions, including the confusion matrix, class and overall statistics are presented in the screenshot below.

Image by Author

So, class 0 represents the ‘Not Converted’ leads, while class 1 represents the ‘Converted’ leads. This column can also be aligned as a categorical one within the process of data processing — so that instead of working with integers (converted into strings), the categorical description can take place.

* Note: The resulting summary is retrieved according to the evaluation of the extracted lead decision model. The procedures of additional optimization, interpretation of the results, and generating the features’ importance are not part of this topic. However, this model building results are satisfactory and acceptable, considering the required scenario goal and the achieved overall accuracy above 80%.

Final Words

In this article, I covered the step by step design process of complete lead decision system development using the KNIME platform. In this context, I covered the usage of the core nodes implemented in terms of establishing working infrastructure for providing lead decisions on previously analyzed and processed data. This approach is part of the broadest systematic approach of working and interpreting marketing and sales data to establish a more insightful and practical lead generation process.

This solution was completely designed and developed using KNIME, an open-source data analytics, reporting and integration platform providing a set of various components for machine learning. It is a stimulating environment empowering the process of building machine learning models, and at the same time, presenting it in a very intuitive visual way. In addition to this, I consider it as an excellent platform for rapid model building, especially in cases where there is a specific need of trying and analyzing multiple different machine learning algorithms. Simultaneously, it can be very easily utilized by researchers or professionals with no coding background. It has a very simple interface and integrated documentation for building different machine learning pipelines, flows, and models.

— — — — — — — — — — — —

Thank you for your time reading the article. I believe it was insightful and got you inspired.

Feel free to share your thoughts on it, I would appreciate your suggestions and point of view.

Originally published at https://www.linkedin.com.

--

--

PhD Computer Science & Engineering | Lead Technical Consultant (R&D) @ ⋮IWConnect | Speaker | Technical Trainer | ML Researcher | Web3 & Blockchain Enthusiast