**Companies will gain tremendous growth opportunities by simply cutting back budgets on initiatives that are not working — the low-hanging fruit in business.**

I once worked as a data scientist in a startup. Most of the startup’s revenue came from inbound leads. Although there were satisfying numbers of leads coming in every day, the conversion rate was not ideal. Curious about how to improve conversion and revenue with the limited resources, I started a lead scoring project to find the prospect cohorts most likely to convert.

I expected to spend a few weeks building the data pipeline and running the machine learning model before I could get any meaningful insights from the data. Because, as you may be aware, a lead scoring project takes non-trivial efforts.

Surprisingly, shortly after I aggregated the data, I found something interesting — the data indicated that the company should ignore at least 80% of its inbound leads. These leads often came from prospects who clearly could not afford the product — once the sales team saw all visitor attributes in one place, they could quickly tell who would never convert.

For example, metric A proved to be a key metric in predicting conversion. While a high metric A didn’t guarantee a closed won, a low metric A indicated a closed loss, as shown in the diagram below.

However, the sales team spent 50%—if not more—of their time chasing the orange cohort. Why? The team didn’t have a prospect’s complete picture in front of them.

Once the company implemented the lead scoring system to guide sales efforts, sales productivity tripled within three months. The sales team moved efforts away from those low-quality leads and closed more deals and bigger deals.

Later on, as I worked on more data science initiatives with a wide range of companies, including both B2B and B2C, I noticed this repeating pattern. For example, many companies’ ad campaign performances look like the diagram below: a large proportion of the budget is spent on campaigns that generate little to no return. These campaigns should be turned off immediately.

Once a company cuts back on underperforming marketing campaigns or other initiatives, it immediately gets a longer runway for high-potential pilots.

**In the current economy, where there is considerable uncertainty, generating more revenue with fewer resources is more relevant than ever.**

You may ask why companies would continue to spend money on something that is not promising. Well, they wouldn’t. Most executives would quickly take action to trim the spending on the stalling initiatives as soon as they saw the complete picture of the performance. However, the issue was that they didn’t have such a picture until they had already wasted tens of thousands, if not millions, of dollars. Take marketing data, for example; the data flow looks like the below in a company actively running marketing campaigns.

Without dedicated minds and efforts in data strategy, most companies don’t have a holistic view of their business performance. Therefore, the opportunities to reduce resources wasted on underperforming initiatives never reach the executives. Many companies treat data science as a nice-to-have and don’t invest in analytics until executives can no longer wrap their heads around what’s working vs not and why that is.

Conversely, data science is on every company’s critical path — customer acquisition.

**Monitoring business performance in real-time and swiftly halting initiatives with negative returns is the low-hanging fruit for every company. It will give a company more time and cash to survive the unstable economy.**

You can read Connect the Dots in Data Strategy to learn more about how to seize the low-hanging fruit. In the following articles, I will discuss how to run experiments and identify what works in businesses with data. Stay tuned!

Seizing the Low-Hanging Fruit in Business with Data Science was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>Your portfolio plays a vital part in telling employers why they should hire you.

]]>When speed matters, lists aren’t the best.

]]>Understanding which columns to include in GROUP BY clause and how to include aggregations in WHERE clauses

]]>For a long time, R has been my go-to tool for most data science-related tasks. I particularly love how easily I can get things done fast and fluently. In particular, the emergence of the tidyverse has really been a game-changer for data wrangling, exploratory analysis, and data visualization. Moreover, Shiny — a framework for building versatile and beautiful web applications — has become increasingly popular.

However, when it comes to machine and deep learning, Python seems to be several steps ahead with ML/DL frameworks such as Sci-kit Learn, PyTorch, and Tensorflow/Keras. Consequently, I find myself using (and liking) Python more and more.

For me, the frustrating part was that I often wanted to deploy and expose computer vision and NLP models in Shiny apps. Even though Shiny has recently become available for Python, Shiny for Python is currently in a very early stage of development. Similar tools are indeed available to Python users, e.g. Streamlit and Gradio are great for exposing ML models, but I find that they are somewhat limited compared to Shiny — particularly when it comes to creating custom user interfaces. There are of course more options for Python that I will need to explore, but I really like developing Shiny apps.

I therefore set out to learn how to do the following:

- Use R’s reticulate package to use Python code in a Shiny app.
- Implement a pre-trained transformer model to process user input.
- Containerize apps containing both R and Python functionality as well as serving a transformer model.

In this article, we’ll go through all three steps to see how it’s done.

I must admit that getting started with reticulate and bridging the gap between R and Python was somewhat tricky and involved a bit of trial, error and tenacity. The good news is that now you might not have to. Here, I will show one way to use python code in R and Shiny.

First things first, make sure you have all the required packages installed including shiny and reticulate and start a new Shiny app project.

Next, we will need to set up a Python environment. There are different ways of doing this, e.g. by using virtualenv as per the reticulate documentation. Here, we will use conda to set up our environment from a YAML-file in which we specify all the necessary dependencies. Create a new file called environment.yml in your project’s home directory with the following contents:

name: my_env

channels:

- conda-forge

- defaults

dependencies:

- python=3.8

- pip

- pip:

- numpy

As you can see, we’ve specified that the environment should be called my_env in which Python v. 3.8 will be installed along with pip — an installer, which in turn fetches the numpy package, which we will need for a simple function that we’ll create.

We can then create the environment by opening a terminal in ourRStudio session and running the following command:

conda env create -f environment.yml

You may check that the environment has been created with the command conda env list. To activate the environment, run the following:

conda activate my_env

Now, in order for reticulate to find the version of Python we just installed, copy and save the output of the commandwhich python for later. It should end in something like “../miniconda3/envs/my_env/bin/python”.

If you’ve initialized your project as a Shiny web app, you should already have a file called app.r. For this minimal example, we’ll keep it simple and modify the app that’s already made for us. At the top of this script, insert the following lines:

library(shiny)

library(reticulate)

Sys.setenv(RETICULATE_PYTHON="python-path-in-my_env")

reticulate::use_condaenv("my_env")

You may also set RETICULATE_PYTHON in a file called .Rprofile instead.

Now we can create a simple Python function that we will be using in the server code of app.r. Start by creating a script, which you may call python_functions.py with the following lines of code:

import numpy as np

def make_bins(x, length):

low = min(x)

high = max(x)

length = int(length)

return np.linspace(low,high,length).tolist()

This function finds the lowest and highest values of a vector and uses Numpy’s linspace() function to return a list of equidistant numbers ranging from lowest to highest. The length of the list is equal to length and will be set interactively by users of the app. With the function defined, we can move on to app.r and modify the script to use our python function.

Right below the two lines we inserted in app.r in the previous step, we add the following line:

reticulate::py_run_file("python_functions.py")

This line makes our make_bins()function available whenever we start a Shiny session. Now, remove or comment out the following line in the server code:

bins <- seq(min(x), max(x), length.out = input$bins + 1)

We then substitute this line with the following:

bins <- py$make_bins(x, input$bins + 1)

Note the py$ part, which signals that the function is a python function. We can finally, run the app and hopefully see that it works!

So now we’ll try to implement a transformer model that determines whether a given input text is either positive or negative. The model we’ll be using is called *distilbert-base-uncased-emotion*, which you may learn more about on the huggingface site. If you haven’t done so already, I encourage you to explore the site, the available models, and supported NLP and computer vision tasks.

We first need to add the packages torch and transformers to our environment.yml file in the following way:

name: my_env

channels:

- conda-forge

- defaults

dependencies:

- python=3.8

- pip

- pip:

- torch

- transformers

We can then update the environment with the following commands:

conda deactivate;

conda env update -f environment.yml --prune

The —-pruneflag ensures that unnecessary packages are removed when updating the environment.

With torch and transformers installed, we’re ready to write new Python functions that allow us to use the model.

import torch

from transformers import pipeline

import numpy as np

def get_model():

model = pipeline("text-classification", model='bhadresh-savani/distilbert-base-uncased-emotion', top_k=-1)

return model

def get_predictions(input_text, classifier):

predictions = classifier(input_text)

return predictions

The first time you run get_model() the model is donwloaded, which may take a minute or two. It’s a good idea, to run get_predictions() outside of a Shiny session to get an idea about what the output looks like.

Now, we can finally build an app that makes use of the model. I’ve provided a simple working script below, which you can try out if you’ve completed the previous steps.

You’ll notice that we load the emotion classification model on line 10 with model <- py$get_model().

Then, on lines 31–33, we apply the model to some input text provided by users and convert the output to a data frame, which makes plotting much easier.

predictions <- py$get_predictions(input$text)

df <- map_df(predictions[[1]], unlist)

It can sometimes be tricky to convert the output of a Python function to a data type that R can work with. In case you run into trouble in your own project, you might find the reticulate docs useful (see “Type Conversion”).

Below you can see what the app will look like.

Docker and container technology offer a great way of running code and applications with full control over environments. We first need to create a Dockerfile, which can often be quite difficult and time consuming. here, I will show one solution to combining Python, R, Shiny and transformer models in a single Docker image. It may not be the most efficient one, and some dependencies and commands may be superfluous. Thus, you might be able to reduce both the time it takes to build the image as well as its size by tinkering with the Dockerfile with specifies how the image is built.

The first line of the Dockerfile indicates the base image. By default, the latest version is used. Usually, when putting the app into production, it’s advisable to opt for a specific version. The next few lines install a few dependencies including R.

FROM continuumio/miniconda3

RUN apt-get update -y; apt-get upgrade -y; \

apt-get install -y vim-tiny vim-athena ssh r-base-core \

build-essential gcc gfortran g++

Normally, I prefer using Docker images created by the Rocker Project, which makes it very easy to write Dockerfiles for containerizing Shiny apps and other R-based applications. However, I ran into some issues when adding Python to the mix and decided to try a different way.

Next, our environment is installed and activated just as before. I have to admit that I’m not yet entirely sure about how many of the environment variables below that are strictly necessary.

COPY environment.yml environment.yml

RUN conda env create -f environment.yml

RUN echo "conda activate my_env" >> ~/.bashrc

ENV CONDA_EXE /opt/conda/bin/conda

ENV CONDA_PREFIX /opt/conda/envs/my_env

ENV CONDA_PYTHON_EXE /opt/conda/bin/python

ENV CONDA_PROMPT_MODIFIER (my_env)

ENV CONDA_DEFAULT_ENV my_env

ENV PATH /opt/conda/envs/my_env/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

Then, we download our model with the following line:

RUN python -c "from transformers import pipeline; pipeline('text-classification', model='bhadresh-savani/distilbert-base-uncased-emotion')"

It is important to do this at build time, as opposed to runtime. If we were to to this at runtime, every session would start by downloading the model!

There are different ways of installing R packages. This way is pretty straightforward.

RUN R -e "install.packages(c('dplyr','purrr','ggplot2','shiny','reticulate'), repos = 'http://cran.us.r-project.org')"

Note that this is actually a single line. You can also see that we’re installing dplyr, purrr and ggplot2, which are the tidyverse packages we actually need. Consequently, we need to load these specific packages and remove library(tidyverse) from app.r.

For some reason, I wasn’t able to install the entire tidyverse at this time. Besides, doing so would take considerably longer and result in an image that would be larger than necessary.

Lastly, we copy everything in our project folder to the image’s working directory, give write read/write permission, expose a port and specify a final command that actually runs the app.

COPY . ./

RUN chmod ugo+rwx ./

EXPOSE 3838

CMD ["R", "-e", "shiny::runApp('/', host = '0.0.0.0', port = 3838)"]

Below you can see the entire Dockerfile.

If we simply call our Dockerfile “Dockerfile”, Docker will by default look for this file when we run the following command:

docker build -t mysimpleapp .

The -t is for ‘tag’ and we will tag this image ‘mysimpleapp’. The dot in the end indicates that the build context is the current directory.

In case you run into trouble due to disk space limitations, you may increase the allowed disk space in the Docker settings, or, in case you have large dangling images you don’t need, you can run docker system prune or docker system prune -a. Be aware that the latter command will remove all unused images!

Finally, keeping our fingers crossed, we can try and run our app!

docker run -it -p 3838:3838 mysimpleapp

The -it flag means that we want to run in interactive mode, so that we may see what’s going on ‘inside’ our container as it starts up. This might be helpful in case something unexpectedly goes wrong.

In your console, you should then see R starting up followed by ‘Listening on http://0.0.0.0:3838’. Point your browser to this address and check that the app works.

I’ve deployed a slightly more advanced app here. This app, which I’ve called Wine Finder, uses a semantic search model called all-MiniLM-L6-v2 that lets users search for wines they might like by typing queries describing the qualities they’re looking for in a wine. For instance, a query might be phrased as*“full-bodied with notes of red berries”*. The app includes descriptions of roughly 130.000 wines, which are then ranked by relevance with respect to the query. The dataset is available here. Below you can see what the app looks like.

It may be a little slow to load since I’ve allowed the service to basically shut down when not in use. This results in “cold starts”, which is much cheaper than having the service constantly running.

We’ve seen how to implement python functions and transformer models in Shiny apps and how to wrap it all up in a docker image ready to be deployed as a web service. It is relatively easy to deploy docker images as web apps using cloud providers such as AWS and Microsoft Azure. However, for personal projects, I think Google Cloud is the cheapest option at the moment. If you would like to deploy e.g. a Shiny app on Google Cloud, make sure to check out my step-by-step guide to using Google Cloud Run for deploying shiny apps. No matter which of these providers you use, the process is roughly the same. You will need to have an account, push your docker image to a container registry and then set up a web service using the image.

We haven’t covered deployment of Shiny apps that run python code directly on Shinyapp.io. The procedure is slightly different as described in this tutorial. Be aware that ff you plan on exposing transformer models in your app, shinyapps.io may not be a viable option, at least not if you’re on the free tier. However, if you don’t need the app to actually contain a large transformer model and/or a lot of data, you may consider simply calling the Huggingface Inference API for the given model.

Serving Transformer Models in Shiny Apps was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>In this article, we’ll bring together two fundamental topics in statistical modeling, namely the covariance matrix and heteroskedasticity.

Covariance matrices are the work horses of statistical inference. They are used for determining if regression coefficients are statistically significant (i.e. different from zero), and for constructing confidence intervals for each coefficient. To do this work, they make a few crucial assumptions. Chief among these assumptions is that the model’s errors are homoskedastic i.e. they have constant variance, and the errors are not auto-correlated.

In practice, it is possible to assume non-auto-correlatedness of errors especially in cross-sectional data (but not in time-series settings).

Unfortunately, the assumption of homoskedasticity of errors is, more often than not, not met. Fortunately, there are ways to build a covariance matrix of regression coefficients (and thereby perform sound statistical inference) in the face of non-constant variance i.e. heteroskedastic regression errors.

In this article, we’ll study one such technique known as the **White’s heteroskedasticity consistent estimator** (named after its creator Halbert White) in which we will build a covariance matrix of regression coefficients that is robust to heteroskedastic regression errors.

This article is part 1 of the following two part series:

**PART 1: Introducing White’s Heteroskedasticity Consistent Estimator**

PART 2: A tutorial on White’s Heteroskedasticity Consistent Estimator using Python and Statsmodels

In PART 1, we will get into the theory of the HC estimator while in PART 2, we walk through a Python based tutorial on how to use it for doing statistical inference that is robust to heteroskedasticity.

Consider the following general form of the regression model:

In this model, we express the response variable ** y** as some unspecified function of the regression variables matrix

A linear form of this model that consists of two regression variables and an intercept would be expressed as follows:

Here, ** 1**,

In practice, it is useful to express Eq (1) or (1a) compactly as follows:

Or in its full matrix glory as follows (in our example, k=3):

*β_1, β_2,…β_k *represent the true, population level values of the coefficients corresponding to the situation when the sample is the entire population.

But for all practical situations the sample data set is some random subset of size *n* from the population. When a linear model is trained a.k.a. “fitted” on this sample data set, we have the following equation of the fitted model:

Here, ** y** and

Since fitting the model on different samples, each of size *n,* will yield a different set of coefficient values each time, the fitted coefficients *β_1_cap, β_2_cap,…β_k_cap* can be considered as random variables. The fitted values *β_1_cap, β_2_cap,…β_k_cap *each have a mean value which can be shown to be the corresponding true population value *β_1, β_2,…β_k*, and they have a variance around that mean.

The covariance matrix of the regression model’s fitted coefficients contains the variances of the fitted coefficients, and the covariances of fitted coefficients with each other. Here’s how this matrix looks like for a model containing *k* regression variables (including the intercept):

The covariance matrix is a square matrix of size *[k x k]*. An element at position *(i,j) *in this matrix contains the covariance of the *ith *and the *jth *fitted coefficient. The values along the main diagonal are the variances of the *k* fitted coefficients while the off-diagonal elements contain the covariances between fitted coefficients. The square-root of the main diagonal elements are the standard errors of fitted coefficients. The matrix is symmetrical around the main diagonal.

The covariance matrix of regression coefficients is used to determine whether the model’s coefficients are statistically significant, and to calculate the confidence interval for each coefficient.

For linear models of the form *y **= **Xβ** + *** ϵ** (which form the backbone of statistical modeling), the formula for the covariance matrix contains within it another covariance matrix, namely the covariance matrix of the model’s error term

Incidentally, note that both matrices contain conditional variances and covariances, conditioned as they are upon the regression matrix ** X**.

Just as with the fitted coefficients, each error *ϵ_i*** **is a random variable with a mean and a variance. If the model’s errors are homoskedastic (constant variance), then for each row

Where ** I** is an identity matrix of size

In a linear model, when the model’s errors are homoskedastic and non-auto-correlated, the covariance matrix of fitted regression coefficients has the following nice and simple form, the derivation of which is covered in my article on covariance matrices:

In the above equation, *X**’* is the transpose of ** X**. The transpose is like turning

Eq (6) suggests that to calculate the covariance matrix of fitted coefficients, one must have access to ** X** and

Unfortunately, reality is seldom simple. While we can often assume non-auto-correlatedness of error terms, especially in cross-sectional data sets, heteroskedasticity of errors is quite commonly encountered in both cross-sectional and time series data sets.

When the error term is heteroskedastic, the covariance matrix of errors looks like this:

In this matrix, σ² is just a common scaling factor which we have extracted out of the matrix so that each of *ω_i=σ²_i/σ². *It is easy to see that when the errors are homoskedastic, *σ²_i=σ²* for all *i* and *ω_i=1* for all *i* and *Ω**=*** I**, the identity matrix.

Incidentally, some statistical texts use ** Ω** to represent a covariance matrix of errors that are

Either way, when the matrix of covariance errors is not *σ²*** I**, and instead is

If you are curious about how Eq (8) was derived, I have mentioned its derivation at the end of the article.

For now, we need to see how best to estimate the variance of the fitted coefficients in the face of heteroskedastic errors using equation (8).

Equation (8) consists of three segments (colored in green, yellow, and green respectively). Since the ** X** matrix is completely accessible to the experimenter, the terms

Recollect that when the errors were homoskedastic and non-autocorrelated, we could estimate σ²** I** using the variance s²

And this is where the estimator proposed by Halbert White in 1980 comes into play. White proposed a way to estimate the yellow term *(**X**’σ²**ΩX**)* at the heart of equation (8). Specifically, he proved the following:

Equation (9) deserves some explanation. The *plim* operator on the L.H.S of Eq (9) stands for **probability limit**. It is a short-hand way to say that the random variable computed by the summation on the L.H.S. of Eq (9) **converges in probability** to the random variable on the R.H.S. as the size of the data set *n* tends to infinity. A simple way to understand “convergence in probability” is by imagining two random variables **A** and **B**. If **A** converges in probability to **B**, it implies that the probability distribution of **A** becomes more and more like the probability distribution **B** as *n* (the size of the data sample) increases and it becomes (almost) identical to the probability distribution of **B** as *n* tends to infinity. Thus, the properties of **A** become indistinguishable from the properties of **B** as *n* becomes arbitrarily large.

The fact that the summation on the L.H.S. of Eq (9) and the term on the R.H.S are random variables can be deduced as follows:

Let’s look at the R.H.S. of Eq (9). Each time one randomly selects a sample of size *n, *** X**, the regression matrix, is likely to take on a different set of values. That makes the

Now let’s look at the L.H.S. On the L.H.S., we have the terms *x**’_i* and *x**_i* which are the transpose of the ith row of ** X** and ith row respectively. Both are random variables for the same reasons that

Let’s inspect the summation on the L.H.S.:

*x**_i* is the ith row of ** X**, so its a row vector of size

Now let’s look at the R.H.S. of Eq (9):

** X** is the regression matrix of size

With this estimate in hand, let’s revisit the formula for the covariance of the fitted coefficients in the face of heteroskedastic errors:

Thanks to Dr. White, we now have a way to estimate the yellow term at the center of Eq (8) as follows:

Equation (10) is known as **White’s Heteroskedasticity Consistent (HC) Estimator**. It gives the regression modeler a way to estimate the asymptotic covariance matrix of the fitted regression coefficients in the face of heteroskedastic errors. The word ‘asymptotic’ implies that the estimator is valid, strictly speaking, only for infinitely large data sets. More on this fact below.

While using it, we should keep in mind its following two limitations:

Recall that this estimator is based on an identity that is valid only as the data set becomes arbitrarily large in size, i.e. technically as n → ∞. In practical settings, this limitation makes this estimator especially valid only for very large data sets. For smaller data sets (known as small samples), say when *n* is less than or equal to a couple of hundred data points, White’s HC estimator tends to underestimate the variance in the fitted coefficients making them look more relevant than they might actually be. This underestimation in small samples can be usually \corrected by dividing Eq (10) by *(n — k)* as shown by MacKinnon and White in their 1985 paper (see paper link at the end of the article).

The second potential issue with White’s HC estimator is that it assumes that there is little to no auto-correlation in the errors of the regression model. This assumption makes it suitable only for cross-sectional and panel data sets and makes it especially unsuitable for time series data sets which typically contain auto-correlations extending to several periods into the past.

All things said, White’s heteroskedasticity consistent estimators provides a powerful means to estimate the covariance matrix of fitted coefficients and thereby perform consistent statistical inference in the face of heteroskedasticity.

We start with the linear model:

A least-squares estimation of ** β** yields the following estimator:

Substituting ** y** in (b) with

The variance of a random variable is the expected value of the square of its mean-subtracted value: *Var(**X**)=E[(**X** — **X**_mean)²].* The mean of the coefficient estimates *β**_cap* is simply the population value ** β**. Thus we continue with the derivation as follows:

The blue colored term at the center of Eq ( e ) is the covariance of the regression model’s errors conditioned upon ** X**. We know that the covariance matrix of errors can be represented as

In my next week’s article, we’ll walk through a tutorial on how to use the White’s Heteroskedasticity Consistent Estimator using Python and Statsmodels. Stay tuned!

White, Halbert. “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity.” *Econometrica*, vol. 48, no. 4, 1980, pp. 817–38. *JSTOR*, https://doi.org/10.2307/1912934. Accessed 25 Sep. 2022. **PDF download link**

James G MacKinnon, Halbert White, Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties, *Journal of Econometrics, Volume 29, Issue 3, 1985, Pages 305–325, ISSN 0304–4076*, https://doi.org/10.1016/0304-4076(85)90158-7. (https://www.sciencedirect.com/science/article/pii/0304407685901587) **PDF download link**

All images in this article are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

*If you liked this article, please follow me at **Sachin Date** to receive tips, how-tos and programming advice on topics devoted to regression, time series analysis, and forecasting.*

Introducing the White’s Heteroskedasticity Consistent Estimator was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>Serve scikit-learn models with FastAPI and Docker

]]>And 8 key features that make it the best ASR model (hey Siri, this one’s for you)

]]>Add to your exploratory data analysis arsenal

]]>Statistics in R Series

]]>