How to Predict Business Success & Failure

Why Some Small Businesses Fail and How to Optimize for Success With Machine Learning

Published in

Towards Data Science

13 min readDec 3, 2019

Starting a business is no trivial task, there are many expenses to consider such as sunk costs, overhead, and loans. Metrics for success include but are not limited to growth, sales revenue, customer retention, and balance sheets.

The viability of a business is first and foremost dependent on one axiom: that it stays open.

This topic is a primary interest for businesses who must remain viable and customers who expect good and reliable services. It is also important in the field of finance where lenders must justify giving out loans, and might be obligated by law to explain why applications are denied.

We therefore formulate the motivation and purpose of this endeavor as follows: Can we predict if a business is open or closed? What are the main indicators of viability?

In this article we are going to solve for this question using the Yelp Dataset, a collection of relational data about local businesses across the USA and Canada. It contains a large volume of comprehensive information:

6,685,900 reviews
192,609 businesses
10 metropolitan areas
1,223,094 tips

The Yelp dataset is therefore an excellent case study on how aptly engineered features, coupled with machine learning models, can predict the success of a business beyond traditional approaches such as balance sheets.

In the next part , we are going to present each segment of the dataset as given and draw features from it:

Businesses: (business.json) This dataset contains business relevant data including location data, restaurant attributes, and cuisine categories.
Reviews: (Review.json) This dataset contains full review text data including the user_id that wrote the review, and the business_id the review is written for.
Checkin: (Checkin.json) This dataset contains the check ins for businesses where available.

Businesses

Feature Engineering

The business.json file contains business data including location data, attributes, and categories. The snippet below shows its configuration:

{
    // string, 22 character unique string business id
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",

    // string, the business's name
    "name": "Garaje",

    // string, the full address of the business
    "address": "475 3rd St",

    // string, the city
    "city": "San Francisco",

    // string, 2 character state code, if applicable
    "state": "CA",

    // string, the postal code
    "postal code": "94107",

    // float, latitude
    "latitude": 37.7817529521,

    // float, longitude
    "longitude": -122.39612197,

    // float, star rating, rounded to half-stars
    "stars": 4.5,

    // integer, number of reviews
    "review_count": 1198,

    // integer, 0 or 1 for closed or open, respectively
    "is_open": 1,

    // object, business attributes to values. note: some attribute values might be objects
    "attributes": {
        "RestaurantsTakeOut": true,
        "BusinessParking": {
            "garage": false,
            "street": true,
            "validated": false,
            "lot": false,
            "valet": false
        },
    },

    // an array of strings of business categories
    "categories": [
        "Mexican",
        "Burgers",
        "Gastropubs"
    ],

    // an object of key day to value hours, hours are using a 24hr clock
    "hours": {
        "Monday": "10:00-21:00",
        "Tuesday": "10:00-21:00",
        "Friday": "10:00-21:00",
        "Wednesday": "10:00-21:00",
        "Thursday": "10:00-21:00",
        "Sunday": "11:00-18:00",
        "Saturday": "10:00-21:00"
    }
}

The target variable is the is_open such as 0 indicates closed and 1 indicates open.

As a first approach we need to explode the nested attributes into its corresponding values. For example within the attributes feature we have a binary encoding for RestaurantsTakeOut and we need to further expand the BusinessParking feature.

Within the attributes feature we can engineer categorical variables such that with numerical values. For example NoiseLevel is a string input with values:

very_loud
loud
average
quiet
NaN

It’s therefore possible to encode these string features with numeric values, for example a scale that represents noise levels. We can repeat same exercise for attributes such as AgesAllowed, Alcohol, RestaurantsAttire, and others.

Furthermore, the categories feature contains a list of strings that are not mutually exclusive. It is possible for a businesses to fall under Mexican, Burgers, and Gastropubs. Therefore these are encoded as binary features.

In a simple case of natural language processing, we notice that ~1000 businesses are named Starbucks and ~800 are named McDonalds . We therefore define the binary chain feature such as 1 indicates that this business is part of a chain, and such that a name must appear at least 5 times to be considered a chain.

We use the latitude and longitude features in order to draw geodetic features from the datataset. We use the minimum bounding box method in order to query all other businesses in a given radius. In the context of this project we set the radius to 2 kilometers, as a reasonable distance customers are willing to walk between businesses.

From geodetic data we can define features such as density which is the amount of businesses in the queried circle. In addition we can compare each business against its surroundings by applying a Z-Score Normalization. For example the Z-score of a business’ price is its the difference between the price and the mean of the group, divided by the standard deviation.

Boost Businesses With External Sources

For feature engineering, it can be helpful to draw information beyond the dataset. Each business has its corresponding postal code in USA and Canadian formats.

That being said the IRS releases Individual Income Statistics and Canada’s Bureau of Statistics releases income data. Although not specific to the business itself, the income of the locality can play a role in viability.

In order to preserve the privacy of citizens, the IRS does not release exact income figures, instead the data is categorical. For example a value of 3 indicates income between $50,000–75000 and 5 is between $100,000–200,000.

Hence we can match each zip code with the corresponding median household income, making sure to covert Canadian Dollars to USA currency and binning Canadian income data per the IRS method.

Reviews

The review.json file contains full review text data including the user_id that wrote the review and the business_id the review is written for.

Below is a snippet laying out the attributes:

{
    // string, 22 character unique review id
    "review_id": "zdSx_SD6obEhz9VrW9uAWA",

    // string, 22 character unique user id, maps to the user in user.json
    "user_id": "Ha3iJu77CxlrFm-vQRs_8g",

    // string, 22 character business id, maps to business in business.json
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",

    // integer, star rating
    "stars": 4,

    // string, date formatted YYYY-MM-DD
    "date": "2016-03-09",

    // string, the review itself
    "text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",

    // integer, number of useful votes received
    "useful": 0,

    // integer, number of funny votes received
    "funny": 0,

    // integer, number of cool votes received
    "cool": 0
}

We can aggregate by business_id and be done with the feature engineering. However given that each impression has a time stamp associated with it, we can directly measure the changes associated with a given location.

The mean of user star ratings return the average business score. Grouping by year can help us see how the business features are changing by year. Are they improving or lagging behind?

Understanding Businesses

The first question to ask is where are those ~192,000 businesses located? Per the plot below: Vegas baby!

Contrary to the plot above, the majority of businesses are located in Arizona as shown below:

Furthermore we can see that the majority of businesses have a rating between 3.0 and 4.5, such that the mean is around 3.5.

Finally we want to compare the distribution of the positive/negative label across business types by plotting counts and color hue by target variable. Restaurants account for the large majority of businesses and have the highest proportion of closed businesses.

Checkin Data

The checkin.json file lists all checkins for businesses where available:

{
    // string, 22 character business id, maps to business in business.json
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg"

    // string which is a comma-separated list of timestamps for each checkin, each with format YYYY-MM-DD HH:MM:SS
    "date": "2016-04-26 19:49:16, 2016-08-30 18:36:57, 2016-10-15 02:45:18, 2016-11-18 01:54:50, 2017-04-20 18:39:06, 2017-05-03 17:58:02"
}

As a first step, we can explore trends in checkins as shown in the plot below, where the y-axis corresponds to the number of checkins aggregated over all the years and split by 30min intervals. For example this plot tells us that the peak average checkin from Saturday-Sunday night occurs around 8PM whilst providing a confidence interval.

On a more macro scale we can also explore checkins in December, which contains a lot of seasonality. In the plot below, the highlights correspond to weekends.

From the data we can extract what is the average monthly checkin. In addition we define the span as the time in seconds between the first and the last checkin: The longer a business has been open, the higher the probability it will remain open (see Sunrise Problem).

Model Selection & Scoring

The question we are trying to solve for falls under supervised learning. Given the heavily imbalanced nature of the target variable, we omit scoring based on accuracy and consider the following metrics:

In a preliminary round, we apply several supervised learning algorithms. The top performers were in ascending order:

In the following sections we will elaborate more on individual techniques that were coupled together to produce reliable results.

Splits

The intent of feature engineering is to have numerical representation of data that can be directly fed into an algorithm. The first step is to split the data between training and testing sets.

In addition we apply min-max feature scaling that is fitted on the training data only to avoid model leakage. The training and testing data are then transformed using that fit.

Grid Search Cross Validation

The method also known as GridSearchCV is popular with data scientists because it is so comprehensive. By combining cross validation with grid search, we obtain tuned hyperparameters.

Feature Relevance

Certain algorithms such as tree based models perform poorly in high dimensional spaces. Having too many features might also entail noisy features.

The first approach used to detect noisy features is via logistic regression’s L1 feature selection, which gives a weight of 0 to useless features. The second approach is using the featexp package in order to identify noisy features.

Ultimately the most interpretable method was permutation for feature importance. In this context we permute a given feature and calculate the change in the model’s prediction error. A feature is ‘important’ if shuffling it significantly changes the error on the model, and is ‘unimportant’ if the error remains unchanged.

Dealing With Imbalances

Finally the most impactful aspect of the project was dealing with the imbalanced target variable. Recall that only about ~20% of businesses are listed as closed.

In practical terms this means that the minority class is overwhelmed by the majority class such that the algorithm does not have enough minority observations to draw decisions, or that the minority class is swamped by the majority.

One way to address this is via undersampling: keep the minority class as is and draw an equal number of observations at random from the majority class. Conversely oversampling works slightly better: duplicate the minority class until the dataset is balanced.

In either choice of sampling, it is crucial to apply it only on the training data. Sampling the test data is a common pitfall, since it equates to misconstruing reality.

Model Performance

The table below summarizes several model performances. Across the board, all models are better at predicting open businesses (Class 1) rather than shuttered ones (Class 0).

Overall we can see tree based models tend to perform best with AUC hovering at 0.75.

That being said model interpretability is as important, if not more, than model performance. We must therefore dive deeper into each model to understand what are the main features driving our decisions.

The table below shows the top 5 most important features for the models.

Interestingly enough within a same model such as logistic regression, being a Restaurant can be a positive or negative weight depending on the sampling method.

We also notice some expected features such as RestaurantsPriceRange2 which relate to price, and some peculiar ones such as AcceptsInsurance and BikeParking. In the case of insurance, the feature might be important to businesses such as massage parlors or doctors’ clinics.

In the winning model, the following features are crucial for business success:

Be a Restaurant
Service lunch
Also serve dinner
Your RestaurantsPriceRange2 matters
Be a chain

XGBoost Model Interpretation

The main features for XGBoost were obtained from the feature_importances_ attribute. However we wish to find a way to explain our output and how to optimize for business success.

For this study we use the SHAP (SHapley Additive exPlanation) package to derive individualized feature attributions. Shapley values are primarily used on a per-prediction basis to help with explainability, and answer questions like “What caused my business to be marked as being closed?”

To justify the use of SHAP, we present the plot_importance method in XGBoost that produces three different interpretations based on the importance type.

weight the number of times a feature appears in a tree

gain the average gain of splits which use the feature

cover the average coverage of splits which use the feature where coverage is defined as the number of samples affected by the split

In the plot below we show the mean absolute value of the SHAP values. The x-axis shows the average magnitude change in the model output when a feature is hidden from the model. Given that hiding a feature changes depending on what other features are also hidden, Shapley values are used to enforce consistency and accuracy.

The plot below is a density scatter plot of SHAP values for each feature to identify how much impact each feature has on the model output for each observation in the dataset. The summary plot combines feature importance with feature effects. Each point on the summary plot is a Shapley value for a feature and an instance. The position on the y-axis is determined by the feature and on the x-axis by the Shapley value. The color represents the value of the feature from low to high.

The coloring by feature shows us that having your stars_change (average change in stars rating over time) increase is a good predictor to stay open. For review_count (the number of reviews) not having a lot can harm you, however having a high reiew_count can also mean a high volume of negative reviews. For the third most important feature useful (total count of useful review votes), it appears to be a positive indicator for success.

The plot also allows us to to identify outliers since overlapping points are jittered in y-axis direction, so we get a sense of the distribution of the Shapley values per feature. For a particular subset of businesses, having a high count of useful can actually be an indicator of closing soon. For example, it is not unheard of for someone to comment “Stay away from the chicken, seemed undercooked”.

SHAP dependence plots are an additional visualization to show the effect of a single feature across the whole dataset. Contrary to partial dependence plots, SHAP account for the interaction effects present in the features, and are only defined in regions of the input space supported by data.

The plot below applies to review_density (normalized review count of business versus all others within 2km radius).

In the plot above the vertical dispersion is driven by interaction effects and another feature, here RestaurantsGoodForGroups , is chosen for coloring to highlight possible interactions. We can see here that having a higher review count relative to other businesses is a good indicator for success. In addition, being good for groups tends to correlate positively with review_density.

The correlation matrix below ties into the plot from above. We can notice a positive correlation between review_density and RestaurantsGoodForGroups.

Conclusion

Knowing if a business will close its doors or stay open is a study entrepreneurs take before investing. With machine learning, we are able to identify features that can predict for the original question of this article. We were also able to provide model interpretability on a per-user basis.

Although a comprehensive answer to the problem might not be obtained, the solution nonetheless offers a path forward. Businesses can use the model interpretations to optimize for success and know the metrics they need to improve upon.

Looking forward we can take advantage of the text information within the review.json file. A sample NLP exercise would be to extract sentiments or group of words in function of time or location that indicate business performance.

The project was made possible thanks to sharpestminds.com and Aditya Subramanian. For an in depth view of the process with corresponding code, head to the GitHub repository:

https://github.com/NadimKawwa/PredictBusinessSuccess

References

Yelp Dataset

If you are a student, you'll have the opportunity to win one of ten awards for $5,000. We'll judge submissions on their…

www.yelp.com

Using Yelp Data to Predict Restaurant Closure

Michail Alifierakis is an aspiring data scientist and a Chemical Engineering PhD candidate at Princeton University…

towardsdatascience.com

Chapter 5 Model-Agnostic Methods | Interpretable Machine Learning

Separating the explanations from the machine learning model (= model-agnostic interpretation methods) has some…

christophm.github.io

Interpretable Machine Learning with XGBoost

This is a story about the danger of interpreting your machine learning model incorrectly, and the value of interpreting…

towardsdatascience.com

Consistent Individualized Feature Attribution for Tree Ensembles

Interpreting predictions from tree ensemble methods such as gradient boosting machines and random forests is important…

arxiv.org

My secret sauce to be in top 2% of a kaggle competition

Build better ML models using feature exploration techniques for feature understanding, noisy/leaky feature detection…

towardsdatascience.com

Finding Points Within a Distance of a Latitude/Longitude Using Bounding Coordinates

This article describes how to efficiently query a database for places that are within a certain distance from a point…

janmatuschek.de

How to Predict Business Success & Failure

Why Some Small Businesses Fail and How to Optimize for Success With Machine Learning

Businesses

Feature Engineering

Boost Businesses With External Sources

Reviews

Understanding Businesses

Checkin Data

Model Selection & Scoring

Splits

Grid Search Cross Validation

Feature Relevance

Dealing With Imbalances

Model Performance

XGBoost Model Interpretation

Conclusion

References

Yelp Dataset

If you are a student, you'll have the opportunity to win one of ten awards for $5,000. We'll judge submissions on their…

Using Yelp Data to Predict Restaurant Closure

Michail Alifierakis is an aspiring data scientist and a Chemical Engineering PhD candidate at Princeton University…

Chapter 5 Model-Agnostic Methods | Interpretable Machine Learning

Separating the explanations from the machine learning model (= model-agnostic interpretation methods) has some…

Interpretable Machine Learning with XGBoost

This is a story about the danger of interpreting your machine learning model incorrectly, and the value of interpreting…

Consistent Individualized Feature Attribution for Tree Ensembles

Interpreting predictions from tree ensemble methods such as gradient boosting machines and random forests is important…

My secret sauce to be in top 2% of a kaggle competition

Build better ML models using feature exploration techniques for feature understanding, noisy/leaky feature detection…

Finding Points Within a Distance of a Latitude/Longitude Using Bounding Coordinates

This article describes how to efficiently query a database for places that are within a certain distance from a point…

Written by Nadim Kawwa