Using Yelp Data to Predict Restaurant Closure

Published in

Towards Data Science

11 min readJan 7, 2018

Michail Alifierakis is an aspiring data scientist and a Chemical Engineering PhD candidate at Princeton University where he models the mechanical and electrical properties of complex materials. He was an Insight Data Science Fellow in Fall 2017. At Insight, in three weeks, he built a model that predicts restaurant closure within a four-year period time frame.

As a fellow at Insight Data Science, I had the opportunity to spend three weeks building the Restaurant Success Model: A model that evaluates whether a restaurant is likely to succeed or fail within the next four years. The most challenging part of this project was to build the right dataset that contained current information about restaurants that existed at some point in the past and to engineer predictive features.

Past efforts to use Yelp data to predict the success of restaurants using the Yelp star rating were unsuccessful. On the other hand, Yelp reviews text is very predictive of restaurant closure on short time scales. In my work I engineered features that are predictive on larger time scales (four years) using metadata from Yelp reviews and features based on relative performance to surrounding restaurants. More details for this problem, the procedure used and the obtained results are presented below.

1. Defining the problem

The U.S. restaurant industry is really large, generating revenues of about $799 billion in 2016, distributed between more than 1 million businesses that employ about 10% of the U.S. workforce. Given the large number of small businesses and the total size of this industry, I decided to create a model that can help restaurant lenders (such as banks) and investors decide whether they should lend/invest at a particular restaurant based on the likelihood that it is going to fail within the next few years.

Restaurant closure is a very clear metric for success although more complicated success metrics can be designed. The restaurant closure metric choice allowed me to frame this problem as a classification problem and it made it easier to obtain labeled data. More detailed information about the performance of each restaurant is hard to find as most restaurants are private companies.

2. Building the dataset

A labeled dataset that can be used to solve the above problem of restaurant closure does not exist to the best of my knowledge. As a starting point, to construct such dataset, I had to find a list of restaurants that existed at some point in the past and then match that information with current information about the restaurants.

My starting dataset was a Yelp dataset released in 2013. This dataset contains information about business in the Phoenix, AZ area. Using the training data of this dataset I decided to work only with restaurants and only with the restaurants that were still open when this dataset was obtained.

This dataset and all other datasets released by Yelp for academic use do not contain the real business identification codes or phone numbers. These codes, if available, would make it easy to obtain current data from the Yelp API. To overcome the lack of business ids, I used the Yelp Search API to search for each restaurant from the old list using their name and address but the results were disappointing.

Using this approach, only two thirds of the restaurants were matched with current information. The remaining searches gave results that did not correspond to the real restaurant. The basic problem with this search method, though, was not the number of data points, it was the bias in the way the results are returned.

The restaurants that were returned correctly through Yelp Search were restaurants that are either still open or closed very recently. This resulted in a dataset that contained almost exclusively successful restaurants (since even the restaurants that closed are restaurants that managed to remain open for about four years since the release of the original dataset). For the remaining restaurants I could not be certain whether they closed or not until I obtained specific information for each of those restaurants.

2a) Solution in obtaining the right current restaurant information

Even though Yelp does not return the results of restaurants that are closed for a long time through the Yelp Search API, it does retain that information in their database. My solution was to use the Google Search API to search the yelp.com domain and extract the business ids of the restaurants that were not yet matched. Those business ids were used to pull current data directly from Yelp through their Yelp Business API. This allowed me to get information on most of the remaining restaurants and build a meaningful model.

2b) Joining the old and new datasets

The way I confirmed whether a restaurant from the old and new dataset were the same was by checking if the first four characters of the old/new restaurant name were contained within the new/old restaurant name and the first four characters of the address were the same. For restaurants that only matched one of the above two criteria, I manually checked to identify the reason and I created a dictionary of restaurant names that have changed in the Yelp database since 2013 (e.g. Kentucky Fried Chicken changed to K.F.C.).

In total, the final dataset contains 3,327 restaurants and about 23% of them have closed since 2013. The process followed to create this dataset is outlined in the following graph.

Graph of process of creating the discussed dataset. The percentages indicate the percentage of data points carried from the previous step.

3. Feature engineering

The predictive ability of the original features provided by Yelp (e.g. the Yelp star rating) was very poor. As seen below, the Yelp star rating distributions look very similar for open and closed restaurants.

Yelp star distribution for restaurants that remained open in the 4-year period (black) and for restaurants that closed (red). On the left, the percentages per category are shown, where the similarity of the two categories becomes apparent. On the right, the absolute numbers are presented, which give a better picture of the class imbalance.

Generating meaningful features was key in building this model and for this I generated features using yelp review and location metadata. Some of these features are the following:

Is the restaurant part of a chain? If the restaurant name appears more than once in the list then it is considered to be part of a chain. This includes national or local chains. Some chains that are represented by only one restaurant in the particular list did not count as a chain due to the way a chain is defined.
What is the local restaurant density? Based on the restaurant coordinates, I created a list of restaurants within 1 mile radius for each of the restaurants in the list.
What is the review count, star rating and price (i.e. general dining cost) relative to surrounding restaurants? The surrounding restaurants within 1 mile radius of each restaurant were identified (similar to the restaurant density calculation) and the relative values for the review count, star rating and price of each restaurant were calculated by subtracting the mean of this group of restaurants from each individual restaurant and dividing with the standard deviation of the value for this group of restaurants.
What is each restaurant’s age? This value is approximated by the date of the first yelp review. This means that restaurants that joined yelp late or do not receive frequent comments would appear to have a relatively younger age than their real value. Also, the restaurant age is limited by the date Yelp was founded (i.e. 2004).

4. Machine learning models and optimization

The dataset was split in 80% training set and 20% test set using stratified sampling. The basic problem with this dataset is that it is not well separated even after introducing additional features, some of which are described above.

There are many reasons why a restaurant can succeed or fail that are not included in our feature space (e.g. other neighboring restaurants, surrounding venues, updated tax system, health inspection results etc.). A complicated decision boundary would not be beneficial in this case. This was confirmed by testing the performance of different machine learning models on our data using accuracy, precision, recall and F1 score as evaluation metrics.

Due to this lack of improvement that I got from using more complicated models, I chose to use a linear logistic regression model, which is simple and has good interpretability. Based on the use case of restaurant lending, I chose to optimize my model parameters for increased precision of open restaurants using grid search with cross-validation. The parameters optimized were the regularization strength (L2 regularization was used) and the intercept scaling factor. The results for my parameter choices are shown below.

On the left, there is a list of evaluation metrics for the model performance. On the right, the confusion matrix is presented, which gives a different perspective on model performance.

As demonstrated above, the precision of open restaurants is 91%. This means that among the restaurants that are recognized as open by the model, 91% of them actually remained open. The remaining 9% are false positives. A bank that would base their decision to give loans based on this model would potentially have a 4-year default rate of 9%, while a bank that gave loans to all restaurants in the list indiscriminately would have a 4-year default rate of about 23% (equal to the restaurant closure rate in our dataset).

Looking at the confusion matrix above, it can be seen that the predictive ability of the model is very poor in the case of closed restaurants. Among the restaurants that are predicted as closed, only 36% of them actually ended up closing in a 4-year period. This is a result of the poor separation between the two classes that was achieved within our feature space. The precision of closed restaurants can be further improved but there is always a trade-off with the precision of open restaurants. Based on our use case (i.e. restaurant lending), I chose to focus my attention on improving the precision of open restaurants. The model should be adjusted further based on the risk a bank is willing to accept for the sake of offering more loans.

5. Feature importance and model interpretation

The feature importance that resulted from this model is shown below. The features that contributed towards the restaurants remaining open are shown in black, while the features that contributed towards the restaurants closing are shown in red.

A list of features ranked on decreasing importance. Features that contribute towards the restaurants remaining open are shown in black, while features that contribute towards restaurant closure are shown in red.

The most important feature, as ranked by our model, is whether the restaurant is part of a chain. The restaurants that are part of chains are more likely to remain open. This is not surprising as restaurant chains usually operate at a higher profit margin than individual restaurants.

The relative review count (i.e. the number of reviews relative to surrounding restaurants within 1 mile radius) is the second most important feature that contributes towards the restaurant remaining open. It is hard to strictly label this metric as an indication or a cause of success. A large number of reviews is an indication of higher traffic in restaurants but it is also a reason to appear higher in Yelp search results, which by itself can drive more traffic.

High restaurant density is correlated with higher closure rates. This is probably due to increased competition. It is interesting to look at this feature in comparison to similar restaurant density (i.e. density of restaurants within 1 mile radius that belong to the same food category). High restaurant density is negative for restaurant success, while high similar restaurant density is positive. This says that, for instance, owning a Chinese restaurant in an area with a large number of restaurants is generally negative for this Chinese restaurant but if this Chinese restaurant is in an area with a lot of other Chinese restaurants (e.g. China Town) then this reduces the risk of failure. One possible hypothesis for this observation is that the lack of differentiation, from a consumer perspective, of restaurants in areas like China Town reduces competition between individual restaurants (this is not generally true for general commodities but when trying a new restaurant there is a general lack of information from the consumer perspective). Another possible hypothesis is that consumers’ appetite does not change easily and therefore popular restaurants in China Town can drive traffic to surrounding restaurants at the times that they are too busy to meet demand: People that go to a popular Chinese restaurant to seek Chinese food will prefer to go to a nearby similar restaurant if their first choice is too busy to serve them. This topic is open to further research and deeper understanding can be achieved by focusing on the data in some particular regions with high density of similar restaurants.

Restaurants claimed on Yelp are more likely to remain open. A claimed Yelp business is a business where the owner has put the effort to go on Yelp and declare the business as their own. In that sense, a positive correlation with restaurant success was expected.

An increase in the number of relative reviews per week seems to be contributing negatively to restaurant success. This is a counter-intuitive result and it is probably caused by two reasons: 1) The relative reviews per week are calculated by the number of relative reviews divided by the restaurant age of the restaurant (time since the oldest review); the restaurant age is positively correlated with restaurant success and dividing a metric with this number creates a negative correlation, which might be more important than the review count effect and 2) the number of relative reviews per week is correlated to the relative review count that the model already took into account. Logistic regression models are not good at dealing with correlated features.

6. Suggested improvements

The results of this model are very promising and they indicate a significant improvement for lending purposes relative to a random model. The key for further improvement, in my opinion, is adding more features, possibly through utilizing different data sources.

One possible reason for a restaurant closure is health inspection ratings. Adding health inspection ratings as a feature in our model could increase its precision.
Another reason for restaurant closure is high rent charges. Adding rent pricing per region could help explain more restaurant closures.
A change in population demographics in certain areas of a city can increase or decrease traffic to some restaurants.
New surrounding venues are another reason that can drive traffic to restaurants and lead to success that cannot be predicted from this model in its current form.
Success of a restaurant is currently defined as the restaurant remaining open. A more accurate definition of success that would be more appropriate for lending purposes would be correlated to restaurant revenue. Even though the revenue of most restaurants is not public information, relevant metrics can be constructed. For instance, multiplying the number of weekly comments received by a restaurant with the price (i.e. general dining cost) of the restaurant can act as a useful metric.

Summary

This model was built for restaurant lending purposes and identifies restaurants that remain open in a 4-year period with a precision of 91%.
The dataset was built by pulling recent information about restaurants that used to exist in 2013 in Phoenix, AZ through the Yelp and Google Search APIs.
Some very predictive features of this model were built using Yelp review and location metadata. This helped to construct relative metrics like restaurant density and quantities that are relative to surrounding restaurants.
The machine learning model used was a simple logistic regression model, which was optimized for precision of open restaurants using grid search with cross-validation.
One lesson learned is that the most important factor that defines whether a restaurant will remain open is whether it is part of a chain. Restaurants that belong to chains close less frequently.
Another lesson learned is that building a restaurant in an area with a lot of other restaurants is generally negative, except if those restaurants offer similar food (e.g. building a Chinese restaurant in China Town).
This model can be improved with the incorporation of further datasets such as health inspection data (not publicly available for Phoenix, AZ at the moment), and information about surrounding venues.

The code for this project can be found in this github repository.