The world’s leading publication for data science, AI, and ML professionals.

Discover Your Next Favorite Restaurant - Exploration and Visualization on Yelp Dataset

Do you use Yelp to find good restaurants? This post reveals insights and patterns in the popular Yelp Dataset.

My generated word cloud for restaurant X (you will find X by the end of this post)
My generated word cloud for restaurant X (you will find X by the end of this post)

You just came to this city for the first time in your life.

It is lunchtime. You are trying to find a good local restaurant but unfortunately, you don’t know anyone in the area to get recommendations from. What do you do now?

If you choose to go on Yelp and look for answers, you are not the only one.

Photo by Jonas Leupe on Unsplash
Photo by Jonas Leupe on Unsplash

As a platform which publishes crowd-sourced reviews about businesses, Yelp can help diners with their decision-making processes, providing valuable features such as restaurant ratings, tips, and reviews that other diners contributed. As of March 2020, there are 211 million cumulative reviews on Yelp [1].

With this massive amount of data, Yelp also releases a subset of their businesses, reviews, and user data for educational and academic purposes [2]. It also holds the "Yelp Dataset Challenge", which provides a chance for students to conduct research and analysis through mining this data (view the past rounds of winners and their papers here).

Now I am sure you are curious about what is in this dataset and what we can discover. Without further ado, let’s begin our journey of exploration!


Table of Curiosities

  1. What is in the business data?
  2. Where are the restaurants located?
  3. Which restaurants have high occurrences? Are their ratings comparable?
  4. What can we find in the restaurant attributes?
  5. How can we find top restaurants based on our needs?
  6. What can we do next?

Dataset

In this post, I will explore the Yelp dataset through Kaggle (link to the dataset here). Note that the dataset was updated in March 2020 with more recent data, and my analysis is performed on the updated dataset (as of July 2020).

This dataset contains 6 JSON files:

business.json – business data such as location, rating(stars), and attributes. review.json – user reviews including user ids, business ids, etc. user.json – user data such as number of reviews written, friends, and some metadata. checkin.json – checkin data, including checkin id and business id. tip.json – tips written by users, which are shorter reviews (At first I thought this is the "tips" that customers leave after the service. Another reminder for us to make sure to fully understand the data before analysis. photo.json – photo data including captions and labels.

For full documentation on these files and data formats, please check out this link.

It is important to note that this dataset is a subset of all the data that Yelp owns so we need to be mindful before drawing any bold conclusions. This post uses the business.json and tip.json files and I will use some of the other files for future posts (review.json etc.).

To see my full Python code, check out my Kaggle kernel or my Github page. This post is meant to present analysis results and findings instead of going over the code in detail.


Peek at the Business Data

Since each file is in JSON format, we would need to load it into a DataFrame first. Then, let’s look at the dimensions of the business data:

(209393, 14)

There are 209393 businesses and 14 attributes.

Let’s see what are the different attributes:

Index(['business_id', 'name', 'address', 'city', 'state', 'postal_code', 'latitude', 'longitude', 'stars', 'review_count', 'is_open', 'attributes', 'categories', 'hours'], dtype='object')

Since we are mostly interested in the restaurants, we would need to subset it using the categories column. Let’s check if there is any missing data in this column:

0.002502471429321897

Of all the businesses in this dataset, only about 0.25% of them are missing the categories data. We can safely exclude these ones in our analysis. After that, let’s check out its shape again:

(208869, 14)

There are more than 1200 different business categories on Yelp [3]. Which category has most number of businesses? Our intuition would probably be restaurants. Let’s quickly confirm through a bar graph:

Indeed, ‘Restaurants’ has the most number of businesses. Note that each business can be in up to three categories [3].

Now we can subset the dataset into restaurants only. One natural question to ask is where are these restaurants located?

Geographic Visualizations

Since we have no prior information on where these selected restaurants are located, let’s plot them on a world map first:

Seems like all restaurants in the Yelp dataset are in the North America region. Let’s zoom into it:

All restaurants are located either in the U.S. or Canada.

Let’s find out which cities have the most number of restaurants in this dataset:

Toronto and Las Vegas are the top two cities in terms of number of restaurants. Let’s look at where these restaurants are located in these two cities:

The brighter of the region is, the more restaurants are located in the region.

Restaurants with High Occurrences & Rating Comparisons

Of all the restaurants in this dataset, which restaurants have higher occurrences than the others? Our intuition would probably be the fast food restaurant chains.

Our intuition is correct. Note that "Subway Restaurants" are "Subway" are named as two different restaurants but they should be the same. Let’s fix it:

Subway and McDonald’s are at the top 2. Note that Tim Hortons, headquartered in Toronto, is also among the top.

Since each individual restaurant branch has a ‘stars’ column (ratings between 1.0 to 5.0, with increments of 0.5), we can compare some of these popular restaurant chains’ ratings across branches through side-by-side box plots:

McDonald’s vs Burger King

Interestingly, it seems that their ratings have no difference at all. Their median ratings are 2.0 and the spreads are roughly the same. The middle 50% (25th percentile – 75th percentile) branches have 1.5 stars – 2.5 stars.

Pizza Hut vs Domino’s Pizza

Comparing the two, Domino’s Pizza has slightly higher ratings than that of Pizza Hut. Branches in the middle 50% have 1.5–2.5 stars for Pizza Hut, and 2.0–3.0 stars for Domino’s Pizza.

Subway vs Jimmy John’s

Similar with the previous comparison, Jimmy John’s has slightly higher ratings compared with Subway. Branches in the middle 50% have 2.0–3.0 stars for Subway, and 2.5–3.5 stars for Jimmy John’s.

Taco Bell vs Chipotle Mexican Grill

The last pair of comparison shows interesting results. Taco Bell has larger spread compared with Chipotle, with a higher IQR (Q3-Q1). This could be partially due to the fact that Taco Bell has over 100 more restaurants in this dataset compared with Chipotle. In addition, it could also be that the food quality and services are more even across the board for Chipotle. Note that we don’t see the median line for Chipotle because it overlaps with its Q1 or Q3. Let’s check its median to confirm:

3.0

Okay, it overlaps with Q3. The reason of this overlap is due to a large number of ‘3 stars’ for Chipotle.

Throughout these comparisons, we can see that ratings across different branches can be quite different for the same restaurant (Chipotle might be an exception), and there is no obvious domination in ratings across each of comparison-pair we looked at.

Relationship between Number of Reviews and Rating

Besides the ‘stars’ attribute, there is also a ‘review_count’ attribute which indicates the number of reviews submitted by Yelp users. Let’s create a scatter plot to visualize the relationship between review count and rating:

We can see that as the rating increases from 1.0 to 4.0, the number of reviews tends to increases as well. However, as rating increases further, especially from 4.5 to 5.0, the number of review shrinks.

A Deeper Look at the Restaurant Attributes

One of the columns in this business dataset is ‘attribute’, which means business amenities. Let’s take a look at them:

Index(['RestaurantsAttire', 'RestaurantsTakeOut', 'BusinessAcceptsCreditCards', 'NoiseLevel', 'GoodForKids', 'RestaurantsReservations', 'RestaurantsGoodForGroups', 'BusinessParking', 'RestaurantsPriceRange2', 'HasTV', 'Alcohol', 'BikeParking', 'RestaurantsDelivery', 'ByAppointmentOnly', 'OutdoorSeating', 'RestaurantsTableService', 'DogsAllowed', 'WiFi', 'Caters', 'Ambience', 'GoodForMeal', 'HappyHour', 'WheelchairAccessible', 'BYOB', 'Corkage', 'DietaryRestrictions', 'DriveThru', 'BusinessAcceptsBitcoin', 'Music', 'BestNights', 'GoodForDancing', 'Smoking', 'BYOBCorkage', 'CoatCheck', 'AgesAllowed', 'RestaurantsCounterService', 'Open24Hours', 'AcceptsInsurance', 'HairSpecializesIn'], dtype='object')

Bitcoin Restaurants

Most of these attributes are pretty straightforward. I find the ‘BusinessAcceptsBitcoin’ attribute interesting, so let’s see how many restaurants in this dataset accepts bitcoin:

False    4245
True       74
Name: BusinessAcceptsBitcoin, dtype: int64

Of the 4319 restaurants that include this attribute, 74 of them accepts bitcoin.

Relationship between Coat Check and Price Range

My intuition is that as the price range increases, the percentage of restaurants that provide coat check would increase as well.

Cross-tabulation between price range and whether there is coat check (listed numbers are percentages)
Cross-tabulation between price range and whether there is coat check (listed numbers are percentages)

Note that for price range, ‘1’ means ‘$’, ‘2’ means ‘$$’, etc. See here for the mapping between the Yelp dollar signs and actual price ranges.

We can see when the price range is $-$$, almost no restaurant has a coatcheck service. However, when the price range is $$$-$$$$, nearly half of the restaurants provide coatcheck.

Relationship between Restaurant Delivery and Rating

My intuition would be as the restaurant rating increases (1.0–3.0), the percentage of restaurants that provide delivery should also increase, and as restaurant rating increases from 3.0 to 5.0, the percentage might drop slightly.

Cross-tabulation between ratings and whether the restaurant provides delivery service
Cross-tabulation between ratings and whether the restaurant provides delivery service

The percentage of restaurants that provides delivery service actually stays roughly the same, independent of its ratings.

Note that there are many more bivariate or multivariate relationships we can investigate among the given attributes and ratings/number of reviews!

Discover Restaurants According to Your Needs

Let’s say you want to find a restaurant in Toronto that meets these criteria:

  1. Rating > 3.5 stars
  2. More than 100 reviews
  3. Open (important!)
  4. Price range is $$ ($11-$30)
  5. Accepts takeout service (#social-distancing)
  6. Accepts credit cards

Let’s suppose we want to see the top 15 restaurants ordered by ratings (first) and review counts (secondary):

28918                      Ramen Isshin
46423             Fresco's Fish & Chips
49260                           BarChef
37688    Descendant Detroit Style Pizza
4472                         Hodo Kwaja
49245                      Saigon Lotus
27989                           deKEFIR
60497                      Banh Mi Boys
30969           Manpuku Japanese Eatery
16129                       Le Gourmand
2827                White Brick Kitchen
30799                        La Palette
43941           Aoyama Sushi Restaurant
4794                      Mangia & Bevi
4136                          La Cubana

Now you are probably thinking about Ramen Isshin and what tips people give about this restaurant. To do that, we would need to look at the tip data. Here are five random tips that users give on Ramen Isshin:

1. 'Very small place fits about 30 people. They don't quite have a big sign.'
2. 'Try the soft tofu appetizer! It's quite light but I find it really tasty!'
3. 'They have a really cool chart in the front of their menu that breaks down each dish!'
4. 'Order the white sesame ramen! Quite tasty!'
5. 'By far, the most spacious ramenya in Toronto.'

Word cloud is a fun way to visualize popular words/phrases in group of text. To make one, we would need to first process the raw text (removing non-letter characters/punctuations/stopwords, etc.). Read more about NLP text processing here.

Not a big fan of ramen and you prefer "Fresco’s Fish & Chips"? Sure…Here are some random tips that people give for "Fresco’s Fish & Chips" :

1. 'Check in for a free drink!!'
2. 'Add extra fish for only a few dollars, great for sharing!'
3. 'Ask to add extra piece of fish to order if you don't want too many chips but want two orders; more fish less chips less money. ;)'
4. 'Use the free drink checking offer!'
5. 'Sunday! 3 items for just $5! I like the shrimp!'

We can see that "check in for free drink" and "add extra fish" are both mentioned two times here. I know what you are probably thinking now…You are right. The featured image for this blog post is indeed for "Fresco’s Fish & Chips".


Coming Up Next

The Yelp dataset has so much more textual information that we haven’t got into in this post. The review.json file contains user reviews that are longer than tips in general. This data can be used for building models for sentiment analysis. In particular, the question is the following: Is it possible to predict the restaurant rating (or overall positive/negative sentiment) that the user gives based on their text reviews?

Check out how I created a sentiment classification model using logistic regression here.


Summary

We explored the business data within the Yelp dataset, and examined the restaurant ratings among some of the fast food restaurant chains. We then took a look at different restaurant attributes and their relationships. At last, we walked through an example of how we can find top restaurants that fit our needs and used the tip data to create visualizations that can help us understand restaurant tips.

I hope that you enjoy this post and please share any thought that you may have 🙂

DS/ML beginner? Check out my other post on how you can build your first Python classifiers with the Iris dataset:

Exploring Classifiers with Python Scikit-learn – Iris Dataset


References

[1] https://www.yelp-ir.com/overview/default.aspx [2] https://www.yelp.com/dataset [3] https://blog.yelp.com/2018/01/yelp_category_list


Related Articles