Convert Yelp Dataset to CSV
How to load a massive file in Jupyter Notebook using Pandas
Round 13 of the Yelp dataset challenge started in January 2019 providing students the opportunity to win awards and conduct analysis or research for academic use.
This post serves to demonstrate a step-by-step of how to load the gigantic file of the Yelp dataset, notably the 5.2 gigabytes worth of review.json file to a more manageable CSV file. With over 6 million reviews in the review.json file, it could be troublesome to load inside a Jupyter Notebook.
1. Downloading the Yelp dataset
(source: Yelp Dataset Challenge https://www.yelp.com/dataset/challenge)
This dataset includes information Yelp obtained on businesses, reviews, users, checkins, tips and photos. The file is 8.69 gigabytes uncompressed in json format (6 json files including business.json, review.json, user.json, checkin.json, tip.json and photo.json).
After unzipping the yelp_dateset
file, another file will appear, add .tar
to the end of the new file to unzip again. The 6 json files should appear along with 2 PDFs of the Yelp’s dataset agreement and Round 13 description.
2. Loading the Yelp dataset in Jupyter Notebook
Since the review.json does not include business’ details, we would need to merge the review file with the business file if we want to have details of what kind of business each review is for.
Load the business.json file
Set up the business.json environment path to load using Pandas
import pandas as pdbusiness_json_path = 'data/business.json'
df_b = pd.read_json(business_json_path, lines=True)
Clean the business.json file
Only keep the businesses that are still open in the dataset
# 1 = open, 0 = closed
df_b = df_b[df_b['is_open']==1]
Drop any irrelevant columns (Must keep business_id
to merge with reviews)
drop_columns = ['hours','is_open','review_count']
df_b = df_b.drop(drop_columns, axis=1)
Split the businesses by category
We can find the relevant reviews that we want later on by querying for the relevant businesses first using categories.
For this example, we are only interested in RV related businesses.
business_RV = df_b[df_b['categories'].str.contains(
'RV Repair|RV Dealers|RV Rental|RV Parks|Campgrounds',
case=False, na=False)]
Find the relevant categories
Since a business can have multiple categories, it could be difficult to figure out how many different categories there are (hint: 1290 different categories!). We can use the explode
function in pandas v0.25 to split the categories.
# Make sure the Pandas version is above 0.25
# If not, upgrade Pandas version
# !pip3 install --upgrade pandas
pd.__version__
Once we confirm we have the updated Pandas version, let’s do the magic.
df_explode = df_b.assign(categories = df_b.categories
.str.split(', ')).explode('categories')
We can then list out all the individual category
df_explode.categories.value_counts()
Find the categories containing RV
df_explode[df_explode.categories.str.contains('RV',
case=True,na=False)].categories.value_counts()
Loading Massive file as chunks in Pandas
With a huge file like the Yelp dataset, loading all the data at once will most likely crash the memory of the computer. Luckily, Pandas have an option to load large data by segmenting the file into smaller chunks.
Cleaning and loading the review.json file
Set up the review.json environment path to load using Pandas.
review_json_path = 'data/review.json'
Identifying the data type of each column can reduce memory usage.
Luckily, Yelp provided documentation for their dataset so we can set the data type of each column.
size = 1000000
review = pd.read_json(review_json_path, lines=True,
dtype={'review_id':str,'user_id':str,
'business_id':str,'stars':int,
'date':str,'text':str,'useful':int,
'funny':int,'cool':int},
chunksize=size)
Here, the chunk size of 1,000,000 means Pandas will be reading 1,000,000 lines each time. In this dataset of review.json, there are more than 6 million reviews (rows). Reducing the chunk size might be easier to load and check the results faster.
Merge Review.json and Business.json file
By merging only the relevant businesses to the review file, the final dataset will only consist of reviews from those businesses.
Convert the new data frame into a CSV file
A more concise dataset can be loaded and shared much more readily.
csv_name = "yelp_reviews_RV_categories.csv"
df.to_csv(csv_name, index=False)
Conclusion
This post demonstrates how we can easily load a JSON file within Jupyter Notebook to a CSV file without the hassle of an external converter. Data is only useful when we know how to read and use it properly. Cleaning and processing the data thoroughly will determine the quality of our analysis.
Next up: Analyzing Yelp Dataset with Scattertext visualization
Now that we have our CSV file, we are ready to analyze the data! In my next article, I will show how to visualize the data with Scattertext spaCy.
Thanks for reading! I would love to hear your thoughts, comment here or send me a message. The code used for this is located in my GitHub repository.