The world’s leading publication for data science, AI, and ML professionals.

7 Real-World Datasets to Learn Everything needed about Machine Learning

With references to sample use-case implementation

Photo by Franki Chamaki on Unsplash
Photo by Franki Chamaki on Unsplash

I’m actively involved in mentoring and coaching people in Data Science. I also guide students to prepare them for a career in data science. Based on my interaction with them, I find many of them ignore the dataset while choosing a project. There are still many using dummy datasets for their data science portfolio projects. The goal of a portfolio project is better learning and to help with a data science job. The use of dummy data for your portfolio project doesn’t help either with the learning or with getting a job.

Using a real-world dataset for your project offers the following benefits,

  • Real data comes with its own challenges hence good opportunity to learn about the typical data issues and about handling them
  • Easier to understand and relate to the dataset
  • Dynamics in the attribute features could drive further learning
  • Helps to differentiate your work and hence your profile as well

In this article, I am going to show you how to use some interesting real-world Datasets to learn in detail about the key classes of machine learning algorithms like

  • Regression Algorithms
  • Classification Algorithms
  • Recommendation System
  • Sentiment Analysis
  • Image Recognition

I have also shared references to some well-liked implementation of some interesting use-case for reference and learning purposes. If you prefer video format, check here – https://www.youtube.com/watch?v=8agqUFZOsO0

Housing Price Dataset

Boston housing price dataset and Melbourne Housing price dataset are two popular housing price datasets. Boston housing price dataset is relatively smaller with approximately 500 records. This dataset provides more weight-age to the location of the property. The attributes about property locality include the distance from the highway, teacher to pupil ratio, crime rate, pollution, etc. which affects the property price.

The Melbourne housing price dataset is much larger and the attributes of this dataset focus on the property rather than the locality. Some of the attributes of this dataset are, the number of carports, building area, land area, year built, and so on. These two datasets provide different perspectives for the same problem. It will be good to understand the key attributes in both the dataset that influence the price of the property.

Use-cases

  • Predicting the housing price – After analysis and preprocessing divide the dataset into train and test, then implement a model to predict the housing price. If you are looking for a reference, here is a kaggle notebook that evaluates the performance of different regression algorithms on the housing price prediction.

Boston house price prediction

  • Analysis about the factors that play an important role in the price of a property. Here is a Kaggle notebook for reference to learn about performing exploratory data analysis using Melbourne housing data.

Melbourne || Comprehensive Housing Market Analysis

Algorithms and Tools

  • Regression algorithms like Linear Regression and SVM
  • For data analysis, libraries like Pandas and NumPy are useful and libraries like Seaborn and Matplotlib are useful for coming up with a good visualization

Mall Customer Dataset

The Mall customer dataset is about people visiting the Mall. It includes attributes such as gender, age, income, and spending score. This dataset is not actually real, but I find this dataset reflecting the dynamic and characteristics of a real-world dataset. I also find it to be a good dataset to learn about customer segmentation.

The attribute, ‘Spending Score’ in this dataset is a derived attribute that reflects the purchasing power of an individual most likely derived based on the customer’s purchase history, visit frequency, and other similar transactional data about the customer.

Use-cases

  • Identifying different customer segments – You can learn to build a clustering or customer segmentation using this dataset. A good reference for customer segmentation is below

Customer Segmentation (K-Means) | Analysis

  • Studying the role of gender and age on spending – Here is a quick reference on using SweetViz to perform the exploratory analysis

EDA WITHIN SNAP OF FINGERS!!!

Algorithms and Tools

  • Clustering algorithms like K-Means Algorithm and DBSCAN clustering
  • Classification algorithms like Random Forest, KNN, and Naive Bayes

Credit Card Fraud Dataset

Credit Card Fraud dataset contains credit card transactions by European customers. A specialty about this dataset is that it is highly imbalanced, out of the 500K transaction only 492 are fraudulent transactions and the rest are genuine transactions. The issue with this kind of dataset is that when you predict all the transactions to be genuine transactions, then the accuracy of the model is over 99% which is misleading as the focus should be on the ability to predict the anomalies in the dataset.

Also, since these are financial transactions, it is very important to identify the fraudulent transactions accurately but, not advisable to tag genuine transactions as a fraud as it could cause customer dissatisfaction. This is a good dataset to better understand the issues with an imbalanced dataset and to learn about methods, techniques, and algorithms that can be used in this scenario.

Use-cases

  • Predicting the fraudulent transactions (i.e. Anomaly detection) – The reference below is an excellent example that explains some best techniques to work on highly imbalanced datasets

Credit Fraud || Dealing with Imbalanced Datasets

Algorithms

  • Unsupervised learning algorithm like Isolation Forest
  • Supervised learning algorithms like GBM and XG Boast

Fake News Detection Dataset

This dataset contains attributes such as title, author, the content of the article, and a flag that captures if the article is reliable or not. There is scope for the application of a lot of NLP techniques in this dataset. This problem can be approached in 2-ways one is to use stemming and stop word removal on the text data and then use the neural network and the other one is to generate features based on the text data which can then be used to solve the problem as a classification problem.

With increasing news being shared on social media, fake news can spread misleading information fast. Most times they remain harmless but at times fake news can cause serious issues hence it is important to predict and stop them from spreading and it remains a challenge until today.

Use-case

  • Predicting if the news is real or fake – This use-case can be efficiently implemented using deep learning. Here is a simple starters friendly example to learn about the use of TensorFlow and Keras to implement fake news prediction

Fake News Detection

  • For those interested in reading and implementing methodologies discussed in research papers, then check the research paper here it explains the implementation of fake news prediction

Algorithm

  • LSTM
  • XG Boast, Logistic Regression, and/or other classification algorithms

Yelp Dataset

Yelp dataset comprises 1.3 million tips provided by over 1.9 million users. It also includes over 1.4 million business attributes like hours, parking, availability, ambiance, reviews, etc. This dataset is available in JSON format and it includes 4 major files, they are

  • Business: This file contains details about the business such as an address, location, rating, timing, etc.
  • Review: This contains the review text, including the user who wrote it and the business about which it has been written. It also includes details like how those reviews have been received by the wider audience
  • User: It contains details about the user, the rating and reviews provided by the user, the user’s friend mapping, and other metadata associated with the user
  • Tips: These are shorter than compared to a review and conveys suggestion to the business

This is massive text data and there is a lot of scope for the application of various NLP techniques and since it has a massive user base, it is a good dataset to study the user’s behavior as well.

Use-cases

  • Predicting success and failure for the business – The reviews, tips, along with location and other business attributes can be used to predict the success and failure of a business. Here is an example notebook that helps to find the popular restaurants

Finding the perfect Restaurants on Yelp

  • Sentiment analysis based on review and tips – Here is a reference notebook that implements the sentiment classifier using classification trees

Sentiment Analysis of the Yelp Reviews Data

  • Analyzing the user behavior, like interests of users of certain characteristics or from a particular location and more – This notebook is a good example of the exploratory analysis that can be performed on review data

What’s in a review? – Yelp ratings EDA

  • Network analysis to understand the influencers – Key influencers can be identified using Network analysis based on the user data

Algorithm

  • Decision Tree or Random Forest
  • Classification algorithms such as KNN, SVM, and/or Naive Bayes

Amazon Review Dataset

This Amazon dataset contains about 233 million customer reviews about their products, which are grouped into 30 categories. This dataset is very flexible, it can be downloaded based on product category, and it also provides an option to download a subset of the data for small-scale implementation or experimentation purposes.

This is a good dataset for implementing a recommendation system. There are several research papers about implementation and techniques to improve the accuracy of a recommendation system.

Use-cases

  • Building product recommendation for users— Here is a Git repository that has the implementation for product recommendation using the Amazon review dataset

mandeep147/Amazon-Product-Recommender-System

  • Sentiment analysis based on customer review – A sentiment analysis model can be build based on the review data. Also, since this is a massive dataset, the model that has been trained on this dataset can be tested on different datasets

Algorithm

  • Collaborative Filtering Algorithm for recommendation system

ImageNet Dataset

ImageNet is a massive image database that comprises 14 million images and 20K visual categories. This dataset is widely used in object detection/identification use-cases. It is commonly used in academics for learning about image recognition using deep learning. This dataset is a major reason for the revolution in the image recognition space.

Use-cases

  • Object detection
  • Object Localization
  • Image Captioning

Algorithm

  • Convolution Neural Network
  • Region-based Convolution Neural Network
  • Single Shot Detector

About Me

I am a Data Science professional with over 10 years of experience and I have authored 2 books in data science. I write data science-related content intending to make it simple and accessible. Follow me up at Medium. I have a YouTube channel as well where I teach and talk about various data science concepts. If interested, subscribe to my channel below.

Data Science with Sharan


Related Articles