The world’s leading publication for data science, AI, and ML professionals.

10 Open-Source Dataset Finders For Your Next ML Project

Best Resources to Upscale Your Skills and Portfolio

Photo by Sincerely Media on Unsplash
Photo by Sincerely Media on Unsplash

There are thousands of free courses online. Thanks to the internet, we can now learn almost everything online. However, it’s possible that once you got the course certificates, you may have forgotten what you have learned if you just finished watching all the videos. The best way to learn machine learning or anything is by working on a project. Besides, it’s always great to add another personal project to your GitHub repo. 🙌

If you’re new to Data Science, you probably don’t know where to start. Don’t worry! Everyone starts somewhere. I still remember myself working on the famous Titanic dataset several years ago. In this article, I will share several best data resources to help you get started.

1. UCI Machine Learning Repo

Difficulty: Easy

Dataset: https://archive.ics.uci.edu/ml/datasets.php

Screenshot by Author
Screenshot by Author

What I like about it: This is a very beginner-friendly website. My professors from grad school used datasets from this website as lecture examples as well. There are 559 datasets, and almost all of the datasets are relatively small and clean. As you see from the screenshot, you can filter the datasets by specifying which ML algorithm you want to use (classification, clustering, regression, etc.) Besides, since many people have worked with these datasets, you could look up what they have done and get inspiration from them.

2. Kaggle

Difficulty: Anywhere from easy to complex

Dataset: https://www.kaggle.com/datasets

Screenshot by Author
Screenshot by Author

What I like about it: Kaggle provides a vast container of datasets. You can find text, audio, numerical, image data here. It allows users to find and publish data sets, so you will see new datasets pretty frequently. Kaggle is one of the most popular websites among data scientists, and it’s famous for its competitions. There are many on-going contests, and you can actually win up to $100,000 in prizes! 🤩

3. Google Dataset Search

Difficulty: Intermediate to Advanced

Dataset: https://datasetsearch.research.google.com/

Screenshot by Author
Screenshot by Author

What I like about it: Similar to how Google Scholar works, Dataset Search allows you to find datasets wherever they are hosted, from a publisher’s site, a digital library, to an author’s web page. You can sort the dataset based on time, download format, usage rights, topic, and free. Because it contains over 25 million datasets, it might take some time for you to find the right dataset. If you try to gain some real-world data science experience, using datasets published by the .gov domain is the way to go.

4. AWS Open Data Registry

Difficulty: Intermediate to Advanced

Dataset: https://registry.opendata.aws/

Screenshot by Author
Screenshot by Author

What I like about it: AWS Open Data Registry also lets users add a dataset as well. I like how each dataset has several tags, so we know the dataset usage at first glance. However, it doesn’t have a filter feature like Google Dataset. For each dataset, you can find usage examples. This is very helpful if you need some guidance. A lot of companies are looking for candidates with big data experience, such as AWS, PySpark. If you haven’t worked with big data yet, this is a great place to get started. You will learn how to access data from AWS S3 buckets and how to process large datasets.

5. Wikipedia ML Datasets

Difficulty: Intermediate to Advanced

Dataset: https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research

Screenshot by Author
Screenshot by Author

What I like about it: Wikipedia ML page has various data such as image, text, sound, signal, and so on. For each dataset, you can find its brief description, the data size, format, and Machine Learning usage. If you are new to deep learning or NLP, you can start exploring the data listed here.

6. Awesome Public Datasets

Difficulty: Anywhere from easy to complex

Dataset: https://github.com/awesomedata/awesome-public-datasets

Screenshot by Author
Screenshot by Author

What I like about it: This repo contains data from various industries. When it comes to starting a personal project, it’s always better to find a topic you are passionate about. In this repo, you can find both famous and unique datasets. If you are a beginner, you can start with some popular datasets like the IMDb database or Lending Club Loan Data. When searching for a dataset on Google or AWS, you might have no idea where to start. You can refer to this repo for inspiration and look for a specific dataset later.

7. Big Bad NLP Database

Difficulty: Intermediate to Advanced

Dataset: https://datasets.quantumstat.com/

Screenshot by Author
Screenshot by Author

What I like about it: Similar to ** the** Wikipedia ML page, this website also shows the data description, format, and the ML use case. However, this website focus on natural language processing tasks only. Some of the datasets are in JSON format, which requires some data preprocessing before reading the data frame. Real-world data usually comes in different formats. The more you practice, the better you get. Besides, text data tends to be really messy!

8. Bureau of Transportation Statistics

Difficulty: Intermediate to Advanced

Dataset: https://www.bts.gov/browse-statistical-products-and-data

Screenshot by Author
Screenshot by Author

What I like about it: If you are interested in learning more about forecasting demands, you can start with this data source. It has historical data and recent COVID-19 related Transportation Statistics, which allows you to understand how passenger behavior has changed. Because this is also real-world data, you will gain experience in working with messy data.

9. VisualData

Difficulty: Intermediate to Advanced

Dataset: https://www.visualdata.io/discovery

Screenshot by Author
Screenshot by Author

What I like about it: After learning all the basic ML, you might want to learn more about deep learning. Visual data contains 500+ datasets. You can sort the dataset by recently added and its popularity. If you are new to deep learning, I suggest experimenting with the popular datasets so that you can learn from those who have done it before first.

10. Recommender Systems Datasets

Difficulty: Intermediate to Advanced

Dataset: https://cseweb.ucsd.edu/~jmcauley/datasets.html

Screenshot by Author
Screenshot by Author

What I like about it: This compilation of datasets from Julian McAuley includes datasets from Amazon, Goodreads, RentTheRunway, Facebook, Twitter, Reddit, and more. For each data, you can see its description, basic statistics, metadata, and example. If you are curious about how I built a content-based filtering recommendation system, feel free to check out more details from my blog. 🙂

Conclusion

There you have it! We all know that data is the bread-and-butter of modern machine learning. I hope you find some inspiration from this article. If you were inspired by any of these data sources, I strongly recommend that you work on at least one project before 2021. Let’s finish this year strong!


If you find this helpful, please follow me and check out my other blogs. ❤️

Discovering your Music Taste with Python and Spotify API

How to Prepare for Business Case Interview as an Analyst?

10 Questions You Must Know to Ace any SQL Interviews

How to Convert Jupyter Notebooks into PDF

Understanding and Choosing the Right Probability Distributions with Examples

How to Set Up Automated Tasks in Linux Using Cron


Related Articles