There are thousands of free courses online. Thanks to the internet, we can now learn almost everything online. However, it’s possible that once you got the course certificates, you may have forgotten what you have learned if you just finished watching all the videos. The best way to learn machine learning or anything is by working on a project. Besides, it’s always great to add another personal project to your GitHub repo. 🙌
If you’re new to Data Science, you probably don’t know where to start. Don’t worry! Everyone starts somewhere. I still remember myself working on the famous Titanic dataset several years ago. In this article, I will share several best data resources to help you get started.
1. UCI Machine Learning Repo
Difficulty: Easy
Dataset: https://archive.ics.uci.edu/ml/datasets.php

What I like about it: This is a very beginner-friendly website. My professors from grad school used datasets from this website as lecture examples as well. There are 559 datasets, and almost all of the datasets are relatively small and clean. As you see from the screenshot, you can filter the datasets by specifying which ML algorithm you want to use (classification, clustering, regression, etc.) Besides, since many people have worked with these datasets, you could look up what they have done and get inspiration from them.
2. Kaggle
Difficulty: Anywhere from easy to complex
Dataset: https://www.kaggle.com/datasets

What I like about it: Kaggle provides a vast container of datasets. You can find text, audio, numerical, image data here. It allows users to find and publish data sets, so you will see new datasets pretty frequently. Kaggle is one of the most popular websites among data scientists, and it’s famous for its competitions. There are many on-going contests, and you can actually win up to $100,000 in prizes! 🤩
3. Google Dataset Search
Difficulty: Intermediate to Advanced
Dataset: https://datasetsearch.research.google.com/

What I like about it: Similar to how Google Scholar works, Dataset Search allows you to find datasets wherever they are hosted, from a publisher’s site, a digital library, to an author’s web page. You can sort the dataset based on time, download format, usage rights, topic, and free. Because it contains over 25 million datasets, it might take some time for you to find the right dataset. If you try to gain some real-world data science experience, using datasets published by the .gov domain is the way to go.
4. AWS Open Data Registry
Difficulty: Intermediate to Advanced
Dataset: https://registry.opendata.aws/

What I like about it: AWS Open Data Registry also lets users add a dataset as well. I like how each dataset has several tags, so we know the dataset usage at first glance. However, it doesn’t have a filter feature like Google Dataset. For each dataset, you can find usage examples. This is very helpful if you need some guidance. A lot of companies are looking for candidates with big data experience, such as AWS, PySpark. If you haven’t worked with big data yet, this is a great place to get started. You will learn how to access data from AWS S3 buckets and how to process large datasets.
5. Wikipedia ML Datasets
Difficulty: Intermediate to Advanced
Dataset: https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research

What I like about it: Wikipedia ML page has various data such as image, text, sound, signal, and so on. For each dataset, you can find its brief description, the data size, format, and Machine Learning usage. If you are new to deep learning or NLP, you can start exploring the data listed here.
6. Awesome Public Datasets
Difficulty: Anywhere from easy to complex
Dataset: https://github.com/awesomedata/awesome-public-datasets

What I like about it: This repo contains data from various industries. When it comes to starting a personal project, it’s always better to find a topic you are passionate about. In this repo, you can find both famous and unique datasets. If you are a beginner, you can start with some popular datasets like the IMDb database or Lending Club Loan Data. When searching for a dataset on Google or AWS, you might have no idea where to start. You can refer to this repo for inspiration and look for a specific dataset later.
7. Big Bad NLP Database
Difficulty: Intermediate to Advanced
Dataset: https://datasets.quantumstat.com/

What I like about it: Similar to ** the** Wikipedia ML page, this website also shows the data description, format, and the ML use case. However, this website focus on natural language processing tasks only. Some of the datasets are in JSON format, which requires some data preprocessing before reading the data frame. Real-world data usually comes in different formats. The more you practice, the better you get. Besides, text data tends to be really messy!
8. Bureau of Transportation Statistics
Difficulty: Intermediate to Advanced
Dataset: https://www.bts.gov/browse-statistical-products-and-data

What I like about it: If you are interested in learning more about forecasting demands, you can start with this data source. It has historical data and recent COVID-19 related Transportation Statistics, which allows you to understand how passenger behavior has changed. Because this is also real-world data, you will gain experience in working with messy data.
9. VisualData
Difficulty: Intermediate to Advanced
Dataset: https://www.visualdata.io/discovery

What I like about it: After learning all the basic ML, you might want to learn more about deep learning. Visual data contains 500+ datasets. You can sort the dataset by recently added and its popularity. If you are new to deep learning, I suggest experimenting with the popular datasets so that you can learn from those who have done it before first.
10. Recommender Systems Datasets
Difficulty: Intermediate to Advanced
Dataset: https://cseweb.ucsd.edu/~jmcauley/datasets.html

What I like about it: This compilation of datasets from Julian McAuley includes datasets from Amazon, Goodreads, RentTheRunway, Facebook, Twitter, Reddit, and more. For each data, you can see its description, basic statistics, metadata, and example. If you are curious about how I built a content-based filtering recommendation system, feel free to check out more details from my blog. 🙂
Conclusion
There you have it! We all know that data is the bread-and-butter of modern machine learning. I hope you find some inspiration from this article. If you were inspired by any of these data sources, I strongly recommend that you work on at least one project before 2021. Let’s finish this year strong!
If you find this helpful, please follow me and check out my other blogs. ❤️
Discovering your Music Taste with Python and Spotify API
How to Prepare for Business Case Interview as an Analyst?
10 Questions You Must Know to Ace any SQL Interviews
How to Convert Jupyter Notebooks into PDF
Understanding and Choosing the Right Probability Distributions with Examples