The Top 17 Places to Find Datasets 📊

I’m often asked for good places to find data. Here are t̶e̶n̶ seventeen great sources.

Jeff Hale
Towards Data Science

--

poppies and sun
Poppies data? Source: pixabay.com

Without further ado, here are the best places to find data, with some helpful information about each. Folks keep pointing me to new sources, so the list is expanding! If you have a favorite, please send it my way! 😀

Awesome Data 😎

Awesome Data is a GitHub repository with a seriously impressive list of datasets separated by category. It is updated regularly.

Data Is Plural 🔢

Jeremy Singer-Vine’s Data Is Plural weekly newsletter has great fresh data sources. I’m always impressed by the quality. The archive is available here.

Kaggle Datasets

kaggle logo

In addition to competitions, Kaggle has a huge range of datasets. Kaggle Datasets provide great summary information and previews for most datasets. You can download the data or use their platform to analyze it in a Jupyter notebook. You can also contribute your own datasets and make them public or private.

Data.world 🌍

data.world logo

Like Kaggle, Data.world provides a wide range of user-contributed datasets. It also offers a platform for companies to store and organize their data.

Google Dataset Search Tool

I think it’s safe to say that Google knows a thing or two about search. It recently added a separate search functionality for datasets through its Google Dataset Search Tool. It’s worth a shot if you’re looking for data on a particular topic or from a particular source.

Hugging Face 🤗

Hugging Face has nearly 2,000 datasets, including many NLP datasets. I love their model cards that contain descriptions, intended uses and limitations, operating instructions, biases, training data and training procedure information, and evaluation results on many common metrics. Added Nov. 16, 2021.

Reddit Datasets

The subreddit r/datasets has lots of great datasets posted regularly by users. Added January 25, 2021.

OpenDaL 🕐

OpenDaL is a data aggregator that allows you to search using a variety of metadata. For example, you can search based on time or location.

screenshot from opendal map
Screenshot from OpenDaL.

Pandas Data Reader 🐼

pandas datareader logo

The Pandas DataReader will help you pull data from online sources into Python pandas DataFrames. Most of the data sources are financial. Here’s the list of available data sources as of late 2020:

Here’s how you use it after installing it into a Python environment with pip install pandas-datareader.

import pandas_datareader as pdr
pdr.get_data_fred('GS10')

VisualData 👓

VisualData logo

If you are looking for computer vision datasets, VisualData is a nice new source. It has some handy filtering options. Thanks to Jie Feng for reminding me of it! Added Nov. 2, 2020.

Data.gov 🏛

If you are looking to use the US government’s datasets, Data.gov has over 217,000 of them! Thanks to Michael Wallace for recommending it. Added Nov. 4, 2020.

data.europa.eu 🇪🇺

The official portal for European data has over a million datasets. data.europa.eu is hosted by the European Union. Added Oct. 28, 2021.

Awesome Satellite Imagery Datasets 🛰

Christoph Rieke has a GitHub repo that is just what it sounds like. Jacob Koehler led me to it; added on May 26, 2021.

Free GIS Data 🗺

The Free GIS Data website link to over 500 websites with free geographic datasets. The sites are nicely categorized, too! ; added on May 5, 2022.

Papers With Code 📑

Papers With Code logo

Papers With Code has over 4,000 datasets as of mid 2021. The datasets are ranked by the number of papers they appear in. “The mission of Papers with Code is to create a free and open resource with Machine Learning papers, code and evaluation tables.” — and apparently datasets! 🎉

Python API Wrappers 🐍

I recently updated my list of Python API wrappers to help users see how popular each package is popular and whether its being actively maintained. My repo now uses shields.io to automatically display GitHub stars and the date of the most recent commit. This list was originally forked from the GitHub repo of Real Python via johnwmillr. My repo contains what I believe is the largest updated list of Python API wrappers — many of which can help you find the data you might need for a project.

APIs

Getting data from a documented API using Python might sound intimidating if you haven’t done it before, but it’s really not bad. Check out my guide to getting data from APIs here. 🚀

Make your own

When all else fails, collecting your own data can be an excellent way to create a dataset for your needs. 😉

Recap

Do you have a favorite place to find data? Awesome! Share it on Twitter or leave it in the comments! 🎉

I hope you find this tool helpful when you’re searching for data sources. If you do, please share it on your favorite social media. 🚀

I write about Python, data science, and other tech topics. If you’re into that kind of stuff read more here and subscribe to my Data Awesome newsletter for awesome monthly curated data resources.

himalayn blue poppy
Source: pixabay.com

Happy data hunting! 😀

--

--

I write about data things. Follow me on Medium and join my Data Awesome mailing list to stay on top of the latest data tools and tips: https://dataawesome.com