The Top 17 Places to Find Datasets 📊
I’m often asked for good places to find data. Here are t̶e̶n̶ seventeen great sources.
Without further ado, here are the best places to find data, with some helpful information about each. Folks keep pointing me to new sources, so the list is expanding! If you have a favorite, please send it my way! 😀
Awesome Data 😎
Awesome Data is a GitHub repository with a seriously impressive list of datasets separated by category. It is updated regularly.
Data Is Plural 🔢
Jeremy Singer-Vine’s Data Is Plural weekly newsletter has great fresh data sources. I’m always impressed by the quality. The archive is available here.
Kaggle Datasets
In addition to competitions, Kaggle has a huge range of datasets. Kaggle Datasets provide great summary information and previews for most datasets. You can download the data or use their platform to analyze it in a Jupyter notebook. You can also contribute your own datasets and make them public or private.
Data.world 🌍
Like Kaggle, Data.world provides a wide range of user-contributed datasets. It also offers a platform for companies to store and organize their data.
Google Dataset Search Tool
I think it’s safe to say that Google knows a thing or two about search. It recently added a separate search functionality for datasets through its Google Dataset Search Tool. It’s worth a shot if you’re looking for data on a particular topic or from a particular source.
Hugging Face 🤗
Hugging Face has nearly 2,000 datasets, including many NLP datasets. I love their model cards that contain descriptions, intended uses and limitations, operating instructions, biases, training data and training procedure information, and evaluation results on many common metrics. Added Nov. 16, 2021.
Reddit Datasets
The subreddit r/datasets has lots of great datasets posted regularly by users. Added January 25, 2021.
OpenDaL 🕐
OpenDaL is a data aggregator that allows you to search using a variety of metadata. For example, you can search based on time or location.
Pandas Data Reader 🐼
The Pandas DataReader will help you pull data from online sources into Python pandas DataFrames. Most of the data sources are financial. Here’s the list of available data sources as of late 2020:
- Tiingo
- IEX
- Alpha Vantage
- Enigma
- Quandl
- St.Louis FED (FRED)
- Kenneth French’s data library
- World Bank
- OECD
- Eurostat
- Thrift Savings Plan
- Nasdaq Trader symbol definitions
- Stooq
- MOEX
- Naver Finance
Here’s how you use it after installing it into a Python environment with pip install pandas-datareader
.
import pandas_datareader as pdr
pdr.get_data_fred('GS10')
VisualData 👓
If you are looking for computer vision datasets, VisualData is a nice new source. It has some handy filtering options. Thanks to Jie Feng for reminding me of it! Added Nov. 2, 2020.
Data.gov 🏛
If you are looking to use the US government’s datasets, Data.gov has over 217,000 of them! Thanks to Michael Wallace for recommending it. Added Nov. 4, 2020.
data.europa.eu 🇪🇺
The official portal for European data has over a million datasets. data.europa.eu is hosted by the European Union. Added Oct. 28, 2021.
Awesome Satellite Imagery Datasets 🛰
Christoph Rieke has a GitHub repo that is just what it sounds like. Jacob Koehler led me to it; added on May 26, 2021.
Free GIS Data 🗺
The Free GIS Data website link to over 500 websites with free geographic datasets. The sites are nicely categorized, too! ; added on May 5, 2022.
Papers With Code 📑
Papers With Code has over 4,000 datasets as of mid 2021. The datasets are ranked by the number of papers they appear in. “The mission of Papers with Code is to create a free and open resource with Machine Learning papers, code and evaluation tables.” — and apparently datasets! 🎉
Python API Wrappers 🐍
I recently updated my list of Python API wrappers to help users see how popular each package is popular and whether its being actively maintained. My repo now uses shields.io to automatically display GitHub stars and the date of the most recent commit. This list was originally forked from the GitHub repo of Real Python via johnwmillr. My repo contains what I believe is the largest updated list of Python API wrappers — many of which can help you find the data you might need for a project.
APIs
Getting data from a documented API using Python might sound intimidating if you haven’t done it before, but it’s really not bad. Check out my guide to getting data from APIs here. 🚀
Make your own
When all else fails, collecting your own data can be an excellent way to create a dataset for your needs. 😉
Recap
- Awesome Data
- Kaggle Datasets
- Data.world
- Google Dataset Search Tool
- Hugging Face
- r/datasets
- OpenDaL
- Pandas Data Reader
- Data Is Plural
- VisualData
- Data.gov
- data.europa.eu
- Awesome Satellite Imagery Datasets
- Free GIS Data
- Papers With Code
- API Wrappers
- APIs
- Make your own!
Do you have a favorite place to find data? Awesome! Share it on Twitter or leave it in the comments! 🎉
I hope you find this tool helpful when you’re searching for data sources. If you do, please share it on your favorite social media. 🚀
I write about Python, data science, and other tech topics. If you’re into that kind of stuff read more here and subscribe to my Data Awesome newsletter for awesome monthly curated data resources.
Happy data hunting! 😀