
If you are just getting started in Data Science and looking for some cool datasets to play with, this might be the article for you. A lot of courses and books never really move beyond the classic titanic and the Iris datasets. Not that there is any harm in that, but there have been instances of extreme familiarity with these datasets to the extent that people also know the number of missing values or the number of string columns in them. Therefore, this article might appear as a fresh chance to learn about some great data sets to tinker with.
This article is part of a complete series on finding good datasets. Here are all the articles included in the series:
Part 1: Getting Datasets for Data Analysis tasks – Advanced Google Search
Part 2: Useful sites for finding datasets for Data Analysis tasks
Part 3: Creating custom image datasets for Deep Learning projects
Part 4: Import HTML tables into Google Sheets effortlessly
Part 5: Extracting tabular data from PDFs made easy with Camelot.
Part 6: Extracting information from XML files into a Pandas dataframe
Part 7: 5 Real-World datasets for honing your Exploratory Data Analysis skills
Palmer Archipelago penguin data
A drop-in replacement for Iris Dataset

The overused Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher. The palmer penguins datasets come as a drop-in replacement to the classic IRIS data. . The dataset consists of attributes of three penguin species – Adélie, Gentoo, and Chinstrap. It is a great intro dataset for data exploration & visualization.
The data folder contains two CSV files:
- penguins_size.csv, which **** includes variables like species, body_mass, gender, island, etc.
- penguins_lter.csv: Original combined data for three penguin species.
Link to Dataset: https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data
Starter Notebook: https://www.kaggle.com/parulpandey/penguin-dataset-the-new-iris
COVID-19 Clinical Trials dataset
Database of COVID-19 related clinical studies being conducted worldwide

ClinicalTrials.gov is a database of privately and publicly funded clinical studies conducted around the world. It is maintained by the National Institute of Health. The COVID-19 Clinical Trials dataset consists of clinical trials related to COVID 19 studies presented on the site.
The dataset consists of XML files where each XML file corresponds to one study. The filename is the NCT number, a unique identifier of a study in the Clinical Trials repository. Additionally, a CSV file has also been provided, which might not have as much information as contained in the XML file, but does give sufficient information. The starter notebook explains how to convert XML files into a pandas dataframe
Link to Dataset: https://www.kaggle.com/parulpandey/covid19-clinical-trials-dataset
Starter Notebook: EDA on COVID-19 Clinical Trials
Article: Extracting information from XML files into a Pandas dataframe
Forbes Highest-Paid Athletes 1990–2020
Who earns the most in Sports?

This dataset consists of a complete list of the world’s highest-paid athletes since Forbes’s first list in 1990. In 2002, they changed the reporting period from the full calendar year to June-to-June, and consequently, there are no records for 2001. The dataset consists of records till the year 2020.
Link to Dataset: https://www.kaggle.com/parulpandey/covid19-clinical-trials-dataset
Starter Notebook: 💰 Who earned the most in Sports in 2020🏆 ?
IT Salary Survey for EU region(2018–2020)
Annual Anonymous IT Salary Survey for the European region

An anonymous salary survey is conducted annually since 2015 among European IT specialists with a stronger focus on Germany. This year 1238 respondents volunteered to participate in the survey. This data has been made publicly available by the authors and shared on Kaggle to reach a wider audience. The dataset contains rich information about the salary patterns among the IT professionals in the EU region and offers some great insights.
Link to Dataset: https://www.kaggle.com/parulpandey/2020-it-salary-survey-for-eu-region
Article— IT Salary Survey December 2020
U.S. International Air Traffic data(1990–2020)
Airport and airline Traffic by the US and International Carriers

This dataset comes from the U.S. International Air Passenger and Freight Statistics Report. As part of the T-100 program, USDOT receives traffic reports of US and international airlines operating to and from US airports. There are two datasets available:
- Departures: Data on all flights between US gateways and non-US gateways, irrespective of origin and destination.
- Passengers: Data on the total number of passengers for each month and year between a pair of airports, as serviced by a particular airline.
Link to Dataset: https://www.kaggle.com/parulpandey/us-international-air-traffic-data
Conclusion
There is no better way to learn something than by doing, and the field of data science is no different. All these datasets are available on kaggle and can be analyzed in their dockerized environment. This means most of the libraries that you would require for your analysis are already installed. The start notebooks can help you to get started quickly. You can begin by exploring one of the datasets and then convert it into a blog post to share your results with the community.