The world’s leading publication for data science, AI, and ML professionals.

4 Secret Ways, to Get Superb Datasets for Data Science Projects & Impress Your Boss.

From NASA to Youtube, let's bring data science and machine learning to another level.

Data Science & Python for beginners.

Photo by Jason Hogan on Unsplash
Photo by Jason Hogan on Unsplash

If you’ve ever worked on a personal Data Science project, you’ve probably spent a lot of time browsing the internet looking for interesting datasets to analyze.

Even if that can be fun to sift through dozens of datasets to find the perfect one, it can also be frustrating to download and import several CSV files to realise that the data isn’t interesting after all.

On top of that, making your data science journey more fun is crucial for your progress, and keeps your motivation high.

I firmly believe that you will always be more efficient if you enjoy what you are doing. By experience, that was the advice number five of my top 6 tips which helped me to get promoted this year:

Python: 6 coding hygiene tips that helped me get promoted.

That’s why, as a father and data scientist, finding the right dataset to make your machine learning model unique and enjoyable is crucial.

Within this article, you will get a shortlist of 4 unique and original dataset you can get for free and with a simple Python line of code.

Let’s start!

1. NASA 🚀

The first unique and original dataset providers I want to present to you is NASA, also called National Aeronautics and Space Administration.

For people living on another planet (let me know if you appreciated the joke :)), NASA is a U.S. government agency responsible for science and technology related to air and space.

But for people who aren’t aware, NASA, as a publicly-funded government organisation, makes its data free and available to the public (where you can run your Deep Learning and machine learning algorithm). The datasets provided by the organisation are split into two categories:

You can start exploring these data by either downloading the CSV directly from the official website linked above or executing the command below and requesting the data using the NASA Python API. This API is made available by the MIT for research purpose. You can install the library using the following line of code below.

pip install python-nasa-api

You will then need to insert the ticker corresponding to the right dataset, and you will be ready to explore the different stars of the milky way.

Here an example of a dataset which I requested using the Python NASA API. It represents the data released by the national agency regarding the discovery type over time, in the galaxy:

Example of Input dataset provided by NASA - Image by Authors
Example of Input dataset provided by NASA – Image by Authors

Second original dataset provider is Quandl.

2. Quandl for Banks and official organisations. 🇺🇸

The second API which I recommend you is more official but still very fun. Using Quandl API will allow you to get data directly from a highly influential structure such as Bank of Canada, US Federal Reserve, the Indian government, etc…. with one Python line of code.

I already wrote a full guide about this API, I let you take a look if you are interested:

Python: I Have Tested Quandl API and How to Get Real Estates & Economics Data in One Line of Code.

The range of data provided by Quandl is going beyond financial and economic data; you can explore the real estate price in your borough, hiring market activities among many others.

I will let you discover the full guide above on how to get this data, but I believe that is a good asset if you need plenty of data to perform a Machine Learning or deep learning model and why not to write an article about it.

Here an output example of a comparison which I made by analysing New York Real Estates data between Harlem and Manhattan:

Here a comparison of the monthly rental price(US $) of a one-bedroom apartment between Manhattan Midtown & Harlem (New York City). - Gif by Authors.
Here a comparison of the monthly rental price(US $) of a one-bedroom apartment between Manhattan Midtown & Harlem (New York City). – Gif by Authors.

Interesting to notice that’s the monthly rental price have shown a significant increase in July 2014 in Manhattan. You can find the full code in this article.

Now let’s continue to another way to get datasets, that you would never think about: Youtube.

3. Youtube 📺

Did you ever dream of creating your own Youtube algorithm? You can make it now using Python.

The third data provider is Youtube. Since 2015 Youtube has made its live data free to use, they have their own API, and you can even build your own Youtube algorithm.

Otherwise, you can use these data to run your own algorithm and understand what’s happening within your Youtube channel. And maybe discover interesting hidden patterns.

If you want to start now, you just have to follow a three-step process.

Process

To set up your API, the path is simple and requires three steps:

  • Install Youtube Python API library
  • Set up your Google Credential
  • Request data

You can have a full example below by reading this brilliant article written by a Towards Data Science fellow writer (Chris Lovejoy):

I created my own YouTube algorithm (to stop me wasting time)

I want to conclude now this article, with some extra alternative for a more specific data science skillset development.

4. Other alternatives 🤟

Before to close this article, I want to give you a list of some extra alternative to improve your data science’s models in specific fields such as geographic data or become better with sentiment analysis.

These sources can be used in other categories as well.

Geospatial data:

If you are looking to build a geographic interactive plot about any subject, I recommend using OpenDataSoft. A few weeks ago, I discovered this API as I targeted to deliver a mapping to a client with a geospatial result-driven map.

Within the dataset provided, you will get the possibility to get all the geospatial information within a JSON file. That will save you hours of work.

Market data:

If you are studying finance and looking for a way to get data related to the market, I strongly recommend Yahoo Finance API.

I have made a test in real-time to check the robustness of the API, and I have been positively surprised by the API’s effectiveness.

You can find full information within this article.

📰 Data for NLP (Natural Language Processing) models:

If you want to get better at analyzing text and develop your skills in terms of Natural Language Processing(NLP), I highly recommend you to use Twitter API.

I have tested multiple online courses about this subject to get better and I discovered a unique and exciting way to build NLP model, by using Twitter data.

I believe Twitter can be very interesting to develop your data science skills because you are testing your Artificial Intelligence model with real people data and all that for free.

I never had as much fun, and my wife is seeing me as a superhero now (not exactly :)), which made me much more focus on improving my skills.

Here a course which I highly recommend:

2021: Algorithmic Trading with Machine Learning in Python

Conclusion:

I hope you enjoyed this article, and obviously, this list can be extended (Netflix, Amazon, Google, Kaggle ….). Here is just a sample of some datasets which I found the most interesting.

And as a father and data scientist, I am trying to get my son into it and the best way that I found was to give him motivating and ambitious projects to work on. And he loves it.


Thank you for your attention,

Happy coding, Sajid.


Related Articles