The world’s leading publication for data science, AI, and ML professionals.

A Step-by-Step Guide to Completely Learn Data Science by Doing Projects

Build a portfolio and become job-ready as you learn

Photo by Prateek Katyal on Unsplash
Photo by Prateek Katyal on Unsplash

There are over 5 million registered users on Kaggle. Over 5 million enrolled for at least one of Andrew Ng’s machine learning courses. The Data Science job market is highly competitive. It doesn’t matter if you are learning data science through a master’s program or self-learning. Being hands-on and having practical exposure is absolutely necessary to stand out. It will give you as much confidence as one gets from a real job experience.

What if I tell, you can get real data science experience while learning? Yes, the most efficient way to master data science is learning by doing projects. It throws some real-world challenges that come up in day to day job of a data scientist. You will end up learning the concepts, their implementation, and troubleshooting the issues. Most importantly it helps in building an amazing portfolio while learning data science.

To become job-ready one need to get better practical exposure on below,

  1. Data collections and cleaning
  2. Extracting the insights
  3. Machine Learning algorithms
  4. Improving communication skills and showing off

The problem many have is identifying the projects that can help in learning data science. In this article, I am going to show some interesting datasets and projects that will help you to learn the important aspects of data science. The only prerequisite here is to have a basic knowledge of a programming language for data science. If you want to gain some knowledge in programming, check the ‘Learning to code using Python/R‘ section in the article here.

1. Data collection and cleaning

One key problem with following a curriculum to learn data science is it doesn’t expose you to real-world issues. In most learning environments the data provided would be clean enough to be used directly. The Kaggle datasets are mostly clean too or at least formatted to be used directly. In reality, a data scientist would spend days collecting data from different sources. Then combining them to create one master dataset. The data as such would have issues with respect to quality and consistency.

So, to get better practical exposure in data collection and data cleaning. The best way to go forward is to collect your own datasets. There are data everywhere and you just need to find an interesting problem. Let me make it simple by sharing some sample project ideas. Also, references to learn and implement web-scraping.

Project 1 – Impact of weather and vaccination rates on daily Covid-19 cases

Data Required for analysis:

  • Weather data – temperature, rainfall, humidity, and etc.
  • Daily vaccination rate
  • Total infected people
  • Daily covid case numbers

Key learning:

  • Web-scraping to collect data
  • Merging the different datasets collected
  • Cleaning and formatting the data

Project 2 – Analysing movies on IMDB

The tricky part of this project is, it requires extracting data from many pages. To learn about extracting all the required data from IMDB check this article below. This concept can be applied for scrapping data from any public source.

How to Scrape Multiple Pages of a Website Using a Python Web Scraper

Combining this dataset with data from social media would lead to some cool insights. The social media data could include the followers and social influence of lead characters. This will help to make your work unique and interesting. In the next section, we will see more on extracting insights from the data.

Key learning:

  • Handling the missing data
  • Data transformation to make them consistent
  • Merging data collected from different sources

2. Extracting the insights

The data collected in the previous step can be used to work further on insights. The starting steps would be to first come up with a set of questions or hypotheses. Then look out for insights in the data and check for the relationship between the attributes. In the first project, the goal was to understand the influence of weather and vaccination rate on the daily covid cases. The second project has no predefined approaches, it is just up to the creativity of the individual working on it. Your focus for the second dataset could be on understanding the patterns in a successful/unsuccessful movie, the impact of having a popular actor/actress in the movie, popular genres, ideal movie length, and etc.

To learn more about extracting insights, check the below notebooks. It helps in understanding the common techniques and methodologies in exploratory data analysis.

Comprehensive data exploration with Python

Topic 1. Exploratory Data Analysis with Pandas

To perform a comprehensive data analysis one needs to follow the below,

  • Step 1 – Formulate the questions
  • Step 2 – Look for patterns
  • Step 3 – Build a narrative

Let us see about them in detail below.

Formulate the questions

Always start with asking more questions about the dataset. The key here is having the best understanding of the problem. Many data science projects fail due to a lack of focus on the actual root cause. The below article talks about using mental models to best understand the problem and be successful.

5 Mental Models to Help Boost Your Data Science Career

Look for patterns

Use different analysis and visualization techniques to extract the patterns out of the dataset. The questions formulated as well as the inputs from other sources should initially drive the analysis. Yet, keeping an open mind will help in identifying interesting insights. It is always possible to find patterns contrasting our expectations.

Look out for the relationship between the attributes and how one influence the other. It helps in shortlisting attributes for the machine learning model. Also, focus on handling attributes that have a lot of noise including the missing values.

Build a narrative

Now it is time to pick the interesting findings and to come up with a narrative. A narrative is more of a linking factor that helps to go through the findings in a sequence best understandable for the audience. Many important insights and findings will be wasted if they are not packaged into a good narrative. For example, if you are working on a customer churn problem then the narrative could be organized as follows,

  • How many customers churn in a month?
  • How is the churn rate across the industry?
  • What is the general profile of customers?
  • Who are the ones churning? Group them based on their profile types?
  • What is the revenue loss across different profile types?
  • Identify the segments of the highest importance
  • Eliminate those churned for a genuine reason that can’t be stopped
  • Top 10 reasons for the others to churn
  • How this could be fixed? recommendations?

When you come up with a good narrative it helps in clearly communicating the analysis. The success of a data science project lies in the value it provides to the business. If the business team fails to see any actionable insights then it is considered a failure. So, coming up with a good narrative is as important as performing a thorough analysis.

3. Machine learning algorithms

Now let us learn different machine learning algorithms by using them. I have included datasets and sample learning scripts for different categories of machine learning problems. These will be enough to learn everything about the most commonly used algorithms. The different problems covered here are,

  • Supervised learning
  • Unsupervised learning
  • NLP
  • Computer vision problem
  • Recommendation system

Supervised learning

When we have a labeled dataset we use supervised learning to solve them. The key categories of supervised learning are regression and classification. I have provided two datasets one for each of them.

First, refer to the below kaggle notebooks to get a better understand of the supervised learning algorithms. These well-documented scripts will help you to properly understand the steps and standards involved in solving supervised learning problems. The first one is about a regression problem and the second is about classification.

Linear Regression 📈 House 🏡 price 💵 prediction

Titanic Tutorial

The goal of learning by doing is to get as much hands-on as possible to improve understanding. Use the above scripts as a reference and solve for the below datasets. To make it even better make sure you spend enough time reading through the kaggle discussion forums. The discussion forums are the goldmine of information. They have many interesting techniques and tips to solve the problems better.

To increase your learning and maximize your chances of getting a job, follow the below,

  • Start with analyzing the dataset
  • Identify the interesting patterns and insights
  • Understand the relationship between the independent variables and the target
  • Explore feature engineering
  • Try different models for prediction
  • Measure the accuracy
  • Refine by trying different features, algorithms, and parameter settings
  • Upload the code to your Git Repository
  • Write a blog and/or upload your notebook on Kaggle with details

Regression Problem: The dataset attached for this problem is housing price. It will help you to learn about the regression problems and the algorithms used to solve them. This particular dataset has more than 75 attributes describing the property. This will help you to get a hang of feature selection and other typical issues in solving regression problems.

House Prices – Advanced Regression Techniques

Classification Problem: The classification problems are those where we classify data into classes. The below example is a binary classification problem. Here is a health insurer wants to predict the interest of their customers in vehicle insurance. Like a regression problem always start with analyzing the dataset. The better one understands the data the better the prediction results.

Health Insurance Cross Sell Prediction 🏠 🏥

While solving these problems focus on

  • Learning different techniques to analyze the data
  • Learn about feature engineering techniques
  • Try to understand what algorithm goes well with what kind of data
  • Document scripts clearly and make them available on your Git repository
  • Write a blog post on your learning – trust me it helps a lot

Unsupervised learning

Unsupervised learning is used to work on a dataset that is unlabeled. For example, when we want to use the profile information of customers to group them into different categories. The approach to solving an unsupervised learning problem should be similar to supervised learning. Always start with the data analysis.

First, let us learn about the clustering algorithms using the Mall customer segmentation problem. It is about creating different customer clusters based on the information provided. We don’t stop once the clusters are identified. We can further analyze to understand the similarities within a cluster and the dissimilarities between clusters. Below is a sample script with clear documentation about approaching a clustering problem.

Mall Customers Clustering Analysis

Now let us scale up and solve for the sensor data. This will help to learn about working with data produced by IoT devices. While it is easier to work and understand human-readable data like the customer profile data. The sensor data are usually tricky as they require much more analysis to extract the insights. The insights are generally not visible while directly looking into the dataset.

Household Electric Power Consumption

This example will help you to get a better understanding of clustering problems. The focus should be on the below areas while learning,

  • Understanding different algorithms
  • Which algorithm works better on what data?
  • Data transformation to suit the requirements of the algorithm
  • Visualizations that help in comparing the clusters

NLP

The next area of focus is natural language processing. There are an increasing amount of data getting generated in social media and other online platforms. Many companies are starting to focus on this dataset as they have much vital information.

The below tweets dataset will help in getting familiar with the text data. The issues with the text data are quite different from those of structured data. It needs different sets of techniques and approaches to solve them. While working on the below dataset focus on

  • Techniques and methods for data cleaning
  • Eliminating the stop words and others that don’t help
  • Handling the noise in the dataset
  • Libraries used for extracting the sentiments

If you are new to natural language processing then first refer to the introductory script here. It helps in understanding approaching and solving an NLP problem. Then use the learning to work on the below dataset.

Coronavirus tweets NLP – Text Classification

Computer vision problem

The recent advancements in processing power have made it possible to perform image recognition. The computer vision application is increasingly used in,

  • Health-care
  • Security and Surveillance
  • Inspection and Predictive Maintenance
  • Autonomous Driving

To learn about convolutional neural networks and how they can be applied for computer vision problems go through the following introductory script here. Now, look into the below image datasets from kaggle, which can be helpful in learning about the computer vision application. While working on computer vision applications focus on the below,

  • Techniques to optimize the image size without losing the information
  • Tools and frameworks that help in computer vision
  • Augmentation techniques when there isn’t enough image data
  • Pre-trained models available for better prediction

There is a slight difference between the below two datasets. The first one is about identifying the dog breeds, it is a typical image recognition problem. Solving this problem will get you first-hand experience about the steps involved in an image recognition problem.

Dog Breed Identification

The below dataset is about object detection, the goal here is to correctly identify the objects in the image. The below dataset is a collection of satellite images of ships and the problem here is to identify all the ships present in every single picture. It requires a lot of training as in some cases the ships would be really small or blended with the background.

Ship Detection from Aerial Images

Recommendation system

The recommendation system is a very interesting technique that is popular among the business. This technique has helped many organizations in improving sales and customer experience. As per this McKinsey’s industry report, about 35% of sales in amazon comes from its recommendation system. Also, 75% of people watch the content recommended to them on Netflix.

MovieLens

If you want to learn about the implementation of a recommendation system then check below. It will give you a good perspective about the working of a recommendation system.

Movie Recommender Systems

4. Improving communication skills and showing off

Write a blog and have a git repository

A good way to ensure your learning stays with you for a long time is to write about it. It helps in establishing credibility for yourself as well. The data science space is getting very competitive, so having blogs could help in standing out. Ensure at least some of the projects you would want to showcase in your resume are available in your git repository.

Create a portfolio website

Having a portfolio website sends a strong message about your skills. A portfolio website is like an online version of the resume. Include all your work and accomplishments. If you are interested in learning about creating a portfolio website for free using GitHub pages then check below,

How to create a stunning personal portfolio website for free

Create a really good resume

The final step is about creating an impressive resume. The amount of knowledge you have gained so far doesn’t mean much without a good resume. There are some tools and techniques to comes up with an impactful resume. Here is an article to help you prepare one for yourself.

How to Build an Impressive Data Science Resume?

Closing comment

These projects will be enough to completely learn the critical skills required for a data scientist. The notebooks provided as references in this article should be used for a better understanding of the concepts. It is very important that you get to solve these problems yourself to learn the most. The hands-on experience you gain will help to boost your confidence and to perform better in the interview. The knowledge gained by doing will be multiple folds as compared to learning by reading or watching tutorials. It also stays in memory for a long time.


To stay connected


If you found this interesting you might also like,

How To Grow From Non-Coder to Data Scientist in 6 Months

How to use Kaggle to Master Data Science


Related Articles