There are over 5 million registered users on Kaggle. Over 5 million enrolled for at least one of Andrew Ng’s machine learning courses. The Data Science job market is highly competitive. It doesn’t matter if you are learning data science through a master’s program or self-learning. Being hands-on and having practical exposure is absolutely necessary to stand out. It will give you as much confidence as one gets from a real job experience.
What if I tell, you can get real data science experience while learning? Yes, the most efficient way to master data science is learning by doing projects. It throws some real-world challenges that come up in day to day job of a data scientist. You will end up learning the concepts, their implementation, and troubleshooting the issues. Most importantly it helps in building an amazing portfolio while learning data science.
To become job-ready one need to get better practical exposure on below,
- Data collections and cleaning
- Extracting the insights
- Machine Learning algorithms
- Improving communication skills and showing off
The problem many have is identifying the projects that can help in learning data science. In this article, I am going to show some interesting datasets and projects that will help you to learn the important aspects of data science. The only prerequisite here is to have a basic knowledge of a programming language for data science. If you want to gain some knowledge in programming, check the ‘Learning to code using Python/R‘ section in the article here.
1. Data collection and cleaning
One key problem with following a curriculum to learn data science is it doesn’t expose you to real-world issues. In most learning environments the data provided would be clean enough to be used directly. The Kaggle datasets are mostly clean too or at least formatted to be used directly. In reality, a data scientist would spend days collecting data from different sources. Then combining them to create one master dataset. The data as such would have issues with respect to quality and consistency.
So, to get better practical exposure in data collection and data cleaning. The best way to go forward is to collect your own datasets. There are data everywhere and you just need to find an interesting problem. Let me make it simple by sharing some sample project ideas. Also, references to learn and implement web-scraping.
Project 1 – Impact of weather and vaccination rates on daily Covid-19 cases
Data Required for analysis:
- Weather data – temperature, rainfall, humidity, and etc.
- Daily vaccination rate
- Total infected people
- Daily covid case numbers
Key learning:
- Web-scraping to collect data
- Merging the different datasets collected
- Cleaning and formatting the data
Project 2 – Analysing movies on IMDB
The tricky part of this project is, it requires extracting data from many pages. To learn about extracting all the required data from IMDB check this article below. This concept can be applied for scrapping data from any public source.
How to Scrape Multiple Pages of a Website Using a Python Web Scraper
Combining this dataset with data from social media would lead to some cool insights. The social media data could include the followers and social influence of lead characters. This will help to make your work unique and interesting. In the next section, we will see more on extracting insights from the data.
Key learning:
- Handling the missing data
- Data transformation to make them consistent
- Merging data collected from different sources
2. Extracting the insights
The data collected in the previous step can be used to work further on insights. The starting steps would be to first come up with a set of questions or hypotheses. Then look out for insights in the data and check for the relationship between the attributes. In the first project, the goal was to understand the influence of weather and vaccination rate on the daily covid cases. The second project has no predefined approaches, it is just up to the creativity of the individual working on it. Your focus for the second dataset could be on understanding the patterns in a successful/unsuccessful movie, the impact of having a popular actor/actress in the movie, popular genres, ideal movie length, and etc.
To learn more about extracting insights, check the below notebooks. It helps in understanding the common techniques and methodologies in exploratory data analysis.
To perform a comprehensive data analysis one needs to follow the below,
- Step 1 – Formulate the questions
- Step 2 – Look for patterns
- Step 3 – Build a narrative
Let us see about them in detail below.
Formulate the questions
Always start with asking more questions about the dataset. The key here is having the best understanding of the problem. Many data science projects fail due to a lack of focus on the actual root cause. The below article talks about using mental models to best understand the problem and be successful.
Look for patterns
Use different analysis and visualization techniques to extract the patterns out of the dataset. The questions formulated as well as the inputs from other sources should initially drive the analysis. Yet, keeping an open mind will help in identifying interesting insights. It is always possible to find patterns contrasting our expectations.
Look out for the relationship between the attributes and how one influence the other. It helps in shortlisting attributes for the machine learning model. Also, focus on handling attributes that have a lot of noise including the missing values.
Build a narrative
Now it is time to pick the interesting findings and to come up with a narrative. A narrative is more of a linking factor that helps to go through the findings in a sequence best understandable for the audience. Many important insights and findings will be wasted if they are not packaged into a good narrative. For example, if you are working on a customer churn problem then the narrative could be organized as follows,
- How many customers churn in a month?
- How is the churn rate across the industry?
- What is the general profile of customers?
- Who are the ones churning? Group them based on their profile types?
- What is the revenue loss across different profile types?
- Identify the segments of the highest importance
- Eliminate those churned for a genuine reason that can’t be stopped
- Top 10 reasons for the others to churn
- How this could be fixed? recommendations?
When you come up with a good narrative it helps in clearly communicating the analysis. The success of a data science project lies in the value it provides to the business. If the business team fails to see any actionable insights then it is considered a failure. So, coming up with a good narrative is as important as performing a thorough analysis.
3. Machine learning algorithms
Now let us learn different machine learning algorithms by using them. I have included datasets and sample learning scripts for different categories of machine learning problems. These will be enough to learn everything about the most commonly used algorithms. The different problems covered here are,
- Supervised learning
- Unsupervised learning
- NLP
- Computer vision problem
- Recommendation system
Supervised learning
When we have a labeled dataset we use supervised learning to solve them. The key categories of supervised learning are regression and classification. I have provided two datasets one for each of them.
First, refer to the below kaggle notebooks to get a better understand of the supervised learning algorithms. These well-documented scripts will help you to properly understand the steps and standards involved in solving supervised learning problems. The first one is about a regression problem and the second is about classification.
The goal of learning by doing is to get as much hands-on as possible to improve understanding. Use the above scripts as a reference and solve for the below datasets. To make it even better make sure you spend enough time reading through the kaggle discussion forums. The discussion forums are the goldmine of information. They have many interesting techniques and tips to solve the problems better.
To increase your learning and maximize your chances of getting a job, follow the below,
- Start with analyzing the dataset
- Identify the interesting patterns and insights
- Understand the relationship between the independent variables and the target
- Explore feature engineering
- Try different models for prediction
- Measure the accuracy
- Refine by trying different features, algorithms, and parameter settings
- Upload the code to your Git Repository
- Write a blog and/or upload your notebook on Kaggle with details
Regression Problem: The dataset attached for this problem is housing price. It will help you to learn about the regression problems and the algorithms used to solve them. This particular dataset has more than 75 attributes describing the property. This will help you to get a hang of feature selection and other typical issues in solving regression problems.
Classification Problem: The classification problems are those where we classify data into classes. The below example is a binary classification problem. Here is a health insurer wants to predict the interest of their customers in vehicle insurance. Like a regression problem always start with analyzing the dataset. The better one understands the data the better the prediction results.
While solving these problems focus on
- Learning different techniques to analyze the data
- Learn about feature engineering techniques
- Try to understand what algorithm goes well with what kind of data
- Document scripts clearly and make them available on your Git repository
- Write a blog post on your learning – trust me it helps a lot
Unsupervised learning
Unsupervised learning is used to work on a dataset that is unlabeled. For example, when we want to use the profile information of customers to group them into different categories. The approach to solving an unsupervised learning problem should be similar to supervised learning. Always start with the data analysis.
First, let us learn about the clustering algorithms using the Mall customer segmentation problem. It is about creating different customer clusters based on the information provided. We don’t stop once the clusters are identified. We can further analyze to understand the similarities within a cluster and the dissimilarities between clusters. Below is a sample script with clear documentation about approaching a clustering problem.
Now let us scale up and solve for the sensor data. This will help to learn about working with data produced by IoT devices. While it is easier to work and understand human-readable data like the customer profile data. The sensor data are usually tricky as they require much more analysis to extract the insights. The insights are generally not visible while directly looking into the dataset.
This example will help you to get a better understanding of clustering problems. The focus should be on the below areas while learning,
- Understanding different algorithms
- Which algorithm works better on what data?
- Data transformation to suit the requirements of the algorithm
- Visualizations that help in comparing the clusters
NLP
The next area of focus is natural language processing. There are an increasing amount of data getting generated in social media and other online platforms. Many companies are starting to focus on this dataset as they have much vital information.
The below tweets dataset will help in getting familiar with the text data. The issues with the text data are quite different from those of structured data. It needs different sets of techniques and approaches to solve them. While working on the below dataset focus on
- Techniques and methods for data cleaning
- Eliminating the stop words and others that don’t help
- Handling the noise in the dataset
- Libraries used for extracting the sentiments
If you are new to natural language processing then first refer to the introductory script here. It helps in understanding approaching and solving an NLP problem. Then use the learning to work on the below dataset.
Computer vision problem
The recent advancements in processing power have made it possible to perform image recognition. The computer vision application is increasingly used in,
- Health-care
- Security and Surveillance
- Inspection and Predictive Maintenance
- Autonomous Driving
To learn about convolutional neural networks and how they can be applied for computer vision problems go through the following introductory script here. Now, look into the below image datasets from kaggle, which can be helpful in learning about the computer vision application. While working on computer vision applications focus on the below,
- Techniques to optimize the image size without losing the information
- Tools and frameworks that help in computer vision
- Augmentation techniques when there isn’t enough image data
- Pre-trained models available for better prediction
There is a slight difference between the below two datasets. The first one is about identifying the dog breeds, it is a typical image recognition problem. Solving this problem will get you first-hand experience about the steps involved in an image recognition problem.
The below dataset is about object detection, the goal here is to correctly identify the objects in the image. The below dataset is a collection of satellite images of ships and the problem here is to identify all the ships present in every single picture. It requires a lot of training as in some cases the ships would be really small or blended with the background.
Recommendation system
The recommendation system is a very interesting technique that is popular among the business. This technique has helped many organizations in improving sales and customer experience. As per this McKinsey’s industry report, about 35% of sales in amazon comes from its recommendation system. Also, 75% of people watch the content recommended to them on Netflix.
If you want to learn about the implementation of a recommendation system then check below. It will give you a good perspective about the working of a recommendation system.
4. Improving communication skills and showing off
Write a blog and have a git repository
A good way to ensure your learning stays with you for a long time is to write about it. It helps in establishing credibility for yourself as well. The data science space is getting very competitive, so having blogs could help in standing out. Ensure at least some of the projects you would want to showcase in your resume are available in your git repository.
Create a portfolio website
Having a portfolio website sends a strong message about your skills. A portfolio website is like an online version of the resume. Include all your work and accomplishments. If you are interested in learning about creating a portfolio website for free using GitHub pages then check below,
How to create a stunning personal portfolio website for free
Create a really good resume
The final step is about creating an impressive resume. The amount of knowledge you have gained so far doesn’t mean much without a good resume. There are some tools and techniques to comes up with an impactful resume. Here is an article to help you prepare one for yourself.
Closing comment
These projects will be enough to completely learn the critical skills required for a data scientist. The notebooks provided as references in this article should be used for a better understanding of the concepts. It is very important that you get to solve these problems yourself to learn the most. The hands-on experience you gain will help to boost your confidence and to perform better in the interview. The knowledge gained by doing will be multiple folds as compared to learning by reading or watching tutorials. It also stays in memory for a long time.
To stay connected
- If you like this article and are interested in similar ones, follow me on Medium. Subscribe to Medium for access to thousands of similar articles
- I teach and talk about various data science topics on my YouTube Channel. Subscribe to my channel here.
- Sign up to my email list here for more data science tips and to stay connected with my work