A framework for project success

Introduction
Data Science is still a roaring field with demand continuing to outstrip supply and many business expecting to increase their IT spend drastically over the next few years.
Although there has been a sharp rise in online courses, bootcamps and degrees and with them, an increase in junior talent, it is still a great time to get into Data Science.
There are some amazing resources out there for project ideas but many of them have been done by most new Data SCientists. Pretty much everyone has done a Twitter sentiment analysis project (myself included!), looked at the Titanic dataset or done image classification of flowers. However, this is for good reason; these projects are a good way to show off skills and the solutions have been well documented.
In this article, I propose a framework for a good Data Science project, as well as provide 5 project ideas and how they fit within this framework.
Table of Contents
- The Anatomy of a Great Project
- A Note on Deployment
- Web Scraping + Regression
- Imbalanced Classification
- Time Series Forecasting
- Image Classification
- Natural Language Processing
The Anatomy of a Great Project
4 Components
There are 4 components to any Data Science project:
- Data collection and cleaning: The process of getting a dataset and cleaning data issues, missing values and making transformations to allow visualisation and modelling.
- Exploratory Data Analysis (EDA): Learning about the dataset, typically by visualisation. Understanding distributions, relationships between variables and data ranges.
- Modelling: The process of building a Machine Learning model to answer a problem statement.
- Deployment: Making the model available for other people or systems to interact with.
Don’t Ignore Learning
With each of these steps, we are learning different skills or tools. It is important to recognise that some projects may require much more cleaning than others, which means we are using more skills in this area. Not every project is about model building.
Data Scientists do not spend a huge portion of their time training models and some of your projects should reflect this.
While planning a project, we should always plan the skills and tools we will use and learn throughout the project.
The Issue With Most Data Science Projects
Deployment.
Most tutorials, videos and online courses do not cover outputs. When building a Portfolio, it is essential that your hard work is understood by your audience. Unfortunately, we can’t guarantee that a potential employer is going to look through your code, it is essential to have digestible outputs from your work that aren’t code in a Github repo.
It also helps to tailor your outputs to the time spent during the project. If you spend 20 hours scraping and cleaning some data, why not upload it to Kaggle or write a tutorial.
We should always try to extract value from our work by creating human-digestible outputs that can be accessed by a range of audiences across a variety of platforms.
Bringing It All Together
I’ve summed this up in a framework, which I’ll modify for each project idea. This isn’t an exhaustive list of skills and outputs, but hopefully it helps to generate some ideas.

A Note on Deployment
Before we get into each project, I’d like to talk briefly about deployment. In a work environment, you’d typically deploy your model as a microservice on a cloud platform; AWS, GCP or Microsoft Azure. We don’t really have this luxury at home as there is a cost per second associated with these services. They do have free-tiers, so I’d suggest looking into these if you want a more deployment focused project.
As such, I’m going to suggest the same process for all of these Projects. I am a huge advocate for deployment using Heroku and Streamlit. If you’d like to show some extra skills, it helps to build a Docker container so anyone can run the app locally (and you could theoretically push this container up to a cloud platform).
I’d suggest this tutorial, leaving out the Docker part and pushing to Heroku as as a single webapp.
Web Scraping + Regression

Problem Statement
Build a web scraping script using Python to scrape data from a website. The best candidates for scraping are typically sites with listings such as homes, vehicles or furniture. The data you get from these lends itself to a regression problem because we’d be predicting prices.
Dataset
This is the key part of this project and most of your time will be spent building the scraper. Scraped data is often messy so even more time might be spent cleaning. For a first-time project, I’d suggest using the BeautifulSoup package but if you’ve done this before try ScraPy or Selenium. There are many good resources online.
Beautiful Soup: Build a Web Scraper With Python – Real Python
EDA
This is also going to be a large part of the project. Use Plotly to make plots that are aesthetically pleasing and combine with something like Datapane if you’d like to show the plots on a Medium blog post.
Try to tell a story here e.g. what can we understand about house prices using charts? Do they increase with square footage? Is a map helpful?
Here’s an example of how I used some scraped data and EDA to write a blog post:
The Must Learn Technical Skills in 2021 for Data Scientists and Analysts
Modelling
In this project, modelling is more of a box tick. You’re able to extract plenty of value with EDA and the dataset and by the time you get to modelling, you might feel tired of the project.
On the other hand, this is a good opportunity to learn, try to use a model you haven’t used before – especially models like XGBoost, LightGBM or Catboost and gain a basic understanding of what they do.
If you want to push it up a notch, here’s a great Kaggle notebook on stacked regressions.
Deployment
This should be fairly straightforward using the tools I mentioned above. This is a good project to use SHAP for model interpretability and add some extra design if you have the CSS skills.
Imbalanced Classification

Problem Statement
Classification problems are the projects that I personally find most enjoyable, however class imbalance can be a huge spanner in the works. Class Imbalance is where we have a very rare outcome that we are trying to predict. Typically, our model will just predict the majority class and will underpredict the rare class. This has parallels with business problems such as fraud detection and medical diagnosis.
Dataset
I’d advise getting a dataset from Kaggle or an API. There are many fraud-detection type datasets out there which will be highly imbalanced, you could always go for something a little less imbalanced. If this is your first time with this type of project, I’d try to find a binary classification problem.
EDA
Use this to highlight class imbalance, depending on your dataset there might not be much you can find with charts alone.
Modelling
This will certainly take most of your time and will be where the class imbalance problem is solved. Over/under-sampling are common but there are many other methods worth exploring. You may also have to tune hyperparameters extensively.
You can explore the methods in this article:
8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset – Machine Learning Mastery
Deployment
Regression problems are typically more relatable (house prices, car prices) so you might want to try and come up with a more inventive way of deploying this model with more visualisations, if possible.
Time Series Forecasting

Problem Statement
Can we predict future trends over time, given information we already have? Time series forecasting is particularly popular in finance for predicting stock prices and retail roles for predicting sales or number of orders based on seasonal trends. There are some Data Scientist roles that are highly time-series focused so this type of project could be a focus for you if you plan to go into one of these.
Dataset
Stock price prediction has been done to death and isn’t easy. Unless you have a very particular interest in stocks, derivatives or crypto, or you have something particular you want to investigate (eg effect of news on stock prices), I’d steer clear of a simple stock price forecasting problem.
There are many good datasets available online so just pick something you enjoy! Avoid multivariate problems if its your first time.
Another great option is to talk to any friends or family that own small businesses, or even local shops. For example, if your friend owns a taco track you could forecast how many onions they should buy. You’d have to get permission to share the data but this would make a great case study and provide them with benefits. Ideally, you’d want 2 years of data.
EDA
Time series data makes great visualisations. Data Analyst roles will often use shop and product data for Tableau/PowerBI tests. This type of project is a great candidate for a Tableau dashboard to showcase some extra skills.
Modelling
ARIMA and SARIMA models are the most basic examples for time series. I’d also advise researching something state of the art like Facebook’s PROPHET algorithm.
Image Classification

Problem Statement
Can we train a model to identify what an image is? This is a great project because there are a lot of resources out there. You have the option to use traditional models but Convolutional Neural Networks drastically increase performance.
Dataset
The great thing about this project is you can get a dataset very easily. There are a lot online but if you want to gather your own you can use the Bing API or a Google Search downloader extension to get a dataset with relatively accurate labels.
EDA
This is fairly limited as there isn’t much to plot.
Modelling
If you’d like to start with non-Deep Learning methods, I’d recommend this tutorial using dimensionality reduction and sklearn models.
To get started with Deep Learning, the official Tensorflow tutorial is really helpful. Make sure you also experiment with pretrained models.
To really take this up a level, I’d recommend this fantastic tutorial on Image Caption generation using the Flickr dataset.
Deployment
Deployment here is a bit trickier because the model may be very large and you’ll need to build a way for the user to upload their image.
Its also worth noting that interpretability can be a problem in these type of models and using something like SHAP could add a lot of value to your project.
Natural Language Processing

Problem Statement
NLP is currently the most hyped area of Deep Learning, for good reason. The potential impact is massive and we are seeing real applications in many industries. It also allows businesses to understand their customers at scale.
A good place to start is to consider what NLP topic you want to tackle. It is more than just Sentiment Analysis and understanding the terminology really helps when Googling code and tutorials. A good place to start is looking at benchmarks to understand different task names, eg http://nlpprogress.com/ or https://paperswithcode.com/area/natural-language-processing
Dataset
Once you have an idea of the problems, its time to think of a dataset. Many people go to Twitter for this due to its fairly intuitive API and the sheer volume of data. Otherwise, this is a really nice resources for NLP datasets.
My recommendation here would either be to follow your interests or try to extract insights about a company you want to work for. For example, if Microsoft release a new product, can capture the sentiment or cluster topics using tweets?
NLP also has great potential to build something that people want to use. Text summarisation and autocomplete are really helpful or you could go down the fun route.. here’s a really fun example on Reddit using GPT-2.
EDA
Once you’ve got a dataset, you can do some really interesting EDA. My personal favourite is doing dimensionality reduction and clustering to really visualise how models might understand the data. You can see my examples here:
Modelling
Pretrained models are the best option here. I’d avoid using Bag of Words + sklearn models if you’ve done it already. Explore the HuggingFace library which has a huge range of state of the art models. I’d also advise not getting too bogged down with how every model works (although understanding Transformer architecture is important!)
Deployment
Depending on your project, there are some fun solutions here. If you go for something simple and use HuggingFace, I’d suggest also using SHAP to interpret the model.
However, if you’ve built something with a real use case, you might want to explore more complex deployment methods using AWS.
Conclusion

Different projects require different skills and take more time in different areas.
A data science project won’t always yield an amazing model. It’s important that your effort isn’t wasted by dumping code in a Github repository that no one will ever see. If you’ve put a lot of effort into something, think about how you can extract value from it and show the world. Chances are, if you’ve struggled to do something, someone else is too.
Learn More
Upgrade Your Beginner NLP Project with BERT
The Must Learn Technical Skills in 2021 for Data Scientists and Analysts