Projects are a great way to learn Data Science. They provide a meaningful, self-guided way to improve your skills and possibly solve real problems for you and others. Projects are also a great way to showcase your skills in a portfolio. While there are some small projects that you just "do" in a day or two to get familiar with certain skills, libraries or topics there may be projects that are a little larger and require a little more upfront planning.
I have made the mistake of just starting a project and then losing track of all the little tasks I wanted to accomplish and ideas for solutions I had. In the end I created a lot of extra work for myself by not creating a plan beforehand. To save you the hassle I want to provide three frameworks you can use to plan your project and guide yourself from start to finish. The steps presented for each framework are of course not set in stone and you should adapt them to your specific situation. Some frameworks are also more specific then others, so just pick what best suits you!
CRISP-DM – Cross-Industry Standard Process for Data Mining
The CRISP-DM is a process model created in 1996 under the funding of the EU. It is one of the most used analytics models. In 2015 IBM refined the approach and created the ASUM-DM, but for our purposes the CRISP-DM provide a good framework. The process model contains multiple steps, with some optional iterative loops between steps. The steps are:
- Business Understanding
- Data Understanding
- Data Preparation
- Modeling
- Evaluation
- Deployment

Let’s go through them and their components.
**Business Understanding:
- Gather background information on your problem, domain etc., define your "business" or project objective and your success criteria. Remember you are trying to solve a problem not reach a certain accuracy score! 2. Assess the situation: Which resources can you use, do you have any assumptions or constraints, are there any potential risks? Have you considered ethical questions? 3. Determine the goals of your Data Science efforts, really understand what you are trying to achieve through your analysis or modeling.4. Produce a plan (at least an outline)**
**Data Understanding:
- Collect (initial) data: this could be existing data from kaggle, Google, some database or some data you already have. It could be data that you need to acquire through web scraping or other data collection methods. 2. Explore and Describe your data: Understand the size, kind and complexity of your data (this may help to avoid costly errors in your analysis or modeling). Explore the data through EDA (understand distributions, summary statistics etc.). 3. Check the quality of your data: A**re there any missings, wrong labels, inconsistencies, is the metadata correct and useful?
The next two steps are not always linear and may involve many iterations of going back and forth between them.
**Data Preparation:
- Select (the right) data: Select your features, possibly split the data into train, validation and test sets. 2. Clean the data: Fill or drop missings, correct data inconsistencies etc. 3. Extend the data: Extend your features (by computing new features) or training examples (e.g. through data augmentation). 4. Format the data:** Transform your data to make it suitable to the coming analysis or machine Learning techniques (scaling, normalization, encoding etc.)
Modeling: 1. Select modeling techniques: This could be traditional statistical models or more advanced machine/deep learning techniques. Select one or more models/algorithms based on your data, goals and constraints. **Define one or more metrics.
- Build/Create the model: Specify the model(s) and hyperparameters, train it, validate it on a separate validation set if possible) 3. Assess the model:** Assess the model(s) performance(s) based on your chosen metrics.
Evaluation: 1. Evaluate the results: Are there clear presentable results or any novel findings? Can the model or findings be used to fulfill your business goal? 2. Review the process: Did anything go wrong? Can it be quickly fixed? Are there any alternative solutions that could be explored? 3. Determine the next steps: If you business goal is not met by the results you may have to go back a bit and start over with a different approach, or if your results are satisfactory you may go into deployment
Deployment 1. Summarize your process and findings. Try to document what you have done, what your process was and how you interpret the results. 2. Create a deployment plan that identifies any possible problems down the road. 3. Plan monitoring and maintenance of your solution 4. Conduct a final review and document your process
Project Management Approach
You don’t need a fully-fledged project plan, or a Gantt Chart here don’t worry. But the general approach, that project management takes can of course be applied to your Data Science Project. The PM approach can generally be divided into the following steps:
Initiation – Planning – Execution – Monitoring & Controlling – Closure
Initiation is the first phase, yet a critical one. This is where you determine your project goals, available resources (data, packages, time). After you’ve done this evaluate whether your project is feasible, if it is: great! If not, you don’t have to abandon it permanently, maybe you are just missing some resources or time right now, but your idea will become feasible later.
During the Planning phase you can determine which exact steps you are going to take to accomplish your goals. Think about what you are going to do with the data, which models you might use etc., if you can try to come up with a timeline or milestones you want to achieve at a certain. This will certainly help you stay on track. If you work with a partner or team you can use your plan to assign tasks
After you have planned everything it’s time to Execute. Work on the tasks you’ve set yourself and try to reach your milestones. During this phase it’s important to Monitor & Control what you are doing, is your work still in line with your initial goal? Did everything go as expected? If not what is different than expected? What went wrong? Try to understand which adjustments may be necessary to reach your goals.
In the end it’s time to close your project. But Closure means more than just saving your work and pushing to your GitHub repo one last time. Document your work, evaluate how everything went and check if you really accomplished everything as planned. Capture your Lessons Learned for yourself and for a potential write-up. Recycle things you can use again like functions you’ve written or workflows you’ve established. And finally: Celebrate a little!
Drivetrain Approach
This approach was established by Jeremy Howard (of fast.ai), Margit Zwemer and Mike Loukides. It was specifically designed to produce data products, which achieve a certain goal. I think it’s a more "high-level" approach than the previous two as it’s steps are less details and defined in broader terms. It features four steps that aim to use data to produce actionable outcomes.
The first step is to define a clear objective or goal you want to accomplish. Once you have a clear objective you can start to figure out which levers you can pull to achieve your objective. Think about which actions you can take to improve existing solutions to a given problem. Then think about the data you need to collect for actions to achieve the objective. Once you’ve completed these steps you can begin to think about which models you can build and use to combine your data and levers to produce the desired outcome.
I hope this overview gave you some ideas on how you can structure your Data Science Projects for yourself or your portfolio. Have fun planning and most importantly executing your projects!
-Merlin
References:
[1] https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining
[2] https://www.oreilly.com/radar/drivetrain-approach-data-products/