Getting Started
I work as a software engineer and Data scientist with my code both in notebooks and software packages. The hardest lesson to learn when coding is "stop and think first before writing code." It can feel great to jump into coding, but you may be missing the bigger picture or a vital aspect of the project based on current assumptions. Software engineering and data science can bring an exciting mix of projects from data cleaning, automation, DevOps, analytic development, tools, and more. Some projects can easily cross over a few different areas. It has been valuable to step back from the initial work and begin to think about the problem at hand. With that, I would like to introduce to you 30 common questions I consider before tackling a new project.
Use Cases: Past, Present, and Future
The first area I like to consider is use cases. When I approach a project, I may know only of the original use case, but more can exist after talking with other stakeholders, customers, or teammates.
1. Sit down and think about your current use case. How would you design the code for this use case?
2. If there are past use cases you can examine, would you design the code differently to accommodate those?
3. Are there potential future use cases that may differ from your past and present cases? How would these change the way you develop your code?
4. Thinking about your code structure, sit down, and discuss it with one or more other developers. Would they approach the problem differently? Why?
5. As you develop your code, consider how it can expand in the future. Would that be an easy feat, or will it be hard to accomplish? How can you make it more reusable or repeatable?
Data Acquisition and Cleaning
After understanding the possible use cases for the project, the next area to consider is data acquisition and cleaning. Depending on the data you are using, you may need to consider how you will ingest it, clean it, and utilize it in your work.
6. What data is required for this project to be successful?
7. Do you need more than one dataset that requires some aggregation, or will you utilize one dataset? Do you need to get this data yourself? If so, how will you get this data?
8. Will your code handle the I/O in your Python package or within a separate notebook that runs the code?
9. Will you interface with another team that already gives you access through a database or API?
10. Are there any processes you need to develop around the acquisition or cleaning this data that will ease the process?
11. What format is your data? Does this format matter to your code?
Automation, Testing, and CI/CD Pipelines
After understanding the use cases of my project and what data you need, the next questions I want to answer are around the automation of processes. I enjoy automating what I can as it allows me to focus on other work. Consider automating different portions of your code using quick scripts you can run as needed, creating jobs that run on schedule, or utilizing a CI/CD pipelines to perform everyday tasks like deployments.
12. Should the output of the work get created regularly? Will, you ever need to repeat your analysis or output generation?
13. Does the output get used in a nightly job, a dashboard, or frequent report?
14. How often should the output be produced? Daily? Weekly? Monthly?
15. Would anyone else need to reproduce your results?
16. Now that you know your code structure, does it live in a notebook, standalone script, or within a Python package?
17. Does your code need to be unit tested for stability?
18. If you need unit testing for your Python package, will you set up a CI/CD pipeline to run automated testing of the code?
19. How can a CI/CD pipeline help ensure your work is stable and doing as expected?
20. If you are creating an analytic, do you need to develop metrics around the analytics to prove they produce the expected results?
21. Can any aspect of this work be automated?
Reusability and Readability
The last set of questions I look at are focused on the reusability and readability of the code. Can someone new come onto your team, pick up your code, learn it, and use it quickly?
I enjoy writing documentation, meaning my code is usually well commented with examples of how to run it. Adding documentation, examples, and tutorials are immensely helpful when bringing on new developers as they can onboard quick and feel they are contributing quickly. No one enjoys taking a long time to understand things and feel they are contributing to the team.
In terms of reusability, this is where your use cases can come in very handy. How can your code be utilized in your use cases, but generalized enough to allow others to pick up where you left off and utilize it for their use cases.
22. You have looked at your use cases. Can you standardize the classes or methods to fit the code's expansion?
23. Is it possible to create a standardized library for your work?
24. Can you expand your work to provide a utility or tool to others related to this work?
25. As you write code, is it clear to others what the code is doing?
26. Are you providing enough documentation to quickly onboard a new data scientist or software developer?
Code Reviews
Lastly, there are code reviews. It may seem odd to think about code reviews before you start coding, but let’s take a step back. Are there common questions you get during your reviews or frequent discussions you have? Think upon those cases and utilize them as you develop your code. Learn from past comments on code implementation and keep developing your skills.
27. Do you need to have a code review for your project?
28. If so, what comments do you anticipate getting on your work?
29. Who can you meet with before the code review to discuss architecture or design decisions?
30. What comments are you required to address, and what comments can you use for later investigation?
Summary
Collectively, it may sound like a lot, but these 30 questions are great ways to think about how you will design your next Data Science engineering project. When you look back, what things do you consider before starting a new project? I look at five areas:
- Determine what use cases exist for this project work.
- Understand data acquisition and cleaning that will be needed.
- Look into the possibilities of adding automation, testing, and CI/CD pipelines to your processes.
- Work to make your code readable and reusable to quickly onboard new team members and make it easy to expand the code at a later stage.
- Use code reviews to your advantage and learn from them.