Let the code and Automation work for you. If you often find yourself writing similar code over and over again, then it’s time to consider automating the task or analysis. Maybe you’re starting up similar looking analyses to run over night or are comparing model results at each run with different input parameters, but are you effectively doing these tasks? I commonly look to improve upon two areas in my daily work: (1) standard code checks using CI/CD, and (2) performing analyses using automated jobs.
Working in data science, I have found I often am writing code in many different notebooks that get shared with different data scientists across the team. If the tasks I am doing begin to seem repetitive or others do something very similar, then I take a step back and look at the process. I look to see if I can utilize a CI/CD pipeline, a script, or an automated job to aid in the work, so I am doing more by doing less.
CI/CD Pipelines for Code Checks
CI/CD pipelines are something I commonly use in my workflows. You can design these pipelines to perform everyday tasks and share results with Data scientists. You can use CI/CD pipelines for more than just unit testing. You can kick off jobs from them, perform performance metrics during analytic changes, kick off the deployment of libraries, and more.
As a developer and data scientist, I look for ways to streamline my team’s processes and make it easier to run those standard checks. When I open a pull request, the pipelines start to run validations, routine checks, and more to validate that the code is working as expected and produces the correct results. These pipelines speed up the process to validate changes in the code and the validity of analytic changes. Therefore, when checking the analytical concepts during a peer review, the primary checks have been completed automatically. What is left for me to do is check the code for logical errors, inefficiencies, and documentation.
Questions I commonly ask myself when considering a new pipeline:
- Can the task be run in less than 6 hours, which is our pipeline's requirement?
- Can the result of the pipeline be easily shared during a pull request? How will the outcome be used to validate the code during a pull request?
- Does the task aid in deciding the pull request or release of the artifacts after the pull request is merged? Such as deploying documentation, a website, or a library.
- Do you have common artifacts that you need to release during or after a pull request or merge? How can you automate those releases using pipelines?
Job Automation for Common Tasks
If I am not creating a pipeline to help automate my tasks, I will be looking for efficiencies in creating and automating standard notebooks and jobs. I run many jobs on an hourly, nightly, or weekly basis for my team and only check the code when I need to make updates or a job has alerted me of a failure. Automating these tasks has allowed me to focus on other work while running in the background by themselves. As you look at your tasks for the week, are there any areas in which you can incorporate an automated job?
There are some jobs that I have created that are not run on a schedule. These jobs either waiting for someone to press a button or fr another process to trigger them. The code will be in place to run the task when I need to, and I can click a button to rerun it as required. For example, once a month, I need to generate analytic artifacts for another team. I may or may not need the code at different times as well, so I have a job created and only click a button to run as required. It takes 5 minutes to make the artifacts I need and package them versus rewriting the code every time.
If a person isn’t kicking off the job, then another process is starting it. Typically, my team sets up these jobs to be either kicked off by another job or a pipeline. When triggered, the job will then run and create results.
As I am creating jobs, things I tend to think about are:
- Is the task run every so often and can live in a notebook, unscheduled job, or manual pipeline when you need to rerun it?
- Is the task run on a schedule and placed into an automated job with that schedule? If yes, then create an automated job that takes in the necessary input.
- What inputs are needed for that job or could be changed by the user? These inputs you can pass into the job when it is started. When you include inputs, consider items that you may want to change later even if you are not changing the input value now. For example, the number of files to read in. You may want to keep it at 200 now but change it later, so it should be input.
- Is the task, something that is run after another action has been completed? You could create a job that is started when the previous item has been completed, such as having a CI/CD pipeline that kicks off an automated job, such as using Azure Pipelines to kick off jobs in Databricks.
Final Thoughts
Let the code and automation work for you. If you find yourself rerunning similar tasks and analyses repeatedly, determine if there is a way you can set up a pipeline or automated job to take care of the work. The more you can automate, the less chance for human error to be introduced into your process.
How have you used pipelines or jobs to aid your processes?
If you would like to read more, check out some of my other articles below!
Top 3 Challenges with Starting out as a Data Scientist
Why You Need a Data Science Mentor in 2021
4 Things I Didn’t Know About Being a Team Lead in Data Science