Two data science hacks to improve your workflow
Useful methodology to expand the data science toolkit
Data science is fundamental to Pinpoint’s application. But, like most startups, we are still in the process of building out our data science architecture; how we load data, store models/runtime data, execute scripts, and output results. Truthfully, our architecture and setup will never be “complete” because it should — and will — evolve as we expand and enhance our project portfolio. However, there are some concepts and components that we leverage today that would be useful in any data science environment. This blog covers the libraries, methods, and technical code logic to help enhance the daily workflow for data scientists based on what we have found useful for our own architecture at Pinpoint.
Use VSCode’s Jupyter Extension to Streamline Workflow
While Jupyter Notebooks are undoubtedly useful for testing code snippets or writing tutorials or demos, they are not much use within a production environment. Early in my career, I was constantly copying and pasting code snippets from my Python scripts into a Jupyter Notebook then executing and debugging the code within the notebook, only to finally transfer the code back into scripts. Needless to say, this is not the most efficient process. VSCode allows you to define Jupyter cells in any Python script using the #%% syntax and then execute the block of code within that cell. An interactive window will then be launched where your code is output and where you can freely write new code — just like a Jupyter Notebook and without all the back and forth. The interactive window is launched within the window you are currently on, therefore never having to leave the environment you’re in.
I have found this extension particularly useful in the exploratory analysis stage when starting a new project to:
- Look at data types
- Calculate value counts for various fields
- Create quick visualizations
- Calculate descriptive statistics about the data
- Test different modeling options
In the above screenshot you can see I’ve defined a few cells within my script to the left, and have my interactive window to the right where the cell blocks get executed and I can write code. This is also the same python file I used when testing out models for our entity recommendation engine.
Click here for the full reference on using the Jupyter extension within VSCode.
Use a Local Environment to Train and Score Pipelines
We use AWS S3 to store data across our platform including data from our data science jobs. Embedded within all of our data science projects are methods to compress and write all necessary files to s3 after a job has completed. This includes input data fed into training and scoring, models, model outputs, and predictions. Having a local pipeline allows us to quickly duplicate and debug errors, compare model metrics, investigate unusual model results, and test out different customer’s use cases and behavior. We are constantly running into unexpected input data types and values which we did not account for originally.
With local pipelines in place I am able to quickly download data, run it through the training or scoring pipeline with break points placed throughout the code, and implement any necessary code changes. This is extremely useful in tackling various kinds of problems and questions as they arise on a daily basis. The only difference between the local and production pipelines are the data sources. Our production pipelines will grab and output data using GraphQL’s API whereas local pipelines simply point to a local directory where I have downloaded the data from s3.
Throughout our entire data science codebase you will find many other engineering practices and tools that are not specific to data science. These include virtual environments, debugger console, and logging messages. We just added Loki logging aggregation to our infrastructure, making it easy to parse through various errors or phases of the pipeline.
Conclusion
Overall, there are unlimited approaches to build your data science infrastructure with a plethora of different databases, cloud infrastructure and modeling techniques to help assist the machine learning process. There are even more tools non-specific to data science that complement data science development. We’ve presented a couple of our favorites here, and hope that they can help alleviate some of the daily challenges in getting to that production-ready point.