My first Data Science project was a machine learning model that predicts used car prices. The main steps of the project were:
- Scraping used-car ads on a website
- Cleaning and preprocessing the scraped data
- Exploratory data analysis
- Model creation
- Model evaluation
The end result was quite satisfactory with an R-squared score of 0.9. I mean it exceeded my expectations considering it was my first project. I wrote an article about this project if you’d like to read more about it.
Note: This story is originally published on datasciencehowto.com.
Such projects are great for learning and practicing. In that project, I spent lots of time using Pandas, Scikit-Learn, Beautiful Soup, and Seaborn which are very popular Python libraries for data science.
The project was a great learning experience for me and it made me feel like I made the right decision to start learning data science.
However, it was also an uncompleted project. I did all the steps mentioned above in a Jupyter notebook and the project was never made into production. It was not used by anyone else as a price prediction service.
This is I think the most challenging part for Machine Learning in real-life. It is not very difficult to create a model and make predictions with it. Deploying it into production and operating it as a continuous service is another world.
Building such an ML system requires many different types of operations that need to be synched and work in tandem. This process refers to machine learning operations, aka MLOps.
You may have heard of DevOps which is a common practice in building large-scale software systems. Mlops can be considered as DevOps for machine learning systems.
The Motivation for MLOps
Although building an ML model in a Jupyter notebook is good for learning, it is far away from creating any business value. On the other hand, building an ML system that is continuously operated in production is what really makes machine learning valuable.
Let’s take a moment and think about how my project of predicting used car prices can offer a business value. If you have a business of buying and selling used cars, you can use it to find cars that are sold less than the market value. You will then be using machine learning to improve your business and increase your profit.
Another valuable asset using this model could be a website. People would use your website to learn the price of their cars according to current market conditions.
Both these cannot be achieved with a machine learning model that was created and trained for once in a Jupyter notebook.
Challenges for MLOps
I think we all agree that MLOps is a fundamental requirement for making machine learning a rewarding and beneficial tool. Not surprisingly, it is not an easy task.
The main challenges for MLOps are scalability, version control, and model decay. In order to overcome these challenges, the ML systems are regarded as software systems and operated with DevOps principles. It is important to note that machine learning code is a very small part of the entire ML system.
The two main principles of DevOps which are continuous integration (CI) and continuous delivery (CD) apply to MLOps as well. In addition to these, a third player gets in the play which is continuous training (CT).
CI requires you to test and validate the code and components. In a machine learning system, you also need to perform these on the data. Data validation might be based on a schema. For instance, if there is a new or missing column in the input data, the entire system might blow. The data validation scripts should handle such issues.
You also need to check the values and take necessary actions if the values are off by far more than the expected values. In this scenario, it is better to discard the new data and skip model training.
Continuous training is of vital importance for all ML systems. In our case of predicting used car prices, there are several factors that have an impact on car prices such as market conditions, inflation rates, global trends, and so on. Thus, you cannot use a model that was trained a long while ago. To accurately reflect the current conditions, you need to be re-training the model continuously.
In a typical ML system, many different software tools and libraries are used. Thus, a version control system needs to be adopted in your workflow.
You will also need to be monitoring and evaluating the model outputs. Machine learning is experimental in nature. There is no guarantee that your model will always produce accurate and reliable results.
Last but not least, all these need to be scalable and working efficiently. You might have the most accurate model but if it’s not scalable, the odds are your product or service will fail.
The MLOps are the key in transitioning your models into ML systems. What we have mentioned in this article just covers the highlights in this domain. There is a wide range of tools and packages used in MLOps. I will be writing about these tools and how to use them in practice. Stay tuned for more MLOps.
You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.
Thank you for reading. Please let me know if you have any feedback.