From raw data to web app deployment with ATOM and Streamlit
Introduction
In this article we will show you how to create a simple web app, capable of helping a data scientist to quickly perform a basic analysis on the performance of predictive models on a provided dataset. The user will be able to upload his own dataset (as a .csv file), and tweak the machine learning pipeline in two ways: selecting which data cleaning steps to apply on the raw dataset, and choosing the models to train and evaluate. And we will do all of this in just 50 lines of code! How? Using the right libraries.
- We will use ATOM for data processing and model training. ATOM is a library designed for fast exploration of machine learning pipelines. Read this story if you want a gentle introduction to the package.
- We will use Streamlit to create the web app. Streamlit is a popular library to make beautiful data apps in a matter of minutes.
Build the web app
Set up
Start making the necessary imports and setting up streamlit’s configuration. We select a wide layout to have space to display two plots next to each other.
import pandas as pd
import streamlit as st
from atom import ATOMClassifier# Expand the web app across the whole screen
st.set_page_config(layout="wide")
Pipeline menu
The idea is to be able to modify the machine learning pipeline from a menu located on a sidebar. The menu will consist of checkboxes that will decide which elements (data cleaning steps or models) are added to the pipeline, like a recipe where you can choose your own ingredients.
Adding objects to a sidebar in streamlit can be achieved using st.sidebar
. The data cleaning steps that we are going to implement are: feature scaling, encoding categorical features and the imputation of missing values.
st.sidebar.title("Pipeline")
# Data cleaning options
st.sidebar.subheader("Data cleaning")
scale = st.sidebar.checkbox("Scale", False, "scale")
encode = st.sidebar.checkbox("Encode", False, "encode")
impute = st.sidebar.checkbox("Impute", False, "impute")
After that, we add the models that can be used to fit the data. This time we wrap the checkboxes in a dictionary to be able to loop over them later (note that we use ATOM’s model acronyms as keys).
# Model options
st.sidebar.subheader("Models")
models = {
"gnb": st.sidebar.checkbox("Gaussian Naive Bayes", True, "gnb"),
"rf": st.sidebar.checkbox("Random Forest", True, "rf"),
"et": st.sidebar.checkbox("Extra-Trees", False, "et"),
"xgb": st.sidebar.checkbox("XGBoost", False, "xgb"),
"lgb": st.sidebar.checkbox("LightGBM", False, "lgb"),
}
Note: Make sure to have the XGBoost and LightGBM packages installed to be able to use these models.
Data ingestion
The sidebar menu is done, time to make the body of the app. The first part is the data ingestion, where we can upload the dataset that we want to use. Use streamlit’s file_uploader function for that.
st.header("Data")
data = st.file_uploader("Upload data:", type="csv")# If a dataset is uploaded, show a preview
if data is not None:
data = pd.read_csv(data)
st.text("Data preview:")
st.dataframe(data.head())
Model training and evaluation
Lastly, write the actual pipeline that will process the data, train the models and evaluate the results. Note that this example only works for binary classification tasks.
st.header("Results")
if st.sidebar.button("Run"):
placeholder = st.empty() # Empty to overwrite write statements
placeholder.write("Initializing atom...")
# Initialize atom
atom = ATOMClassifier(data, verbose=2, random_state=1)
if scale:
placeholder.write("Scaling the data...")
atom.scale()
if encode:
placeholder.write("Encoding the categorical features...")
atom.encode(strategy="LeaveOneOut", max_onehot=10)
if impute:
placeholder.write("Imputing the missing values...")
atom.impute(strat_num="median", strat_cat="most_frequent")
placeholder.write("Fitting the models...")
to_run = [key for key, value in models.items() if value]
atom.run(models=to_run, metric="f1")
# Display metric results
placeholder.write(atom.evaluate())
# Draw plots
col1, col2 = st.beta_columns(2)
col1.write(atom.plot_roc(title="ROC curve", display=None))
col2.write(atom.plot_prc(title="PR curve", display=None))
else:
st.write("No results yet. Click the run button!")
This is a rather large chunk of code, so let me explain what happens here. The if statement at the start creates a button in the sidebar, that, if clicked, executes the code block inside the if statement. As long as the button is not clicked, the pipeline does not run. This ensures that the pipeline doesn’t start running after every click on one of the checkboxes in the menu. The code block inside the if statement does the following:
- Create a placeholder text block to write some progress information while the pipeline is running.
- Initialize an ATOMClassifier instance that will process the pipeline. With this command, the data is automatically split into a training and test set with a 80%-20% ratio.
- Run the data cleaning step corresponding to every checked checkbox in the sidebar.
- Use atom’s run method to train all the selected models on the training set.
- Output the models’ performance on the test set using the scoring method.
- Display the Receiver Operating Characteristic curve and the Precision-Recall curve for all the trained models. The display=None argument is necessary to return the created matplotlib figure.
Try it out
And just like that, the web app is done! Let’s give it a try. To run the app, open the terminal and go to the directory where the file is located. Run the command streamlit run <name_web_app>.py
. The web app will automatically open in your default browser. Follow the steps described here to deploy it.
The data used in the example shown is a variation on the Australian weather dataset from Kaggle. It can be downloaded from here. The goal of this dataset is to predict whether or not it will rain tomorrow, training a binary classifier on target column RainTomorrow
. The example’s full code can be found here.
Conclusion
We have seen how to use ATOM and Streamlit to quickly create a web app capable of exploring a basic machine learning pipeline. Due to the flexibility and ease-of-use of both libraries, it wouldn’t take much effort to improve the web app adding new models, allowing regression pipelines, showing some extra plots or increasing the complexity of the pipeline.
Related stories:
- https://towardsdatascience.com/atom-a-python-package-for-fast-exploration-of-machine-learning-pipelines-653956a16e7b
- https://towardsdatascience.com/how-to-test-multiple-machine-learning-pipelines-with-just-a-few-lines-of-python-1a16cb4686d
For further information about ATOM, check out the project’s GitHub or Documentation page. For bugs or feature requests, don’t hesitate to open an issue on GitHub or send me an email.