Photo by lucas Favre on Unsplash

LineaPy Data Science Workflow In Just Two Lines: MLOps Made Easy

Data engineering, simplified

Towards Data Science
8 min readMay 31, 2022

--

Introduction:

LineaPy is a python package used for data science automation. According to the LineaPy documentation:

LineaPy is a Python package for capturing, analyzing, and automating data science workflows. At a high level, LineaPy traces the sequence of code execution to form a comprehensive understanding of the code and its context. This understanding allows LineaPy to provide a set of tools that help data scientists bring their work to production more quickly and easily, with just two lines of code.

I saw their announcement last week about LineaPy. This is originated from UC Berkeley research like Apache Spark and now open-sourced. I tried LineaPy and it looks very interesting and useful. If you want to know more about LineaPy then please continue reading.

Image by the author (using Excalidraw and Excalidraw stickers)

Table of Contents:

1. Why do we need LineaPy?
2.LineaPy Installation.
3. Concepts.
4.2 Lines of Code.
5. Walkthrough LineaPy Example.
6. Conclusion

I have used the following tools to create the diagrams and code snippets.

->Excalidraw
->Gitmind
->Gist
->Carbon.now.sh

Why do we need LineaPy:

The process of data science development to production is a complex engineering process. An article in VentureBeat that about 90% of data science projects don’t make it to production. In short only one out of ten projects makes it to production. It is easy to write a messy code in the Jupyter notebook since you will be doing a lot of EDA, statistical analysis, editing the cells, deleting the cells, etc. Also maintaining the notebook clean and in the sequence is time-consuming and needs a lot of effort. The refactoring of the data science development code and building the pipelines are complex, manual, and time-consuming. LineaPy provides just 2 lines of code to take the development code to the production code and also generate the required pipeline.

Image by the author

LineaPy Installation:

Image by the author

Concepts:

Image by the author

Artifact:

  • Artifact refers to an intermediate result in the data science development process.
  • In the data science workflow, an artifact can be a model, a chart, a statistic, a dataframe, or a feature function.
  • LineaPy treats the artifact as code and a value. It stores the value of the artifact and also the necessary code to derive the artifact.
Image by the author

Artifact Store:

  • Artifacts are stored in the artifact store.
  • The artifact store also saves the metadata of the artifact like creation time and version, etc.
  • The artifact store is globally accessible by everyone. The user can view, load, and build on artifacts across different development sessions and even different projects.

Pipeline:

  • A pipeline refers to a series of steps that transform data into useful information/product.

For example below is a pipeline

Image by the author
  • These pipelines are developed each component at a time and then later all the components are connected to get the whole pipeline.
  • In LineaPy, each component is represented as an artifact, and LineaPy provides APIs to create pipelines from a group of artifacts.
Image by the author

2 Lines of Code:

Image by the author
  • Save the artifact what you want to save like dataframe, variable, model, etc. Then call the artifact and get the value and code of the artifact. The lineapy.save API creates LineaArtifacts and saves them to the database. The code to save the artifact is
# Step 1
#Store the variable as an artifact

saved_artifact = lineapy.save(model, "my_model_name")
  • The method requires two arguments: the variable to save and the string name to save it as. It returns the saved artifact.
  • LineaArtifact the object has two key APIs:
  • .get_value() returns value of the artifact, e.g., an integer or a dataframe
  • .get_code() returns minimal essential code to create the value

The code is

# Step2:Check the value of the artifact
print(saved_artifact.get_value())
# Check minimal essential code to generate the artifact
print(saved_artifact.get_code())
#Check the session code
saved_artifact.get_session_code()
#To check the saved artifacts
lineapy.catalog()
#Get version info of the retrieved artifact
desired_version = saved_artifact.version

# Check the version info
print(desired_version)
print(type(desired_version))
Image by the author

Walkthrough LineaPy Example:

About the dataset: The dataset is the Auto MPG dataset available at

UCI Machine learning Repository
Center for Machine Learning and Intelligent Systems

Source: This dataset was taken from the StatLib library maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition.

Dataset: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: the University of California, School of Information and Computer Science.

Dataset features:

1. mpg: continuous
2.
cylinders: multi-valued discrete
3.
displacement: continuous
4.
horsepower: continuous
5.
weight: continuous
6.
acceleration: continuous
7.
model year: multi-valued discrete
8.
origin: multi-valued discrete
9.
car name: string (unique for each instance)

This tutorial uses the classic Auto MPG dataset and demonstrates how to build models to predict the fuel efficiency of the late-1970s and early 1980s automobiles. This is a classic regression problem. To do this, you will provide the models with a description of many automobiles from that time period. This description includes attributes like cylinders, displacement, horsepower, and weight.

The workflow followed in my example is below. The main goal is how to save the artifact, get the value/code and generate the pipeline.

  1. Load the train and test data into a pandas dataframe
  2. EDA & Statistical Analysis.
  3. Save the final training and test data as an artifact-Use save()
  4. Check the artifact-get()
  5. Build the model using different methods.
  6. Choose the best model.
  7. Save the best model as an artifact-save()
  8. Display the artifact catalog-catalog()
  9. Build the pipeline using the saved artifacts.

To practice locally please use Google Colab. I used Google Colab. Again the main objective of this article is to demo the usage of the LineaPy package and again not worried much about the model building or performance of the model.Please check the Tensorflow tutorial in the reference for the more details.Here the goal is to look at lineapy functionality.

Declare all the necessary packages.

Image by the author

Download the dataset and upload the data into a pandas dataframe.

Image by the author

Now do some exploratory data analysis and statistical analysis.

Image by the author

Now save the train and test data dataframes as artifacts-save()

Image by the author

For example

Image by the author

The type of the artifact is

<class ‘lineapy.graph_reader.apis.LineaArtifact’>

Display the value and code of the artifact:

Image by the author

The output of the saved artifact-train dataframe and the code.

Image by the author

All the EDA code and unnecessary code are dropped.

Image by the author

To display the original session code use the get. session

Image by the author

The above get_session_code will show the entire code which has all the eda code, etc.

Now build the models and choose the best model. You can use the artifacts stored previously to get the pre-processed train and test data sets.

1st model: Linear Regression Model

Image by the author

2nd Model-DNN Model:

Image by the author

Results:

Image by the author
Image by the author

DNN model is the better model than the linear regression model. I will save the DNN model.

Now save the DNN model.

Image by the author

To display the code of the model, use the get() method.

Image by the author

The below is the code generated when you use the get_code for the model_artifact.

Image by the author

List all the saved artifacts.

Image by the author

The output-displays all the artifacts stored

train_data:0 created on 2022–05–28 02:07:02.669098 
train_labels:0 created on 2022–05–28 02:07:10.961282
test_data:0 created on 2022–05–28 02:07:14.520631
test_labels:0 created on 2022–05–28 02:07:20.777316
dnn_model:0 created on 2022–05–28 02:15:23.632722

Now you can also build a data pipeline with the artifacts you have saved. The code to generate the pipeline are

  • Get the artifact and assign it to a variable.
Image by the author

Now build the data pipeline.

  • Pre-processed data.
  • Model building
Image by the author
  • artifacts is the list of artifact names to be used for the pipeline. Here we are using the train and model artifacts. Actually, we don’t need the test artifact.
  • pipeline_name is the name of the pipeline. Here the pipeline name is titanic_pipeline.
  • dependencies is the dependency graph among artifacts
  • If artifact A depends on artifacts B and C, then the graph is specified as { A: { B, C } }
  • If A depends on B and B depends on C, then the graph is specified as { A: { B }, B: { C } }
  • output_dir is the location to put the files for running the pipeline
  • framework is the name of the orchestration framework to use
  • LineaPy currently supports "AIRFLOW" and "SCRIPT"
  • If "AIRFLOW", it will generate files that can run Airflow DAGs. You can execute the file in airflow CLI.
  • If "SCRIPT", it will generate files that can run the pipeline as a Python script

Running lineapy.to_pipeline() generates several files that can be used to execute the pipeline from the UI or CLI. The following files are generated

Image by the author

In this case, the pipeline_name is titanic_pipeline.

The files are stored

Image by the author
Image by the author

The requirements file is automatically generated

lineapy
matplotlib==3.2.2
numpy==1.21.6
pandas==1.3.5
seaborn==0.11.2
tensorflow==2.8.0
tensorflow.keras==2.8.0

The dockerfile is generated automatically when you build the pipeline.

Image by the author

The airflow dag file is generated

Image by the author

The python file which consists of all the data

Image by the author

For more information on building the pipelines, please check the documentation.

To know more about the artifact store, please check the documentation.

Please check out the following for detailed information. Also, there are a few examples using the iris dataset and housing price prediction dataset from Kaggle on their Github. I tried it locally using Google Colab.

Conclusion:

LineaPy package will definitely help the MLOps teams in automating the workflow by using just 2 lines of code which are save(), get() and then to_pipelinemethods(). Maybe it can be used as a first cut and then further modifications can be done. I checked and the refactoring code looks good. Also, the docker and the airflow dag file look good.LineaPy is an open-source and for more information check out their GitHub repo.

Please free to get connected on Linkedin.

References:

  1. Dataset Source:Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: the University of California, School of Information and Computer Science.
  2. About LineaPy -https://lineapy.org/why-lineapy/
  3. LineaPy Github Repo: https://github.com/LineaLabs/lineapy
  4. Lineapy ai-https://linea.ai/
  5. Keras Tutorial:https://www.tensorflow.org/tutorials/keras/regression

--

--

ML/DS - Certified GCP Professional Machine Learning Engineer, Certified AWS Professional Machine learning Speciality,Certified GCP Professional Data Engineer .