How to use the magic of pipelines

Using ETL to save the day

Ricardo Pinto
Towards Data Science

--

Photo by Sven Kucinic on Unsplash

Surely you have heard of pipelines or ETL (Extract Transform Load), or seen some method in a library, or even heard of any tool to create pipelines. However, you aren’t using it yet. So, let me introduce you to the fantastic world of pipelines.

Before understanding how to use them, we have to understand what it is.

A pipeline is a way to wrap and automatize a process, which means that the process will always be executed in the same way, with the same functions and parameters and the outcome will always be in the predetermined standard.

So, as you may guess, the goal is to apply pipelines in every development stage to try to guarantee that the designed process never ends up different from the one idealized.

Made with Kapwing

There are in particular two uses of pipelines in data science, either in production or during the modelling/exploration, that have a huge importance. Furthermore, it makes our life much easier.

The first one is the data ETL. During production, the ramifications are way greater, and consequently, the level of detail spent in it, however, it can be summed up as:

E (Extract) — How am I going to collect the data? If I am going to collect them from one or several sites, one or more databases, or even a simple pandas csv. We can think of this stage as the data reading phase.

T (Transform) — What do I need to do for the data to become usable? This can be thought of as the conclusion of the exploratory data analysis, which means after we know what to do with the data (remove features, transform categorical variables into binary data, cleaning strings, etc.), we compile it all in a function that guarantees that cleaning will always be done in the same way.

L (Load) — This is simply to save the data in the desired format (csv, data base, etc.) somewhere, either cloud or locally, to use anytime, anywhere.

The simplicity of the creation of this process is such that it can be done only by grabbing that exploratory data analysis notebook, put that pandas read_csv inside a funcion; write the several functions to prepare the data and compile them in one; and finally create a function saving the result of the previous one.

Having this, we can create the main function in a python file and with one line of code executes the created ETL, without risking any changes. Not to mention the advantages of changing/updating everything in a single place.

And the second, and likely the most advantageous pipeline, helps solve one of the most common problems in machine learning: the parametrization.

How many times have we faced these questions: which model to choose? Should I use normalization or standardization?

“Screenshot captured by author”

Libraries such as scikit-learn offer us the pipeline method where we can put several models, with their respective parameters’ variance, add pre-processing such as normalization, standardization or even a custom process, and even add cross-validation at the end. Afterwards, all possibilities will be tested, and the results returned, or even only the best result, like in the following code:

def build_model(X,y):                          
pipeline = Pipeline([
('vect',CountVectorizer(tokenizer=tokenize)),
('tfidf', TfidfTransformer()),
('clf', MultiOutputClassifier(estimator=RandomForestClassifier())) ])
# specify parameters for grid search parameters = {
# 'vect__ngram_range': ((1, 1), (1, 2)),
# 'vect__max_df': (0.5, 0.75, 1.0),
# 'vect__max_features': (None, 5000, 10000),
# 'tfidf__use_idf': (True, False),
# 'clf__estimator__n_estimators': [50,100,150,200],
# 'clf__estimator__max_depth': [20,50,100,200],
# 'clf__estimator__random_state': [42] }

# create grid search object
cv = GridSearchCV(pipeline, param_grid=parameters, verbose=1) return cv

At this stage, the sky is the limit! There are no parameters limits inside the pipeline. However, depending on the database and the chosen parameters it can take an eternity to finish. Even so, it is a very good tool to funnel the research.

We can add a function to read the data that comes out of the data ETL, and another to save the created model and we have model ETL, wrapping up this stage.

In spite of everything that we talked about, the greatest advantages of creating pipelines are the replicability and maintenance of your code that improve exponentially.

So, what are you waiting for to start creating pipelines?

An example of these can be found in this project.

--

--

Data Scientist with a civil engineering background. Water polo player. Loves ML/AI, data, decision science, gaming, manga.