The world’s leading publication for data science, AI, and ML professionals.

Python: The (unofficial) OOP crash course for (aspiring) data scientists!

Classes, attributes, methods, inheritance, encapsulation, and all that other mumbo jumbo you're not sure if you really need (spoiler: for…

Credit: Pixabay
Credit: Pixabay

Background

Python is experiencing tremendous increases in its market demand and user base; whether you’re a developer, analyst, researcher or engineer, there’s a good chance that python is being used in your domain. The barrier to entry could not be lower with so many free educational materials (such as Automate the Boring Stuff.) Oddly, _ however,_ we’re seeing an unusual consequence: Pythonistas are plateauing prematurely, terminating their study of python fundamentals in favor of domain specific learning of libraries and frameworks. This is especially true for the present day Data Science culture; after gaining intermediate familiarity with strings, lists, and dictionaries, prospective pythonistas jump directly into Numpy, Pandas, and Scikit-Learn (if not Keras, TensorFlow, or PyTorch.)

So should you study data structures and algorithms (DS&A), implementing a binary tree or linked list from scratch? What is the appropriate "I’ve learned enough python fundamentals" threshold to start one’s domain-specific journey? There’s no "one size fits all" answer; however, I recommend familiarizing yourself with OOP at the very least before jumping into topics like regression, classification, and clustering.

What is Object Oriented Programming (OOP)?

To make sense of OOP, we need to briefly discuss functional programming; without diving deep into this paradigm, suffice to say, functional programming separates functions (actions) from data (information), whereas OOP views this as a false dichotomy. Have you used python’s built in list before? (Yes) Surely you’ve noticed that the append method allowed you to insert an element after the current last element, automatically incrementing the list’s length? (Yes)

Congrats! You like OOP. The fact that the data type and its methods (functions attached to an object) are one cohesive whole, is the core "essence" of OOP.

In the following code snippets, I’m going to define a class or two, and demonstrate some essential OOP concepts around an established Machine Learning task so that you can see how OOP can benefit the data scientist. We will be classifying Yelp reviews using any variety of ML algorithms. In fact, this class will receive only two mandatory arguments, the data and the model you wish to use. (There will of course be several other algorithms), however, this class architecture will allow us to do the following:

(1) Divide data into train and test sets, (2) Preprocess and vectorize data into TF-IDF tokens, (3) train the model, (4) compute accuracy and related metrics on test set performance, and (5) save the model via pickle.

This pipeline will greatly increase your efficiency. I, like many others, jumped into ~data science material before having a solid handle on the fundamentals. I’d scroll up and down through notebooks looking for the right variables I had defined earlier. And if I wanted to compare the performance of 2 or 3 models, it would become a nightmare, making sure I referenced the appropriate variable names. You’ll see that with the following approach, comparing model performance will be a trivial task and (with any luck,) I’ll have made an OOP evangelist out of you!

Sentiment Analysis with OOP

For display issues, see this GitHub Gist.

As you’ll note above, I’ve defined two classes: DataSplitter and Classifier. For starters, let’s just look at DataSplitter; we’ll visit Classifier shortly thereafter.

DataSplitter receives a dataframe, the name of the text column and the name of the sentiment column. Then train_test_split is used to assign the class following attributes: x_train, x_test, y_train, and y_test. Note that the random_state and test_percent parameters have default values. What this means is – unless you specifically change either of these parameters, two class instances will have identical x_train, x_test, y_train, and y_test attributes. This will be useful as we can compare ML models directly without worry that they were trained on (slightly) different datasets.

When a DataSplitter class object is instantiated, you can access these values very simply:

import pandas as pd
data = pd.read_csv('yelp_samples_500k.csv')
d = data.sample(n=1000)
ds = DataSplitter(data=d,x_var='text',y_var='sentiment')
ds.x_test
>>>
267157    This place always has a line so I expected to ...
197388    Would give them a zero stars if I could. They ...

As you can see, the self keyword binds these attributes to the object. Using dot notation, it’s trivial to retrieve these attributes.

On to our next OOP concept: Inheritance! For this example we’ll move onto the second class, Classifier. Inheritance simply means one class inheriting functionality from a previously defined class. In our case, Classifier will inherit all the functionality of DataSplitter; note two things: (1) DataSplitter receives no class to inherit from in its definition DataSplitter() whereas Classifier(DataSplitter) does receive a class to inherit from. (2) The super keyword is used in Classifier’s __init__ method. This has the effect of executing DataSplitter’s init method then moving on to all other instructions specific to Classifier’s own init method. Bottom line, we train/test/split our data without retyping all that code over again!

After the init method, you’ll see __vectorize . Note the double underscores preceding the definition. This is how encapsulation is achieved in Python. Encapsulation means ~the object has access attributes and methods that are not available to the programmer. In other words, they’re abstracted away (or encapsulated) as to not distract the programmer.

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
c = Classifier(data=d,model_instance=nb,
x_var='text',y_var='sentiment')
c.__vectorize('some text')
>>>
AttributeError: 'Classifier' object has no attribute '__vectorize'

In other words, the object has access to these methods, however, we do not!

Speaking of methods, if you hadn’t guessed – methods are simply functions built into a class object. The methods we’ve defined above include vectorize, fit, __evaluate_accuracy, metrics, predict, and save. Starting with vectorize, we use NLTK’s word_tokenize function to tokenize all words and punctuation into unigrams. Next, we use NLTK’s ngrams function to create bigrams and trigrams

'I like dogs' #text
['I', 'like', 'dogs'] #unigrams
['I like', 'like dogs'] #bigrams
['I like dogs'] #trigram 

This method improves upon unigrams greatly because it allows for ML models to learn the difference between "good" and "not good," for example. Note that I’ve not removed stopwords or punctuation and haven’t stemmed word tokens either. This might be a good homework assignment if you’re interested in expanding functionality! The vectorize method is delivered to the pipeline attribute, which is created by the fit method. Likewise, the fit method trains the ML model supplied to the init method and creates predictions (preds.) The following method evaluate_accuracy determines the binary accuracy of the model and assigns as a class attribute for ease of access later (no need to recompute multiple times.) Next, our metrics method will either retrieve binary accuracy for us or print the classification report (precision, recall, etc.) Our predict methods productionizes this code with simplicity. Supply text to the method call and either a class will be assigned or the probability of belonging to class 1 will be returned. (If you’re interested in multiclass classification, some adjustments will be necessary – I’ll leave that to you as a homework problem!) Lastly, the save method receives a file-path and pickles the entire object. This is neat – all we have to do is open the pickled file to access all class methods and attributes, including the fully trained model!

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb_model = Classifier(data=data,model_instance=nb
,x_var='text',y_var='sentiment')
nb_model.metrics()
>>>
'92.77733333333333 percent accurate'

Let’s compare this to the random forest classifier!

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf_model = Classifier(data=data,model_instance=rf,
x_var='text',y_var='sentiment')
rf_model.metrics()
>>>
'86.29666666666667 percent accurate'

It appears that our naive bayes model outperforms our random forest classifier. Saving our object (and opening it again for later use) is as easy as:

## saving
nb_model.save('nb_model') #will append .pkl to end of input
## opening for later use
with open('nb_model.pkl','rb') as f:
    loaded_model = pickle.load(f)
loaded_model.predict("This tutorial was super awesome!",prob=True)
>>>
0.943261472480177 # 94% certain of positive sentiment

I hope you’ve enjoyed this tutorial; my goals were – concise, informative, and practical. To meet all three objectives, some OOP topics didn’t make the cut (polymorphism, for example.) Please comment if you found this helpful. Likewise, if you’d like me to explore similar concepts, post a comment and I’ll see if it’s something I can work into future articles.

Thank you for reading – If you think my content is alright, please subscribe! 🙂


Related Articles