The world’s leading publication for data science, AI, and ML professionals.

Object-Oriented Programming for Data Scientists

How a switch can make your code production-ready, can reduce code complexity, and can improve team efficiency.

Image by Joanna Reichert from Pixabay
Image by Joanna Reichert from Pixabay

With so much to learn in the way of programming, data analysis, machine learning, Artificial Intelligence, mathematics, and all of the many other components of data science, it’s fair to say that the learning of concepts becomes an arduous affair when becoming a data scientist.

With data scientists coming from a multitude of backgrounds, many of them not computer science-based, it’s completely understandable that some computer science principles are passed over in favor of getting to the good stuff: completing data analyses.

One of those concepts is object-oriented programming (OOP).

When you ask current data scientists for their opinion on OOP, you’ll probably come back with a mixed bag of answers. In some cases, OOP can be incredibly instrumental in reducing the complexity and time it takes to complete an analysis. In others, OOP can result in having more code than you need or even know what to do with.

Depending on your situation or the project you’re working on, you may find it beneficial to switch to an OOP approach, or you may find that it hinders your progress.

However, one thing is certain: as a data scientist, it never hurts to have an extra skill in your back pocket that may come in handy when you least expect it. In other words, why not learn a little about OOP and see how you or your team can benefit from its principles?


A quick introduction to object-oriented programming (OOP).

Like many excellent things (classic rock, the Ford GT40, and the Apollo 11 space mission, to name a few), object-oriented programming was developed before the turn of the century. Beginning in the 1960s, and then becoming more mainstream in the 1980s, OOP became the go-to method for managing the complexity of large programs.

The goal of OOP is to organize code in a way that makes it maintainable, easy to read, and above all else, reusable. The code gets organized into two different structures that work together, otherwise known as objects and classes. A class is a piece of code that defines particular attributes and functions that can be used to create objects. Think of a class as a blueprint. Objects are pieces of code created using a class. Each object contains the properties given to it by the class and can have unique values assigned to each of those properties. In other words, an object is a single unique instance of a class. Think of an object like a house that was built using a blueprint (the class).

Here’s a visual example using dogs:

Example of the relationship between classes and objects. Inspired by: Erin Doherty
Example of the relationship between classes and objects. Inspired by: Erin Doherty

Each dog is the same in that they all have the same attributes of breed, name, birthday, age, and color. However, my dog is unique. He has different values for each attribute. Therefore, I created a unique instance of the class Dog and used it to describe my dog. If I purchased another dog, I could create a second instance of the class to describe my new dog. In other words, I reuse the code created for the class without having to type it out again.

As mentioned above, classes and objects can also contain methods. Objects will inherit any methods from the class it originated from. However, objects may also contain unique methods that don’t appear in the class. For example, all dogs know how to sit. Therefore, the Dog class will contain the sit() method. In addition to sit(), I also taught my dog to stay(), and speak(). Therefore, the object describing my dog will contain unique methods that don’t appear anywhere else.

Here’s how that looks visually:

Example of the inheritance of methods from class to object, with the addition of two unique methods to the object that don't originally appear in the class. Inspired by: Erin Doherty
Example of the inheritance of methods from class to object, with the addition of two unique methods to the object that don’t originally appear in the class. Inspired by: Erin Doherty

However, this is a pretty mundane example. You get the idea, but it doesn’t necessarily make sense in the context of Data Science. Not a problem. Let’s take a look at this from a data science perspective.

Rose Day explained it best when she described how OOP can be used in data science in this article. She uses the example of a team member who uses a specific set of functions to clean data that generally works for all data sets the team uses. The team member creates a data cleaning library in an object-oriented manner which allows the other team members to clean their code using the same functions without having to write them all from scratch. Now, any team member who wants to use those functions to clean their code can just use the library instead of writing new methods each time.

For a further example on how to use OOP in a data science context from someone who can explain it much better than I can, check out this article that walks you through exactly how to better wrangle data using classes and objects using examples and source code:

Improve Your Data Wrangling With Object Oriented Programming


Why do data scientists need OOP?

The Yang to OOP’s Yin is functional programming, a method of Programming that avoids the shared-state nature of OOP by creating variables and functions in a more free-flowing manner that avoids the structure of classes and objects. Functional programming rests on the principles of pure functions (functions always produce the same results and have no side effects), recursion (no for and while loops are used), referential transparency (the values of variables can’t be changed once defined), and the immutability of variables (variables can’t be modified once they have been initialized).

For many instances in data science, functional programming will work perfectly fine, and many data scientists will spend their careers writing strictly functional code. Depending on who you ask, you may get a very split response between those who use functional programming and those who prefer OOP. Furthermore, depending on the instance, you may even switch between the two depending on the project you’re working on at a given time.

However, there are some key reasons why data scientists should at least have some knowledge of OOP in their back pocket.

  • One of the cornerstones of programming best practices is the DRY principle (Don’t Repeat Yourself). Say the filename containing data changes from "data" to "dataset". If your code is functionally programmed, you would need to go through your code line by line and change every reference of "data" to "dataset". However, if your data was organized into classes, you would only have to change one reference to "data", and be assured that the objects coming from that class would inherit the change.
  • Writing code using OOP principles ensures that your code will be easier to debug. When you see a bug popping up in a particular object, you know it’s stemming from within its class. This feature also helps when troubleshooting or implementing new features. Furthermore, when you make changes to your code within an object, you know that it will execute while leaving other parts of your code alone. Thus, less code gets broken if something goes wrong, and the problem is easier to find and solve.
  • Data scientists often work with software developers to produce production-ready code. Due to the common partnership between data scientists and software developers to create production code, it makes sense that data scientists develop some OOP skills to help make the process a smooth one. Data scientists come from a multitude of backgrounds, often ones that aren’t computer science-related. Because of this, OOP principles aren’t necessarily going to be used. However, OOP is commonplace for software developers. Therefore, it helps to learn the language to make teamwork easy and fluid.

In short, using OOP will help improve team efficiency, will reduce code complexity, and will help you produce production-ready code that your software developers will be looking forward to receiving.


How to get better at object-oriented programming.

  • Only focus on the four principles of object-oriented programming. For data scientists, understanding the four basic principles of OOP is often enough to get by. Understanding anything beyond that is fantastic, but because it can be such an abstract concept, don’t worry about understanding the deep dark details until you really need them. The four principles of OOP are encapsulation (putting objects into classes to protect them from interacting with other objects), abstraction (hiding particular properties to make objects simpler to deal with), inheritance (subclasses will inherit traits from superclasses and objects will inherit traits from the classes they belong to), and polymorphism (the ability of an object to take on many forms). I explain these concepts in greater detail here.
  • Model a real-world problem using OOP. Modeling dog breeds, cars, and houses are great ways to get started with OOP, and many tutorials online cover these basics in great detail. However, these concepts are a little more concrete than the ones you may be dealing with when doing data analyses. Therefore, I suggest finding some real-world problems to solve using OOP. I particularly suggest following the tutorial I have linked under the "A quick introduction to object-oriented programming" due to its data-wrangling nature.
  • Use a strongly-typed language to force yourself to use OOP. Depending on who you ask, you’ll get a variety of answers that often contradict themselves when you ask for an example of a strongly-typed programming language. Because of this, I suggest using Java (what seems to be the only agreed upon strongly-typed language) when learning OOP. Using Java will force you to use OOP to make your code work, whereas using another language may allow you to get away with not being perfectly object-oriented. This will help cement the concepts and will allow you to transition them to your chosen language easily.

Final thoughts.

As you can see, OOP isn’t just for software developers – it also contains many benefits that data scientists can take advantage of. It doesn’t hurt to have some OOP knowledge in the back of your mind when looking at a new project you want to complete efficiently, or when looking at an old project you want to optimize.

Besides, if the situation necessitates it, who wouldn’t want to write data science code that is production-ready, less complex, and is completed using improved team efficiency?


Related Articles