Data-Oriented Programming in Python

A recap on Data-Oriented Programming by Yehonathan Sharvit but illustrated with Python examples (instead of JavaScript and Java)

Tam D Tran-The
Towards Data Science

--

Photo by AltumCode on Unsplash

Data-Oriented Programming by Yehonathan Sharvit is a great book that gives a gentle introduction to the concept of data-oriented programming (DOP) as an alternative to good old object-oriented programming (OOP). Sharvit deconstructs the elements of complexity that sometimes seems inevitable with OOP and summarizes the main principles of DOP that helps us make the system more manageable.

As its name suggests, DOP puts data first and foremost. This can be achieved by adhering to four main principles. These principles are language-agnostic. They can be represented in OOP languages (Java, C++, etc.), functional programming (FP) languages (Clojure, etc.) or general-purpose languages (Python, JavaScript). Whereas the author illustrates his examples using JavaScript and Java, this article attempts to demonstrate the ideas in Python.

Following along the article, you’ll find simple code snippets in Python that illustrate how each principle can be adhered to or broken. Sharvit also clarifies what the benefits and costs for each principle — many of them are relevant in Python whereas some are not.

Please note all the principles, corresponding advantages and drawbacks mentioned are credited to Yehonathan Sharvit, whereas the viewpoints on the applicability of these statements to Python, in conjunction with the Python code illustrations, are my own.

Principle #1: Separate code from data

“Separate code from data in a way that the code resides in functions whose behavior does not depend on data that is encapsulated in the function’s context.” — Yehonathan Sharvit

A natural way of adhering to this principle in Python is to use top-level functions (for code) and data classes that only have fields (for data). Whereas Sharvit illustrates in his book how to follow this principle in OOP and FP (functional programming) separately, my example in Python is a hybrid of OOP and FP.

Refer to the code snippet below as an example where code (behavior) is separated from data (facts/information).

from dataclasses import dataclass

@dataclass
class AuthorData:
"""Class for keeping track of an author in the system"""

first_name: str
last_name: str
n_books: int

def calculate_name(first_name: str, last_name: str):
return f"{first_name} {last_name}"

def is_prolific(n_books: int):
return n_books > 100

author_data = AuthorData("Isaac", "Asimov", 500)
calculate_name(author_data.first_name, author_data.last_name)
# 'Isaac Asimov'

Benefit # 1: “Code can be reused in different contexts” — Yehonathan Sharvit

As seen from the example above, calculate_name() can be used not only for authors but also for users, librarians, or anyone that has a first name and a last name field. The code that deals with full name calculation is separate from the code that deals with the creation of author data.

@dataclass
class UserData:
"""Class for keeping track of a user in the system"""

first_name: str
last_name: str
email: str

user_data = UserData("John", "Doe", "john.doe@gmail.com")
calculate_name(user_data.first_name, user_data.last_name)
# 'John Doe'

Benefit # 2: “Code can be tested in isolation” — Yehonathan Sharvit

Below is an example that doesn’t adhere to Principle #1.

class Address:
def __init__(self, street_num: int, street_name: str,
city: str, state: str, zip_code: int):
self.street_num = street_num
self.street_name = street_name
self.city = city
self.state = state
self.zip_code = zip_code


class Author:
def __init__(self, first_name: str, last_name: str, n_books: int,
address: Address):
self.first_name = first_name
self.last_name = last_name
self.n_books = n_books
self.address = address

@property
def full_name(self):
return f"{self.first_name} {self.last_name}"

@property
def is_prolific(self):
return self.n_books > 100


address = Address(651, "Essex Street", "Brooklyn", "NY", 11208)
author = Author("Issac", "Asimov", 500, address)
assert author.full_name == "Issac Asimov"

In order to test the full_name() property that lives inside the Author class, we need to instantiate the Author object, which requires us to have values for all attributes, including those unrelated to the behavior we are testing (such as n_books and address custom class). This is an unnecessarily complex and tedious setup just to test a single method.

On the other hand, in the DOP version, to test calculate_name() code, we can create data to be passed into the function in isolation.

assert calculate_name("Issac", "Asimov") == "Issac Asimov"

Cost # 1: “No control on what code can access what data” — Yehonathan Sharvit

“…in OOP, the data is encapsulated in an object, which guarantees that the data is accessible only by the object’s methods. In DOP, since data stands on its own, it can be accessed by any piece of code…which is inherently unsafe.” — Yehonathan Sharvit

This statement is not applicable in Python.

In Python, data held by a class can still be accessed by any piece of code that has a reference to the object. For example:

class Author:
def __init__(self, first_name: str, last_name: str, n_books: int):
self.first_name = first_name
self.last_name = last_name
self.n_books = n_books

@property
def full_name(self):
return f"{self.first_name} {self.last_name}"

@property
def is_prolific(self):
return self.n_books > 100

author = Author("Issac", "Asimov", 500, address)
author.full_name
# 'Issac Asimov'

Also, unless we store data in a global variable, we can still use scopes (functions, loops, etc.) to control who can access/change data in Python.

Cost #2: “There is no packaging” — Yehonathan Sharvit

“In DOP, the code that manipulates the data could be anywhere. This might make it difficult for developers to discover that [a specific function] is available, which could lead to wasted time and duplicated code.” — Yehonathan Sharvit

This is true with our Python example above. For instance, our AuthorData data class might be in one file and calculate_name() function might be in another file.

Principle #2: Represent data with generic data structures

“In DOP, data is represented with generic data structures, such as maps (or dictionaries) and arrays (or lists).” — Yehonathan Sharvit

In Python, our built-in options for generic data structures are dict , list , and tuple.

In this article, I use Python’s dataclass , which can be thought of as a “mutable named tuple with defaults.” Note that this was not what Sharvit meant by “generic data structure.” Python’s dataclass is a hybrid that is closer to OOP than DOP. However, compared with dictionaries and tuples, this alternative is less susceptible to typos, more descriptive with type hinting, helps represent nested complex structure in a clearer and more concise way, and more. Also, it can easily be turned into a dictionary or a tuple if we want to.

from dataclasses import dataclass, asdict

@dataclass
class AuthorData:
"""Class for keeping track of an author in the system"""

first_name: str
last_name: str
n_books: int

author_data = AuthorData("Isaac", "Asimov", 500)
asdict(author_data)
# {'first_name': 'Isaac', 'last_name': 'Asimov', 'n_books': 500}

Benefit #1: “The ability to use generic functions that are not limited to our specific use case” — Yehonathan Sharvit

Given generic structures, we can manipulate data using a rich set of built-in Python functions available on dict, list, tuple, etc.

Below are a few examples of generic functions that can be used to manipulate data stored in a dict .

author = {"first_name": "Issac", "last_name": "Asimov", "n_books": 500}

# Access dict values
author.get("first_name")

# Add new field to dict
author["alive"] = False

# Update existing field
author["n_books"] = 703

This means we don’t have to learn and remember the custom methods of everyone’s classes. Also, the generic functions can’t break if we change some library versions. They only break if the Python language changes them (which almost never happens).

Benefit #2: “Flexible data model” — Yehonathan Sharvit

“When using generic data structures, data can be created with no predefined shape, and its shape can be modified at will.” — Yehonathan Sharvit

In the example below, not all the dictionaries in the list have the same keys. The extra keys can exist in the second dictionary as long as the required fields are present.

names = []
names.append({"first_name": "Isaac", "last_name": "Asimov"})
names.append({"first_name": "Jane", "last_name": "Doe",
"suffix": "III", "age": 70})

Cost #1: “Performance hit” — Yehonathan Sharvit

This does not entirely translate to Python.

In Python, there is not much difference in performance between retrieving value of a class member and retrieving value associated to a key in a dictionary. Unlike Java, there is no compilation step in Python, which means there is no compiler optimization when it comes to accessing a class member.

However, not all generic data structures are equal. Lookup time for set and dict is more efficient than that for list and tuple , given that sets and dictionaries use hash function to determine any particular piece of data is right away, without a search.

Cost #2: “No data schema” — Yehonathan Sharvit

“When data is instantiated from a class, the information about the data shape is in the class definition. The existence of data schema at a class level makes it easy to discover the expected data shape. When data is represented with generic data structures, data schema is not part of the data representation.” — Yehonathan Sharvit

For example, we can easily tell the data shape of FullName which is instantiated as a class object below.

class FullName:
def __init__(self, first_name, last_name, suffix):
self.first_name = first_name
self.last_name = last_name
self.suffix = suffix

Cost #3: “No compile-time check that the data is valid” — Yehonathan Sharvit

This does not entirely translate to Python.

Again, there is no compilation step in Python as in Java. The only compile-time checking for Python would be running a tool like mypy.

However, Sharvit’s example about how data shape errors could slip through the crack with generic data structures can still somewhat be demonstrated in Python as below.

When data is passed to the FullName class that does not conform to the shape it expects, an error occurs at run time. For example, if we mistype the field that stores first name ( fist_name instead of first_name ), we would get TypeError: __init__() got an unexpected keyword argument 'fist_name'.

class FullName:
def __init__(self, first_name, last_name, suffix):
self.first_name = first_name
self.last_name = last_name
self.suffix = suffix

FullName(fist_name="Jane", last_name="Doe", suffix="II")

However, with generic data structures, mistyping the field might not result in an error or an exception. Rather, first name is mysteriously omitted from the result.

names = []
names.append({"first_name": "Jane", "last_name": "Doe", "suffix": "III"})
names.append({"first_name": "Isaac", "last_name": "Asimov"})
names.append({"fist_name": "John", "last_name": "Smith"})

print(f"{names[2].get('first_name')} {names[2].get('last_name')}")
# None Smith

Cost #4: “The need for explicit type casting” — Yehonathan Sharvit

This does not translate to Python.

Python is a dynamically typed language. It does not require explicit type casting.

Principle # 3: Data is immutable

“According to DOP, data should never change! Instead of mutating data, a new version of it is created.” — Yehonathan Sharvit

To adhere to this principle, we make our dataclassfrozen (i.e. immutable).

@dataclass(frozen=True)
class AuthorData:
"""Class for keeping track of an author in the system"""

first_name: str
last_name: str
n_books: int

The immutable data types in built-in Python are int , float , decimal , bool , string , tuple and range . Note that dict , list and set are mutable.

Benefit #1: “Data access to all with confidence” — Yehonathan Sharvit

“When data is mutable, we must be careful when passing data as an argument to a function since it can be mutated or cloned.” — Yehonathan Sharvit

In the example below, we originally pass an empty list as a default argument to the function. Since list is a mutable object, every time we call the function, the list gets mutated and a different default value gets used in the successive call.

def append_to_list(el, ls=[]):
ls.append(el)
return ls

append_to_list(1)
# [1]
append_to_list(2)
# [1, 2]
append_to_list(3)
# [1, 2, 3]

To fix the use case above, we can do:

def append_to_list(el, ls=None):
if ls is None:
ls = []
ls.append(el)
return ls

append_to_list(1)
# [1]
append_to_list(2)
# [2]

This code works as expected because None is immutable.

“When data is immutable, it can be passed to any function with confidence because data never changes.” — Yehonathan Sharvit

Benefit #2: “Predictable code behavior” — Yehonathan Sharvit

Here is an example of an unpredictable piece of code:

from datetime import date

dummy = {"age": 30}

if date.today().day % 2 == 0:
dummy["age"] = 40

The value of age in dummy dictionary is not predictable. It depends on whether you run the code on an even or odd day.

However, with immutable data, it is guaranteed that data never changes.

author_data = AuthorData("Issac", "Asimov", 500)

if date.today().day % 2 == 0:
author_data.n_books = 100
# dataclasses.FrozenInstanceError: cannot assign to field "n_books"

The piece of code above would error out, saying dataclasses.FrozenInstanceError: cannot assign to field "n_books" . With frozen data class, no matter it’s an even or odd day, author_data.n_books is always 500.

Benefit #3: “Fast equality checks” — Yehonathan Sharvit

Python has two similar operators for checking whether two objects are equal: is and == . is checks for identity (of objects) by comparing the integer equality of the memory address. == checks for equality (of values) by examining the actual content stored.

# String is immutable
x = "abc"

# Note that the identity of `x` and `abc` is the same
print(id(x))
# 140676188882480
print(id("abc"))
# 140676188882480

print(x == "abc")
# True
print(x is "abc")
# True

# List is mutable
y = [1, 2, 3]

# Note that the identity of `y` and `[1, 2, 3]` is different
print(id(y))
# 140676283875904
print(id([1, 2, 3])
# 140676283875584

print(y == [1, 2, 3])
# True
print(y is [1, 2, 3])
# Fasle

As seen above, is and == behaves the same way for x which is a string (i.e. immutable data type) but behaves differently for y which is a list (i.e. mutable data type). With immutable data objects, is behaves more predictably. Also,is is generally faster than == because comparing object addresses is faster than comparing all the fields. Immutable data thus enables fast equality checks by comparing data by reference.

Benefit #4: “Free concurrency safety” — Yehonathan Sharvit

When data is mutable in a multi-thread environment, race condition failure can occur. For example, assuming that two threads are attempting to access and modify the value of x by adding/subtracting 10 to/from it:

Race condition failure example. Image by Author.

There are three possible answers: x=90 , x=100 , and x=110 . Depending on the order of execution, the program’s behavior changes each time it is run, which is not safe and vulnerable to corruption. To ensure concurrency safety, data should be in an immutable state.

Cost #1: “Performance hit” — Yehonathan Sharvit

Given that list is mutable and tuple is immutable, as we expand both objects, list identity remains the same whereas a brand new tuple is created with a different identity.

list1 = [1, 2, 3]
tuple1 = (1, 2, 3)

print(id(list1))
# 140218642718848
print(id(tuple1))
# 140218642722496

list1 += [4, 5]
tuple1 += (4, 5)

print(id(list1))
# 140218642718848
print(id(tuple1))
# 140218642772352

The need to copy contents of immutable object into a new object every time we modify it requires additional memory and creates added cost on CPU power, especially for very large collections.

Cost #2: “Required library for immutable data structures” — Yehonathan Sharvit

This does not translate to Python.

frozenset and tuple are some basic built-in immutable data structures in Python. We are not always required to include a third-party library to adhere to data immutability principle.

Principle #4: Separate data schema from data representation

“In DOP, the expected shape of data is represented as (meta) data that is kept separately from the main data representation.” — Yehonathan Sharvit

Given below is a basic JSON schema (essentially a dictionary) that describes the format of data which is also represented as a dictionary. The schema defines which fields are required and the data types of the fields, whereas the data is represented by a generic data structure per Principle #3.

schema = {
"required": ["first_name", "last_name"],
"properties": {
"first_name": {"type": str},
"last_name": {"type": str},
"books": {"type": int},
}
}

data = {
"valid": {
"first_name": "Isaac",
"last_name": "Asimov",
"books": 500
},
"invalid1": {
"fist_name": "Isaac",
"last_name": "Asimov",
},
"invalid2": {
"first_name": "Isaac",
"last_name": "Asimov",
"books": "five hundred"
}
}

Data validation functions (or libraries) can be used to check whether a piece of data conforms to a data schema.

def validate(data):
assert set(schema["required"]).issubset(set(data.keys())), \
f"Data must have following fields: {schema['required']}"

for k in data:
if k in schema["properties"].keys():
assert type(data[k]) == schema["properties"][k]["type"], \
f"Field {k} must be of type {str(schema['properties'][k]['type'])}"

The validate function passes through when data is valid or returns errors with details in a human readable format when data is invalid.

validate(data["valid"]))
# No error

validate(data["invalid1"])
# AssertionError: Data must have following fields: ['first_name', 'last_name']

validate(data["invalid2"])
# AssertionError: Field books must be of type <class 'int'>

Benefit #1: “Optional fields” — Yehonathan Sharvit

“In OOP, allowing a class member to be optional is not easy. In DOP, it is natural to declare a field as optional in a map.” —Yehonathan Sharvit

This does not translate to Python.

In Python, even with OOP, allowing a class member to be optional is not hard. This benefit is therefore not strong in the context of Python. For example, below we can set the default argument of n_books to None to indicate the field is optional.

class Author:
def __init__(self, first_name: str, last_name: str, n_books: int = None):
self.first_name = first_name
self.last_name = last_name
self.n_books = n_books

@property
def fullname(self):
return f"{self.first_name} {self.last_name}"

@property
def is_prolific(self):
if self.n_books:
return self.n_books > 100

author = Author("Issac", "Asimov")

Benefit #2: “Advanced data validation conditions” — Yehonathan Sharvit

“In DOP, data validation occurs at run time. It allows the definition of data validation conditions that go beyond the type of a field.” — Yehonathan Sharvit

Compared with the minimal schema defined above, the following schema can be expanded to include more properties for each field.

schema = {
"required": ["first_name", "last_name"],
"properties": {
"first_name": {
"type": str,
"max_length": 100,
},
"last_name": {
"type": str,
"max_length": 100
},
"books": {
"type": int,
"min": 0,
"max": 10000,
},
}
}

While not all the advantages and disadvantages of DOP principles mentioned by Sharvit directly apply to Python, the fundamental principles remain robust. This approach promotes code that is easier to reason about, test and maintain. By embracing the principles and techniques of DOP, Python programmers can create more maintainable and scalable code, and unlock the full potential of their data.

Special thanks to Eddie Pantridge for his thoughtful comments and efforts towards improving this article.

--

--

Data scientist. Talks about anything data science and human-centered designs. Experience in finance and healthcare. https://www.linkedin.com/in/tamtranthe/