The world’s leading publication for data science, AI, and ML professionals.

Data Engineering – ORM and ODM with Python

Manipulate database data leveraging an object-oriented programming paradigm

Photo by David Clode on Unsplash
Photo by David Clode on Unsplash

Introduction

When working on Data Science projects, one fundamental pipeline to set up is the one regarding data collection. Real-world Machine Learning mainly differs from Kaggle-like problems because data is not static. We need to scrape websites, gather data from APIs, and so on. This way of collecting data might look chaotic, and it is! That’s why we need to structure our code following best practices to bring some sort of order to all this mess.

The ORM software pattern

Once you identified the sources from which you want to gather your data, you need to collect them in a structured way to store those in your Database. For example, you might decide that in order to train your LLM what you need are data sources which contain 3 fields: author, content, and link.

What you could do is to download the data, and then write SQL queries to store and retrieve data from your database. More commonly you might want to implement all the queries to perform CRUD operations. CRUD stands for create, read, update, and delete. These are the four basic functions of persistent storage.

Now, what you can do is encapsulate the complexity of writing SQL queries into an OMR (object-relational mapping) class that handles all the database operations. The OMR will be capable of interacting with SQL databases such as MySQL or PostgreSQL.

Let’ code

To start with, we need to import the necessary libraries and define a base class for all the OMR models.

from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker

# Define the base class for the ORM models
Base = declarative_base()

We are ready now to define a model. A model represents a table in our database. Here I am interested in building a document class to store my data. We need a user ID which will be our primary key for this model. If you are familiar with SQL, you should be familiar with this. In the model class, we also need to specify the table name to which the Python class will be associated. Let’s also rewrite the magic method repr to be sure our objects will be represented and printed in a good way.

# Define the Document model
class Document(Base):
    __tablename__ = 'documents'

    id = Column(Integer, primary_key=True, autoincrement=True)
    author = Column(String, nullable=False)
    content = Column(String, nullable=False)
    link = Column(String, nullable=False)

    def __repr__(self):
        return f"<Document(id={self.id}, author='{self.author}', content='{self.content}', link='{self.link}')>"

Let’s define a SQLite database

engine = create_engine('sqlite:///documents.db', echo=True)
Base.metadata.create_all(engine)

It is also important to initialize a session with which we will actually be able to perform operations over our database.

# Create a session factory
Session = sessionmaker(bind=engine)
session = Session()

We can add data to our database using Python!

# Add some documents to the database
def seed_data():
    doc1 = Document(author='John Doe', content='This is a sample document about SQLAlchemy.', link='http://example.com/doc1')
    doc2 = Document(author='Jane Smith', content='Understanding relationships in SQLAlchemy.', link='http://example.com/doc2')
    doc3 = Document(author='Alice Johnson', content='An introduction to Python ORM.', link='http://example.com/doc3')

    session.add_all([doc1, doc2, doc3])
    session.commit()

Same for querying the DB. We can also filter based on some attributes like the author’s name.

# Query the database
def query_data(author:str = None):

    if not author:
      print("nDocuments:")
      for document in session.query(Document).all():
          print(document)

    print(f"nDocuments by {author}:")
    john_doe_docs = session.query(Document).filter(Document.author == author).all()
    for document in john_doe_docs:
        print(document)

Check this link to learn more about SQLAlchemy ORM.

The ODM software pattern

The ODM pattern which stands for object document mapping is extremely similar to the ORM pattern, but instead of working with relational databases, we work with NoSQL databases such as MongoDB.

ODM maps object-oriented code to JSON-like documents and stores them in the database.

This means that the class we are going to build will shape how the JSON data is represented in the DB. For this reason, we will adopt Pydantic, which provides:

  • validation: automatically checks if the input data matches the expected types and constraints
  • serialization.: converts Python objects to and from formats like JSON or dictionaries.

Let’ code

We start with the imports as usual.

from pydantic import BaseModel, Field
from uuid import UUID, uuid4
from typing import Any, Dict
from pymongo import MongoClient

The class we are going to build will inherit from Pydantic BaseModel to use Pydantics features.

The type UUID is used to specify that the id variable should be a universally unique identifier. The Field class is provided by Pydantic to add constraints to variables (for validation) or to add some metadata. _defaultfactory=uuid4 means that when a document is created without an ID the method uuid4() is automatically called to create an ID. uuid4() generates a random number.

class Document(BaseModel):
    id: UUID = Field(default_factory=uuid4)
    author: str
    content: str
    link: str

Now I want to define a method to generate an instance of the class starting from a dictionary object.

This will be a class method. In this way, we have a reference to the class (cls) and can leverage this to use the Pydantic feature to automatically convert a JSON-like object into a class instance.

The type returned is therefore a Document. We need to use the apexes ‘Document’ to return the type of the class we are implementing in the class itself.

    @classmethod
    def from_mongo(cls, data: Dict[str, Any]) -> 'Document':
        """
        Convert a MongoDB document to a Document instance.
        This ensures the MongoDB '_id' field is mapped to 'id'.
        """
        if '_id' in data:
            data['id'] = data.pop('_id')
        return cls(**data)

In a similar way we implement the to_mongo() method.


    def to_mongo(self) -> Dict[str, Any]:
        """
        Convert a Document instance to a MongoDB document.
        This ensures the 'id' field is mapped to '_id'.
        """
        data = self.dict()
        data['_id'] = data.pop('id')
        return data

The entire class then would be the following:

class Document(BaseModel):
    id: UUID = Field(default_factory=uuid4)
    author: str
    content: str
    link: str

    @classmethod
    def from_mongo(cls, data: Dict[str, Any]) -> 'Document':
        """
        Convert a MongoDB document to a Document instance.
        This ensures the MongoDB '_id' field is mapped to 'id'.
        """
        if '_id' in data:
            data['id'] = data.pop('_id')
        return cls(**data)

    def to_mongo(self) -> Dict[str, Any]:
        """
        Convert a Document instance to a MongoDB document.
        This ensures the 'id' field is mapped to '_id'.
        """
        data = self.dict()
        data['_id'] = data.pop('id')
        return data

We can instantiate a MongoDB connection and use these methods to insert and fetch data easily to and from Mongo.

# Example usage with MongoDB
def main():
    # Connect to the local MongoDB instance
    client = MongoClient("mongodb://localhost:27017/")
    db = client.test_database
    collection = db.documents

    # Create a new document instance
    new_document = Document(author="Jane Doe", content="Learning MongoDB with Pydantic.", link="http://example.com/doc2")

    # Insert the document into MongoDB
    inserted_id = collection.insert_one(new_document.to_mongo()).inserted_id
    print(f"Inserted document with _id: {inserted_id}")

    # Fetch the document back from MongoDB
    mongo_document = collection.find_one({"_id": inserted_id})

    # Convert the MongoDB document to a Document instance
    document_instance = Document.from_mongo(mongo_document)
    print("Fetched and converted to Document instance:", document_instance)

    # Clean up (delete the inserted document)
    collection.delete_one({"_id": inserted_id})
    print("Deleted the inserted document.")

# Run the example
main()

Done!

This is a very simple example. You might consider implementing some other methods to streamline the interaction with Mongo. For example, some methods you could implement are:

  • save(): to allow an instance of the model to be inserted into the database
  • get_or_create(): method which attempts to find a document in a database, if not existing it is created.
  • bulk_insert(): to insert multiple documents in the database
  • find(): search in the database using advanced filters
  • bulk_find(): find multiple documents
  • get_collection_name(): determines the name of the MongoDB collection

Final Thoughts

In this article, I described how to implement the ORM and ODM paradigms to streamline the work with relational and document-based databases. These approaches abstract the complexity of managing the database and writing SQL queries by writing clean and maintainable code.

Follow me on Medium if you like this article! 😁

💼 Linkedin ️| 🐦 X (Twitter) | 💻 Website


Related Articles