
Introduction
When working on Data Science projects, one fundamental pipeline to set up is the one regarding data collection. Real-world Machine Learning mainly differs from Kaggle-like problems because data is not static. We need to scrape websites, gather data from APIs, and so on. This way of collecting data might look chaotic, and it is! That’s why we need to structure our code following best practices to bring some sort of order to all this mess.
The ORM software pattern
Once you identified the sources from which you want to gather your data, you need to collect them in a structured way to store those in your Database. For example, you might decide that in order to train your LLM what you need are data sources which contain 3 fields: author, content, and link.
What you could do is to download the data, and then write SQL queries to store and retrieve data from your database. More commonly you might want to implement all the queries to perform CRUD operations. CRUD stands for create, read, update, and delete. These are the four basic functions of persistent storage.
Now, what you can do is encapsulate the complexity of writing SQL queries into an OMR (object-relational mapping) class that handles all the database operations. The OMR will be capable of interacting with SQL databases such as MySQL or PostgreSQL.
Let’ code
To start with, we need to import the necessary libraries and define a base class for all the OMR models.
from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
# Define the base class for the ORM models
Base = declarative_base()
We are ready now to define a model. A model represents a table in our database. Here I am interested in building a document class to store my data. We need a user ID which will be our primary key for this model. If you are familiar with SQL, you should be familiar with this. In the model class, we also need to specify the table name to which the Python class will be associated. Let’s also rewrite the magic method repr to be sure our objects will be represented and printed in a good way.
# Define the Document model
class Document(Base):
__tablename__ = 'documents'
id = Column(Integer, primary_key=True, autoincrement=True)
author = Column(String, nullable=False)
content = Column(String, nullable=False)
link = Column(String, nullable=False)
def __repr__(self):
return f"<Document(id={self.id}, author='{self.author}', content='{self.content}', link='{self.link}')>"
Let’s define a SQLite database
engine = create_engine('sqlite:///documents.db', echo=True)
Base.metadata.create_all(engine)
It is also important to initialize a session with which we will actually be able to perform operations over our database.
# Create a session factory
Session = sessionmaker(bind=engine)
session = Session()
We can add data to our database using Python!
# Add some documents to the database
def seed_data():
doc1 = Document(author='John Doe', content='This is a sample document about SQLAlchemy.', link='http://example.com/doc1')
doc2 = Document(author='Jane Smith', content='Understanding relationships in SQLAlchemy.', link='http://example.com/doc2')
doc3 = Document(author='Alice Johnson', content='An introduction to Python ORM.', link='http://example.com/doc3')
session.add_all([doc1, doc2, doc3])
session.commit()
Same for querying the DB. We can also filter based on some attributes like the author’s name.
# Query the database
def query_data(author:str = None):
if not author:
print("nDocuments:")
for document in session.query(Document).all():
print(document)
print(f"nDocuments by {author}:")
john_doe_docs = session.query(Document).filter(Document.author == author).all()
for document in john_doe_docs:
print(document)
Check this link to learn more about SQLAlchemy ORM.
The ODM software pattern
The ODM pattern which stands for object document mapping is extremely similar to the ORM pattern, but instead of working with relational databases, we work with NoSQL databases such as MongoDB.
ODM maps object-oriented code to JSON-like documents and stores them in the database.
This means that the class we are going to build will shape how the JSON data is represented in the DB. For this reason, we will adopt Pydantic, which provides:
- validation: automatically checks if the input data matches the expected types and constraints
- serialization.: converts Python objects to and from formats like JSON or dictionaries.
Let’ code
We start with the imports as usual.
from pydantic import BaseModel, Field
from uuid import UUID, uuid4
from typing import Any, Dict
from pymongo import MongoClient
The class we are going to build will inherit from Pydantic BaseModel to use Pydantics features.
The type UUID is used to specify that the id variable should be a universally unique identifier. The Field class is provided by Pydantic to add constraints to variables (for validation) or to add some metadata. _defaultfactory=uuid4 means that when a document is created without an ID the method uuid4() is automatically called to create an ID. uuid4() generates a random number.
class Document(BaseModel):
id: UUID = Field(default_factory=uuid4)
author: str
content: str
link: str
Now I want to define a method to generate an instance of the class starting from a dictionary object.
This will be a class method. In this way, we have a reference to the class (cls) and can leverage this to use the Pydantic feature to automatically convert a JSON-like object into a class instance.
The type returned is therefore a Document. We need to use the apexes ‘Document’ to return the type of the class we are implementing in the class itself.
@classmethod
def from_mongo(cls, data: Dict[str, Any]) -> 'Document':
"""
Convert a MongoDB document to a Document instance.
This ensures the MongoDB '_id' field is mapped to 'id'.
"""
if '_id' in data:
data['id'] = data.pop('_id')
return cls(**data)
In a similar way we implement the to_mongo() method.
def to_mongo(self) -> Dict[str, Any]:
"""
Convert a Document instance to a MongoDB document.
This ensures the 'id' field is mapped to '_id'.
"""
data = self.dict()
data['_id'] = data.pop('id')
return data
The entire class then would be the following:
class Document(BaseModel):
id: UUID = Field(default_factory=uuid4)
author: str
content: str
link: str
@classmethod
def from_mongo(cls, data: Dict[str, Any]) -> 'Document':
"""
Convert a MongoDB document to a Document instance.
This ensures the MongoDB '_id' field is mapped to 'id'.
"""
if '_id' in data:
data['id'] = data.pop('_id')
return cls(**data)
def to_mongo(self) -> Dict[str, Any]:
"""
Convert a Document instance to a MongoDB document.
This ensures the 'id' field is mapped to '_id'.
"""
data = self.dict()
data['_id'] = data.pop('id')
return data
We can instantiate a MongoDB connection and use these methods to insert and fetch data easily to and from Mongo.
# Example usage with MongoDB
def main():
# Connect to the local MongoDB instance
client = MongoClient("mongodb://localhost:27017/")
db = client.test_database
collection = db.documents
# Create a new document instance
new_document = Document(author="Jane Doe", content="Learning MongoDB with Pydantic.", link="http://example.com/doc2")
# Insert the document into MongoDB
inserted_id = collection.insert_one(new_document.to_mongo()).inserted_id
print(f"Inserted document with _id: {inserted_id}")
# Fetch the document back from MongoDB
mongo_document = collection.find_one({"_id": inserted_id})
# Convert the MongoDB document to a Document instance
document_instance = Document.from_mongo(mongo_document)
print("Fetched and converted to Document instance:", document_instance)
# Clean up (delete the inserted document)
collection.delete_one({"_id": inserted_id})
print("Deleted the inserted document.")
# Run the example
main()
Done!
This is a very simple example. You might consider implementing some other methods to streamline the interaction with Mongo. For example, some methods you could implement are:
- save(): to allow an instance of the model to be inserted into the database
- get_or_create(): method which attempts to find a document in a database, if not existing it is created.
- bulk_insert(): to insert multiple documents in the database
- find(): search in the database using advanced filters
- bulk_find(): find multiple documents
- get_collection_name(): determines the name of the MongoDB collection
Final Thoughts
In this article, I described how to implement the ORM and ODM paradigms to streamline the work with relational and document-based databases. These approaches abstract the complexity of managing the database and writing SQL queries by writing clean and maintainable code.
Follow me on Medium if you like this article! 😁
💼 Linkedin ️| 🐦 X (Twitter) | 💻 Website