How to Make the Most of Pydantic

Explore techniques for data contract validation, higher interoperability with JSON Schemas, and simplified data model processing.

Pere Miquel Brull
Towards Data Science

--

Pydantic has been a game-changer in defining and using data types. It makes the code way more readable and robust while feeling like a natural extension to the language.

It is an easy-to-use tool that helps developers validate and parse data based on given definitions, all fully integrated with Python’s type hints. The principal use cases include reading application configurations, checking API requests, and creating any data structure one might need as an internal building block.

Example of a pydantic model. Image by the author.

While these definitions might already be familiar to some readers, we will explore different techniques to:

  • Make our usage of Pydantic safer and easier to debug by correctly holding data contracts.
  • Achieve higher interoperability with JSON Schemas.
  • Simplify data model processing with Python’s built-in functions.

Parsing data the safe way

JSON is the language of the web. Most data scientists and engineers have stumbled upon a file or API from which they expect a consistent structure before safely cooking its data.

Let’s imagine we are retrieving data from an API that works as a cats directory. Then, for a specific endpoint, we expect a data contract as follows:

{
"name": string,
"age": integer,
"address": {
"city": string,
"zip_code": string,
"number": integer
}
}

The structure defines a cat entry with a nested definition of an address. So then, defining a Pydantic model to tackle this could look like the code below:

Notice how easily we can come up with a couple of models that match our contract. In this scenario, the definitions only required one nesting level, but Pydantic allows for straightforward combinations of any number of models.

The next step is to parse the given data using these schemas, transforming a raw JSON into a specific Python object that we can quickly play with.

All fields are transformed into attributes known by the Python interpreter and IDEs, which is a huge help.

So far, we have gone on a happy path. However, seasoned developers plan further than that. What if the API starts sending different data? When everything is automated, teams need to make sure they can quickly pick up issues and safely react to them. Let’s see what happens if the contract breaks without updated definitions:

Pydantic to the rescue! As the age field is defined as an int, but we received a str, the snippet will be raising a ValidationError. This same validation exception will be thrown for other kinds of inconsistencies, such as a missing field (if not defined as an Optional).

Breaking the contract

What if the exact opposite happens? If the API starts adding more and more fields that were not part of the contract and this scenario is not monitored correctly, we might be missing some hidden bugs.

This snippet will run perfectly, and the data object will only contain the defined fields, although this is shadowing some inconsistencies. But just because there’s not an exception or a log in a specific piece of code, it is no guarantee that the whole process is working as expected.

While this is the default behavior in Pydantic, we can tweak the configurations to forbid any additional field from being sent to the class definition:

This small change in the configuration of the class will now throw a ValidationError when parsing data with extra fields:

ValidationError: 2 validation errors for CatRequest
key
extra fields not permitted (type=value_error.extra)
key2
extra fields not permitted (type=value_error.extra)

Interoperability

Those into Data Mesh literature will have come across this term already. When breaking down our siloed platforms and teams into smaller domains, we must share global data definitions among separate systems.

As technical enforcement will possibly be out of the question, the governance team needs to make sure that polyglot architectures can safely share and evolve from central and unique schemas that build a single source of truth.

One way of achieving a language-agnostic solution is by defining data structures as JSON Schemas. Luckily, this does not take Pydantic out of the equation, as there is a fantastic project that helps developers translate JSON Schemas directly into Pydantic models.

Let’s iterate over the example above and build a JSON Schema with it:

{
"$id": "https://catz.org/schema/api/data/catRequest.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "CatRequest",
"description": "Cat API request definition.",
"type": "object",
"definitions": {
"address": {
"description": "Defines the city, code and number.",
"type": "object",
"properties": {
"city": {
"description": "Cat's city",
"type": "string"
},
"zip_code": {
"description": "Postal code",
"type": "string"
},
"number": {
"description": "House number",
"type": "integer"
}
},
"additionalProperties": false
}
},
"properties": {
"name": {
"description": "Cat's name.",
"type": "string"
},
"age": {
"description": "Cat's age, in cat years.",
"type": "integer"
},
"address": {
"description": "Where does the cat live.",
"$ref": "#/definitions/address"
}
},
"required": ["name"],
"additionalProperties": false
}

Here we are defining our CatRequest schema by providing not only its properties: name, age, and address, but also with the ability to write down the nested definitions in one go.

Two valuable characteristics of JSON Schemas are:

  • The ability to select which fields are mandatory and optional by passing an array of required field names. For the sake of the explanation, we just chose name as a must-have.
  • Using additionalProperties, developers can easily control that these definitions do not become key-value dumps. Defining a contract is only helpful if we can ensure that it holds.

Let’s now convert the schema into Pydantic classes by using the datamodel-codegen CLI:

$ pip install datamodel-code-generator$ datamodel-codegen --input cat.json --input-file-type jsonschema --output cat.py

Which stores the models’ definitions in cat.py, not only adding the configurations but with proper typing in Optionals and field descriptions.

JSON Schema has been key for powering our centralized schemas at OpenMetadata. With tools such as datamodel-codegen, we can then use the same definitions in different modules of the code built with multiple languages.

Singledispatch

It is common to have a single processing method where a pipeline has to run a different logic for each data model. Then, developers often fall into an if-else frenzy that makes the code trigger all the complexity alarms.

What if there was a better way? What if it was built-in into Python? The singledispatch decorator and Pydantic are a match made in heaven. This function is a powerful tool in the functools module, helping maintain a more scalable codebase by registering specific behaviors for each data type.

Image by the author

Single dispatching is Python’s way of implementing function overloading, i.e., calling a single function that will know which internal logic to run depending on the arguments it receives. However, note that singledispatch only considers the type of the first argument. Luckily, we can package all the data we need into this single argument by using a Pydantic model.

With this approach, we can quickly create and test individual functions that respond to specific data needs, integrate new models into the loop without any hassle, and have a clear and direct approach for handling multiple data sources.

Conclusion

Pydantic is one of these tools that created a before and after in the Python ecosystem. Not only does it make developers’ lives easier by highly improving code quality and readability, but it also has been the foundation for modern frameworks such as FastAPI.

In this post, we’ve seen best practices around:

  • Being able to track any change in a data contract with added or missing fields.
  • Using JSON Schemas to integrate data definitions among distributed systems while still using Pydantic for Python codebases.
  • Applying the singledispatch decorator for designing easier to read, test and scale processing pipelines.

--

--

Building OpenMetadata — Founding Engineer @ Collate. I write about Python & Data.