The world’s leading publication for data science, AI, and ML professionals.

pydantic

The library you must know if you juggle data around

Image by author
Image by author

Having complex nested data structures is hard. The traditional approach to store this kind of data in Python is nested dictionaries. Although Python dictionaries are amazing, there are two issues which typically arise: (1) How do I, as a developer, know which kind of data is to be expected in the passed dictionary and (2) how do I prevent typos?

If you use classes, your editor or mypy prevents typos in attribute names. For dictionaries, any valid string can be a key.

A solution to both problems is using a library: pydantic. It is a validation and parsing library which maps your data to a Python class.


Prerequisites

Install pydantic via

pip install pydantic

For this article, I assume that your data is a network of people in [people.json](https://gist.github.com/MartinThoma/517d20998501afc4fff72be032782d41). They have an ID, a name, a list of friends given by their ID, a birthdate, and the amount of money on their bank account.

We want to calculate how much more everybody has than their median friend with this example.py:

The problems we want to approach are:

  • Ugly None: Having None as a value for friends is pretty ugly. In some cases, None is different from an empty list. In this case, let’s just assume it means the same. Replacing missing data with something else is called imputation and there are several techniques.
  • Ugly Any: The type annotations have Dict[str, Any] , because it was considered too complicated or even impossible to know which values the dictionary representing a person can map too.
  • Ugly indexing: It’s just syntax, but ["id"] is 50% longer than .id . This is the reason why bunch/munch exists.
  • Typos: If you make a typo in any of the string indices of a dictionary, no static code analysis tool can help you to detect them. No auto-completion will work properly.
  • Late Errors: Good software modularizes responsibilities. There is one module responsible for the business logic and one for input/output. When I write "module" I mean a unit of code – it could be a Python module, a class, a function, a method. It is bad to have simply one function to deal with those different types of complexity. Getting the business logic right is inherently different from making sure your input/output looks as expected. One is about defining and validating proper interfaces, the other one is about understanding the domain.
  • Documentation: New developers regularly have to read the code of bigger applications. Usually, there is no time to explain every single part in great detail by somebody who knows everything about the application. Most of the time, there isn’t even a single person who knows everything – even if the project was done by a single developer. I always forget parts of my code. Hence documenting is crucial. In Python, documenting expected parameter types and its return values is crucial. Having Dict[str, Any] is better than nothing, but way worse than Person .

Create a pydantic model

We create a new type for the ID of a person, simply because PersonId is so much more meaningful than just int . Then we subclass [pydantic.BaseModel](https://pydantic-docs.helpmanual.io/usage/models/#basic-model-usage)


Use it for input parsing

Next, we use [parse_file_as](https://pydantic-docs.helpmanual.io/usage/models/#parsing-data-into-a-specified-type) to read the JSON file:

Please note that datetime and Decimal are automatically parsed – you still should always look up how it is done. Doing data validation early is good so that errors also happen early and in a known place. This means pydantic nudges you into a good design. I love it 😍


Constrained Types

Constrained types are integers/floats in a certain value range or a string that matches a RegEx 😃


Missing data: Use default values

If your JSON might miss some attributes which you want to have, you need to work with default values. A typical default is None which means that you need to change the type to Optional[what it was before] . This is typically pretty ugly as you need to check for None later in the code.

Especially for Lists, you might want to consider using an empty list instead. You do it like this:

For immutable data types like strings, integers, floats, tuples, you can simply assign the value. For mutable ones, you need to use Field with the default_factory that generates a new list every time. Learn why mutable defaults are evil, if you don’t know it already.


Additional data: Ignore, allow, or forbid

It’s sometimes impossible to know at development time which attributes a JSON object has. Still, you need to pass those around. This is super unfortunate and should be challenged, but it can happen.

Pydantic calls those extras. If you ignore them, the read pydantic model will not know them. Ignored extra arguments are dropped. Allowing them means to accept that this unfortunate design is necessary. Allowed extras will be part of the parsed object. Finally, to forbid extra arguments means an pydantic.ValidationError exception will be thrown if an extra argument occurs.

This is configured by adding a subclass calledConfig to the pydantic model:


Rename attributes

Names are important. Readability counts. In Python, variables should follow a snake_case naming scheme, while in JavaScript variables should follow a camelCase naming scheme. To fulfill both, pydantic offers [allow_population_by_field_name](https://pydantic-docs.helpmanual.io/usage/model_config/) as a config parameter.


Validators

Sometimes, simple types are not enough. You want to check more complex stuff.

The docs already give a pretty good example for such a scenario:

You can check pretty much anything as long as you just need the class itself. Please don’t run queries against a database to do consistency checks; e.g. if you want a username to be unique or something similar. Although you can likely make this run, it will be unexpected to trigger a database query by creating a "data container".

In our case, we might want to prevent that people can be a friend of themselves:

Instead of throwing an exception, we can also simply fix it:


Property-based Testing with Pydantic

Photo by Science in HD on Unsplash
Photo by Science in HD on Unsplash

Property-based tests auto-generate inputs for the function under test and make sure a certain property is fulfilled. In the simplest case, this property is that the function under test does not crash. If you want to learn more about this type of Testing, read my article about property-based testing with hypothesis.

By the way, this test actually pointed out a potential issue:

Falsifying example: test_two_people(
    person_a=Person(id=0, name='', bank_account=Decimal('NaN'), birthdate=datetime.date(2000, 1, 1), friends=[]),
    person_b=Person(id=0, name='', bank_account=Decimal('NaN'), birthdate=datetime.date(2000, 1, 1), friends=[]),
)

More neat stuff

Pydantic is pretty awesome:

  • You can generate a schema from its models (source),
  • a mypy plugin gives even better type checks
  • Serialization to a dictionary can be done with the .dict() method, serialization to a JSON string can be done with the .json() method.

Operational Safety

Photo by ThisisEngineering RAEng on Unsplash
Photo by ThisisEngineering RAEng on Unsplash

One part that usually worries me is general support. Here are some indicators that pydantic has a healthy community:

  • ✔️ GitHub: 4.5k stars, 404 forks, 172 contributors
  • ✔️ Usage: FastAPI uses it. Microsoft Onefuzz uses it. AWS Lambda Powertools as well. Many machine learning projects
  • PyPI project: Sadly, this only has one maintainer. I always like it a bit better if there are two. Simply to prevent that one might lose their password.
  • ✔️Self-decided project status: The maintainer considers pydantic to be production-ready/stable.

Summary

pydantic is an awesome data parsing and validation library. It can support you very well to get better type annotations in Python. Use it!


Related Articles