Turning around a Django site without a single line of Python (This Old Pony #68)

It’s not magic and it’s not a trick. It’s not even a clickbait bait and switch. Note, however, that I didn’t say “without a line of *any* code”.

You see, the heart of 99.9% of all Django applications is a database, usually a relational database, like PostgreSQL or MySQL. It’s health, or at least the health of the data, is affected by the application and in turn affects how the application works for users.

It’s not uncommon to find database data and database design issues behind what look like pure application bugs. Now these may indeed be bugs, but they’re bugs that would only affect in a small part of the application or only one-way in the customer-application data pipeline. A difference in how the data is treated in the database would likely surface those errors earlier and prevent problems from bubbling up later.

I’ve mentioned some of these before, like the scourge of nullable fields[0]. However there are all kinds of other constraints we expect on our data that when unapplied or applied improperly result in headaches and heartburn. And you can solve for them by adding constraints directly to your database without writing even a line of Python (although it might be a very good idea).

A concrete example should motivate this.

We recently had a client come to us with a problem in part of a data publishing process. In this process, publishing a document version has the effect of changing the status of the current version to “archived” and creating a fresh copy in the “published” state. The latter is what the API presents to customers.

Whether due to customer excitement or something else in the long trail from customer seat to web application server, some requests to publish data were being registered twice in such quick successions that multiple published versions of a document were created. As a result customers hitting the high volume API would see errors fetching their data due to the underlying MultipleObjectsReturned error.

Now, the ultimate goal is to prevent a customer from being able to publish with such super human speed that they can’t issue requests like this, but the more pressing concern was the data. Our client’s business is based on that API, so delivering the data is more pressing.

A first pass solution is to replace “get” queryset methods with “first” methods. This works, although it necessitates some change in the logic (checking for None rather than an ObjectDoesNotExist exception). But while it works, it’s still unsatisfactory. Where else in the code might be find a dependence on the implied 1:1 relationship of published version with a source document? And what kind of problems could this pose understanding the data in the future?

For the former, we could wrap the code in our queryset method and use this everywhere. This method could be used to safely return a single object even from several published version and even raise the proper exception when nothing exists. But it doesn’t solve the data problem. To do that we need a unique relationship.

Unique constraints are available right in our Django models, in the Meta options class. The problem is that they’re universal. That is, we can set up a unique constraint for a document version with the source document, but it will apply to every single document version, even those that are not published. We want to ensure that there is only published version but we don’t care if there are 200 archived versions. The “unique_together” option is of no help here.

However the database can do this! PostgreSQL, among others, has a concept of partially indexing columns (fields) on a database table (model). And we can use it to create a unique constraint on a table that only applies to documents in the published state. It can be applied with 1 line of SQL - forgive the line breaks for readability:

CREATE UNIQUE INDEX published\_documents ON document\_versions (document\_id) WHERE is\_published IS TRUE;

Now, in practice such a constraint should be added via a Django database migration, which yes, involves adding lines of Python, but you don’t *have to* do that to get the core benefit. And that is that the database will _ never let duplicated versions be published again _. If nothing else were updated, and no other improvements made to the UI, the worst thing that would happen is customer would see a brief error message from their dashboard in addition to the the first success message. The data would be published, the data would be consistent, and the API results would continue responding as expected.

Additional constraints and checks are fairly simple things to add but because they’re not supported by Django’s ORM[1][2] they’re easy to forget about. Shoring up your data rules from within the database is a great way to not just reduce certain classes of errors but to reduce the scope of code your team needs to focus on.

Unconstrainedly yours,
Ben

[0] There are Very Good Reasons (TM) to use nullable fields, but these should be deployed strategically (e.g. supporting legacy fields, migrating new ones, etc).
[1] For now! field-level checks are on their way in Django 2.2. So are partial indexes, but not as described here as a unique constraint across columns.
[2] There are several Django packages that provide some of the aforementioned functionality, including unique constraints (e.g. https://github.com/mattiaslinnap/django-partial-index)

Wellfire Interactive // Expertise for established Django SaaS applications

Turning around a Django site without a single line of Python (This Old Pony #68)

More like this