I thought I saw a VARCHAR, or, suboptimal model modeling in Django (This Old Pony #45)

Short quiz: where do you keep your application’s persistent data?

If you answered “in a database”, congratulations![0]

It’s the rare web application that functions without a database of one kind or another, most often a relational, SQL-backed database like PostgreSQL or MySQL[1]. The wonderful thing about relational databases is their use of schemas to define both individual data fields and also the relations between data. This can also pose a pain point in applications. Many developers overlook the importance of designing their database, especially when using an ORM, and working through these design decisions later can pose development challenges.

So this week let’s take a look through a few anti-patterns in database design that pop up in Django apps and a few ways to solve for them. We’ll also touch on a few items that are model related but not DB related.

Under normalized data

In a normalized database data is organized so that it is not repeated and such that is usually grouped logically or descriptively. For example, if we were modeling musical bands and the members of these bands, we could have a “Band” model with a bunch of fields for the names of band members; and/or for musicians we could have fields listing the various bands they were members of. The problem here is that we need to keep information updated in multiple places. So a better solution, the normal solution, would be to have a (i) musicians table, a (ii) band table, and then a (iii) band members table that links musicians to bands.

There’s nothing controversial about this.

Frequently a design starts out with some data added onto a model and instead of using a normalized form the result is just more fields (columns) tacked onto existing models, and often repeated data. For example, multiple address fields (for mailing address, physical address, HQ, etc).

The bonus here is a simpler query, I suppose, but it means having to repeat things like validation and field constraints - which often leads to errors when changes are required.

It can also require a lot of the explicit conditional checks that disappear when dealing with working with iterables instead.

Overly aggressively normalized databases

Occasionally you might see a structure that has an obvious normalized design and implement it, only to never use more than one such relation, meaning that for the purpose of working with the data the normalization is superfluous.

An easy to grasp example is contact information. We’ve seen data like phone numbers or locations (like addresses above) represented in separate tables to allow for alternate scenarios like a business having multiple locations, only for the result to be a 1:1 equivalent. From a strict database design standpoint there’s nothing wrong with this, however it adds a minor amount of tangible complexity to development. Updates require additional queries and things like ModelForms are trickier.

The half-solution: if for some reason changing the database structure is just totally out of the question, you could yet simplify the usage by adding properties to the primary model which get and set values in the secondary model (e.g. the location), thus obviating the need to ever deal directly with the secondary model. Then follow up by updating the save method on the primary model to save the secondary model as well - you might even track these initial values in the primary model to see if they’ve changed prior to calling the secondary model’s save method to minimize extraneous queries.

The full solution is to denormalize these data. As an example, for an address model you’d want to add the same field definitions from that model to the parent model and migrate the data over. Provided you’re only concerned about retrieving data, you could add a property to the parent model that returns a dictionary or another structure of only the address fields if you need to satisfy an implied interface somewhere (e.g. a template).

Missing constraints

We can do a lot to validate our data before it gets into the database, but unless your data doesn’t matter at all it’s a Very Good (TM) idea to enforce data rules in the database itself. If you have to choose between assuming the data is correct when it’s added versus assuming it’s correct when it’s retrieved, I’d always go with assuming it’s correct when it’s retrieved, which means we need to add constraints on what’s allowed to be added.

The first of these are constraints is uniqueness. If a field should be unique, we shouldn’t need exception handling around non-unique values or relations returned. This could include singularly unique values or unique combinations (using unique_together).

Similarly, if you have bounding values for a field include these in the model definition, not just forms. The one caveat here is that Django will not create constraints for max/min values, rather these will be used for model level validation.

Wrong or suboptimal field types

This is less common but can be one of the most troubling kinds of wrong schema implementation.

A lesser problematic example is using a text field for a JSON data. Hey, we’ve all been there. The latest major versions of Django’s contrib.postgres module have a JSON field that actually stores JSON - not just the wrapper string - in the database. This offers a couple of advantages: one, guaranteed actually good JSON from the database, and two it’s queryable on it’s own.

A more significant issue is using character fields for non character values, e.g. for numbers, UUIDs, etc. Sometimes this will work “fine” but it’s not leaning on the database for optimal field performance and validity.

Using any kind of non-indexed field for searchable data of any kind is also a “field smell”. In fact, if you need to do any kind of non-trivial searching you should take a serious look at using search fields. And again contrib.postgres shows up with some very handy fields and helpers for making separate search-only fields.

This is hardly an exhaustive list. Let’s not forget unnecessary many-to-many relations in lieu of array fields and… well, hopefully you get the picture.

In some cases these fields may be alterable in place. If not, the solution is the creation of a replacement field combined with a data migration and often new accessor methods as aliases to serve as a fallback for the old attribute name; of course a sequence of migrations will do, as well, which swap fields out.

What about NoSQL?

I remember when the NoSQL database trend started, with the release of databases like CouchDB and later MongoDB[2]. The excitement for many developers was that not only did you not need to use SQL[3] but you didn’t need to worry about schemas anymore. You could just write your data into a document as it is and “BOOM” done!

Except what they forgot is that means you’re writing the schema just without a definition![4] If you then have code that tries to access data you that was structured differently you’ll either get a runtime error or have to start adding lots of conditional logic to handle this. It turns out that just because you’re not using a relational database doesn’t mean you don’t have to worry about things like field definitions or document schema migration. Instead, when you change the schema you just need to update every single document/object with the previous schema.

Anyhow, there are analogs to these antipatterns in using NoSQL databases. They include analogs to both over and under normalization but we have not done nearly enough work in this space to offer significant advice about how to best design schemas in them.

WITH SELECT first_name FROM author AS,
Ben

[0] Sure, there are other options, too, some data may be required to be stored in directly accessible files, but I’m willing to bet you’re still using a database.
[1] Or SQL Server or Oracle or MariaDB or …. yes, there are more databases than just PostgreSQL and MySQL, and more beyond RDBMS’s, too.
[2] These were predated by ZopeDB, a Python object database used primarily by the Plone CMS
[3] Maybe it’s Stockholm syndrome but I love SQL
[4] The irony of critiquing this while using a dynamically typed language is not lost on the author