Wellfire Interactive // Expertise for established Django SaaS applications

The problem with fat models, or, an OOPs mistake (This Old Pony #69)

The term “fat models” references the idea of moving as much business logic, especially that related to your Django models, out of views (controllers) and into the model classes themselves. All things being equal, “fat models, skinny views”[0] is pretty good advice. And it gets us closer to what object oriented programming is supposed to look like, and what it’s supposed to do for us.

However… however! There’s a common anti-pattern that crops up in Django apps, often enough that it seems like I make passing mention of it in every other newsletter. Here’s something from two weeks ago when I wrote about “leverage points” (emphasis added):
 

A great deal of complexity and performance problems alike stem from poor or insufficient use of QuerySets in the Django ORM. The complexity comes in the form of gigantic view functions or methods, ad hoc querysets which are easily copied and pasted - wrongly, not to mention less testable code. The performance hits often come from data manipulation of one kind or another that could and should be pushed to the database but is instead punted to Python code or, worse, to yet additional queries. Organizing querysets into _ fat managers _ and taking advantage of advanced querying (as well as update patterns) in querysets so often results in significant organizational, testing, and performance benefits that code-wise it should be the first lever you look for.

It’s common to find code executing properly designed methods on model instances by iterating over a queryset. Sometimes there’s nothing that can be done about this, but all too often this is the wrong approach, and the result is, at best, unnecessary performance degradation.

Anyhow, I think I finally figured out why we keep doing this (“we” because, hey, I’ve done it too). It’s not laziness or ignorance, not always at least. No, it’s a fundamental misunderstanding of the relationship between database data and the ORM’s OOP abstraction.

You see, our model classes are at once just that, Python classes for Python objects, and also an abstraction layer over the database and it’s organizational system. That probably doesn’t sound controversial, but understanding the directionality is crucial. If you think about the database as first a place to store your object data then yes, getting, fetching, updating, and deleting all make sense on an individual instance level, pretty much always. Give me my objects!

But instead, think about the classes as the abstraction over the database, where the database is instead taken as the given. We may use the ORM to design and manage the database schema, but it’s the database that is the authority, the leader, the source of truth. The ORM and our object oriented classes are the abstraction layer. Now the principle of accessing objects individually by themselves starts to look less appealing.

Mapping a method call across a list of objects is - quite often - mapping at the wrong abstraction level. Instead we should be mapping across rows of data, e.g. an update() or delete() call on a queryset. Those are trivial examples, but they each map an update, however instead of mapping across Python objects they map across database rows. And to map across database rows you need to push the mapping to the database.

Let me motivate with another example.

Let’s say you have a model call NewspaperArticle and the act of publishing it involves changing the status, invaliding cached values, firing a webhook call, and emailing an editor that the article has been published, or republished.

You could write the NewspaperArticle.publish() method like so:
 

class NewspaperArticle(models.Model):    def publish(self, publish\_user):        self.published\_time = now()        self.status = ArticleStatus.published        self.last\_published\_user = publish\_user        self.save()     self.category.publish()        published\_webhook.delay(self.pk)        alert\_editors.delay(self.pk)

When an article is published the method updates the published time, sets the publishing user, saves the instance to the database of course, and then makes some additional calls. It calls the “publish” method on the related category, which for our purposes we’ll assume does nothing more than invalidate the cache for the category page listings, then fires some asynchronous tasks to send a webhook and email the editors.

This is fine! But what happens if this needs to scale? What if you need to publish hundred or thousands of articles at once? When we think about scaling issues we usually think about huge amounts of data, but even small collections of calls over dozens, hundred, or thousands of instances add up and can degrade user experience, cause timeouts, or worse.

You see, referencing this encapsulating method on each object instance is the OOPiest way of publishing these articles. And if you had a NewspaperArticleContainer it would probably have a method called “publish” which would just iterate over each of its NewspaperArticle members and call “publish”.

The key is recognizing that a collection of NewspaperArticles as a queryset is categorically different than any other iteratable of NewspaperArticles, like a list or a NewspaperArticleContainer. That’s because it doesn’t just represent an iterable of NewspaperArticle instances, it represents a single set of rows, which can be read, deleted, updated, or used for additional queries doing all of the same.

And it almost always pays to operate on that set.

On the flip side, it means you have to write another method in addition to the model method.

class ArticleQueryset(models.Model):    def publish(self, publish\_user):        self.update(published\_time=now(), status=ArticleStatus.published) keys = self.list\_values('pk', flat=True) Category.objects.filter( pk\_\_in=self.values\_list('category')).publish() for key in keys:        published\_webhook.delay(key)        alert\_editors.delay(key)

It may look like I’m punting on the category cache invalidation, but that’s also something that should be pushed to the queryset.

When we’re prototyping and writing tests we rarely work with data en mass so it’s really easy to get away with iterated or mapped calls. But once the number of instances increases as is wont to happen in production, you’re certain to find that it’s worth reassessing OOPs-driven logic for mapped data transformations.

Selected for update yours,
Ben

[0] “controllers” instead of “views” in the original nomenclature

Learn from more articles like this how to make the most out of your existing Django site.