Tags: orm, python, sql, sqlalchemy, pydantic
Tags: orm, python, sql, dan-mckinley
Via Michele Nasti
Tags: javascript, orm, sql
Via @philgyford
Tags: django, orm, phil-gyford
I'm a big believer in the importance of a "recent additions" feed. Any time you're building an application that involves users adding and editing records it's useful to have a page somewhere that shows the most recent objects that have been created across multiple different types of data.
I've used a number of techniques to build these in the past - from an extra database table (e.g. the Django Admin's LogEntry model) to a Solr or Elasticsearch index that exists just to serve recent additions.
For a recent small project I found myself needing a recent additions feed and realized that there's a new, simple way to build one thanks to the QuerySet.union() method introduced in Django 1.11 back in April 2017.
Consider a number of different ORM models that can be added by users, each with a created timestamp field.
Prior to QuerySet.union(), building a combined recent additions feed across multiple models was difficult: it's easy to show recent additions for a single model, but how can we intersperse and paginate additions made to models stored across more than one table?
Consider the following three models:
class Project(models.Model):
name = models.CharField(max_length=128)
description = models.TextField()
created = models.DateTimeField(auto_now_add=True)
class Image(models.Model):
project = models.ForeignKey(
Project, related_name='images', on_delete=models.CASCADE
)
image = models.ImageField()
created = models.DateTimeField(auto_now_add=True)
class Comment(models.Model):
project = models.ForeignKey(
Project, related_name='comments', on_delete=models.CASCADE
)
comment = models.TextField()
created = models.DateTimeField(auto_now_add=True)
Let's build a single QuerySet that returns objects from all three models ordered by their created dates, most recent first.
Using .values() we can reduce these different models to a common subset of fields, which we can then .union() together like so:
recent = Project.objects.values(
'pk', 'created'
).union(
Image.objects.values('pk', 'created'),
Comment.objects.values('pk', 'created'),
).order_by('-created')[:4]
Now if we print out list(recent) it will look something like this:
[{'created': datetime.datetime(2018, 3, 24, 1, 27, 23, 625195, tzinfo=<UTC>),
'pk': 28},
{'created': datetime.datetime(2018, 3, 24, 15, 51, 29, 116511, tzinfo=<UTC>),
'pk': 15},
{'created': datetime.datetime(2018, 3, 23, 20, 14, 3, 31648, tzinfo=<UTC>),
'pk': 5},
{'created': datetime.datetime(2018, 3, 23, 18, 57, 36, 585376, tzinfo=<UTC>),
'pk': 11}]
We've successfully combined recent additions from three different tables! Here's what the SQL for that looks like:
>>> from django.db import connection
>>> print(connection.queries[-1]['sql'])
(SELECT "myapp_project"."id", "myapp_project"."created" FROM "myapp_project")
UNION (SELECT "myapp_image"."id", "myapp_image"."created" FROM "myapp_image")
UNION (SELECT "myapp_comment"."id", "myapp_comment"."created" FROM "myapp_comment")
There's just one problem: we got back a bunch of pk and created records, but we don't know which model each of those rows represents.
We can fix this by using Django's annotate() method combined with a Value() object to attach a constant string to each record specifying the type of the row it represents. Here's how to do that for a single model:
>>> from django.db.models import Value, CharField
>>> list(Image.objects.annotate(
... type=Value('image', output_field=CharField()
... )).values('pk','type', 'created')[:2])
[{'created': datetime.datetime(2018, 3, 22, 17, 16, 33, 964900, tzinfo=<UTC>),
'pk': 3,
'type': 'image'},
{'created': datetime.datetime(2018, 3, 22, 17, 49, 47, 527907, tzinfo=<UTC>),
'pk': 4,
'type': 'image'}]
We've added the key/value pair 'type': 'image' to every record returned from the querystring. Now let's do that to all three of our models and combine the results using .union():
recent = Project.objects.annotate(
type=Value('project', output_field=CharField())
).values(
'pk', 'created', 'type'
).union(
Image.objects.annotate(
type=Value('image', output_field=CharField())
).values('pk', 'created', 'type'),
Comment.objects.annotate(
type=Value('comment', output_field=CharField())
).values('pk', 'created', 'type'),
).order_by('-created')[:4]
If we run list(recent) we get this:
[{'created': datetime.datetime(2018, 3, 24, 15, 51, 29, 116511, tzinfo=<UTC>),
'pk': 15,
'type': 'comment'},
{'created': datetime.datetime(2018, 3, 24, 15, 50, 3, 901320, tzinfo=<UTC>),
'pk': 29,
'type': 'image'},
{'created': datetime.datetime(2018, 3, 24, 15, 46, 35, 42123, tzinfo=<UTC>),
'pk': 15,
'type': 'project'},
{'created': datetime.datetime(2018, 3, 24, 7, 53, 15, 222029, tzinfo=<UTC>),
'pk': 14,
'type': 'comment'}]
This is looking pretty good! We've successfully run a single SQL UNION query across three different tables and returned the combined results in reverse chronological order. Thanks to the type column we know which model each record corresponds to.
Now we need to inflate those primary key references a full ORM object from each corresponding table.
The most efficient way to do this is to collect together the IDs for each type and then run a single SQL query per type to load the full objects.
Here's code that does exactly that: it first collects the list of primary keys that need to be loaded for each type, then executes an efficient SQL IN query against each type to fetch the underlying objects:
records = list(recent)
type_to_queryset = {
'image': Image.objects.all(),
'comment': Comment.objects.all(),
'project': Project.objects.all(),
}
# Collect the pks we need to load for each type:
to_load = {}
for record in records:
to_load.setdefault(record['type'], []).append(record['pk'])
# Fetch them
fetched = {}
for type, pks in to_load.items():
for object in type_to_queryset[type].filter(pk__in=pks):
fetched[(type, object.pk)] = object
# Annotate 'records' with loaded objects
for record in records:
key = (record['type'], record['pk'])
record['object'] = fetched[key]
After running the above code, records looks like this:
[{'created': datetime.datetime(2018, 3, 24, 15, 51, 29, 116511, tzinfo=<UTC>),
'object': <Comment: a comment>,
'pk': 15,
'type': 'comment'},
{'created': datetime.datetime(2018, 3, 24, 15, 50, 3, 901320, tzinfo=<UTC>),
'object': <Image: Image object (29)>,
'pk': 29,
'type': 'image'},
{'created': datetime.datetime(2018, 3, 24, 15, 46, 35, 42123, tzinfo=<UTC>),
'object': <Project: Recent changes demo>,
'pk': 15,
'type': 'project'},
{'created': datetime.datetime(2018, 3, 24, 7, 53, 15, 222029, tzinfo=<UTC>),
'object': <Comment: Here is another comment>,
'pk': 14,
'type': 'comment'}]
We can now feed this to a template and use it to render our recent additions page.
Here's a function that implements the above in a re-usable way:
def combined_recent(limit, **kwargs):
datetime_field = kwargs.pop('datetime_field', 'created')
querysets = []
for key, queryset in kwargs.items():
querysets.append(
queryset.annotate(
recent_changes_type=Value(
key, output_field=CharField()
)
).values('pk', 'recent_changes_type', datetime_field)
)
union_qs = querysets[0].union(*querysets[1:])
records = []
for row in union_qs.order_by('-{}'.format(datetime_field))[:limit]:
records.append({
'type': row['recent_changes_type'],
'when': row[datetime_field],
'pk': row['pk']
})
# Now we bulk-load each object type in turn
to_load = {}
for record in records:
to_load.setdefault(record['type'], []).append(record['pk'])
fetched = {}
for key, pks in to_load.items():
for item in kwargs[key].filter(pk__in=pks):
fetched[(key, item.pk)] = item
# Annotate 'records' with loaded objects
for record in records:
record['object'] = fetched[(record['type'], record['pk'])]
return records
This is also available as a gist.
I can now use that function to combine arbitrary querysets (provided they share a created datestamp field) like so:
recent = combined_recent(
20,
project=Project.objects.all(),
image=Image.objects.all(),
comment=Comment.objects.all(),
)
This will return the most recent 20 records across all three types, with the results looking like this:
[{'when': datetime.datetime(2018, 3, 24, 15, 51, 29, 116511, tzinfo=<UTC>),
'object': <Comment: a comment>,
'pk': 15,
'type': 'comment'},
{'when': datetime.datetime(2018, 3, 24, 15, 50, 3, 901320, tzinfo=<UTC>),
'object': <Image: Image object (29)>,
'pk': 29,
'type': 'image'},
{'when': datetime.datetime(2018, 3, 24, 15, 46, 35, 42123, tzinfo=<UTC>),
'object': <Project: Recent changes demo>,
'pk': 15,
'type': 'project'},
{'when': datetime.datetime(2018, 3, 24, 7, 53, 15, 222029, tzinfo=<UTC>),
'object': <Comment: Here is another comment>,
'pk': 14,
'type': 'comment'}]
If you're going to render these objects on a page, it's pretty likely you'll need to load additional data about them. My example models above are deliberately simplified, but in any serious Django project it's likely they will have additional references to other tables.
We can apply Django's magic select_related() and prefetch_related() methods directly to the querysets we pass to the function, like so:
recent = combined_recent(
20,
project=Project.objects.all().prefetch_related('tags'),
image=Image.objects.all().select_related('uploaded_by'),
comment=Comment.objects.all().select_related('author'),
)
Django's query optimizer is smart enough to ignore those calls entirely when building the initial union queries, so even with the above extras the initial union query will still look like this:
(SELECT "myapp_project"."id", "myapp_project"."created", 'project' AS "recent_changes_type" FROM "myapp_project")
UNION (SELECT "myapp_image"."id", "myapp_image"."created", 'image' AS "recent_changes_type" FROM "myapp_image")
UNION (SELECT "myapp_comment"."id", "myapp_comment"."created", 'comment' AS "recent_changes_type" FROM "myapp_comment")
ORDER BY (2) DESC LIMIT 20
The select_related() and prefetch_related() clauses will then be incorporated into the subsequent SQL queries that are used to efficiently inflate the full objects from the database.
There are a bunch of interesting extensions that can be made to this pattern.
Want pagination? The initial unioned queryset can be paginated using offset/limit by slicing the queryset, or using the Django Paginator class.
Want more efficient pagination (since offset/limit tends to get slow after the first few thousand rows)? We're ordering by created already which means it's not difficult to build efficient range-based pagination, requesting all records where the created date is less than the earliest date seen on the previous page.
Since everything is based on regular Django querysets, it's possible to build all kinds of variants of the recent additions feed. So far we've just built one showing all changes across an entire application, but it's not hard to apply additional filters to only show changes made by a specific user, or changes made relating to a specific foreign key relationship. If you can represent it as a collection of querysets that each expose a created column you can combine them into a single feed.
You don't even need to use records that share a created column: if you have objects with columns of differing names you can use an annotation to alias those columns, like so:
recent = combined_recent(
20,
project=Project.objects.annotate(
when=models.F('updated')
).prefetch_related('tags'),
image=Image.objects.annotate(
when=models.F('uploaded_at')
).select_related('uploaded_by'),
comment=Comment.objects.annotate(
when=models.F('commented_at')
).select_related('created_by'),
datetime_field='when'
)
I haven't extensively load-tested this pattern, but I expect it will work fine for databases with tens-of-thousands of records but may start running into trouble if you have millions of records (though an index on the created column should help a lot). If you need a recent additions feed on something larger scale than that you should probably look at a separate logging table or an external index in something like Elasticsearch instead.
For another interesting thing you can do with .union() check out my article on Implementing faceted search with Django and PostgreSQL.
I've added a faceted search engine to this blog, powered by PostgreSQL. It supports regular text search (proper search, not just SQL"like" queries), filter by tag, filter by date, filter by content type (entries vs blogmarks vs quotation) and any combination of the above. Some example searches:
It also provides facet counts, so you can tell how many results you will get back before you apply one of these filters - and get a general feeling for the shape of the corpus as you navigate it.

I love this kind of search interface, because the counts tell you so much more about the underlying data. Turns out I was most active in quoting people talking about JavaScript back in 2007, for example.
I usually build faceted search engines using either Solr or Elasticsearch (though the first version of search on this blog was actually powered by Hyper Estraier) - but I'm hosting this blog as simply and inexpensively as possible on Heroku and I don't want to shell out for a SaaS search solution or run an Elasticsearch instance somewhere myself. I thought I'd have to go back to using Google Custom Search.
Then I read Postgres full-text search is Good Enough! by Rachid Belaid - closely followed by Postgres Full-Text Search With Django by Nathan Shafer - and I decided to have a play with the new PostgreSQL search functionality that was introduced in Django 1.10.
... and wow! Full-text search is yet another example of a feature that's been in PostgreSQL for nearly a decade now, incrementally improving with every release to the point where it's now really, really good.
At its most basic level a search system needs to handle four things:
Modern PostgreSQL ticks all of those boxes. Let's put it to work.
Here's how to execute a full-text search query against a simple text column:
from blog.models import Entry
from django.contrib.postgres.search import SearchVector
results = Entry.objects.annotate(
searchable=SearchVector('body')
).filter(searchable='django')
The generated SQL looks something like this:
SELECT "blog_entry"."id", ...,
to_tsvector(COALESCE("blog_entry"."body", %s)) AS "searchable"
FROM "blog_entry"
WHERE to_tsvector(COALESCE("blog_entry"."body", "django"))
@@ (plainto_tsquery("django")) = true
ORDER BY "blog_entry"."created" DESC
The SearchVector class constructs a stemmed, stopword-removed representation of the body column ready to be searched. The resulting queryset contains entries that are a match for "django".
My blog entries are stored as HTML, but I don't want search to include those HTML tags. One (extremely un-performant) solution is to use Django's Func helper to apply a regular expression inside PostgreSQL to strip tags before they are considered for search:
from django.db.models import Value, F, Func
results = Entry.objects.annotate(
searchable=SearchVector(
Func(
F('body'), Value('<.*?>'), Value(''), Value('g'),
function='regexp_replace'
)
)
).filter(searchable='http')
Update 6th October 8:23pm UTC - it turns out this step is entirely unnecessary. Paolo Melchiorre points out that the PostgreSQL ts_vector() function already handles tag removal. Sure enough, executing SELECT to_tsvector('<div>Hey look what happens to <blockquote>this tag</blockquote></div>') using SQL Fiddle returns 'happen':4 'hey':1 'look':2 'tag':7, with the tags already stripped.
This works, but performance isn't great. PostgreSQL ends up having to scan every row and construct a list of search vectors for each one every time you execute a query.
If you want it to go fast, you need to add a special search vector column to your table and then create the appropriate index on it. As of Django 1.11 this is trivial:
from django.contrib.postgres.search import SearchVectorField
from django.contrib.postgres.indexes import GinIndex
class Entry(models.Model):
# ...
search_document = SearchVectorField(null=True)
class Meta:
indexes = [
GinIndex(fields=['search_document'])
]
Django's migration system will automatically add both the field and the special GIN index.
What's trickier is populating that search_document field. Django does not yet support a easy method to populate it directly in your initial INSERT call, instead recommending that you populated with a SQL UPDATE statement after the fact. Here is a one-liner that will populate the field for everything in that table (and strip tags at the same time):
def strip_tags_func(field):
return Func(
F(field), Value('<.*?>'), Value(''), Value('g'),
function='regexp_replace'
)
Entry.objects.update(
search_document=(
SearchVector('title', weight='A') +
SearchVector(strip_tags_func('body'), weight='C')
)
)
I'm using a neat feature of the SearchVector class here: it can be concatenated together using the + operator, and each component can be assigned a weight of A, B, C or D. These weights affect ranking calculations later on.
We could just set this up to run periodically (as I did in my initial implementation), but we can get better real-time results by ensuring this field gets updated automatically when the rest of the model is modified. Some people solve this with PostgreSQL triggers, but I'm still more comfortable handling this kind of thing in python code - so I opted to use Django's signals mechanism instead.
Since I need to run search queries across three different types of blog content - Entries, Blogmarks and Quotations - I added a method to each model that returns the text fragments corresponding to each of the weight values. Here's that method for my Quotation model:
class Quotation(models.Model):
quotation = models.TextField()
source = models.CharField(max_length=255)
tags = models.ManyToManyField(Tag, blank=True)
def index_components(self):
return {
'A': self.quotation,
'B': ' '.join(self.tags.values_list('tag', flat=True)),
'C': self.source,
}
As you can see, I'm including the tags that have been assigned to the quotation in the searchable document.
Here are my signals - loaded once via an import statement in my blog application's AppConfig.ready() method:
@receiver(post_save)
def on_save(sender, **kwargs):
if not issubclass(sender, BaseModel):
return
transaction.on_commit(make_updater(kwargs['instance']))
@receiver(m2m_changed)
def on_m2m_changed(sender, **kwargs):
instance = kwargs['instance']
model = kwargs['model']
if model is Tag:
transaction.on_commit(make_updater(instance))
elif isinstance(instance, Tag):
for obj in model.objects.filter(pk__in=kwargs['pk_set']):
transaction.on_commit(make_updater(obj))
def make_updater(instance):
components = instance.index_components()
pk = instance.pk
def on_commit():
search_vectors = []
for weight, text in components.items():
search_vectors.append(
SearchVector(Value(text, output_field=models.TextField()), weight=weight)
)
instance.__class__.objects.filter(pk=pk).update(
search_document=reduce(operator.add, search_vectors)
)
return on_commit
(The full code can be found here).
The on_save method is pretty straightforward - it checks if the model that was just saved has my BaseModel as a base class, then it calls make_updater to get a function to be executed by the transaction.on_commit hook.
The on_m2m_changed handler is significantly more complicated. There are a number of scenarios in which this will be called - I'm reasonably confident that the idiom I use here will capture all of the modifications that should trigger a re-indexing operation.
Running a search now looks like this:
results = Entry.objects.filter(
search_document=SearchQuery('django')
)
We need one more thing though: we need to sort our search results by relevance. PostgreSQL has pretty good relevance built in, and sorting by the relevance score can be done by applying a Django ORM annotation:
query = SearchQuery('ibm')
results = Entry.objects.filter(
search_document=query
).annotate(
rank=SearchRank(F('search_document'), query)
).order_by('-rank')
We now have basic full text search implemented against a single Django model, making use of a GIN index. This is lightning fast.
My site has three types of content, represented in three different models and hence three different underlying database tables.
I'm using an abstract base model to define common fields shared by all three: the created date, the slug (used to construct permalink urls) and the search_document field populated above.
As of Django 1.11 It's possible to combine queries across different tables using the SQL union operator.
Here's what that looks like for running a search across three tables, all with the same search_document search vector field. I need to use .values() to restrict the querysets I am unioning to the same subset of fields:
query = SearchQuery('django')
rank_annotation = SearchRank(F('search_document'), query)
qs = Blogmark.objects.annotate(
rank=rank_annotation,
).filter(
search_document=query
).values('pk', 'created', 'rank').union(
Entry.objects.annotate(
rank=rank_annotation,
).filter(
search_document=query
).values('pk', 'created', 'rank'),
Quotation.objects.annotate(
rank=rank_annotation,
).filter(
search_document=query
).values('pk', 'created', 'rank'),
).order_by('-rank')[:5]
# Output
<QuerySet [
{'pk': 186, 'rank': 0.875179, 'created': datetime.datetime(2008, 4, 8, 13, 48, 18, tzinfo=<UTC>)},
{'pk': 134, 'rank': 0.842655, 'created': datetime.datetime(2007, 10, 20, 13, 46, 56, tzinfo=<UTC>)},
{'pk': 1591, 'rank': 0.804502, 'created': datetime.datetime(2009, 9, 28, 23, 32, 4, tzinfo=<UTC>)},
{'pk': 5093, 'rank': 0.788616, 'created': datetime.datetime(2010, 2, 26, 19, 22, 47, tzinfo=<UTC>)},
{'pk': 2598, 'rank': 0.786928, 'created': datetime.datetime(2007, 1, 26, 12, 38, 46, tzinfo=<UTC>)}
]>
This is not enough information though - I have the primary keys, but I don't know which type of model they belong to. In order to retrieve the actual resulting objects from the database I need to know which type of content is represented by each of those results.
I can achieve that using another annotation:
qs = Blogmark.objects.annotate(
rank=rank_annotation,
type=models.Value('blogmark', output_field=models.CharField())
).filter(
search_document=query
).values('pk', 'type', 'rank').union(
Entry.objects.annotate(
rank=rank_annotation,
type=models.Value('entry', output_field=models.CharField())
).filter(
search_document=query
).values('pk', 'type', 'rank'),
Quotation.objects.annotate(
rank=rank_annotation,
type=models.Value('quotation', output_field=models.CharField())
).filter(
search_document=query
).values('pk', 'type', 'rank'),
).order_by('-rank')[:5]
# Output:
<QuerySet [
{'pk': 186, 'type': u'quotation', 'rank': 0.875179},
{'pk': 134, 'type': u'quotation', 'rank': 0.842655},
{'pk': 1591, 'type': u'entry', 'rank': 0.804502},
{'pk': 5093, 'type': u'blogmark', 'rank': 0.788616},
{'pk': 2598, 'type': u'blogmark', 'rank': 0.786928}
]>
Now I just need to write function which can take a list of types and primary keys and return the full objects from the database:
def load_mixed_objects(dicts):
"""
Takes a list of dictionaries, each of which must at least have a 'type'
and a 'pk' key. Returns a list of ORM objects of those various types.
Each returned ORM object has a .original_dict attribute populated.
"""
to_fetch = {}
for d in dicts:
to_fetch.setdefault(d['type'], set()).add(d['pk'])
fetched = {}
for key, model in (
('blogmark', Blogmark),
('entry', Entry),
('quotation', Quotation),
):
ids = to_fetch.get(key) or []
objects = model.objects.prefetch_related('tags').filter(pk__in=ids)
for obj in objects:
fetched[(key, obj.pk)] = obj
# Build list in same order as dicts argument
to_return = []
for d in dicts:
item = fetched.get((d['type'], d['pk'])) or None
if item:
item.original_dict = d
to_return.append(item)
return to_return
One last challenge: when I add filtering by type, I'm going to want to selectively union together only a subset of these querysets. I need a queryset to start unions against, but I don't yet know which queryset I will be using. I can abuse Django's queryset.none() method to crate an empty ValuesQuerySet in the correct shape like this
qs = Entry.objects.annotate(
type=models.Value('empty', output_field=models.CharField()),
rank=rank_annotation
).values('pk', 'type', 'rank').none()
Now I can progressively build up my union in a loop like this:
for klass in (Entry, Blogmark, Quotation):
qs = qs.union(klass.objects.annotate(
rank=rank_annotation,
type=models.Value('quotation', output_field=models.CharField())
).filter(
search_document=query
).values('pk', 'type', 'rank'))
The Django ORM is smart enough to compile away the empty queryset when it constructs the SQL, which ends up looking something like this:
(((SELECT "blog_entry"."id",
"entry" AS "type",
ts_rank("blog_entry"."search_document", plainto_tsquery(%s)) AS "rank"
FROM "blog_entry"
WHERE "blog_entry"."search_document" @@ (plainto_tsquery(%s)) = TRUE
ORDER BY "blog_entry"."created" DESC))
UNION
(SELECT "blog_blogmark"."id",
"blogmark" AS "type",
ts_rank("blog_blogmark"."search_document", plainto_tsquery(%s)) AS "rank"
FROM "blog_blogmark"
WHERE "blog_blogmark"."search_document" @@ (plainto_tsquery(%s)) = TRUE
ORDER BY "blog_blogmark"."created" DESC))
UNION
(SELECT "blog_quotation"."id",
"quotation" AS "type",
ts_rank("blog_quotation"."search_document", plainto_tsquery(%s)) AS "rank"
FROM "blog_quotation"
WHERE "blog_quotation"."search_document" @@ (plainto_tsquery(%s)) = TRUE
ORDER BY "blog_quotation"."created" DESC)
So far, our search engine can only handle user-entered query strings. If I am going to build a faceted search interface I need to be able to handle filtering as well. I want the ability to filter by year, tag and type.
The key difference between filtering and querying (borrowing these definitions from Elasticsearch) is that querying is loose - it involves stemming and stopwords - while filtering is exact. Additionally, querying affects the calculated relevance score while filtering does not - a document either matches the filter or it doesn't.
Since PostgreSQL is a relational database, filtering can be handled by simply constructing extra SQL where clauses using the Django ORM.
Each of the filters I need requires a slightly different approach. Filtering by type is easy - I just selectively include or exclude that model from my union queryset.
Year and month work like this:
selected_year = request.GET.get('year', '')
selected_month = request.GET.get('month', '')
if selected_year:
qs = qs.filter(created__year=int(selected_year))
if selected_month:
qs = qs.filter(created__month=int(selected_month))
Tags involve a join through a many-2-many relationship against the Tags table. We want to be able to apply more than one tag, for example this search for all items tagged both python and javascript. Django's ORM makes this easy:
selected_tags = request.GET.getlist('tag')
for tag in selected_tags:
qs = qs.filter(tags__tag=tag)
There is just one more ingredient needed to complete our faceted search: facet counts!
Again, the way we calculate these is different for each of our filters. For types, we need to call .count() on a separate queryset for each of the types we are searching:
queryset = make_queryset(Entry, 'entry')
type_counts['entry'] = queryset.count()
(the make_queryset function is defined here)
For years we can do this:
from django.db.models.functions import TruncYear
for row in queryset.order_by().annotate(
year=TruncYear('created')
).values('year').annotate(n=models.Count('pk')):
year_counts[row['year']] = year_counts.get(
row['year'], 0
) + row['n']
Tags are trickiest. Let's take advantage of he fact that Django's ORM knows how to construct sub-selects if you pass another queryset to the __in operator.
tag_counts = {}
type_name = 'entry'
queryset = make_queryset(Entry, 'entry')
for tag, count in Tag.objects.filter(**{
'%s__in' % type_name: queryset
}).annotate(
n=models.Count('tag')
).values_list('tag', 'n'):
tag_counts[tag] = tag_counts.get(tag, 0) + count
Having constructed the various facets counts in the view function, the template is really simple:
{% if type_counts %}
<h3>Types</h3>
<ul>
{% for t in type_counts %}
<li><a href="{% add_qsarg "type" t.type %}">{{ t.type }}</a> {{ t.n }}</a></li>
{% endfor %}
</ul>
{% endif %}
{% if year_counts %}
<h3>Years</h3>
<ul>
{% for t in year_counts %}
<li><a href="{% add_qsarg "year" t.year|date:"Y" %}">{{ t.year|date:"Y" }}</a> {{ t.n }}</a></li>
{% endfor %}
</ul>
{% endif %}
{% if tag_counts %}
<h3>Tags</h3>
<ul>
{% for t in tag_counts %}
<li><a href="{% add_qsarg "tag" t.tag %}">{{ t.tag }}</a> {{ t.n }}</a></li>
{% endfor %}
</ul>
{% endif %}
I am using custom templates tags here to add arguments to the current URL. I've built systems like this in the past where the URLs are instead generated in the view logic, which I think I prefer. As always, perfect is the enemy of shipped.
And because the results are just a Django queryset, we can use Django's pagination helpers for the pagination links.
The full current version of the code at time of writing can be seen here. You can follow my initial implementation of this feature through the following commits: 7e3a0217 c7e7b30c 7f6b524c a16ddb5e 7055c7e1 74c194d9 f3ffc100 6c24d9fd cb88c2d4 2c262c75 776a562a b8484c50 0b361c78 1322ada2 79b1b13d 3955f41b 3f5ca052.
And that's how I built faceted search on top of PostgreSQL and Django! I don't have my blog comments up and running yet, so please post any thoughts or feedback over on this GitHub issue or over on this thread on Hacker News.
Update 9th September 2021: A few years after implementing this I started to notice performance issues with my blog, which turned out to be caused by search engine crawlers hitting every possible combination of facets, triggering a ton of expensive SQL queries. I excluded /search from being crawled using robots.txt which fixed the problem.
Tags: django, full-text-search, orm, postgresql, projects, search, facetedsearch
Tags: andrew-godwin, django, migrations, orm, south, recovered
Appending the request URL to SQL statements in Django
A clever frame-walking monkey-patch which pulls the most recent HttpRequest object out of the Python stack and adds the current request.path to each SQL query as an SQL comment, so you can see it in debugging tools such as slow query logs and the PostgreSQL "select * from pg_stat_activity" query.Tags: chris-lamb, debugging, django, orm, postgresql, python, sql, recovered, monkeypatch
Cache Machine: Automatic caching for your Django models
This is the third new ORM caching layer for Django I've seen in the past month! Cache Machine was developed for zamboni, the port of addons.mozilla.org to Django. Caching is enabled using a model mixin class (to hook up some post_delete hooks) and a custom caching manager. Invalidation works by maintaining a "flush list" of dependent cache entries for each object--this is currently stored in memcached and hence has potential race conditions, but a comment in the source code suggests that this could be solved by moving to redis.Tags: cachemachine, caching, django, memcached, mozilla, orm, ormcaching, python, redis
Tags: caching, databases, django, memcached, orm, ormcaching, performance, python
Entry.objects.all()[:10].batch_select("tags") to execute two queries - one pulling back the first ten entries and another using an "IN" query against the tags table to pull back all of the tags for those entries in one go.
Via Psychic Origami
Tags: batchselect, django, john-montgomery, manytomany, orm, python, selectrelated, sql
Model.objects.raw(SQL), the smarter {% if %} tag and class-based generic views.
Tags: classbasedviews, django, multidb, orm, python
Via Django | Weblog | Django 1.1 released
Tags: aggregates, django, django-admin, open-source, orm, python, releases
Tags: andrew-godwin, django, models, orm, parsing, python, south
Via Toast Driven
Tags: daniel-lindsley, django, haystack, orm, python, search, solr, whoosh
Tags: andrew-godwin, databases, django, migrations, orm, south
DB2 support for Django is coming
From IBM, under the Apache 2.0 License. I'm not sure if this makes it hard to bundle it with the rest of Django, which uses the BSD license.Tags: antonio-cangiano, bsd, databases, db2, django, ibm, licenses, open-source, orm, python
Django 1.0 alpha release notes
The big features are newforms-admin, unicode everywhere, the queryset-refactor ORM improvements and auto-escaping in templates.Tags: alpha, autoescaping, django, django-admin, newformsadmin, orm, python, querysetrefactor, unicode
Django's ORM is, in my opinion, the unsung gem of the framework. For the subset of SQL that's used in most web applications it's very hard to beat. It's a beautiful piece of API design, and I tip my hat to the people who designed and built it.
If you haven't spent much time with the ORM, two key features are lazy evaluation and chaining. Consider the following statement:
entries = Entry.objects.all()
Assuming you have created an Entry model of some sort, the above statement will create a Django QuerySet object representing all of the entries in the database. It will not result in the execution of any SQL - QuerySets are lazily evaluated, and are only executed at the last possible moment. The most common situation in which SQL will be executed is when the object is used for iteration:
for entry in entries:
print entry.title
This usually happens in a template:
<ul>
{% for entry in entries %}
<li>{{ entry.title }}</li>
{% endfor %}
</ul>
Lazy evaluation works nicely with template fragment caching - even if you pass a QuerySet to a template it won't be executed if the fragment it is used in can be served from the cache.
You can modify QuerySets as many times as you like before they are executed:
entries = Entry.objects.all()
today = datetime.date.today()
entries_this_year = entries.filter(
posted__year = today.year
)
entries_last_year = entries.filter(
posted__year = today.year - 1
)
Again, no SQL has been executed, but we now have two QuerySets which, when iterated, will produce the desired result.
Chaining comes in when you want to apply multiple modifications to a QuerySet. Here are blog entries from 2006 that weren't posted in January:
Entry.objects.filter(
posted__year = 2006
).exclude(posted__month = 1)
And here's entries from that year posted to the category named "Personal", ordered by title:
Entry.objects.filter(
posted__year = 2006
).filter(
category__name = "Personal"
).order_by('title')
The above can also be expressed like this:
Entry.objects.filter(
posted__year = 2006,
category__name = "Personal"
).order_by('title')
The parallels to jQuery are pretty clear. The jQuery API is built around chaining, and the jQuery animation library even uses a form of lazy evaluation to automatically queue up effects to run in sequence:
jQuery('div#message').addClass(
'borderfade'
).animate({
'borderWidth': '+10px'
}, 1000).fadeOut();
One of the neatest things about jQuery is the plugin model, which takes advantage of JavaScript's prototype inheritance and makes it trivially easy to add new chainable methods. If we wanted to package the above dumb effect up as a plugin, we could do so like this:
jQuery.fn.dumbBorderFade = function() {
return this.addClass(
'borderfade'
).animate({
'borderWidth': '+10px'
}, 1000).fadeOut();
};
Now we can apply it to an element like so:
jQuery('div#message').dumbBorderFade();
Django supports adding custom methods for accessing the ORM through the ability to implement a custom Manager. In the above examples, Entry.objects is the Manager. The downside of this approach is that methods added to a manager can only be used at the beginning of the chain.
Luckily, Managers also provide a hook for returning a custom QuerySet. This means we can create our own QuerySet subclass and add new methods to it, in a way that's reminiscent of jQuery:
from django.db import models
from django.db.models.query import QuerySet
import datetime
class EntryQuerySet(QuerySet):
def on_date(self, date):
next = date + datetime.timedelta(days = 1)
return self.filter(
posted__gt = date,
posted__lt = next
)
class EntryManager(models.Manager):
def get_query_set(self):
return EntryQuerySet(self.model)
class Entry(models.Model):
...
objects = EntryManager()
The above gives us a new method on the QuerySets returned by Entry.objects called on_date(), which lets us filter entries down to those posted on a specific date. Now we can run queries like the following:
Entry.objects.filter(
category__name = 'Personal'
).on_date(datetime.date(2008, 5, 1))
This method works fine, but it requires quite a bit of boilerplate code - a QuerySet subclass and a Manager subclass plus the wiring to pull them all together. Wouldn't it be neat if you could declare the extra QuerySet methods inside the model definition itself?
It turns out you can, and it's surprisingly easy. Here's the syntax I came up with:
from django.db.models.query import QuerySet
class Entry(models.Model):
...
objects = QuerySetManager()
...
class QuerySet(QuerySet):
def on_date(self, date):
return self.filter(
...
)
Here I've made the custom QuerySet class an inner class of the model definition. I've also replaced the default manager with a QuerySetManager. All this class does is return the QuerySet inner class for the current model from get_query_set. The implementation looks like this:
class QuerySetManager(models.Manager):
def get_query_set(self):
return self.model.QuerySet(self.model)
I'm pretty happy with this; it makes it trivial to add custom QuerySet methods and does so without any monkeypatching or deep reliance on Django ORM internals. I think the ease with which this can be achieved is a testament to the quality of the ORM API.
Via A comment on "Django's tipping point"
Tags: django, ivansagalaev, masterslave, mysql, mysqlcluster, orm, python, replication
Tags: django, malcolm-tredinnick, orm, python, querysetrefactor, sql
Tags: caching, david-cramer, django, orm, ormcaching, python
Announcing StaticGenerator for Django
Simple but powerful static file generator for Django applications--just tell it about your model instances and it will create an entire static site based on calling get_absolute_url() on each one. Uses signals to repopulate the cache when a model changes.Tags: caching, django, jared-kuolt, orm, performance, static, static-generator
Tags: databases, django, djangoevolution, migration, orm, schema
Using the extra() QuerySet modifier in Django for WeGoEat
You can use select() on a QuerySet to obtain extra values using subqueries.Tags: django, orm, python, queryset, ryan-kanno, subqueries
Tags: django, djangoorm, models, orm, python, sqlalchemy, tranquil