#postgres #django #python

Bulk Upsert with Django and Postgres


This short article will cover what upserting is, and how to accomplish that using Python, Django and Postgres.

 

Python and Postgres

 

The Problem

There's plenty of cases where we want to create a database record if it doesn't exist, or update it if it does (in this context, the existence or not of a record is defined as the presence/absence of an identifier field or primary key value). To address that need, Django's ORM provides us with an update_or_create() method which you can call on the default manager of a model. See the documentation for more details on its usage.

While that works just fine for a single record, the problem is that it issues two hits to the database: one for learning if the record exists or not, and another for either inserting or updating depending on the result of the first query. This duplication of queries, as you can imagine, doesn't really scale when we're working with large batches of records. For example, if we want to create or update a batch of 1000 records, we would need to hit the database 2000 times, which is far from ideal.

We could, of course, go from 2N operations, to N + 2, by using bulk_create() and bulk_update() as shown in the documentation. With our previous example of 1000 records, we would need to hit the database 1000 times just to learn if records exist or not, and group them into two smaller batches of records to be inserted and ones to be updated. Then on each of the smaller batches we could call the respective bulk method, which issues a single query, ending up with 1002 queries issued in total.

While that may seem like a huge improvement, database hits are largely I/O bound operations, especially in our case as the select, insert and update statements themselves would be fairly simple. This means that neither of the two approaches are an acceptable solution, and with N getting bigger and bigger, both would present more or less the same significant bottleneck in our system.


Bulk Upsert

What if the example above could be reduced to just a few queries? Here comes bulk upsert to the rescue. Upsert, a term derived from merging 'insert' and 'update', is a SQL technique that allows us to create or update a record with a single database hit. What's better is that you can perform it in bulk as well, drastically reducing the number of queries issued to the database.

I say a few queries and not exactly one, because even though you can technically do bulk upsert in a single query, the number may increase a bit depending on some implementation details. For example, going with a temporary table solution would add one extra query. Automatically handling timestamp fields like created_at and updated_at (which we'll see later) would add an additional one. If your initial batch size is too big, then chunking it would add one more query per chuck. However, the end result would still leave you with significantly fewer queries. Also, you'd probably need to chunk your batch even with the normal, built-in bulk methods. And some of the other steps, such as creating a temporary table or learning if the model is tracked, are always a single extra query no matter how large the batch gets (constant operation), so it's not really something to worry about.

Django's ORM though does not provide us with built-in constructs to perform bulk upsert, as opposed to SQLAlchemy. For that reason, we'll write a short and simple utility class in Python, which does just that. The solution is not that Django-centric; with some small tweaks it can easily be used with pure Python or any other framework. But since we're avoiding the ORM entirely,  the SQL statements are written and tested against the PL/pgSQL dialect. This means the syntax might change quite a bit, or may not be supported at all in another DBMS, depending on what you're using.

In order not to clutter this blog post with snippets, you can find the entire code in this gist. Let me briefly go over how to use the class, and the steps that the execution goes through, from a bird's-eye view:

  1. You will first need to create a BulkUpserter instance with the corresponding attributes, which are rather self-explanatory.
  2. To pull the trigger, you will have to call my_instance.upsert(). This method is the entry point, and it's the only one will need to call directly. The rest of the code in the class serves this method, and we'll only go over it for explanation purposes.
  3. Inside the body of the upsert() method, we first create a temporary table within a context manager. The temporary table, among other things, offers the benefit of early data validation, as its definition is copied from the original table that we want to ultimately write to.
  4. Next, we populate the newly created temporary table with records, derived from the instance's data attribute, which holds a list of dicts representing records.
  5. Lastly, the data is written to the main table, using the ON CONFLICT clause to achieve upsert.

 

Other Considerations

First and foremost, this is a bare-bones solution to the problem. It should work fine for simple use cases, but you may need to tweak and adjust it to your own situation. The goal is to provide a base structure from which you can build upon. Following is a list of concerns or possible improvements that you need to be aware of before using the code.

Invalid input. One thing to keep in mind is that upserting will fail if in the raw data we have two records with the same value for the on_conflict_col, meaning nothing will be written to the database. This is true whether or not that value already exists in the database. You would need to drop the duplicates before attempting to upsert. There are situations in which such operation makes sense and you don't really lose much by dropping duplicated records. Consider the implications of your case. Also, the on_conflict_col must represent a field with a UNIQUE constraint, otherwise upserting will automatically fail as well. Which makes sense if you think about it, as this field is used to resolve the conflict and identify records.

Benchmarks. As far as performance is concerned, I managed to upsert 3000 records in under a second, on my local machine. And while that is not a real benchmark as it depends on a lot of factors that are specific to my machine's hardware (CPU clock rate and number of cores, RAM frequrncy and latency, SSD lookup speed), software versions (Linux, Python, Django, Postgres, psycopg2 etc.), database state (indexes, query plan cache) and other things that I may forget, it's generally quite fast. You are advised to perform your own benchmarks to get a better idea.

Chunking. You would also probably want to add some chunking logic to the code, if you're working with big data. As mentioned above, with raw data in the thousands, it does not seem to struggle at all and there's no significant memory or processor usage spike on my local machine. But as the number of records gets bigger, having everything execute in one go is not the best idea. The current structure hits the database only four times, so don't feel guilty for taking a few more trips, especially when considering what you are gaining as compared to the traditional approach. If you can't resist running the batches in parallel, though, you have to keep in mind the class of problems that arise from parallel computing.

Auto-updated fields. Database tables more often that not will contain created_at and updated_at timestamp fields. While we usually don't have to think about that at the ORM level (models would likely be inheriting from a TrackedModel abstract model or something like that), when we’re writing pure SQL we need to explicitly fill these fields, as the default constraints are not transmitted to a database level (Postgres can't make use of Python callables). To avoid having to write a custom migration that sets those database defaults for every tracked model using BulkUpserter, it was handled in the code. We first understand if the model is tracked or not, and then fill created_at or updated_at, depending on which action will be applied to the record. The process works as expected even when the model isn’t tracked. If for some reason this isn't something you'd like the code to do, removing that part of the code should be fairly straightforward and leave you with a much cleaner class.