I'm hunting for the best solution on how to handle keeping large sets of DB records "sorted" in a performant manner.
Most of us have work on projects at some point where we have needed to have ordered lists of objects. Whether it be a to-do list sorted by priority, or a list of documents that a user can sort in whatever order they want.
A traditional approach for this on a Rails project is to use something like the acts_as_list
gem, or something similar. These systems typically add some sort of "postion" or "sort order" column to each record, which is then used when querying out the records in a traditional order by position
SQL query.
This approach seems to work fine for smaller datasets, but can be hard to manage on large data sets with hundreds (or thousands) of records needing to be sorted. Changing the sort position of even a single object will require updating every single record in the database that is in the same sort group. This requires potentially thousands of writes to the database for every sort order update.
One typical alternative approach is to store a single, serialized column that contains the full sort order (for instance a serialized array column that contains the ordered list of record ids). Using this approach, we are now required to query the records out from the database using this specified order.
Ideally, it would be nice if the database just used the order of the in clause so a query of select * from foo where id in (67,23,1362,24)
returned the records in that order, but I'm not aware of a database that does this.
MySQL has a built-in ORDER BY Field()
clause where you could do something like ORDER BY FIELD(67,23,1362,24)
PostgreSQL doesn't seem to have any similar functions. There are two somewhat widely discussed solutions that I have come across.
- Using the PostgreSQL
CASE
clause like the following:
SELECT * FROM foo ORDER BY CASE WHEN id=67 THEN 1 WHEN id=23 THEN 2 WHEN id=1362 THEN 3 WHEN id=24 THEN 4 END
- Using a custom
JOIN
table like the following:
SELECT * FROM foo JOIN (VALUES (67,1), (23,2), (1362,3), (24,4)) as x(id, ordering) ON foo.id = x.id ORDER BY x.ordering
I'm currently using solution #2 as it was easier to implement as a module that I can include in any AR::Base
class and have it work well with other scopes.
I'm wondering if there is an obvious solution I'm just missing, or any alternatives or alternate implementations of the db schema to begin with that might make this a non-issue.
What are the best ways to keep a large set of data sorted according to user specified order with PostgreSQL?
For this particular use case that I'm dealing with, we allow sorting more than one object at a time. The objects (images) are laid out in a grid on the UI and the user can select a single image to reorder via drag and drop, or they can select multiple images to be reordered via drag and drop at the same time. This makes using something like the acts_as_list#insert_at
method less useful. I suppose we could call #insert_at
or manually #update_attribute(:position, N)
one at a time for each item that was moved in position, but I'd like to do it all in one fell swoop if possible.
I think it depends a lot on the size of your list. For example, if N < 100, you'll get lots of reasonable behavior out of a serialized sort array. You can even do the sorting in ruby without much problem. You do end up reading and writing the list a lot, but it's small, so who cares?
OTOH, if N = 100000, you have some not-so-nice behaviors from a serialized array. Now the single reads and writes are larger, parse times are slower, etc. If you want to do things like get items 30000-30099 to show on a page, it's gonna get painful.
However, if you have large N, you can still get great performance out of an indexed column, while still keeping the number of records updated equal to the number of records being moved. You just have to use a sort column with a datatype with a continuous sort function, like float or string (not integer). Then, there are various convenient ways to update the sort column to reorder, depending on how much integrity you want to enforce.
Note: if you consider the pathological case, with very large datasets, and with extremely bad reordering operations (e.g. swapping two adjacent items 2^^64 times) you can run into problems, either with collisions in the sort key (float), or ballooning space requirement (string). In practice, this isn't really a concern, and the sort column can be renormalized periodically anyway to avoid such issues.