-
-
Save beginor/9d9f90bc58e1313f6aecd107f8296732 to your computer and use it in GitHub Desktop.
CREATE SEQUENCE public.global_id_seq; | |
ALTER SEQUENCE public.global_id_seq OWNER TO postgres; | |
CREATE OR REPLACE FUNCTION public.id_generator() | |
RETURNS bigint | |
LANGUAGE 'plpgsql' | |
AS $BODY$ | |
DECLARE | |
our_epoch bigint := 1314220021721; | |
seq_id bigint; | |
now_millis bigint; | |
-- the id of this DB shard, must be set for each | |
-- schema shard you have - you could pass this as a parameter too | |
shard_id int := 1; | |
result bigint:= 0; | |
BEGIN | |
SELECT nextval('public.global_id_seq') % 1024 INTO seq_id; | |
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis; | |
result := (now_millis - our_epoch) << 23; | |
result := result | (shard_id << 10); | |
result := result | (seq_id); | |
return result; | |
END; | |
$BODY$; | |
ALTER FUNCTION public.id_generator() OWNER TO postgres; | |
-- SELECT public.id_generator() |
shard_id is datacenter id, sec_id is worker id
you mean "seq_id" ? The seq_id here is not assigned a value, is its default value 0?
Sorry, I made a mistake, this is just a simplified version of snow flake id, the shard_id
should be datacenter id or worker id.
there is only one fixed part (should be the datacenter id or worker id) in the generated result, like this:
timestamp | shard_id | seq_id |
---|---|---|
1641784918526 | 020 | 113 |
1641785292978 | 020 | 114 |
Why do you call this generator Snowflake? It's a generator that Instagram once used a long time ago.
https://instagram-engineering.com/sharding-ids-at-instagram-1cf5a71e5a5c
@soroktu Yes, it is.
A warning for anyone using this: this implementation generates non-monotonic IDs but Twitter's implementation always generates monotonic IDs (per-shard). Whether or not this matters depends on your use-case.
(Monotonic means that on any given shard IDs which are generated later in time are always larger than older ones.)
Details of how this happens: so the first bytes (time, shard) are monotonic at least to the extent that time is monotonic (pretty close), but the potential problem is the last few bytes: nextval(seq_name) % 1024
. Say the sequence is currently at 1022 and we generate a few ids within one millisecond then the id sequence will be X + 1022
, X + 1023
, X + 0000
, X + 0001
. So later ids in the sequence are smaller than earlier ones.
Obviously it can also happen if the system clock isn't monotonic, Twitters implementation also prevents this.
👀😱
What’s the alternative? Do you have any solution to this? @davidshepherd7
Do you have any solution to this?
Basically: no.
Our most likely approach is to first figure out if we really need monotonicity, if we don't then we'll just continue using this. If we do then we'll switch to a simpler ID format, something like (shard << 10) | sequence
. We can do this because we don't strictly need the time part: we don't need compatibility with any other related ID generators and we don't need to spin up arbitrary new shards without communication. That would mean losing out on one other somewhat nice feature though: future IDs would become easily guessable (which snowflake IDs aren't, unless your volume is extremely high).
Twitter's implementation gets monotonicity by tracking the previous timestamp, and busy-waiting until the next millisecond if it would generate a non-monotonic ID. We don't think that's possible in postgres outside of a C extension because there's no sufficiently performant and transaction-ignoring way to share the previous timestamp to everywhere that would need it. If we found that we really needed exactly Twitters ID format we would probably go and write a C extension.
Here's an extension, but I'm not sure if that solves the problem: https://github.com/mausimag/pgflake
This is also interesting, but I think it probably suffers the same problem:
https://github.com/mastodon/mastodon/blob/main/lib/mastodon/snowflake.rb#L68-L106
Oh neat, it looks like pgflake probably does it right (from a quick glance). The part inside if (last_time == curr_time)
looks the same as twitter's implementation.
@davidshepherd7 , what implementation did you guys settle with? We are also in the same boat and deciding which id generation method to use out of the plethora of options available out there! What I like about this, as compared to pgflake, is that we dont need to build this. We can just create the function public.id_generator
when we are creating our db/tables on a new postgres instance.
But am worried about "generates non-monotonic IDs" part you mentioned. We already have a created_at
column in each table and use that when we need "proper" sorting. With that in mind can you elaborate on "we'll switch to a simpler ID format, something like (shard << 10) | sequence." part.?Did you go with that design, if yes, did you find any issue?
Thanks in advance!
We're in an awkward situation where we have since tables that need a monotonic Id still using sequential IDs and others using the approach in the snippet. Resolving this hasn't been a priority yet, I'm not sure where we'll end up.
JFYI, mausimag/pgflake#3
Want to point out that this generates ids with Instagram's layout which is slightly different from Twitter's (Instagram uses 13 bits for shard ID + 10 for seq number and Twitter uses 10 for a machine ID + 12 for a seq number). So if you want to generate snowflakes w/ the Twitter layout you should change the seq number to % 4096
(2^12) and change the shifts to << 22
and << 12
.
There's another extension for this here: https://github.com/pgEdge/snowflake-sequences I'm not sure if/how it's different to pgflake.
Can somebody hint how to get now_millis
from stored ID?
Since the row creation time is encoded in snowflake ID, it should be retrievable. I need it for example to "get only records created today".
I would like to ask the general snowflake id will have two variables, one is datacenterId, the other is workerId, which two variables should correspond here?