wood_spirit 3 days ago

I am a big fan of the new uuid v7 format.

It has the advantage of being a drop in replacement most places everyone uses v4 today. It also has the advantage over other specs of ulid in that it can be parsed easily even in languages and databases with no libraries because you just need some obvious substr replace and from_hex to extract the timestamp. Other specs typically used some custom lexically sortable base64 or something that always needed a library.

Early drafts of the spec included a few bits to increment if there were local ids generated in the same millisecond for sequencing. This was a good fit for lots of use cases like using the new ids for events generated in normal client apps. Even though it didn’t make the final spec I think it worth implementing as it doesn’t break compatibility

  • sedatk 3 days ago

    There’s already a 72-bit random part. That should be sufficient to address conflicts.

    Incrementing a sequence completely kills the purpose of a UUID, and requires serialization/synchronization semantics. If you need that, just use a long integer.

    • wood_spirit 2 days ago

      There is utility in knowing that event a comes before event b in the same local system even if both are generated at the same millisecond. I have found this useful eg when ui latency gets so low that you can have a user interaction and a menu opening in the same millisecond. Being able to plot them on a timeline without any kind of joins is nice.

      Anyway, as I said, it was dropped from the spec

    • Retr0id 2 days ago

      If you have a billion users and they each generate 64 random 72-bit numbers, you have a ~63% chance of a collision.

      • cowsandmilk 2 days ago

        If they all did that in the same millisecond?

        • zer00eyz 2 days ago

          Today were at 768 threads on the latest AMD system. Sub millisecond performance is possible (I don't know with this algo).

          If you got a spare 50k kicking around we could set up a test system and find out how likely it is to happen...

          • asah 2 days ago

            Billion in one millisecond...

        • Retr0id 2 days ago

          Ah, yeah, ignore me.

    • n42 2 days ago

      What do you consider the purpose of a UUID?

      • sedatk 2 days ago

        Asynchronous unique ID generation.

        • HexDecOctBin 2 days ago

          You can have both asynchrony and sequence by encoding thread ID in the UUID too, and make the sequence a thread local state.

          • sedatk 2 days ago

            Then, just use thread ID and integer sequence pairs instead of trying to stuff them into an arbitrary binary format.

            • HexDecOctBin 2 days ago

              Thread IDs get repeated across reboots. Integer sequence may also repeat in a distributed scenario, unless you want a massive bottleneck. You do need other stuff (timestamps, random number, etc.).

              • sedatk 2 days ago

                Yes, but that’s what parent proposed? :shrug:

oezi 3 days ago

I have recently wondered why Ruby on Rails is using a full-length SHA256 for their ETag fingerprinting (64 characters) when a UUID at 36 chars would probably be entirely enough to prevent collisions and be more readable at the same time. Esbuild on the other hand seems to use just 32bit (8 chars) for their content hash.

  • nertzy 3 days ago

    Isn’t it because you can generate the same content two different times and hash it and come to the same ETag value?

    Using UUID here wouldn’t help here because you don’t want different identifiers for the same content. Time-based UUID versions would negate the point of ETag, and otherwise if you use UUIDv8 and simply put a hash value in there, all you’re doing is reducing the bit depth of the hash and changing its formatting, for limited benefit.

    • oezi 3 days ago

      I would assume that you would only create a new UUID if the content of the tagged file changed serverside.

      Benefits are readability and reduced amount of data to be transferee. UUID is reasonably save to be unique for the ETag use case (I think 64 bits actually would be enough).

      • ninkendo 3 days ago

        The point of the content hash is to make it trivial to verify that the content hasn’t changed from when its hash was made. If you just make a uuid that has nothing to do with the file’s contents, you could easily forget to update the UUID when you do change its content, leading to invalid caches (or generate a new UUID even though the content hasn’t changed, leading to wasteful invalidation.)

        Having the filename be a simple hash of the content guarantees that you don’t make the mistakes above, and makes it trivial to verify.

        For example, if my css files are compiled from a build script, and a caching proxy sits in front of my web server, I can set content-hashed files to infinite lifetime on the caching proxy and not worry about invalidating anything. Even if I clean my build output and rebuild, if the resulting css file is identical, it will get the same hash again, automatically. If I used UUID’s and blew away my output folder and rebuilt, suddenly all files have new UUID’s even though their contents are identical, which is wasteful.

      • vlovich123 3 days ago

        SHA256 has the benefit that you can generate the ETAG deterministically without needing to maintain a database (i.e. content-based hashing). That way you also don’t need to track if the content changes which reduces bugs that might creep in with UUIDs. Also, if typically you only update a subset of all files, then aside from not needing to keep track of assigned UUIDs per file, you can do a partial update. Reasons to do content-based hashing are not invalidated because of a new UUID format.

  • paulddraper 2 days ago

    UUIDs and hashes are not the same.

    For example, hashes are often taken over untrusted data, which could be manipulated to produce a collision.

    UUIDs aren't meant to protect against that.

    I'm sure RoR just did the straightforward thing, didn't get cute, and called it a day.

  • stouset 2 days ago

    For the same reason that git blobs are identified by their SHA and not a synthetic identifier. It’s a content hash.

sedatk 2 days ago

I don’t understand the part where monotonicity of UUIDs is discussed. UUIDs should never be assumed monotonic, or in a specific format per se. If you strictly need monotonicity, just use an integer counter. Let UUIDs be black boxes, and assume that v7 is just a better black box that deals with DB indexes better.

  • sgarland 2 days ago

    The nice thing about them is you don’t have to assume, though, because the version is baked into an octet. Does the 3rd field start with a 4? v4. 7? v7. Etc.

    Re: monotonicity, as I view it, v7 is the best compromise I can make with devs as a DBRE where the DB isn’t destroyed, and I don’t have to try to make them redesign huge swaths of their app.

    • sedatk 2 days ago

      The part I'm talking about proposes "counters" in UUID, not just date/time.

  • paulddraper 2 days ago

    The monotonicity can be useful in multiple contexts: colocating database data by time, providing "sooner than" comparisons.

    Integers are monotonic but can't be distributed like UUIDs.

    Unless you make them 128 bits ;)

    As usual, most people are not dumb most of the time, even if it seems that way.

    • sgarland 2 days ago

      > [integers] can’t be distributed like UUIDs

      They can, to an extent. The use of integers as a primary key has been a solved problem for quite some time, usually by either interleaving distribution among servers, or a coordinator handing chunks out.

      If you mean enabling the ability to do joins across physical databases, my counter to that is it’s an unsupported method by any RDBMS, and should be discouraged. You can’t have foreign key constraints across DBs, and without those, I in no way trust the application to consistently do the right thing and maintain referential integrity. I’ve seen too many instances of it going wrong.

      The only way I can see it working is something involving Postgres’ FDW, but I’m still not convinced that could maintain atomic updates on its own; maybe with a Pub/Sub in addition? This rapidly gets hideously complicated. Just design good, normalized schema that can maintain performance at scale. Then when/if it doesn’t, shard with something that handles the logic for you and is tested, like Vitess or Citus.

      • paulddraper 2 days ago

        For example, imagine a client that can generate a UUID and at a later time save that to remote database.

        Or imagine two separate databases that get merged.

        • sgarland 2 days ago

          > For example, imagine a client that can generate a UUID and at a later time save that to remote database.

          DBs can return inserted data to you; Postgres and SQLite can return the entire row, and MySQL can return the last generated auto-increment ID.

          > Or imagine two separate databases that get merged.

          This is sometimes a legitimate need, yes, but it does make me smirk a bit since it goes against the concept of microservices owning their own domain (which I never thought was a great idea for most). However, it’s also quite possible to merge DBs that used integers. Depending on the amount of tables and their relationships (or rather, lack of formally defined ones) it may be challenging, but nothing that can’t be handled.

          I mostly just question the desire to dramatically harm DB performance (in the case of UUIDv4) for the sake of undoing future problems more easily.

          • paulddraper 21 hours ago

            An example of the latter is when I worked with healthcare systems.

            It was not uncommon for systems to merge datasets, either due to literal M&A or to share records and coordinate care.

            A globally unique ID was important, despite not having a globally centralized system.

          • paulddraper 2 days ago

            > DBs can return inserted data to you; Postgres and SQLite can return the entire row, and MySQL can return the last generated auto-increment ID.

            Assuming you can+want to talk to a database right then.

            The useful part of UUIDs is that they can be generated anywhere, locally, remotely, same DB, separate DB, online, offline, and never change.

    • sedatk 2 days ago

      > colocating database data by time, providing "sooner than" comparisons.

      If you need to perform date/time related operations, use date/time related data types, not an unrelated type that happens to have some arbitrary timestamp embedded in its binary layout.

      > Integers are monotonic but can't be distributed like UUIDs.

      Yes, use UUIDs if you need distribution, use integers if you need monotonicity. If you need "monotonic and distributed", you need an external authority for proper distribution of those IDs. Then, an integer would still work.

      • paulddraper 2 days ago

        > use date/time...not...timestamp

        :/

    • cm2187 2 days ago

      And if you have a clustered index like in MS SQL Server, a non monotonic uuid results in inserting the data in the middle of the table (bad performance) rather than appending to the end.

      • Tostino 2 days ago

        For the Postgres fans out there, it also kills performance on that side of the fence. You have things like wal amplification due to using things like UUID v4 (random prefix). I think v7 should greatly help with that.

        • sgarland 2 days ago

          It also hurts query performance in some circumstances even in Postgres, due to the Visibility Map.

  • bongodongobob 2 days ago

    Integer counters are a problem because they leak information. In most cases I've encountered that's not acceptable.

    • sgarland 2 days ago

      So don’t expose them in the URL. Or have separate internal and external IDs. So many options that don’t destroy B+trees.

    • switch007 2 days ago

      UUIDv7, for example, leaks the timestamp

      I’ve met more than one architect who hands waves that fact away during a “leaking integers is bad!” campaign

    • sedatk 2 days ago

      Monotonic UUIDs leak information too.

pphysch 3 days ago

For v7, the last chunk of bits (rand_b) can be "pseudorandom OR serial". There is no flag bit that must indicate which approach was used.

Therefore, given a compliant UUIDv7 sample, it is impossible to interpret those bits. You can't say if they are random or serial without knowing the implementation, or stochastic analysis of consecutive samples. It's a black box.

The standard would be improved if it just said those bits MUST be uniquely generated for a particular timestamp (e.g. with PRNG or atomic counter).

Logically, that's what it already means, and it opens up interesting v8-style application-specific usages of those bits (like encoding type metadata in a small subset, leaving the rest random), while also complying with the otherwise excellent v7 standard.

  • sedatk 3 days ago

    Serial is just a terrible idea for UUID. UUIDs shouldn’t require synchronization to be generated.

refulgentis 3 days ago

v7 is really helpful for meaningful UX improvements.

ex. I'm loading your documents on startup.

Eventually, we're going to display them as a list on your home screen, newest to oldest.

Now, instead of having to parse the entire document to get the modified date, or wait for the file system, I can just sort by the UUID v7 thats in the filename.

Is it perfect? No, ex. we could have a really old doc thats the most recently modified, and the doc ID is a proxy for the creation date.

But its much better than the status quo of "we're parsing 1000+ docs at ~random at startup, please wait 5 seconds for the list to stop updating over and over."

  • canadiantim 2 days ago

    Tho presumably the uuid would give you the creation date but not the modified date. Still very useful.

  • sedatk 2 days ago

    Or just use the file date.

    • JadeNB 2 days ago

      > Or just use the file date.

      Your parent says they don't want to wait for the file system:

      > Now, instead of having to parse the entire document to get the modified date, or wait for the file system, I can just sort by the UUID v7 thats in the filename.

      • sedatk 2 days ago

        > Your parent says they don't want to wait for the file system:

        That information comes for free when you're iterating files in a directory. There's no extra waiting than the file name itself because file dates are kept in the same structure that keeps the file names.

        • nemothekid 2 days ago

          This doesn't seem to be true in the standard directory implementations for iterating a directory in Python or Rust.

          • sedatk 2 days ago

            You're right. It's free on Windows, but isn't on Unix apparently. https://doc.rust-lang.org/std/fs/struct.DirEntry.html#method...

            Still, I find that justification to rely on certain binary format of an ID format weird. Just use the dates in filenames if you truly need such a mechanism.

            • refulgentis 2 days ago

              What if I need an ID in the filename and I get the date sorting for free?

              You know, like described in the OP.

              Is it okay if thats useful?

              • sedatk 2 days ago

                Yes. I agree with the sentiment that UUIDv7 being chronological can be useful. But, in this specific example, I think it’s a design smell to design your feature around the format of the filename and UUID generation algortihm. I’d say, wait for the FS if you have to instead of creating failure-prone dependencies like that.

                • refulgentis 2 days ago

                  What part is failure prone?

                  What is it being relied upon for?

                  Alternatively, more explicitly, lets look at it from this angle:

                  Let's follow exactly what you're recommending: parse it from the file.

                  Then add a fault-tolerant layer in front that parses a UUID-v7 from the filename.

                  What do you think of that?

                  • sedatk 2 days ago

                    The assumption that the app files would always use the same format and the same UUID algorithm in that format is a totally unnecessary tight coupling for a “loading UI”. The potential future costs isn’t worth it.

                    Adding layers, etc. Again, it’s a loading UI.

                    Obviously, we’re talking about a fantasy app here. I’m weighing options based on my understanding of it.

                    • refulgentis 2 days ago

                      Gotcha, a better name for it is fantasy app.

                      Let's have the fantasy app do exactly as you're recommending.

                      Now, the fantasy app also happens to store its file using this filename format: {uuid}.json

                      What objections are there to parsing the uuid from the filename and using it to sort?

                      Assuming you again mention the filename not be a valid UUID:

                      Is it possible to account for that and fallback to the safe behavior? :)

                      • sedatk 2 days ago

                        No, because you wouldn’t know if UUID algorithm was changed. It’s a completely unnecessary coupling, like tying your shoelaces together before running.

                        • refulgentis 2 days ago

                          Reductio ad absurdum: same argument applies to any persisted UUID.

                          Do you understand? On second read, could be too short and unnecessarily Latin-y. :)

maxfurman 3 days ago

I'm having trouble understanding the use of v8. It can be pretty much any bits as long as it has 1000 in the right spot? It strikes me as too minimal to be useful. I must be missing something

  • SigmundA 3 days ago

    The useful part is you can do anything you want with the other bits and have it still be a valid UUID.

  • yardstick 2 days ago

    Being able to do anything with the remaining bits is very useful.

    You can do any scheme that suits your individual features needs, and it will be a valid UUID still.

    This also means future schemes can be implemented right now without having to get a formal UUID version.

    You could use the first few bits to indicate production vs qa vs dev data.

    Or a subtle hint to what it might be for (eg is this UUID a product identifier or a user identifier or a post or a comment etc). Similar to how AWS etc prefix IDs with their type.

    • edflsafoiewq 2 days ago

      But it kind of defeats the purpose of encoding which of a fixed set of generation methods you are using in the ID, which is presumably to avoid having to check that none of the O(N^2) pair-wise combinations of N methods produce collisions.

SturgeonsLaw 5 days ago

[flagged]

  • lukev 3 days ago

    The cool thing about the various verions of UUID is that they're all compatible. The differences almost all come down to database locality (and therefore performance.)

    The exception is if you're extracting the time portion of a time-based UUID and using it for purposes other than as a unique key, but in my experience this is typically considered bad practice and time is usually stored in a separate column for cases where it matters for business purposes.

  • ktm5j 3 days ago

    Well, technically this is all about different versions of the same standard.

  • paulddraper 2 days ago

    But no one ever said UUID v_ replaces all the others.

    They aren't "versions" so much as variants.

  • refulgentis 3 days ago

    It's not necessarily that its dismissive, more so that its that a fuzzy pattern-matching comment, thats incorrect, and just a wordless link. Trivial to make, nontrivial to respond to: "Funny", in the way in-group cultural references usually are - responding means you're taking it too seriously. Yet, incorrect enough that'll misinform anyone who isn't diligently reading the full article and understands historical context. Noise thats likely to generate noise. Trolling, just missing active intent to derail.