Choosing the right data type & means of generating unique primary keys

This article is one of several I've written while building a web app in public. It’s called Lumeno and you can follow its development on Twitter — @LumenoDev, or sign up to be notified when it is launched.

While building Lumeno, I’ve been taking some time to investigate the options for how to safely generate unique primary keys for database records. Most of the time, as a developer, you probably won’t need to think about this topic, but for those that do, this article is for you…

The voice of reason

Before we dig into the meat of this topic, it is worth pointing out that I’ve seen quite a few developers assume that they need to use a more complex system in order to generate unique identifiers for their primary keys.

If you’re building a system that is enterprise in nature, or you’re familiar with and need to use features like clusters, regions, replicas, and so on, then this is probably true. However, most of the time, developers are building modest applications powered by databases running on a single, modest server.

If this is you, then the default method of generating unique primary keys (auto-incrementing integers), should serve you just fine. Indeed, there are significant benefits to sticking with them…

Auto-incrementing integers

If you’re at least somewhat familiar with databases, then you’ll know that this is the standard way of generating unique keys. Behind the scenes, the system maintains an internal counter. When you insert a record, it increments the counter and sets the new record’s primary key to that value.

Is this a good approach?

In short, yes! While there is a slight overhead in maintaining the counter, in practice it’s negligible. On the plus side, your records will always be unique, easily-indexed and super fast (in fact, its always the fastest option).

When should I opt for the something different?

I’ve already mentioned a few scenarios when you might want to consider something different, but even in those cases, you should ask yourself if you can still stick with incrementing integers. Sometimes, simply adjusting your schema to use a larger data type e.g. BIG INT, is enough.

Even if it’s not such a minimal adjustment, in a lot of cases, if you design your application and database correctly, then you can still use them.

However, if you find you are making too many compromises, your app is becoming unwieldy, or your setup makes it too complex to use a database counter, then it may be time to reach for something else…

Alternatives… it’s a jungle out there

Okay, so you’ve decided that auto-incrementing integers are not for you. What do you use instead? Well, if you consult your database manual, you probably won’t get much help. Databases provide a default method. If you don’t want to use it, then you’re largely on your own. So, you go to the internet…

Congratulations! You’ve discovered endless different ways of generating a unique string of characters to use as a primary key and you come away with no clear understanding of which one to use. Fortunately, when it comes to a database at least, choosing one really depends on just four things:

Uniqueness

Every method of generating a key has a possibility of causing a collision (two generated keys being the same). Therefore, you want to choose a method with as low a risk of collision as possible.

Lexicography.

A complicated-sounding word that basically refers to some means of organization. In the database, we’re talking about being able to order our records using the key.

Efficiency

Depending on the method you choose, your database might already include a function to generate a key using the algorithm. However, even if it does, it may be better to generate it within your app so the DB has less to do. Note that there are tradeoffs with this approach e.g. not being able to do things like INSERT INTO SELECT.

Storage

Everything from the length of the key, to what it consists of, to how you store it in the database e.g. as an integer, a string, as binary etc.

Now we know what to look for, let’s examine some options. I’m only going to cover a few because (as you’d discover from research) there are only so many ways you can generate a string of characters. In other words, many of the different algorithms are really just variations on a theme.

UUIDS aka the gold standard

These are the granddaddy of unique keys. The acronym stands for universally unique identifier, which is quite frankly absurd. How can we possibly know it is unique in the universe? Microsoft opted to call it GUID, which is a globally unique identifier. That makes much more sense, but for whatever reason, outside of Microsoft (and its products), UUID is the term that stuck.

So, what is it?

A UUID is a 128 bit string of alphanumeric characters broken up by dashes to make it readable. 128 bits translates to 32 characters, so it looks like this:

123e4567-e89b-12d3-a456-426614174000

Simple enough, right? Well no. There are actually 5 different ways of creating a UUID. Seriously? But wait, it gets better!

Draft proposals exist for creating versions 6, 7 and 8. If you’re reading this and thinking to yourself, “how can there possibly need to be 8 different ways to create a string of characters?”, then I’m definitely with you.

I’m not going to go into the different algorithms used to generate the versions. All I’ll say is that if you want a truly random UUID, then you need to use V4.

How do I create them?

There are countless libraries available for every programming language you can probably think of. Further, many databases include native functions for generating them (you’ll need to check which versions they generate though). Laravel includes the Str::uuid() and Str::orderedUuid() functions.

ULIDS

This stands for universally unique lexicographically sortable identifier. It aims to improve on the current UUID formats by offering shorter, sortable keys.

ULIDs are 26 characters in length and use a format that includes a timestamp at the beginning (which makes them sortable). A ULID looks like this:

01ARZ3NDEKTSV4RRFFQ69G5FAV

ULIDs also include some other features, which make it more attractive than UUIDs such as case insensitivity, base32, and so on.

How do I create them?

The Github repo containing the specification for the format also maintains a list of implementations in different programming languages (the default is JavaScript, which makes it ideal for a lot of applications).

Snowflakes

This way of generating keys was developed by Twitter for their own systems and it has since been adopted by other big tech companies.

Unlike UUIDs and ULIDs, snowflake IDs only consist of numbers. In addition, they are the shortest at just 16 characters. A snowflake looks like this:

3398978255192064

Snowflake identifiers also have the benefit of starting with a timestamp, which makes them sortable.

How do I create them?

Twitter have archived the open-source repository containing the source code for their original implementation. It appears to have now been folded into their Twitter server project. However, many separate implementations exist for most of the major programming languages.

Which should we use?

The short answer is, it depends. Each method has benefits, whether it is ease-of-use, library availability, overall familiarity, performance, efficiency, the need to work with data already in a similar format, etc.

Since there is no “correct” way as such, I’ll explain which of them I chose and why I did. To do that, I’ll go back to the checklist I wrote earlier in the article:

Uniqueness

Each of the three implementations claims to handle this well, with all of them being able to generate thousands of keys per millisecond without collision. So in truth, this isn’t really a factor when choosing one over the others.

There are some limits when it comes to the maximum number of keys you can create, but these are way off in the future e.g. the year 2060 or more, and it is somewhat unlikely any system built today will still be in operation then, or at the very least, won’t have been significantly changed in the preceding years.

Lexicography

Certain implementations of UUID are sortable (versions 1 and 2), however since these versions depend on MAC addresses, they tend not to be used due to potential security risks. You’ll also find variations of version 4, which have reordered the key so that the timestamp is first, this makes it sortable, but concerns have been raised about collision / randomness as a result.

ULIDs were always designed to be sortable by beginning with timestamps. The same is also true of Snowflakes.

Efficiency

This one should be fairly obvious. Since UUIDs have the longest length, they take the longest to generate, followed by ULIDs, and then Snowflakes. I did some superficial benchmarks to see what the difference was and the results were quite striking. They might not seem like much, but at scale, when you need to generate thousands of records a second, it really adds up:

Standard UUID v4 - 5.6 milliseconds
Sortable UUID v4 - 8.3 milliseconds
ULID             - 3.1 milliseconds
Snowflake        - 0.4 milliseconds

In addition, since the primary keys will also serve as foreign keys, queries that use them as part of a join will be significantly faster if they are integers. Put simply, integers are better than strings in every metric and in any scenario.

Storage

Like efficiency, UUID takes up the most space, followed by ULID, and then Snowflake. It should also be noted that indexing of strings takes up a lot more space and is less efficient than indexing of integers.

For example, a 100,000 record index involving integer primary keys is a couple of megabytes, but for strings it can jump to several hundred. You can make this more efficient by converting the strings to binary, but then your app has to handle converting to and from binary which is cumbersome.

Finally, I seeded a test database of 100,000 users and several million records per ancillary table. With Snowflake IDs, the DB was roughly 2.7 GB. When using strings (UUID or ULID), it was around 6.2 GB. This might not seem like much, but the smaller the DB is, the more of it you can fit into memory, resulting in much better performance.

Conclusion

Of the three options, Snowflake is the clear winner. First, it is the shortest, which means it uses less space. It also begins with a timestamp, which makes it sortable. Finally, since it is an integer, it can be indexed faster, and queries that use it in joins will also be measurably faster.

Another added benefit, is that even though Snowflake uses integers, it does not use simple auto-incrementation, which means it is impossible to guess how many records are in the database table. While this is security through obscurity, it is a nice little cherry to have on the cake.