Published on: Nov 4, 2023
When working with data in digital applications, we often need a unique identifier to refer to individual records. The unique identifier allows us to distinguish between two records even if they might have identical data.
In some scenarios we can simply re-use some existing data for this. In a table of users, for example, we might use the email address as the unique identifier. In other cases, we need to generate a unique identifier ourselves.
The simplest way to generate a unique identifier is to simply count up from 1. This is how many databases work by default. The first record created in the database is given the ID 1, the second 2, and so on.
There are a couple of drawbacks to this approach. Firstly, it doesn't scale so well. As your system grows, you might end up creating thousands or even millions of records per second. Your system will also likely be spread out geographicaly, with different servers handling different requests. This makes maintaining a single counter extremely difficult.
The second issue is around security. If your system has some kind of insecurity allowing an attacker to fetch records by ID, an attacker can easily pull all records simply by pulling ID 1, 2, 3 etc. They simply have to iterate over all the numbers. This is called an enumeration attack.
Having seen the issues with simple integer IDs, let's explore the more modern standard.
UUID stands for Universally Unique Identifier. It is a 128-bit number that is typically represented as a string of 32 hexadecimal digits separated into 5 groups separated by hyphens. For example:
In fact, we have our very own random UUID generator you can play with here!
We'll go over some of the specific algorithms of how UUIDs are generated later but for now let's just say they're generated randomly. Being 128 bits long means the probability of ever picking two identical UUIDs is almost zero. This means we can easily scale our system as IDs for objects can be generated anywhere at anytime.
Not generating the IDs sequentially also means we don't have to worry about enumeration attacks. Even if an attacker can get into our system, they would have to guess the UUID of each record which will also be near impossible.
Let's take a look at how UUIDs are generated and how it's evolved over the years.
In order to keep track of different UUID versions over the years, a standard was maintained that the 13th character (the first digit of the 3rd group) of the UUID would be a number representing the version. 4 bits are reserved for this. There are also an extra 2 bits reserved for the variant.
V1 UUIDs were introducted in late 1980s and were not actually random! They were generated by combining the system's current time with a "node identifier" i.e. something that identifies the specific device/server the ID was generated on. This was usually just the MAC address.
V2 came in the 1990s and was very similar to V1 with minor technical changes, we won't discuss it here.
V3 came in the late 1990s and were deterministic based on 2 parameters. The namespace and the name. The namespace is a UUID itself and the name is just a string. The V3 UUID is generated by taking the MD5 hash of the combined namespace and name.
V4 came in the early 2000s and is arguably the most popular way to generate UUIDs. They are just generated randomly! All 122 bits (the total 128 bits without the version and variant bits) are generated randomly.
V5 is also from the early 2000s and is the most recently introduced. It was simply an iteration of V3 to use the SHA-1 hash instead of MD5.
Version 1 and 2 UUIDs allowed easy guarantees of uniqueness when using multiple machines. It also meant that the UUIDs themselves could be traced back not only to the time they were generated but also the machine they were generated on. This was useful for debugging purposes, but in many scenarios presents some security concerns.
Version 3 and 5 UUIDs take the approach of determinism, meaning you can do easy collision detection for records you don't want duplicated such as user emails.
Version 4 are the simplest to generate and are the most popular. Simply generating randomly means collisions are extremely unlikely. The birthday paradox tells us that you'd have to generate around 2 sextillion (2,000,000,000,000,000,000,000) random UUIDs before you could expect a collision!
In Crudly, we use V4 UUIDs for all our unique identifiers. Every entity in a table will automatically get a random UUID generated that can be used to identify it. This means you can easily scale your system without worrying about collisions.
There are many ways to uniquely identify records in your database. V4 UUIDs are arguably the easiest and best scaling. By using Crudly to manage your data, you'll have an easy-to-use and scalable way to identify your entities, ensuring both security and efficiency in your application's backend.