« Brother UUIDs », or why UUIDs are not unique

(Two years later: I was wrong. I opened some RFC comment and when staring a lot at the RFC it turns out there's no problem.)

The RFC problem

This is an actual problem in the RFC, that should be fixed one day, hopefully. Check out the « 4.1.1. Variant » section, applicable for RFC-compliant UUIDs, which defines the bits that indicate if an UUID is RFC-compliant.

Obviously this serves no purpose (#tautology) since a non-compliant UUID can accidentaly have two bits sets to «RFC-compliant». You would then have a non-compliant compliant UUID. 🙃

4.1.1.  Variant

   The variant field determines the layout of the UUID.  That is, the
   interpretation of all other bits in the UUID depends on the setting
   of the bits in the variant field.  As such, it could more accurately
   be called a type field; we retain the original term for
   compatibility.  The variant field consists of a variable number of
   the most significant bits of octet 8 of the UUID.

   The following table lists the contents of the variant field, where
   the letter "x" indicates a "don't-care" value.

   Msb0  Msb1  Msb2  Description

    0     x     x    Reserved, NCS backward compatibility.

    1     0     x    The variant specified in this document.

    1     1     0    Reserved, Microsoft Corporation backward
                     compatibility

    1     1     1    Reserved for future definition.

   Interoperability, in any form, with variants other than the one
   defined here is not guaranteed, and is not likely to be an issue in
   practice.

So, there are « don't-care » values. How great. Let's consider the version-specific issues:

Non-compliant UUIDs (NCS backward compatibility)

Nothing to say here, they have been defined once and won't be generated anew.

UUIDv0

All bits are set to 0, nothing to see here.

UUIDv1

This could be problematic, since the « Msb2 » bit isn't defined, but as it overlaps with the « Clock ID » short integer, it's not an issue. Once again: once generated, an UUID won't be generated anew.

Do you see where we're going ?

UUIDv2

They don't even exist.

Things to see here : « Less than nothing - Hegel and the shadow of dialectal materialism - Slavoj Ẑiẑek ».

UUIDv4

Fully random. We don't care about random bits. Same rule applies.

UUIDv3 and UUIDv5 : the real problem

There we go.

These are made using the output of a hash function (MD5 or SHA1).

1
2
3
uuid = hash( namespace || payload )
uuid = bitmask (uuid, version) # easy, that's a nibble to define
uuid = bitmask (uuid, variant) # how to handle the « don't-care » values ?

You have three options :

  • Set the « don't-care » bit to 0
  • Set the « don't-care » bit to 1
  • Leave the « don't-care » bit untouched

Fun fact:

For both SHA1 and MD5, different implementations can have different results. Let's have a look at the Null UUID namespace, and the payload « D » (One byte, 0x44).

SHA1:

$ python3 -c 'import uuid;print(uuid.uuid5(uuid.UUID("0"*32),"D"))'
0bbecd2a-c3aa-52d5-b2ae-fd1832e84cf0
$ uuidgen --namespace 00000000-0000-0000-0000-000000000000 --name '44' --sha1 -x
0bbecd2a-c3aa-52d5-92ae-fd1832e84cf0

MD5:

$ python3 -c 'import uuid;print(uuid.uuid3(uuid.UUID("0"*32),"D"))'
57e30c8f-1e14-3114-a0ed-bd76e4f49aba
$ uuidgen --namespace 00000000-0000-0000-0000-000000000000 --name '44' --md5 -x
57e30c8f-1e14-3114-80ed-bd76e4f49aba

One-liner highlighting the differences :

$  python3 -c 'p="D";import uuid as u;import subprocess as S;n=u.UUID("0"*32);print([list(filter(lambda x:x[0]!=x[1],zip(S.getoutput("uuidgen -n{} -N{} -{}".format(n,p,a)),str(f(n,p))))) for a,f in [["m",u.uuid3],["s",u.uuid5]]])'
[[('8', 'a')], [('9', 'b')]]

uuid.py only sets the right two bits to what they need to be :

1
2
3
4
5
6
# File: /usr/lib/python3.6/uuid.py
#
# Set the variant to RFC 4122.
int &= ~(0xc000 << 48) # AND 0b1100
int |= 0x8000 << 48    # OR  0b1000
                       #   = 0b10??

In libuuid (util-linux) they overwrite the whole byte

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// File: util-linux-2.32.1/libuuid/src/gen_uuid.c

/* index with UUID_VARIANT_xxx and shift 5 bits */
static unsigned char variant_bits[] = { 0x00, 0x04, 0x06, 0x07 };

/* ... */

  out[6] &= ~(UUID_TYPE_MASK << UUID_TYPE_SHIFT);
  out[6] |= (UUID_TYPE_DCE_MD5 << UUID_TYPE_SHIFT);

  out[8] &= ~(UUID_VARIANT_MASK << UUID_VARIANT_SHIFT);
  out[8] |= (variant_bits[UUID_VARIANT_DCE] << UUID_VARIANT_SHIFT);

The constants are defined here :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
// File: util-linux-2.32.1/libuuid/src/uuid.h
/* UUID Variant definitions */
#define UUID_VARIANT_NCS    0
#define UUID_VARIANT_DCE    1
#define UUID_VARIANT_MICROSOFT  2
#define UUID_VARIANT_OTHER  3

#define UUID_VARIANT_SHIFT  5
#define UUID_VARIANT_MASK     0x7

/* UUID Type definitions */
#define UUID_TYPE_DCE_TIME   1
#define UUID_TYPE_DCE_SECURITY 2
#define UUID_TYPE_DCE_MD5    3
#define UUID_TYPE_DCE_RANDOM 4
#define UUID_TYPE_DCE_SHA1   5

#define UUID_TYPE_SHIFT      4
#define UUID_TYPE_MASK     0xf

Which makes :

1
2
3
4
5
6
7
8
9
// File: util-linux-2.32.1/libuuid/src/gen_uuid.c
  out[6] &= ~(0xf << 4); // AND 0b00001111
  out[6] |=  (3 << 4);   //  OR 0b00110000
                       //   = 0b0011????
                       //   = 0x   3   ?

  out[8] &= ~(0x7 << 5); // AND 0b11100000
  out[8] |= (4 << 5);    //  OR 0b10000000
                       //   = 0b100?????

So, as a summary (and as expected):

  • libuuid sets 10x to 100
  • uuid.py sets 10x to 10?

This means that for UUIDv3 and UUIDv5, the bit number 61 (little endian) is undefined (« don't-care »).

1
B = uuid.UUID(int=A.int^1<<61)

Both implementations are respecting the specification :

  • uuid.py does not alter a bit that is a « don't-care » value
  • libuuid sets to 0 a bit which is in a « don't-care » state

None are violating the RFC4122 which states that we don't care about this bit, bit nevertheless, the two following UUIDs are not identical !

0bbecd2a-c3aa-52d5-b2ae-fd1832e84cf0
0bbecd2a-c3aa-52d5-92ae-fd1832e84cf0

Conclusion

I honestly don't know.

  • The RFC does not care about this bit.
  • libuuid uses the safe engineering standard of setting to zero an unspecified bit
  • uuid.py strictly applies the RFC and does not alter this bit

Maybe I should consider publishing an errata ?

Errata : libuuid is wrong

Turns out, the « don't-care » status of this bit is resolved as well for the UUIDv3 and UUIDv5 case. The 4.3. section of the RFC is a huge mess, but

4.3.  Algorithm for Creating a Name-Based UUID

   The version 3 or 5 UUID is meant for generating UUIDs from "names"
   that are drawn from, and unique within, some "name space".  The
   concept of name and name space should be broadly construed, and not
   limited to textual names.

[.. snip ..]

   o  Set the clock_seq_hi_and_reserved field to octet 8 of the hash.

   o  Set the two most significant bits (bits 6 and 7) of the
      clock_seq_hi_and_reserved to zero and one, respectively.

[.. snip ..]

   o  Convert the resulting UUID to local byte order.

There you go, the RFC said « Set the two Msb to 0 and 1, which means leave Msb2 untouched, which means libuuid is wrong and the Python stdlib is right.