(Two years later: I was wrong. I opened some RFC comment and when staring a lot at the RFC it turns out there's no problem.)
The RFC problem
This is an actual problem in the RFC, that should be fixed one day, hopefully. Check out the « 4.1.1. Variant » section, applicable for RFC-compliant UUIDs, which defines the bits that indicate if an UUID is RFC-compliant.
Obviously this serves no purpose (#tautology) since a non-compliant UUID can accidentaly have two bits sets to «RFC-compliant». You would then have a non-compliant compliant UUID. 🙃
4.1.1. Variant
The variant field determines the layout of the UUID. That is, the
interpretation of all other bits in the UUID depends on the setting
of the bits in the variant field. As such, it could more accurately
be called a type field; we retain the original term for
compatibility. The variant field consists of a variable number of
the most significant bits of octet 8 of the UUID.
The following table lists the contents of the variant field, where
the letter "x" indicates a "don't-care" value.
Msb0 Msb1 Msb2 Description
0 x x Reserved, NCS backward compatibility.
1 0 x The variant specified in this document.
1 1 0 Reserved, Microsoft Corporation backward
compatibility
1 1 1 Reserved for future definition.
Interoperability, in any form, with variants other than the one
defined here is not guaranteed, and is not likely to be an issue in
practice.
So, there are « don't-care » values. How great. Let's consider the version-specific issues:
Non-compliant UUIDs (NCS backward compatibility)
Nothing to say here, they have been defined once and won't be generated anew.
UUIDv0
All bits are set to 0, nothing to see here.
UUIDv1
This could be problematic, since the « Msb2 » bit isn't defined, but as it overlaps with the « Clock ID » short integer, it's not an issue. Once again: once generated, an UUID won't be generated anew.
Do you see where we're going ?
UUIDv2
They don't even exist.
Things to see here : « Less than nothing - Hegel and the shadow of dialectal materialism - Slavoj Ẑiẑek ».
UUIDv4
Fully random. We don't care about random bits. Same rule applies.
UUIDv3 and UUIDv5 : the real problem
There we go.
These are made using the output of a hash function (MD5 or SHA1).
1 2 3 |
|
You have three options :
- Set the « don't-care » bit to 0
- Set the « don't-care » bit to 1
- Leave the « don't-care » bit untouched
Fun fact:
For both SHA1 and MD5, different implementations can have different results. Let's have a look at the Null UUID namespace, and the payload « D » (One byte, 0x44).
SHA1:
$ python3 -c 'import uuid;print(uuid.uuid5(uuid.UUID("0"*32),"D"))'
0bbecd2a-c3aa-52d5-b2ae-fd1832e84cf0
$ uuidgen --namespace 00000000-0000-0000-0000-000000000000 --name '44' --sha1 -x
0bbecd2a-c3aa-52d5-92ae-fd1832e84cf0
MD5:
$ python3 -c 'import uuid;print(uuid.uuid3(uuid.UUID("0"*32),"D"))'
57e30c8f-1e14-3114-a0ed-bd76e4f49aba
$ uuidgen --namespace 00000000-0000-0000-0000-000000000000 --name '44' --md5 -x
57e30c8f-1e14-3114-80ed-bd76e4f49aba
One-liner highlighting the differences :
$ python3 -c 'p="D";import uuid as u;import subprocess as S;n=u.UUID("0"*32);print([list(filter(lambda x:x[0]!=x[1],zip(S.getoutput("uuidgen -n{} -N{} -{}".format(n,p,a)),str(f(n,p))))) for a,f in [["m",u.uuid3],["s",u.uuid5]]])'
[[('8', 'a')], [('9', 'b')]]
uuid.py only sets the right two bits to what they need to be :
1 2 3 4 5 6 |
|
In libuuid (util-linux) they overwrite the whole byte
1 2 3 4 5 6 7 8 9 10 11 12 |
|
The constants are defined here :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Which makes :
1 2 3 4 5 6 7 8 9 |
|
So, as a summary (and as expected):
- libuuid sets 10x to 100
- uuid.py sets 10x to 10?
This means that for UUIDv3 and UUIDv5, the bit number 61 (little endian) is undefined (« don't-care »).
1 |
|
Both implementations are respecting the specification :
- uuid.py does not alter a bit that is a « don't-care » value
- libuuid sets to 0 a bit which is in a « don't-care » state
None are violating the RFC4122 which states that we don't care about this bit, bit nevertheless, the two following UUIDs are not identical !
0bbecd2a-c3aa-52d5-b2ae-fd1832e84cf0
0bbecd2a-c3aa-52d5-92ae-fd1832e84cf0
Conclusion
I honestly don't know.
- The RFC does not care about this bit.
- libuuid uses the safe engineering standard of setting to zero an unspecified bit
- uuid.py strictly applies the RFC and does not alter this bit
Maybe I should consider publishing an errata ?
Errata : libuuid is wrong
Turns out, the « don't-care » status of this bit is resolved as well for the UUIDv3 and UUIDv5 case. The 4.3. section of the RFC is a huge mess, but
4.3. Algorithm for Creating a Name-Based UUID
The version 3 or 5 UUID is meant for generating UUIDs from "names"
that are drawn from, and unique within, some "name space". The
concept of name and name space should be broadly construed, and not
limited to textual names.
[.. snip ..]
o Set the clock_seq_hi_and_reserved field to octet 8 of the hash.
o Set the two most significant bits (bits 6 and 7) of the
clock_seq_hi_and_reserved to zero and one, respectively.
[.. snip ..]
o Convert the resulting UUID to local byte order.
There you go, the RFC said « Set the two Msb to 0 and 1, which means leave Msb2 untouched, which means libuuid is wrong and the Python stdlib is right.