For context, this project branches from this fun project by Nolen "eieio" Royalty.
(For you non-python-using readers, x ** y
means x to the power of y.)
One of the known limitations of his approach is that the "ctrl-f" search feature is a bit spoofed. In a nutshell, if you search "1234" it appears to work fine, but if you mash find-next, it takes about 100 clicks to cycle back to the beginning of the list. But if you do a little napkin math, there should be more like ~10^32 hits in the full list.
This felt like a fun and challenging puzzle to me. What would it take to accurately cycle through all the string matches, in order?
So I chose my constraints.
I was having flashbacks to my number theory class at NCSSM and I was pretty sure modular arithmetic was the right tool here, but I had to relearn quite a bit after 15 years of not using it. Wikipedia saved the day... again.
In my gut I just knew we needed a simpler transform function. I went with modular multiplication here. We can use multiplicative inverses to reverse the transform. If you squint really hard, this looks like dumbed down RSA. And since we need to permute 2^122 numbers, we can use all the odd numbers < 2^123. Because odd numbers are all coprime to 2^123, the permutation will be unique and invertible. Let's visualize that in python:
M = 2**123
P = 0x3141592653589793238462643383279 # must be odd
Q = mult_inverse(P, M)
def index_to_uuid(i):
x = i * 2 + 1
y = x * P % M
return y // 2
def uuid_to_index(u):
y = u * 2 + 1
x = y * Q % M
return x // 2
We can use Euler's Theorem to compute the inverse of some arbitrary P mod M as long as P is coprime to M. Since M is a power of 2, P can be any number not divisible by 2. Choosing a large high-entropy P is needed to satisfy the "appearance of random" goal.
We need P * Q % M == 1
, and we know P ** totient(M) % M == 1
so we can infer Q = P ** (totient(M) - 1) % M
. Now we need the totient of M, which turns out for a power of 2 is just M/2
. The good news is that P ** (M/2 - 1) % M
can be computed in log(M) time and is needed only once for the whole page. Basically we know M/2 - 1 = 1 + 2 + 4 + 8 + ...
, so we can compute (P ** 1) * (P ** 2) * (P ** 4) * ...
def mult_inverse(P, M):
q = 1
pp = P
i = 1
while i < M/2 - 1:
q = (q * pp) % M
pp = (pp * pp) % M
i *= 2
return q
M = 2**123
P = 0x3141592653589793238462643383279 # must be odd
Q = mult_inverse(P, M)
x = 0x123456
y = index_to_uuid(x)
x2 = uuid_to_index(y)
print(hex(Q)) # 0x6c0ce98c29a14f187a559bbed3397c9
print(hex(y)) # 0x2f62449bd09199bd4f43388ad3fa1e2
print(hex(x2)) # 0x123456
for i in range(20):
print(f'{index_to_uuid(i):031x}')
# 18a0ac9329ac4bc991c2313219c193c
# 09e205b97d04e35cb54693964d44bb5
# 3b235edfd05d7aefd8caf5fa80c7e2e
# 2c64b80623b61282fc4f585eb44b0a7
# 1da6112c770eaa161fd3bac2e7ce320
# 0ee76a52ca6741a943581d271b51599
# 0028c3791dbfd93c66dc7f8b4ed4812
# 316a1c9f711870cf8a60e1ef8257a8b
# 22ab75c5c4710862ade54453b5dad04
# 13ecceec17c99ff5d169a6b7e95df7d
# 052e28126b223788f4ee091c1ce11f6
# 366f8138be7acf1c18726b80506446f
# 27b0da5f11d366af3bf6cde483e76e8
# 18f23385652bfe425f7b3048b76a961
# 0a338cabb88495d582ff92aceaedbda
# 3b74e5d20bdd2d68a683f5111e70e53
# 2cb63ef85f35c4fbca08577551f40cc
# 1df7981eb28e5c8eed8cb9d98577345
# 0f38f14505e6f42211111c3db8fa5be
# 007a4a6b593f8bb534957ea1ec7d837
First step is to deal with "substring" concept. We can check valid alignments with hyphens first, and then remove hyphens from the search and from the UUID, so you're working with pure hex strings. Likewise, we can remove the "pinned" bits from the UUID string and work in a contiguous 122 bit number space.
Next step is to split up the possible alignments into separate searches. So if you search for "1234", you look for the next hit at offset 0, next hit at offset 1, etc. independently. There are at most 32 different alignments for a single hex digit search.
Now that we are looking for a hex substring at a fixed offset, it becomes a math problem. For example: if you were searching 32 bit numbers for 0x##1234##
, you could express it as:
X % (2**24) >= 0x123400 and X % (2**24) <= 0x1234ff
This points towards the search API we want to implement:
def find_next(start_idx, search, search_bits, pad_bits):
...
In the above example: search_bits = 16
and pad_bits = 8
indicating our search string is 4 hex digits (4 bits per digit) and there are 2 hex digits on the right. start_idx
would be the current scroll / cursor position, and search
would be the integer value of the search hex string.
Rather than finding the closest (index order) match for 0x##1234##
we could instead look for the closest match for 0x##123400
. It turns out this is easier to do, and part of my general solution anyway.
Considering that our transform is just multiplication, it has the nice property that offsets between indices or UUIDs can also be transformed into value offset in the other column. I.e.
assert index_to_uuid(A + B) == (index_to_uuid(A) + index_to_uuid(B)) % M
assert uuid_to_index(A + B) == (uuid_to_index(A) + uuid_to_index(B)) % M
We can use this to find a step to the next full suffix match. Essentially we compute the UUID for current index, compute the UUID offset to the target, transform that offset back to index offset, then reduce that modulo the search bits. Since the upper bits will not affect the lower bits in the transform, masking out the upper bits will give you the shortest step to hit the target.
def find_next_suffix(start_idx, search, search_bits):
start_uuid = index_to_uuid(start_idx)
diff_uuid = search - start_uuid
diff_index = diff_uuid * Q % (2 ** search_bits)
return start_idx + diff_index
index = find_next_suffix(start_idx=0xbeefd00d, search=0x123400, search_bits=24)
print(hex(index)) # 0xbf909fe4
print(hex(index_to_uuid(index))) # 0xaf0db6574fb6dc48791e2b92123400
This search_bits modulo trick is really nice to help us just not worry about the upper bits and find the shortest step to a suffix. But in non-suffix cases with many pad_bits we can't possibly search through all the suffixes explicitly so another trick is needed.
The gist of it is we use the Euclidian Algorithm to find smaller and smaller index steps that continue to match the search criteria, starting from a match found using the full suffix search.
Empirically it seems to consistently find the smallest step, but the proof is left as an exercise to the reader.
At a high level it looks like this:
def find_next(start_idx, search, search_bits, pad_bits):
idx0 = find_next_suffix(start_idx, search << pad_bits, search_bits + pad_bits)
step0 = idx0 - start_idx
step = min_step(step0, search_bits, pad_bits)
return start_idx + step
To set the stage for how we use the euclidian algorithm, let's just watch it unfold for some sample inputs. We will use it to reduce index diffs, but we care about the UUID diffs during the process, so I'll print out both along the way.
search_bits = 8
pad_bits = 8
N2 = 2 ** (search_bits + pad_bits)
idiff0 = N2
udiff0 = idiff0 * P % N2
idiff1 = Q % N2
udiff1 = idiff1 * P % N2
while -udiff0 < 2 ** pad_bits:
print(f"{idiff0:08d} {udiff0:03d} | {idiff1:08d} {udiff1:03d}")
if idiff0 > idiff1:
idiff0 -= idiff1
udiff0 -= udiff1
else:
idiff1 -= idiff0
udiff1 -= udiff0
output
00065536 000 | 00038857 001
00026679 -01 | 00038857 001
00026679 -01 | 00012178 002
00014501 -03 | 00012178 002
00002323 -05 | 00012178 002
00002323 -05 | 00009855 007
00002323 -05 | 00007532 012
00002323 -05 | 00005209 017
00002323 -05 | 00002886 022
00002323 -05 | 00000563 027
00001760 -32 | 00000563 027
00001197 -59 | 00000563 027
00000634 -86 | 00000563 027
00000071 -113 | 00000563 027
00000071 -113 | 00000492 140
00000071 -113 | 00000421 253
00000071 -113 | 00000350 366
00000071 -113 | 00000279 479
00000071 -113 | 00000208 592
00000071 -113 | 00000137 705
00000071 -113 | 00000066 818
The gist of euclidian algorithm is start with 2 numbers, and subtract the smaller number from the bigger number, and then repeat. Eventually you arrive at 2 of the same number and that's the GCD of the original numbers. In our case we start with two index diffs (idiff), and also track the corresponding uuid diffs (udiff). When we substract idiffs we also subtract udiffs. We're using the search_bits modulo to focus only on the lower bits we care about. And we've specifically chosen initial index diffs to have minimal uuid diffs (0 and 1). As the index diffs are subtracted they become smaller diffs (we want a minimal index diff after all). Along the way the uuid diff ends up growing gradually.
Now assuming we start from a full suffix match of 0x123400
we can look at these growing uuid diffs as diffs we can tolerate adding to our initial match until it exceeds 0xff at which point the addition carries and spills into our match bits (e.g. 0x1234ff
-> 0x123500
). That is why -udiff0 < 2 ** pad_bits
is the loop condition above.
So to adapt this to a min_step()
function, we need to add a 3rd column to the algorithm, which is the initial step found by find_next_suffix()
. And whenever possible, we subtract an idiff
from it without going negative, and without the udiff exceeding 2 ** pad_bits
.
def min_step(step, search_bits, pad_bits):
step_udiff = 0
N2 = 2 ** (search_bits + pad_bits)
N1 = 2 ** pad_bits
idiff0 = N2
udiff0 = idiff0 * P % N2
idiff1 = Q % N2
udiff1 = idiff1 * P % N2
while udiff0 > -N1:
if idiff0 > idiff1:
idiff0 -= idiff1
udiff0 -= udiff1
else:
idiff1 -= idiff0
udiff1 -= udiff0
while step - idiff0 > 0 and step_udiff - udiff0 < N1:
step -= idiff0
step_udiff -= udiff0
return step
start = 0xbaaaaaad
idx = find_next(start_idx=start, search=0xf00d, search_bits=16, pad_bits=8)
print(hex(idx)) # 0xbaaad7eb
print(hex(index_to_uuid(idx))) # 0x3d7e69b6a2cfc8157ee22a476f00d4f
# brute force verification is possible when pad_bits is not too big
for i in range(start, idx):
u = index_to_uuid(i)
assert (u >> 8) % (2 ** 16) != 0xf00d
There you have it! Precise incremental searching of pseudo-random UUID ordering.
Building the translations for real UUID4 bits and handling multi-alignment searching should be straightforward.